Toward a study in corpora in the frames of Translation Science and its general overlook

Шичкина Мария Геннадьевна

Данная статья посвящена исследованию понятия и феномена корпусной лингвистики в области современной науки, в частности в области переводоведения. Компьютеризация современной науки открыла возможность для исследователей собирать и накоплять всевозможные образцы текстовых фрагментов с целью дальнейшего изучения и исследования языковых особенностей. Наиболее оптимальным способом корпус может облегчить задачу переводчика, так как может предложить в каждой конкретной ситуации целый спектр переводческих решений.

Ключевые слова: corpus, corpora, TS, texts, translation

With the advent of the equipment, in particular computers providing translators and interpreters with a much greater variety of options, it appears that there has been a consequent shift towards analysis of naturally occurring language data. The current approach has gained popularity among researchers carrying out investigations in various fields and areas of linguistics: these include the Translation Science, commonly abbreviated as TS. Translation Science has been a rapidly developing area of linguistics, soon growing into another branch of science with its laws and regulations, and as any science, it has followed the latest techniques and inventions in the computer technology. This way it is possible to integrate the two sciences and obtain more successful results of further investigations and researches, which originated the new area in Translation Science and comparative linguistics, corpus linguistics. This branch of linguistics studies blocks of texts, sentences, phrases and words, collected together to exemplify certain rules of the language. A resurgence of interest in empirical and statistical methods of language analysis has turned corpus analysis into a common practice, with a number of frequently occurred sentences, phrases and words included in collected blocks. These blocks are used when there appears a necessity to carry out a knowledge-based research, given the fact that contemporary corpora are getting more and more complete. As the popularity of corpus linguistics rose, there appeared new works covering the subject of the branch of linguistics mentioned above: …

All of the works, starting from experimental studies and finishing with corpus-based works could be considered a great contribution to the whole bulk of science. They investigate various problems and issues partially or totally interconnected with each other and the overall goal is commonly enriching corpus linguistics. This branch is divided into different parts, each of which studies its own area, for example, a set of phrases that cover a certain topic or a set of words with the same grammatical meaning. This specific feature of the corpus linguistics is peculiar to it due to the differentiation of parts of speech and words by meaning in English, as well due to the numerous variables existing in the language. All the variables should be divided into different groups, which overall compose the bulk of English vocabulary. This division is necessary for better understanding of the rules, which function within the language. For example, a “girl” is a noun and “to do” is a verb and the words, same in their grammatical function as the ones above could be united with those into corpora, which could be used for educational and translational purposes. Dictionaries exist to fulfill a range of various tasks, such as learning new words in the arranged order or finding a certain meaning for the current word, therefore dictionaries could be called corpora. As it seen, there could be found a number of corpora, which are to be studied and analyzed as well. Our work is aimed at bringing to light specific features of corpora and describing them in the common sense, which could be later used for further researches and better understanding of the concept.

Interest in language development brought scientists to the systematic diary studies, starting in the 19^th century (Jaeger 1985) and lasting into the first decades of the 20^th century. Diaries could be divided into two types: comprehensive diaries with a broader focus and topical diaries with a narrower focus. The 20^th century was more predominated by the scientific research, still limited by the scientific resources barely existing at that time. However, it should be mentioned that the results of various studies and researches led to further surveys in the 21^st century, with more means to use in the course of research. Technical and methodological resources have grown in number and, as opposed to the data written in diaries, were overlaid to more reliable printed variants and electronic sources. Most of the data was collected and added to the Internet, which enabled scientists from all over the world and every researcher, interested in a certain subject, search for the information on the Internet, ‘pick up’ certain details and consequently personally contribute to the whole bulk of science, grounded on previous researches. Overall, with the information collected and studied, it was possible to give the definition for the term “corpus” (or “corpora” in plural, originated from Latin), now used in common practice by scientists and researchers: “a collection of texts assumed to be representative of a given language, dialect or other subset of a language, to be used for linguistic analysis”. It is also possible to add the statement that there also exists the opinion that corpora could also be considered dictionaries for the fact they possess a number of words, arranged in a specific order and processed by analogy through the careful analysis of an expert or a group of experts. The term “corpus” itself, originated from Latin as it was mentioned earlier, means a “body” which consequently could be regarded to anything that could be viewed as a whole thing. Dictionaries are quite full and provide the learner with the information related to the sphere that one is interested in. Some are organized in order of the subject that they belong to: for instance, law or medicine, others are arranged in the alphabetical order which helps to orient oneself better in case one is interested only in one word and searches for it in the dictionary. Therefore, it is possible to consider dictionaries corpora to some extent. However, the group of linguists, viewing corpora exceptionally as texts collected together with a definite particle in common, believes that it is much more profitable to relate to corpora only with the definition in question. This opinion could be proved by the generally accepted fact: the narrower a definition is, the better it is for understanding and using a term. Thus, we will use the term “corpus” (or “corpora” in plural) as the term denoting a collection of texts arranged together for the sake of achieving some goal. There is no “standard size” for corpora: starting from a sentence and finishing with a larger text of any non-defined form, such as the British National Corpus (BNC) or any other one, currently accessible on the Internet. Some dictionaries, for instance the ABBYY dictionary, also include a corpus of sentences or a couple of sentences, each of which usually possesses a word in question. As a rule, these corpora consist of sentences, borrowed from literary, scientific and other texts which usually gained some popularity and became quite well-known. It could be a fragment from a famous novel by Terry Pratchett, Daniel Defoe, Jonathan Swift, Mark Twain or any other world-famous author; it could be as well a part of a scientific research published by a world-acclaimed organization; eventually it could be a fragment from the Constitution which inevitably serves the educational and translational purposes. The origin of the fragment is usually displayed, so as to help a corpus user orient oneself in the large collection of texts: the origin tells which way the fragment or a part of it should be exploited. In case a fragment belongs to the medical sphere, words with exactly the same meaning should be used in the exactly same medical context without alternation on the account of the semantic segregation in the English language. The word “attack” could be both used as one component of the phrases “heart attack”, “anxiety attack”, “panic attack”, “terrorist attack”, “air attack”, “DoS attack” etc. The first three examples could manifest the fact that even within one sphere the meanings for one word could slightly differ, the other examples use the word “attack” in such an aspect that it could be defined as an act of aggression, administered toward some group or person. As it is known, each word has its own definition and sometimes even more than one. Corpora help define these definitions, while each definition should be grounded on a basis of frequency, in which this or that meaning occurs in the context. Therefore, corpora enable researchers define the frequency, prepare an exact definition for each meaning of the word and lay it out in the dictionary or in the article, which would in turn provide other researches with a possibility to continue contributing to the body of science. Contextual relations, which are one of the main factors for apprehension of the functions of a word, play an important role in corpora formation. A number of words, belonging to the same theme or subject, define the meaning of the other word, co-existing with others in the same context, such as a word “attack”. In the sentences below we tried to differentiate the meanings of the same word on the account of the phrases used in this paragraph:

1)After the serious heart attack he could hardly walk and speak and the doctor recommended him to stay more in bed.

2)The anxiety attack she had finished in the hospital.

3)Due to serious panic attacks she was prescribed medicine which she was to take twice a day.

4)The terrorist attack was carefully planned and the military forces could not prevent it.

5)The subsequent air attack on the terrorists’ positions was successful.

6)It was an attempt to make a computer resource unavailable to its intended users, it was a DoS attack.

In every sentence it is possible to outline words and phrases that altogether give the notion of the context. In the first sentence these words are “doctor”, “heart”, “to recommend”, “to stay in bed”. Whereas the first two words clearly indicate at the context, the other two display their context in the relation to the first words. Therefore, if one can say so, all words help each other. In the second sentence such words, so-called “indicators” could be “anxiety”, “hospital”, which can immediately show what kind of situation is described in the fragment. In the third sentence these words are “prescribed”, “medicine”. In the forth one these are “terrorist”, “military forces”. In the fifth sentence these are “computer resource”, “intended users”, “DoS”. On the account of the “indicators”, the researchers collecting corpora segregate one group of fragments from each other and include the fragments in the right corpora, so that these who are interested in a certain word could immediately find it without spending much time on the search.

Another way to differentiate corpora is based on the origin of texts, as well as the author, type and even year of publication, while researchers could follow different goals in the process of search. The study could be aimed not only at contributing to linguistics, but to other branches of science as well. The translation might be still required because some studies are carried out on the basis of foreign texts. Thus, a corpus is not only the object of scientific research, but a tool useful for carrying out a range of various operations, for instance it could be immensely useful for improving the results of translation. It might be sometimes difficult to precisely determine, in which context, in which phrase a word as a component could be used, so the corpus as an accurately collected body of the text could be used. It might be also quite difficult to pick up an exact way to translate a word or a phrase, even though it is (as one can say) on the tip of tongue. A corpus user can type in one component of the phrase or the word that the unit in question is associated with, and there commonly comes the result. It is a great facility for a translator to use the corpus for the following reasons:

1) A corpus usually provides a lot of results, united by a specific feature that is of interest for the searcher, and the variety of sentences or fragments can most brightly manifest the peculiarity that the searcher is interested in. A dictionary can provide a number of results too, but these results do not so clearly display the contextual relations of the word, which lessens the probability of its usage by the translator. It is important to mention that the contextual relations represent a very important point to the translator because, as a rule, translation requires the clear understanding of the context, interrelated grammatical functions of words, the syntactic peculiarities, the overall structure of the text, which could be most brightly shown by the corpora examples.

2) A corpus usually possesses texts, segregated by different traits and features, such as, as it was mentioned, the author, the year etc. This way if the translator is for instance interested in the vocabulary of the nineteenth century and the way the word in question correlated with archaisms and outdated words that were used, it is possible to search for sentences within the certain corpus. It would provide the maximum of information that the searcher requires.

3) Corpora are getting larger and therefore obtaining a possibility to provide a translator with a great amount of results which could enable one to view the word from different angles.

4) Corpora can provide a possibility for a person with the analytical bias to realize this attitude when studying fragments of texts and using the same phrases and constructions within a translated fragment.

This way it is possible to conclude that corpora provide their exploiter with great facilities which should not be underestimated and underappreciated. Corpora have existed for a long time and the study related to corpora has started in the nineteenth century which led to the current situation, in which corpora are accessible almost to every researcher, both (mostly) in the electronic and printed versions. Corpora are of a great use for everybody interested in exploiting them, in particular researchers and translators. Corpora provide specialists and experts with a wide variety of different words, phrases, expressions that co-exist within the context and obey certain rules. A very clear manifestation of contextual relations are the reason why corpora should be used by translators: it is known that every text is translated with a deep analysis into its context. Corpora should be studied and there should be further studies on the subject for the mentioned reasons.

References:

Banko M., Brill E. Scaling to very very large corpora for natural language disambiguation //Proceedings of the 39th annual meeting on association for computational linguistics. — Association for Computational Linguistics, 2001. — С. 26–33.
Chierchia G. Reference to kinds across language //Natural language semantics. — 1998. — Т. 6. — №. 4. — С. 339–405.
Fishman J. A. Language modernization and planning in comparison with other types of national modernization and planning //Language in Society. — 1973. — Т. 2. — №. 1. — С. 23–43.
Leech G., Smith N. Extending the possibilities of corpus-based research on English in the twentieth century: A prequel to LOB and FLOB //ICAME Journal. — 2005. — Т. 29. — С. 83–98.
Nesselhauf N. Learner corpora and their potential for language teaching //How to use corpora in language teaching. — 2004. — Т. 12. — С. 125–156.
Quirk R. Language varieties and standard language //English Today. — 1990. — Т. 6. — №. 1. — С. 3–10.

Молодой учёный

Toward a study in corpora in the frames of Translation Science and its general overlook

Toward a study in corpora in the frames of Translation Science and its general overlook

Молодой учёный