Free Republic
Browse · Search
News/Activism
Topics · Post Article

Skip to comments.

Corpus Linguistics - The statistical analysis of George Orwell’s 1984 Wording.
Located on the Internet via Google Search ^ | Timeless FR Post 09-06-01 | Not Specified

Posted on 09/06/2001 11:49:17 AM PDT by vannrox

1. Introduction

1.1 Object of the Study

    Corpus Linguistics, a new field of linguistics, has expanded rapidly in the last thirty years and now it is one of the most dynamic and perspective fields of linguistics. Corpus linguists have developed a number of valuable approaches to the language research that are widely accepted and used.

    The goal of the work is to give an introduction of this new field and its terminology, as well as to show how methods of corpus linguistics can be applied to the analysis of a literary text. Specifically, we will analyse George Orwell’s novel 1984 by means of statistical methods. It must be emphasized, however, that the object of the study is not the thorough analysis of the novel, but rather presentation of possibilities of statistical methods. The literary text and corpora will serve the study as empirical data. So, we will not try to give the comprehensive interpretation of George Orwell’s novel, but we will show how methods of corpus linguistics may be of use to explain certain features of language in the novel.

1.2 Conditions of Research

    The study that uses corpus-based methods is the first one in our department and one of the first ones in our university. The novelty of the work causes several difficulties.

    First of all, there are no existing rules or standards of how the research in corpus linguistics should be presented. We have had to find answers to the questions: How deep can a linguist go into the theory of statistics, mathematics, or computer science? What information should be presented in the work? How should we present the empirical data? How computer programs and other software that are used in the work should be introduced? We have relied on our intuition, and we also have made use of some experience of English Department at the Stockholm University (Johannesson 1988).

    The other difficulty is an access to empirical data. At present, there is no English corpus available in Lithuania. Thus, a corpus linguist has either to choose a short text for his study, or make his own corpus, or try to get connected to a corpus via Internet, or use results of other studies. All the four possibilities have their shortages and limitations. Analysis of a short text can aim to explain the language of the text itself, but cannot make any general claims about the language. The compilation of a small corpus is possible technically even for an individual researcher, but this work is very expensive and time-consuming. Most often it is a privilege of a team of researchers, which may include typists, linguists, grammarians, programmers, annotators, etc. Corpus via Internet is, perhaps, a solution for the future. However, we could not use it for the study, since we found the connection to corpus via Internet too slow and problematic. Consequently, the only way to get a glimpse into the larger corpora is through the eyes of other researchers.

    We have also been trying to solve the problem of terminology in this work. Corpus linguistics as a new science has created new terminology, new concepts and new methods of research. Thus, the terminology of corpus linguistics is not clearly set or systematized yet: it is under development. To our knowledge there is no any comprehensive dictionary so far that thoroughly covers the terminology of corpus linguistics. Systematic Dictionary of Corpus Linguistics is presented in Appendix B. This dictionary is an attempt to group, systematize, define and explain the basic English terms in corpus linguistics and relative fields. This dictionary may be used not only as a dictionary for reference, but also as an introductory material to the new field.

    At present, there are very few comprehensive works or course-books that would deal with the basic theory of corpus linguistics. Our list of references mainly consists of books that contain collections of articles. As a rule, many of the articles deal with very specific problems and they cannot be sources of introductory and general theory of corpus linguistics. However, two basic works were particularly helpful in order to realize basic trends and methods in corpus linguistics: a course-book for undergraduate students (one of the first books of this kind) by Tony McEnery and Andrew Wilson (1996), and a comprehensive book about methods in corpus linguistics by John Sinclair (1991).

1.3 Structure of the Work

    The first part of the paper will introduce the basic trends of Corpus linguistics. The brief introductory material is mainly based on the articles by Aijmer and Altenberg (1991) and Leech (1991), and on the book by McEnery and Wilson (1996). The basic terminology is explained in Appendix B.

    The second part is devoted to the statistical analysis of George Orwell’s 1984. This part consists of description and calculation of two distinct statistical measures: type / token ratio and chi-square value. The first experiment described in Chapter 4 is an attempt to compare the type / token ratio of 1984 to type / token ratios of other corpora, as well as to compare type / token ratios across different chapters within the text. This study made use of the results of counts that were presented in the book Corpus Linguistics (McEnery and Wilson 1996, pp. 149-158). The lingware used for the study is Wordsmith Tools (Scott 1996).

    The second experiment described in Chapter 5 tries to track down the key words in 1984. The experiment is based on the calculation of chi-square value. This is a more complex statistical research than the previous one and it touches upon many important topics of corpus linguistics such as frequency list, lemmatisation, concordancing, disambiguation, chi-squared test and others.

    The results of the analysis are summarized and evaluated in Conclusions.

    The work is supplied with two appendixes: Appendix A gives a description of the Collins Cobuild English Collocations on CD-ROM, and Appendix B is a Thematic dictionary of terminology of Corpus Linguistics.

    Index of the basic terms and names is provided at the end of the work.

 


TOPICS: Culture/Society; Editorial
KEYWORDS:
I found this a very interesting read.
1 posted on 09/06/2001 11:49:17 AM PDT by vannrox
[ Post Reply | Private Reply | View Replies]

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson