CORPORA
ubiquitous_concordane_thumb[4].png

Many scholars in the humanities make use of corpora to conduct quantitative research. A corpus (Latin for body) is essentially a large, searchable database of written or spoken text. According to corpus linguist Nancy Ide, "the corpus is a fundamental tool for any type of research on language" (Companion, par. 1). From the perspective of many corpus linguists, commenting on what language is, does, or constructs, without looking to an appropriate corpus, is mere conjecture based on a paucity of samples.

Some grammarians, particularly disciples of Chomsky's generative grammar, find no value in corpus research because any native speaker of a language, they argue, can in an instant generate any grammatically allowable sentence in her given language; thus, there is no need to search through endless text when each speaker's brain is already capable of producing endless text. (The number of legal sentences in a grammar is believed to be infinite.) However, corpus linguists counter that the Chomskian position does not take into consideration that any given speaker may not be familiar with different registers or dialects within her language and thus would not be able to produce all possible sentences wthin a grammar. In addition, corpus linguists and other researchers who make use of corpora typically are interested in how a language functions "on the ground," so to speak, rather than elucidating a formal grammar. Research in generative grammar relies too much on an individual's "intuition" of his native language; corpus linguistics, on the other hand, looks to large samples of "language-in-use" to describe what speakers of a language may or may not do.

Corpora allow researchers to search "automatically for a variety of language features, and compute frequency, distributional characteristics, and other descriptive statistics" (Ide, par. 1). Of course, computers are indispensable aids to corpus researchers. Recording and storing corpora, to say nothing of analyzing them, is virtually impossible without the aid of computer technology. Just as discourse analysis was impractical before Philips introduced Compact Cassettes, so too was corpus linguistics impossible before the advent of hard drives.

TYPES OF CORPORA
A corpus can be based on written or spoken language. When a corpus is based on written lanuage, the data can be preserved in its original form via scanning before it is "normalized" into an orthography acceptable for processing or publication. A drawback to spoken corpora, however, is that often they are normalized into orthography without regard for tone, pauses, and all the other wonderful vagaries of speech. For researchers trained in phonetics, the International Phonetic Alphabet can be a way to record the sounds of speech in a somewhat faithful orthography. (See below for Phonetics and Phonology.) But for most researchers in rhetoric and composition, who are rarely concerned with phonetic issues, spoken corpora can nonetheless be invaluable resources for both orginal research and for testing hypotheses put forward within the contexts of ethnographies and other qualitative projects.

Corpora can be finite or non-finite. This means that corpora can be either comprised of a closed selection of texts or open-ended, constantly being enlarged with new text. Non-finite corpora (called monitor corpora by corpus linguist John Sinclair) are valuable for tracking changes in words and morphemes over time but can be unwieldy when it comes to truly quantifying the linguistic data they provide. (It is impossible to quantify a constantly changing sample.) Finite corpora, on the other hand, have been closed to further addition. They capture a language during a particular time and can consist of as few as one million words or as many as 400 million words. The International Corpus of English, a project which has sought to archive regional differences in English, provides smaller corpora. The Corpus of Contemporary American Englishcontains 425 million words across a variety of spoken and written texts. And the same researchers who have compiled the Corpus of Contemporary American English have also compiled an entirely written corpus from Google Books. The Google Books Corpus contains 155 billion words. Most corpora that have been compiled nationally and internationally are available through the Linguistic Data Consortium or the European Language Research Association.

A monolingual corpus is comprised of texts from a single language. A multilingual corpus is comprised of texts from multiple languages. An alligned parallel corpus is a multilingual corpus that has been arranged for side-by-side comparison.


USES OF CORPORA
All of the available corpora can be searched for patterns across a wide variety of text. There are few limits to the number of research questions that a large corpus can answer. According to Nancy Ide's article in the Companion to Digital Humanities, some of the methods used to annotate (that is, search and analyze) corpora are the following:

1. Morpho-syntactic annotation disambiguates linguistic data into separate parts of speech. Disambiguation is necessary because many words in a text have more than one possible part of speech given the context. Bugs can be a plural noun or a finite verb; that can be a relative pronoun or an adjective. A search algorithm that does not take these contexts into consideration would be of little use.

2. Parallel allignment is an algorithm designed to allign texts for which there exists more than one translation. Parallel allignment "maps" one text to another (for example, an English and a German translation of Plato's Republic), matching words and sentences to each other for comparative analysis.

3. Syntactic annotation shows relations between parts of speech within a clause. Many algorithms for syntactic annotation provide "treebanks" which show constituency (i.e., how noun phrases are embedded within prepositional phrases which are embedded in verb phrases, et cetera).

4. Semantic annotation tags linguistic elements with extra information about the "meaning" of those elements. This kind of annotation is commonly used by sociolinguists (who tag data for ideas like "agent" or "act") and by literary scholars (who tag data for themes and concepts).

5. Discourse level annotation analyzes data for a) detecting topics across a wide range of text and b) cross-reference within the same text (for example, following a chain of pronouns).

More important than any particular corpus or any particular type of annotation is the idea of the corpus itself. We can complain that a single corpus (such as the CoCAE) does not adequately represent such-and-such segment of society that we want to research. But understanding what a corpus is allows us to combat any such disparities: we can form partnerships to create our own corpus of spoken or written language as used by any community that we wish to study.

CURRENT CORPUS RESEARCH
1729046987417570832.png
Stanford University's Literary Lab "discusses, designs, and pursues literary research of a digital and quantitative nature." Run by Matthew Jockers and Franco Moretti, envisions itself as a space where corpus research is placed at the center of literary study. The lab, run primarily by graduate students under the supersvision of Jockers and Moretti, is dedicated to quantitative and statistical research of style, genre, plot, and character in literature. The following are some of the research projects being pursued at the Literary Lab:

1. A Geography of of 19th Century English and American Fiction
2. Towards a Stylistics of the Novelistic Sentence
3. Network Theory and Dramatic Structure


centre-signage.jpg

The Centre for Corpus Research is housed at the University of Birmingham in England. Since the 1970s, Birmingham has been at the forefront of corpus research, compiling the Bank of English and the 17 million word Birmingham Collection of English Text. The following is a short video from Dr. Nicholas Groom, a lecturer at the Centre, talking about what corpus linguistics is and how it is situated as a challenge to the dominant theories of linguistics.






LogoCECLweb.jpg
Located at the Univesrite Catholique de Louvain, the Centre for English Corpus Linguistics "specializes in the collection and use of corpora for linguistic and pedagogical purposes. Its main areas of focus are learner and multilingual corpora." Louvain's Centre compiled the International Corpus of Learner English, which contains argumentative essays written by learners of English from various linguistic backgrounds. The following are some of the research projects being pursued here:

1. Cognitive Grammar and English as a Foreign Language
2. Quantification and Approximation in Business Language
3. LEAD: An English for Academic Purposes Dictionary for non-native speakers

THE CORPUS AS A TOOL FOR ESL TEACHERS
The video below briefly explains how a corpus can be used to assist L2 learners by providing a plethora of examples of how native speakers construct specific sentences.





PHONETICS AS A CORPUS TOOL
Because researchers in rhetoric and composition tend to utilize written texts as primary sources, the sounds of language may not often seem relevant. While this may be true in many cases, there are other research projects that may benefit from a method for analyzing speech sounds. For instance, two areas in which speech sounds remain very relevant are the study of "language stereotyping" and the socio-political issues surrounding "proper" pronunciation among L2 and non-traditional students. One way to talk about these issues is with the sometimes unwieldy terms provided by crtical theory and socio-cultural rhetorical discourse. Another way of talking about them is through phonetics and phonology, which allow us to analyze, from a linguistic standpoint, why stereotypes may emerge in the first place. (Phonetics is the study of speech sounds themselves; phonology is the study of the rules that native speakers follow to make sense of the speech sounds.)

The International Phonetic Alphabet is what linguists use to describe speech sounds. In a way, it is a corpus of all the sounds that the human vocal tract is capable of producing. In fact, if a researcher needs to transcribe details from a spoken corpus, the IPA is the most faithfully detailed way to describe the spoken language in a way that can be recognized by other researchers. (Also see Feature Analysis for an even more thorough way of describing the phonetics and phonology of a language.) Thus, the IPA is a valuable tool, and researchers outside of linguistics can (with effort) familiarize themselves with the IPA so that they can talk about speech sounds in a more detailed and measurable way. Particularly in ethnic studies, the IPA can be valuable for discussing different dialects of English, as well as the issues L2 speakers face when transitioning to an English-only environment. How people pronounce words is a powerful marker of class and identity, and both phonetic and phonological analyses can be powerful ways of describing, quantitatively, what pronunciations are and how they work at a psychological and physiological level.

3530675848_80469ae76f.jpg
3530675848_80469ae76f.jpg

ipachart.gif
ipachart.gif