|
 Selected Bibliography and Links
Journals
Monographs
- Barnbrook, Geoff. (1996) Language and Computers: A
Practical Introduction to the Computer Analysis of Language. Edinburgh: Edinburgh
University Press.
[Overview of text analysis with chapters on: data capture; frequency lists;
concordances; collocations; tagging and parsing; natural language processing and some case
studies.]
- Kennedy, Graeme. (1998) An Introduction to Corpus
Linguistics. London: Longman
[A useful overview. Includes chapters on: introduction; the design and development
of corpora; corpus-based descriptions of English; corpus analysis; implications and
applications of corpus-based analysis.]
- McEnery, Tony, and Andrew Wilson. (1966) Corpus
Linguistics. Edinburgh: Edinburgh University Press; 1996.
[Intended as an undergraduate course book, but useful for other beginners. Includes
chapters on: early corpus linguistics and the Chomskyan revolution; what is a corpus and
what is in it; quantitative data; the use of corpora in language studies; corpora and
computational linguistics; a case study (investigates the hypothesis that a sublanguage
will show a high degree of closure at various levels of description, using two general
corpora and one of IBM manuals). See also the book's Web site and the Web course
based on the book.]
- Ooi, Vincent B.Y. (1998) Computer Corpus Lexicography.
Edinburgh: Edinburgh University Press.
[Overview of the role of corpora in dictionary making and computational lexicons.
Discusses convergence of computational linguistics, computational lexicography and corpus
linguistics.]
- Sinclair, John. Corpus, Concordance, Collocation.
(1991) Oxford: Oxford University Press.
[Discusses concordances and collocations mostly with reference to specific words
and phrases. Each chapter is an updated version of an earlier paper.]
- Stubbs, Michael. (1996) Text and Corpus Analysis. Oxford:
Blackwell.
[Another introductory volume. Situates corpus linguistics within broader areas of
linguistics, followed by some case studies.]
Collections
- Garside, Roger, Geoffrey Leech, and Geoffrey Sampson. (Eds)
(1987) The Computational Analysis of English: A Corpus-based Approach. London:
Longman; 1987.
[Papers describing research at the Unit for Computer Research on the English
Language (UCREL), Lancaster. Although published in 1987, a good deal of this is still
current.]
- Hickey, Raymond, Merja Kytö, Ian Lancashire, and Matti
Rissanen. (Eds) (1997) Tracing the Trail of Time: Proceedings from the Second
Diachronic Corpora Workshop. Amsterdam: Rodopi.
[Most recent collection of papers on the use of diachronic corpora.]
- Sinclair, J.M. (Ed) (1987) Looking Up: An Account of the
COBUILD Computing Project in Lexical Computing. London: Collins ELT.
[Papers from the Cobuild group describing how they created the Cobuild dictionary.
Topics include corpus development, grammar, definitions, examples, pronunciation and
"moving on".]
- Thomas, Jenny, and Mick Short. (Eds) (1996) Using
Corpora for Language Research: Studies in Honour of Geoffrey Leech. London: Longman.
[Contains papers by most leading corpus linguistics. Sections on: using corpora for
language research; corpus-based language studies; applications of corpus-based research to
speech and language technology; wider applications of corpus-based research (teaching,
lexicography, multilingual work).]
- Wichmann, Anne, Steven Fligelstone, Tony McEnery, and Gerry
Knowles. (Eds) (1997) Teaching and Language Corpora. London: Addison Wesley
Longman.
[Papers on the use of corpora in teaching language and linguistics. From the TALC
Conference in spring 1994.]
Many of the volumes listed above contain useful
bibliographies. For a comprehensive bibliography see also Parts 2 (1989-) and 3 (1990-98)
of Bengt Altenberg's ICAME bibliography accessible via the ICAME home page.
Specific Corpora
- Aston, Guy, and Lou Burnard. (1998) The BNC handbook:
Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University
Press.
[Contains worked examples on how to use the BNC for specific enquiries.]
- Greenbaum, Sydney. (Ed) (1996) Comparing English
Worldwide: the International Corpus of English. Oxford: Oxford University Press.
[Collection of papers on the compilation and applications of ICE.]
- Johansson, Stig, and Knut Hofland. (1989) Frequency
Analysis of English Vocabulary and Grammar Based on the LOB Corpus. 2 volumes.
Oxford: Clarendon Press.
[Mostly lists of words and tags, but the introduction has some useful discussion of
the methodology and limitations.]
Specific Topics
Introductory
- Hockey, Susan. (1998) "Textual Databases". in
Lawler, John and Aristar Dry, Helen, Eds. Using Computers in Linguistics: a Practical
Guide. London: Routledge. 101-133.
[Introduction to textual databases and concordance tools.]
Corpus Design
- Atkins, Sue, Jeremy Clear, and Nicholas Ostler. (1992)
"Corpus Design Criteria". Literary and Linguistic Computing 7, 1-16.
[Discussion of design of written corpora at the time when the BNC was being planned.]
- Burr, Elisabeth. (1996) "A Computer Corpus of Italian
Newspaper Language" in Hockey, Susan and Ide, Nancy, Eds. Research in Humanities
Computing 4: Selected Papers from the 1992 ALLC-ACH Conference. Oxford: Oxford
University Press. 216-239.
[Insightful discussion on the creation of a newspaper corpus by a linguist.]
- Crowdy, S. (1993) "Spoken Corpus Design". Literary
and Linguistic Computing 8, 259-265.
[Plans for the design of the spoken part of the British National Corpus.]
Text Encoding
- Burnard, Lou. (1995) "What is SGML and How Does it Help?" Computers
and the Humanities, 29, 41-50.
[Overview of why SGML is important.]
- Renear, Allen. (1997) "Out of Praxis: Three
(Meta)Theories of Textuality". Electronic Text: Investigations in Theory and
Method. Ed. Kathryn Sutherland. Oxford: Clarendon Press. 107-26.
[Discusses intellectual rationale for structured markup.]
- Sperberg-McQueen, C. Michael, and Lou Burnard. (Eds) (1994) Guidelines
for the Encoding and Interchange of Electronic Texts. Chicago and Oxford: ACH, ACL,
ALLC.
[Full specification of the Text Encoding Initiative encoding scheme. Somewhat daunting
volumes, but Chapter 2 A
Gentle Introduction to SGML is a useful starting point. A searchable version of the TEI Guidelines is
online at the University of Michigan Humanities Text Initiative.]
- The SGML/XML Web
Page
[Very comprehensive list of SGML- and XML-related information. The compiler of this site
was originally a biblical scholar and so it includes plenty of information for academic
users and well as commercial users. Updated on almost a daily basis.]
- Text Encoding
Initiative Home Page
[Information about the TEI Project and how to obtain the Guidelines. Also a list of some
TEI projects and some bibliography.]
- Oxford University Computing Services. (1988) Micro-OCP
Manual. Oxford: Oxford University Press.
[Chapter 2 contains a description of the COCOA markup scheme.]
Metadata - Data About the Data
- Giordano, Richard. (1995) "The TEI Header and the
Documentation of Electronic Texts". Computers and the Humanities 29, 75-84.
[Discussion and description of the TEI header by a librarian/computer scientist who
has designed corpus headers.]
- Miller, Paul. Metadata for the Masses.
[An introduction to the Dublin Core. Published in Ariadne, from the UK Electronic
Libraries Programme.]
- Oxford Text Archive. (1997) Metadata for
Electronic Texts Workshop Report
[Report of meeting held in Oxford in May 1997. Discusses metadata needs for electronic
texts with reference to the Dublin Core and TEI Headers. Appendix 4 maps Dublin Core
elements against elements in the TEI header.]
Spoken Texts
- Leech, Geoffrey, Greg Myers, and Jenny Thomas. (1995) Spoken
English on Computer: Transcription, Mark-up and Application. London: Longman.
[Papers discussing problems in working with "computer corpora of spoken
discourse". Sections on issues and practices, applications and more specialized uses,
and samples and systems of transcription.]
Software Tools for Text Analysis
- Concordance.
Developed by Rob Watt from his Web
Concordances.
[For Windows NT 4.0 and Windows 95/98. Makes word frequencies,
concordances, and web concordances. User-definable alphabet and user-definable reference
system (based on COCOA format).]
- MonoConc Pro. Developed
by Michael Barlow.
[For Windows95. Designed for linguists. Makes word frequencies and concordances.]
- WordSmith
Tools. Developed by Mike Scott.
[For Windows95. Designed for linguists. Makes word frequencies and concordances.]
- Language
Technology Group Software. Developed at the University of Edinburgh.
[Set of routines callable from C programs for corpus processing and analysis.
Includes SGML/XML tools.]
Collocations
- Jones, S. and Sinclair, J. (1974) "English Lexical
Collocations: A Study in Computational Linguistics". Cahiers de Lexicologie,
24.1, 15-61.
[A ground-breaking article on the use of collocations in the study of lexis.]
- Church, Kenneth, William Gale, Patrick Hanks, and Donald
Hindle. (1991) "Using Statistics in Lexical Analysis". Lexical Acquisition:
Exploiting On-Line Resources to Build a Lexicon. Ed. Uri Zernik. Hillsdale: Lawrence
Erlbaum. 115-64.
[Discussion of various statistical methods for identifying lexical relations in
both raw and tagged text.]
Corpus Annotation
- Garside, Roger, Geoffrey Leech, and Anthony McEnery. (Eds)
(1997) Corpus Annotation: Linguistic Information from Text Corpora. London:
Addison Wesley Longman.
[Papers from the Lancaster group who tagged the British National Corpus. Discusses
the nature of annotation including grammatical tagging, syntactic annotation,
semantic annotation, discourse annotation, software tools for annotation, and the broader
applications of annotated corpora.
A free demo of the
Lancaster CLAWS4 tagging system is restricted to 300 words.]
Lexical Databases and Corpora
- Calzolari, Nicoletta, and Antonio Zampolli. (1991)
"Lexical Databases and Textual Corpora: A Trend of Convergence Between Computational
Linguistics and Literary and Linguistic Computing". Research in Humanities
Computing 1: Papers from the 1989 ACH-ALLC Conference. Eds Susan Hockey and
Nancy Ide. Guest ed Ian Lancashire. Oxford: Oxford University Press. 272-307.
[Important paper discussing the use of lexical databases in corpus analysis and
developments in computational linguistics and literary and linguistic computing. Based on
the work of the Istituto di Linguistica Computazionale, Pisa.]
- Walker, Donald, Nicoletta Calzolari, and Antonio Zampolli.
(Eds) (1995) Automating the Lexicon: Research and Practice in a
Multilingual Environment. Oxford: Oxford University Press.
[Collection of papers derived from the Grosseto Workshop where the ground work was laid
for the development of resuable linguistic resources and corpora.]
- WordNet
[On-line lexical database of English developed originally for research into
psycholinguistic theories of human lexical memory. Developed at the
Cognitive Science Laboratory at Princeton University under the direction
of George Miller.]
Corpus Validation
- McEnery, Tony, and Lou Burnard, with Andrew Wilson and Paul
Baker. Validation of Linguistic Corpora.
[Report prepared for ELRA. Discusses methods for the formal validation of language
corpora, particularly with reference to annotation and SGML-based schemes.]
Corpus Web Sites and Discussion Lists
- Corpora discussion list.
[
- Corpus
Linguistics Web Site. Maintained by Michael Barlow at Rice University.
[Pointers to centres, corpora, software, bibliography etc.]
- Tutorial: Concordances and
Corpora. Developed by Catherine N. Ball.
[For a course at Georgetown. Good starting point with discussion and pointers to
other material.]

Web page prepared by Susan Hockey for a workshop given at
the North American Symposium
on Corpora in Linguistics and Language Teaching, University of Michigan, Thursday 20
May 1999, 9am - 12pm.

|