Banner1.gif (3780 bytes)

Selected Bibliography and Links

Journals

Monographs

  • Barnbrook, Geoff. (1996) Language and Computers: A Practical Introduction to the Computer Analysis of Language. Edinburgh: Edinburgh University Press.
    [Overview of text analysis with chapters on: data capture; frequency lists; concordances; collocations; tagging and parsing; natural language processing and some case studies.]
  • Kennedy, Graeme. (1998) An Introduction to Corpus Linguistics. London: Longman
    [A useful overview. Includes chapters on: introduction; the design and development of corpora; corpus-based descriptions of English; corpus analysis; implications and applications of corpus-based analysis.]
  • McEnery, Tony, and Andrew Wilson. (1966) Corpus Linguistics. Edinburgh: Edinburgh University Press; 1996.
    [Intended as an undergraduate course book, but useful for other beginners. Includes chapters on: early corpus linguistics and the Chomskyan revolution; what is a corpus and what is in it; quantitative data; the use of corpora in language studies; corpora and computational linguistics; a case study (investigates the hypothesis that a sublanguage will show a high degree of closure at various levels of description, using two general corpora and one of IBM manuals). See also the book's Web site and the Web course based on the book.]
  • Ooi, Vincent B.Y. (1998) Computer Corpus Lexicography. Edinburgh: Edinburgh University Press.
    [Overview of the role of corpora in dictionary making and computational lexicons. Discusses convergence of computational linguistics, computational lexicography and corpus linguistics.]
  • Sinclair, John. Corpus, Concordance, Collocation. (1991) Oxford: Oxford University Press.
    [Discusses concordances and collocations mostly with reference to specific words and phrases. Each chapter is an updated version of an earlier paper.]
  • Stubbs, Michael. (1996) Text and Corpus Analysis. Oxford: Blackwell.
    [Another introductory volume. Situates corpus linguistics within broader areas of linguistics, followed by some case studies.]

Collections

  • Garside, Roger, Geoffrey Leech, and Geoffrey Sampson. (Eds) (1987) The Computational Analysis of English: A Corpus-based Approach. London: Longman; 1987.
    [Papers describing research at the Unit for Computer Research on the English Language (UCREL), Lancaster. Although published in 1987, a good deal of this is still current.]
  • Hickey, Raymond, Merja Kytö, Ian Lancashire, and Matti Rissanen. (Eds) (1997) Tracing the Trail of Time: Proceedings from the Second Diachronic Corpora Workshop. Amsterdam: Rodopi.
    [Most recent collection of papers on the use of diachronic corpora.]
  • Sinclair, J.M. (Ed) (1987) Looking Up: An Account of the COBUILD Computing Project in Lexical Computing. London: Collins ELT.
    [Papers from the Cobuild group describing how they created the Cobuild dictionary. Topics include corpus development, grammar, definitions, examples, pronunciation and "moving on".]
  • Thomas, Jenny, and Mick Short. (Eds) (1996) Using Corpora for Language Research: Studies in Honour of Geoffrey Leech. London: Longman.
    [Contains papers by most leading corpus linguistics. Sections on: using corpora for language research; corpus-based language studies; applications of corpus-based research to speech and language technology; wider applications of corpus-based research (teaching, lexicography, multilingual work).]
  • Wichmann, Anne, Steven Fligelstone, Tony McEnery, and Gerry Knowles. (Eds) (1997) Teaching and Language Corpora. London: Addison Wesley Longman.
    [Papers on the use of corpora in teaching language and linguistics. From the TALC Conference in spring 1994.]

Many of the volumes listed above contain useful bibliographies. For a comprehensive bibliography see also Parts 2 (1989-) and 3 (1990-98) of Bengt Altenberg's ICAME  bibliography accessible via the ICAME home page.

Specific Corpora

  • Aston, Guy, and Lou Burnard. (1998) The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.
    [Contains worked examples on how to use the BNC for specific enquiries.]
  • Greenbaum, Sydney. (Ed) (1996) Comparing English Worldwide: the International Corpus of English. Oxford: Oxford University Press.
    [Collection of papers on the compilation and applications of ICE.]
  • Johansson, Stig, and Knut Hofland. (1989) Frequency Analysis of English Vocabulary and Grammar Based on the LOB Corpus. 2 volumes. Oxford: Clarendon Press.
    [Mostly lists of words and tags, but the introduction has some useful discussion of the methodology and limitations.]

Specific Topics

Introductory

  • Hockey, Susan. (1998) "Textual Databases". in Lawler, John and Aristar Dry, Helen, Eds. Using Computers in Linguistics: a Practical Guide. London: Routledge. 101-133.
    [Introduction to textual databases and concordance tools.]

Corpus Design

  • Atkins, Sue, Jeremy Clear, and Nicholas Ostler. (1992) "Corpus Design Criteria". Literary and Linguistic Computing 7, 1-16.
    [Discussion of design of written corpora at the time when the BNC was being planned.]
  • Burr, Elisabeth. (1996) "A Computer Corpus of Italian Newspaper Language" in Hockey, Susan and Ide, Nancy, Eds. Research in Humanities Computing 4: Selected Papers from the 1992 ALLC-ACH Conference. Oxford: Oxford University Press. 216-239.
    [Insightful discussion on the creation of a newspaper corpus by a linguist.]
  • Crowdy, S. (1993) "Spoken Corpus Design". Literary and Linguistic Computing 8, 259-265.
    [Plans for the design of the spoken part of the British National Corpus.]

Text Encoding

  • Burnard, Lou. (1995) "What is SGML and How Does it Help?" Computers and the Humanities, 29, 41-50.
    [Overview of why SGML is important.]
  • Renear, Allen. (1997) "Out of Praxis: Three (Meta)Theories of Textuality". Electronic Text: Investigations in Theory and Method. Ed. Kathryn Sutherland. Oxford: Clarendon Press. 107-26.
    [Discusses intellectual rationale for structured markup.]
  • Sperberg-McQueen, C. Michael, and Lou Burnard. (Eds) (1994) Guidelines for the Encoding and Interchange of Electronic Texts. Chicago and Oxford: ACH, ACL, ALLC.
    [Full specification of the Text Encoding Initiative encoding scheme. Somewhat daunting volumes, but Chapter 2 A Gentle Introduction to SGML is a useful starting point. A searchable version of the TEI Guidelines is online at the University of Michigan Humanities Text Initiative.]
  • The SGML/XML Web Page
    [Very comprehensive list of SGML- and XML-related information. The compiler of this site was originally a biblical scholar and so it includes plenty of information for academic users and well as commercial users. Updated on almost a daily basis.]
  • Text Encoding Initiative Home Page
    [Information about the TEI Project and how to obtain the Guidelines. Also a list of some TEI projects and some bibliography.]
  • Oxford University Computing Services. (1988)  Micro-OCP Manual. Oxford: Oxford  University Press.
    [Chapter 2 contains a description of the COCOA markup scheme.]

Metadata - Data About the Data

  • Giordano, Richard. (1995) "The TEI Header and the Documentation of Electronic Texts". Computers and the Humanities 29, 75-84.
    [Discussion and description of the TEI header by a librarian/computer scientist who has designed corpus headers.]
  • Miller, Paul. Metadata for the Masses.
    [An introduction to the Dublin Core. Published in Ariadne, from the UK Electronic Libraries Programme.]
  • Oxford Text Archive. (1997) Metadata for Electronic Texts Workshop Report
    [Report of meeting held in Oxford in May 1997. Discusses metadata needs for electronic texts with reference to the Dublin Core and TEI Headers. Appendix 4 maps Dublin Core elements against elements in the TEI header.]

Spoken Texts

  • Leech, Geoffrey, Greg Myers, and Jenny Thomas. (1995) Spoken English on Computer: Transcription, Mark-up and Application. London: Longman.
    [Papers discussing problems in working with "computer corpora of spoken discourse". Sections on issues and practices, applications and more specialized uses, and samples and systems of transcription.]

Software Tools for Text Analysis

  • Concordance. Developed by Rob Watt from his Web Concordances.
    [For Windows NT 4.0 and Windows 95/98. Makes word frequencies, concordances, and web concordances. User-definable alphabet and user-definable reference system (based on COCOA format).]
  • MonoConc Pro. Developed by Michael Barlow.
    [For Windows95. Designed for linguists. Makes word frequencies and concordances.]
  • WordSmith Tools. Developed by Mike Scott.
    [For Windows95. Designed for linguists. Makes word frequencies and concordances.]
  • Language Technology Group Software. Developed at the University of Edinburgh.
    [Set of routines callable from C programs for corpus processing and analysis. Includes SGML/XML tools.]

Collocations

  • Jones, S. and Sinclair, J. (1974) "English Lexical Collocations: A Study in Computational Linguistics". Cahiers de Lexicologie, 24.1, 15-61.
    [A ground-breaking article on the use of collocations in the study of lexis.]
  • Church, Kenneth, William Gale, Patrick Hanks, and Donald Hindle. (1991) "Using Statistics in Lexical Analysis". Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Ed. Uri Zernik. Hillsdale: Lawrence Erlbaum. 115-64.
    [Discussion of various statistical methods for identifying lexical relations in both raw and tagged text.]

Corpus Annotation

  • Garside, Roger, Geoffrey Leech, and Anthony McEnery. (Eds) (1997) Corpus Annotation: Linguistic Information from Text Corpora. London: Addison Wesley Longman.
    [Papers from the Lancaster group who tagged the British National Corpus. Discusses the nature of annotation  including grammatical tagging, syntactic annotation, semantic annotation, discourse annotation, software tools for annotation, and the broader applications of annotated corpora.
    A free demo of the Lancaster CLAWS4 tagging system is restricted to 300 words.]

Lexical Databases and Corpora

  • Calzolari, Nicoletta, and Antonio Zampolli. (1991) "Lexical Databases and Textual Corpora: A Trend of Convergence Between Computational Linguistics and Literary and Linguistic Computing".  Research in Humanities Computing 1: Papers from the 1989 ACH-ALLC Conference.  Eds Susan Hockey and Nancy Ide. Guest ed Ian Lancashire.  Oxford: Oxford University Press. 272-307.
    [Important paper discussing the use of lexical databases in corpus analysis and developments in computational linguistics and literary and linguistic computing. Based on the work of the Istituto di Linguistica Computazionale, Pisa.]
  • Walker, Donald, Nicoletta Calzolari, and Antonio Zampolli. (Eds) (1995)  Automating the Lexicon: Research and Practice in a Multilingual Environment. Oxford: Oxford University Press.
    [Collection of papers derived from the Grosseto Workshop where the ground work was laid for the development of resuable linguistic resources and corpora.]
  • WordNet
    [On-line lexical database of English developed originally for research into
    psycholinguistic theories of human lexical memory. Developed at the
    Cognitive Science Laboratory at Princeton University under the direction
    of George Miller.]

Corpus Validation

  • McEnery, Tony, and Lou Burnard, with Andrew Wilson and Paul Baker. Validation of Linguistic Corpora.
    [Report prepared for ELRA. Discusses methods for the formal validation of language corpora, particularly with reference to annotation and SGML-based schemes.]

Corpus Web Sites and Discussion Lists

  • Corpora discussion list.
    [Discusses corpora for language analysis and computational linguistics.  corpora@hd.uib.no. See also the Corpora List Archive.]
  • Corpus Linguistics Web Site. Maintained by Michael Barlow at Rice University.
    [Pointers to centres, corpora, software, bibliography etc.]
  • Tutorial: Concordances and Corpora. Developed by Catherine N. Ball.
    [For a course at Georgetown. Good starting point with discussion and pointers to other material.]

rule.gif (1648 bytes)

Web page prepared by Susan Hockey for a workshop given at the North American Symposium on Corpora in Linguistics and Language Teaching, University of Michigan, Thursday 20 May 1999, 9am - 12pm.

rule.gif (1648 bytes)