Last modified 11 years ago
Last modified on 09/21/08 20:42:41
DIY corpus using Search Engines (like Google)
- Open source development of large corpora
- Creating general-purpose corpora using automated search engine queries
- An Crúbadán - web crawling software for the automatic development of large text corpora for minority languages
Natural Language Toolkit (NLTK)
- NLTK Over 40 corpora and corpus samples are included with the NLTK Corpus Distribution (750Mb).
ARPA
- CMU Hub4 - language model in ARPA format
- CMU ARPA-format bigram language model - 57138 unigrams and about 10 million bigrams (link)
- xvoice-sphinx language model
Possible sources of written data (written corpora) for the creation of Language Models
- U.S. Government Printing Office
- Gutenburg project
- The Universal Digital Library
- Wikipedia Spoken Articles
- Hansard Canada
- Google Research word n-gram models and training corpora
- Complete Works of William Shakespeare
- Moby Project
- Google Books
- ibiblio
- Corpora and Corpus-based Computational Linguistics - Manuel Barbera's Web Resources Reference Guide
- Alex: A Catalogue of Electronic Texts on the Internet
- Wiretap Electronic Text Archive- Gopher based (accessible via a browser)
- The Electronic Text Center of the University of Virginia
- Aligned Hansards of the 36th Parliament of Canada
- Enron emails
- Enron Email Dataset William W. Cohen, CMU
- Federal Energy Regulatory Commission website
- 20+ Places for Public Domain E-Books
Other Sources but with Licensing Restrictions
- TAPoR Text Analysis Portal for Research at the University of Alberta
- WestburyLAB USENET corpus - Creative Commons Attribution-Non Commercial-No Derivs
- Google
- Web 1T 5-gram Version 1 - linguistic education and research only
Multilingual Copora
- European Parliament Proceedings Parallel Corpus 1996-2006
- Leipzig Corpora Collection - licensing restrictions