== Natural Language Toolkit (NLTK) == * [http://nltk.org/index.php/Corpora#Text_Corpora NLTK] Over 40 corpora and corpus samples are included with the NLTK Corpus Distribution (750Mb). * [http://nltk.org/index.php/Corpora#Parsed_Corpora Parsed Corpora] * [http://nltk.org/index.php/Corpora#Tagged_Corpora Tagged Corpora] * [http://nltk.org/index.php/Corpora#Text_Corpora Text Corpora] * [http://nltk.org/index.php/Corpora#Lexicons Lexicons] * [http://nltk.org/index.php/Corpora#Categorized_Corpora Categorized Corpora] * [http://nltk.org/index.php/Corpora#Miscellaneous Miscellaneous] == ARPA == * [http://www.speech.cs.cmu.edu/sphinx/models/hub4opensrc_jan2002/ CMU Hub4] - language model in ARPA format * [http://www.speech.cs.cmu.edu/sphinxman/fr5.html CMU ARPA-format bigram language model] - 57138 unigrams and about 10 million bigrams ([http://cmusphinx.org/models/lm/bn.bigram.arpa link]) * [http://xvoice.sourceforge.net/xvoice-sphinx/status.php xvoice-sphinx language model] == Possible sources of written data (written corpora) for the creation of Language Models == * [http://www.gpoaccess.gov U.S. Government Printing Office ] * [http://www.gutenberg.org Gutenburg project ] * [http://www.ulib.org/index.html The Universal Digital Library] * [http://en.wikipedia.org Wikipedia Spoken Articles ] * Hansard Canada * [http://www.parl.gc.ca/common/Chamber_House_Debates.asp?Language=E&Parl=39&Ses=1 House of Commons] * [http://www.parl.gc.ca/common/Chamber_Senate_Debates.asp?Language=E&Parl=39&Ses=1 Senate] * [http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html Google Research word n-gram models and training corpora ] * [http://www-tech.mit.edu/Shakespeare/ Complete Works of William Shakespeare] * [http://www.dcs.shef.ac.uk/research/ilash/Moby/ Moby Project] * [http://books.google.com/ Google Books] * [http://www.ibiblio.org/ ibiblio] * [http://www.bmanuel.org Corpora and Corpus-based Computational Linguistics] - Manuel Barbera's Web Resources Reference Guide * [http://www.infomotions.com/alex/downloads/ Alex: A Catalogue of Electronic Texts on the Internet] * [http://wiretap.area.com/ Wiretap Electronic Text Archive]- Gopher based (accessible via a browser) * [http://etext.lib.virginia.edu/ The Electronic Text Center of the University of Virginia] * [http://www.isi.edu/natural-language/download/hansard/ Aligned Hansards of the 36th Parliament of Canada] * Enron emails * [http://www.cs.cmu.edu/~enron/ Enron Email Dataset] William W. Cohen, CMU * [http://www.ferc.gov/industries/electric/indus-act/wec/enron/info-release.asp# Federal Energy Regulatory Commission website] * [http://mashable.com/2007/11/12/public-domain-ebook-sources/ 20+ Places for Public Domain E-Books] == Other Sources but with Licensing Restrictions == * [http://tapor.ualberta.ca/News/news.php?channel=alberta&id=177 TAPoR] Text Analysis Portal for Research at the University of Alberta * [http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html WestburyLAB] USENET corpus - Creative Commons Attribution-Non Commercial-No Derivs * Google * [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 Web 1T 5-gram Version 1] - linguistic education and research only