Frequency Lists and Corpora

Frequency Lists

Frequency lists, i.e. lists of the most frequently used words or phrases, have several purposes. Some language learners use them to decide what vocabulary to learn first. In the context of accessibility, they can used in processes for producing easy-to-read texts or in processes for evaluating the readability of texts. (Readability encompasses more than just vocabulary or word frequency.)

Frequency lists should be based on a corpus. Today, an electronic corpus of a million words is considered small. This page gives a non-exhaustive overview of frequency lists, organised by language or language family. Many important or “big” languages are missing from this overview due to the absence of frequency dictionaries or lists that are based on a sufficiently large corpus. For example, the following languages—each with 60 million or more native speakers—are missing: Hindi, Bengali, Punjabi, Javanese, Malay, Telugu, Vietnamese, Marathi, Tamil, Urdu and Italian. For example, the Italian English Frequency Dictionary - Essential Vocabulary: 2500 Most Used Words & 421 Most Common Verbs by J. L. Laide does not provide any information on whether it was based on a corpus. The Frequency Dictionary of Italian Words by Alphonse Juilland dates from 1973 and is probably out of date. The frequency lists listed on Wiktionary are based on very limited corpora (e.g. just subtitles). Lexiteria has frequency lists with the top 200 words for 40 languages; they may be able to produce longer lists on demand.

Frequency Lists for English

Frequency Lists for Other Germanic Languages

Frequency Lists for Chinese Languages

Frequency Lists for Romance Languages

Frequency Lists for Slavic Languages

Frequency Lists for Finno-Ugric Languages

Frequency Lists for Austronesian Languages

Frequency Lists for Indo-Iranian Languages

Frequency Lists for Semitic Languages

Frequency Lists for Turkic Languages

Frequency Lists for Language Isolates

Other Frequency Lists

Corpora

Corpora for Semitic Languages

Algorithms and Code for Creating Frequency Lists

Research on Corpora and Frequency Lists

Corpus and Concordance Tools