There are likely over two million English words in all forms when scientific terms are included, and likely over four million if organism and specie designations were to be included (Crystal, 1990).
Webster’s Third New International Dictionary contains about 267,000 entries. Paul Nation classified 113,161 of those entries as word families (Nation, 1990). [We are happy to disclose here that Paul Nation advises Lexxica and the development of its services.]
The largest credible estimate we know of is from Henry Kucera. He has suggested the probable existence of some 375,000 English words, including proper words and special terms. He further suggested the 375,000 words would extend to about 600,000 English words in all forms based on his widely accepted ratio of 1 to 1.6 (Kucera, 1982).
There are a variety of ways to count the words in the English language. Take for example the following six words:
If we were to count these in terms of a “word family”, there would be just one word, ‘accept’. If we were to count in terms of lemmas, all 6 items would be counted. Which is correct? We believe the answer lies somewhere in between. Our preliminary findings indicate that the statistical item difficulty factors for ‘accept’, ‘accepts’ and ‘accepting’ are very close, whereas the statistical difficulties for ‘acceptable’, ‘acceptance’ and ‘unacceptable’, are all quite different. One hypothesis is that the brain treats these six items as four different Base Words: ‘accept’, ‘acceptable’, ‘acceptance’, and ‘unacceptable’. As our database grows we will be able to identify with increasing accuracy how many discrete Base Words there are in any language.
By our reckoning, a Base Word is any word, or set of word forms, that the brain recognizes as one lexical unit. Base Words may manifest in multiple related word forms, as with ‘accept’, ‘accepts’, and ‘accepting’, or Base Words may manifest in just one form, as with the word ‘the’. Base Words that manifest in multiple forms share a defining characteristic in our approach in that each form must have the same, or almost the same, statistical difficulty factor among a population. In practical terms the Base Word designation means that whenever a person indicates recognition of any one form of a Base Word, there can be a high degree of confidence that they will recognize all forms of that Base Word.
V-Check English Base Word recognition findings to date:
NOTICE: Lexxica systems are still early in their evolution. As of the date first indicated above, V-Check is certified for Japanese persons to assess English vocabulary knowledge up to the 6000th most important Base Word. This chart is provided solely to demonstrate how word recognition will differ between different population groups. It represents mean vocabulary sizes by population group however, any findings marked with an asterisk (*), are UNCERTIFIED PROVISIONAL findings generated from very limited samples.
| Culture / Demographic | Average (mean) number of all known English Base Words |
|---|---|
| Japan / Age 21-25, C/U, M&F | 3,708 |
| Japan / Age 17-20, HS, M&F | 2,984 |
| Japan / Age 14-16, JHS, M&F | 2,102 |
| USA / Age 35+, M/PhD, M&F | 42,732* |
| USA / Age 25-55, C/U, M&F | 33,739 * |
| USA / Age 17-20, HS, M&F | 22,996 * |
| USA / Age 14-16, ES, M&F | 17,239 * |
| China / Age 35+, M/PhD, M&F | 12,854 * |
| Korea / Age 21-25, C/U, M&F | 4,244 * |
| Taiwan / Age 21-25, C/U, M&F | 4,294 * |
Our methods and applications are lexical in nature - not grammatical or structural. Speaking from a lexical perspective, knowing an average of 19 out of every 20 words (95 percent coverage) of a written text is sufficient for effective comprehension. 95 percent coverage would permit a reader to comprehend the meaning without aid of a dictionary. The meanings of the 5 percent (or fewer) unrecognized words could be adequately grasped through context.
The term coverage describes how many vocabulary words are known. Simply stated, “coverage” is a way of measuring and describing the amount of words in a text or spoken dialog that are known by the receiver. For reading, research indicates that knowing 19 of every 20 words, or 95 percent coverage, is the important threshold beyond which people can self-learn new words without the aid of a dictionary. Our research indicates that the 5,000 most important English Base Words are more than sufficient to “cover” 95 percent of general written English, and just 1500 most-important Base Words will effectively “cover” communication in spoken English.
Certain words tend to be better known among populations, and to occur more frequently in print than other words. Looking at frequency of occurrence, for example, ‘the’ is the most frequent word in the English language representing, or covering, about 7 percent of all the English words one is likely to ever encounter. Knowledge of the top 10 most frequent words represents, or covers, 25 percent of the words used in almost all written texts. Coverage then, generally describes the relationship between known vocabulary and the lexicon of a corpus. The chart below describes the relationship between high frequency English words and the well-known British National Corpus (BNC).
| High Frequency Words | Percentage Coverage of BNC |
|---|---|
| 1 | 7 |
| 10 | 25 |
| 100 | 50 |
| 1000 | 75 |
| 2000 | 85 |
| 3300 | 90 |
| 4000 | 95 |
| 6000 | 98 |
| 375,0000 | 100 |
These BNC findings are based on the word family method of counting. Lexxica organizes and counts words using a Base Word approach. Base Words are single citations that represent sets of related word forms. Base Words include standard inflected word forms and in some cases derived word forms. The more widely known word family method, described by Nation (1991), includes multiple derived word forms in each citation based on a fixed set of criteria and without regard for difficulty.
We have found that derived forms of words tend to vary widely in terms of difficulty. Lexxica hypothesizes that related word forms having the same measure of difficulty are being stored and processed similarly by the brain. Word forms that have different difficulties are likely being treated as different words. As a result, at the 95 percent level of coverage, the Base Word method will typically indicate about 25 percent more word citations than the word family method. For example BNC researchers have estimated that 4,000 words cover 95 percent of general texts. Lexxica estimates that 5,000 words are required. Admittedly there is tremendous overlap in the two approaches, and regardless of which is favored, it is highly recommended to make instruction of these most important words an integral part of any language program.
English is a remarkably efficient language with which people can easily survive and even thrive with limited vocabularies. Beyond the first several thousand important words, the remaining low importance words add tremendous, depth, flexibility and color to the language, but for most communications they are optional. However even low frequency words quickly become statistically important to comprehension and coverage when the subject matter concerned is a special purpose domain such as a professional vocation, an academic focus, or a career interest.
Research shows us that 95 percent coverage enables one to comprehend meaning without the aid of a dictionary, and coverage of less than 95 percent requires the use of a dictionary, and that coverage of less than 85 percent will generally defy comprehension regardless of dictionary use.
Following are sample paragraphs concerning two popular media figures. The first is shown at 67 percent coverage where just 13.4 out of every 20 running words are recognizable. Selected words in the text have been scrambled to simulate the experience of reading unknown words. Try to figure out the meaning of this passage and/or identify the missing words:
Brad Pitt told Marchilate mviswabe that he and Angelina Jolie will not be winplurtzd until the smorte to winplurtz is fromptes to bilps and plortes. Pitt, who trimpted the fitzleg of the smigteglar Bortslig fratmack, says, “Angie and I will consider gigrit the tonk when everyone else in the nonctron who wants to be winplurtzd is bleah.”
Lexxica co-founder Charles Browne’s research has identified that 67 percent is the average coverage Japanese high school students have for their EFL textbooks. Regardless of the purpose or focus of any textbook, at 67 percent coverage, reading it will be nearly impossible.
Here below is the same paragraph shown at 95 percent coverage where 19 out of every 20 running words are recognizable and only a few words are scrambled. Again, try to figure out the meaning and/or identify the missing words.
Brad Pitt told Esquire magazine that he and Angelina Jolie will not be married until the right to marry is given to gays and plortes. Pitt, who graced the cover of the magazine’s October issue, says, “Angie and I will consider tying the tonk when everyone else in the country who wants to be married is able.”
Lexxica’s online V-Check test and V-Flash level check employ similar regimented data collection and statistical processes to determine the mathematical probability of any semantic or lexical item being recognized by a member of a particular population group.
The collective responses of a population group are compiled to establish the aggregate difficulty factor for each semantic item. The result is what we call a Recognition Ogive that is unique to its population group.
When taking a Lexxica test or level check, an individual is presented with semantic items (words, terms, expressions, polywords, idioms, constructs, signs, images, etc.) selected from points along a Recognition Ogive belonging to the user’s population group. Lexxica’s IRT based Computer Adaptive Test, quickly determines an accurate measure of user ability along the user’s Recognition Ogive. An essential and proprietary element of the process is our inclusion of false semantic items to control for guessing. False semantic items are introduced in accordance with standard precepts of Signal Detection Theory.
Lexical item difficulty data is systematically collected with each new Lexxica test. Lexxica’s processes are adept at assessing not only Base Word recognition but also Base Word depth of knowledge. Over time, our understanding of how the brain treats and processes all forms of semantic items will be greatly advanced.
Several important observations have emerged.