Language
Reseach Index

Measuring Vocabulary Size via Online Technology

Browne, Cihi and Culligan,

Copyright Lexxica 2007. All rights reserved. Last updated March 2007

Background and Research

Lexxica has developed a comprehensive system for individualized lexical knowledge assessments and the generation of individualized courses of language study. Most of the references in this report describe the assessment and instruction of English, however, the reader should bear in mind that all of the processes described apply equally to the assessment and instruction of languages (and semantic systems) other than the English language. Furthermore it is the express intention of Lexxica to apply its systems to a wide variety of languages, and language sub-domains, irrespective of whether they are a native language or a non-native language of the learner.

Until recently, vocabulary learning was seen as peripheral to language acquisition, both theoretically and practically. Linguistic theory assigned word learning to a simple functional-associative model which of course could not accommodate syntax, and applied language researchers and teachers largely concurred with this view in an effort to be aligned with proper theories, and also in the knowledge that vocabulary was anyway too vast a quantity for direct instruction (but fortunately could be picked up more or less by itself).

With the grammar-translation method, and its focus on the syntax of the sentence, it was thought that once the students learned the grammar of the sentences, they would be able to slot in vocabulary and therefore generate language. The advent of the Audiolingual method, based on habit-formation, was much the same regarding vocabulary. Words were taught only within the structures that were the main focus. Since then, subsequent research has often attempted to account for second language acquisition (“SLA”) by looking at grammatical features in such areas as the developmental sequence (Cancino, Rosansky, & Schumann, 1978; Pienemann, 1989), the role of input (Loschky, 1994; Shook, 1994; White, Spada, Lightbown, & Ranta, 1991), and instruction (Dulay, & Burt, 1973; Ellis, 1992; Sharwood Smith, 1981; VanPatten, & Cadierno, 1993). From the publication of Corder's seminal paper in 1967 to Larsen-Freeman writing in 1991 on SLA research, the study of grammar and its acquisition has almost become synonymous with SLA.

Much of what was believed has now been reversed. Theoretically, it appears likely that language acquisition begins with word learning rather than syntax triggering, with words gradually "grammaticalized" through experience on a largely associative basis. Practically, studies throughout the 1980s and 1990s showed that vocabulary skill and knowledge are the precondition for most other language abilities and, in addition, the main source of variance in the final state of such abilities. It now seems clear that vocabulary acquisition does not happen by itself to any satisfactory degree, particularly as needed for first language literacy or a second language generally.

Over the years, a relatively small group of scholars has worked consistently to consider the needs of learners from a predominantly lexical perspective. Many of the questions they asked, and the results they found are still relevant today. These questions included how many words a student needed to know, how these words should be sequenced, and what the student needed to know about these words.

One of the first debates centered on the number of words that a student needed to know. This necessarily led to defining what a word is, and what it means to know a word. While this research primarily focused on first language acquisition, there are obvious implications for SLA as well. The central argument was whether it would be possible to increase a learner's vocabulary by the direct instruction of words and their meaning. If estimates of native speakers vocabulary were large, explicit instruction would not be feasible, and early research seemed to indicate that this was the case. Studies cited in D'Anna, Zechmeister, & Hall, (1991) suggested a recognition vocabulary of 155,736 words (Seashore & Eckerson,) and over 200,000 words (Hartman) but both studies suffered from methodological problems in defining what a word is.

Nagy and Anderson (1984) used six semantic categories to organize lexis from a corpus of high school English and found that students were exposed to 45,000 base words and 88,500 word families. They suggested that teaching children "words one by one, ten by ten, or even hundred by hundred would appear to be an exercise in futility" (p. 328), and that teachers should concentrate on teaching skills and strategies for independent word learning. Later research by Goulden, Nation, & Read, (1990) questioned whether native speakers actually knew these words. By designing tests based on the frequencies of the words, the researchers determined that native speakers' vocabulary averages 17,200 words. This number suggests that the learning burden is not as insurmountable as previously suggested. Other research by D'Anna, et al. (1991) found a similar result of 16,785 words.

Hazenberg & Hulstijn, (1996) found that native Dutch speakers had a vocabulary of 18,807 words but they also looked at the vocabulary of non-native students writing a Dutch university entrance exam and concluded that these students needed a minimum of 10,000 base words for entry into university. Laufer (1992) compared vocabulary size and reading comprehension scores and found that a recognition vocabulary of at least 3,000 words was the threshold for being able to read unsimplified texts. While in no way negating Nagy and Anderson's argument that learning vocabulary from reading is important, there is sufficient evidence that teaching at least some of the words explicitly can have a meaningful effect on the students’ vocabulary.

While an assessment of vocabulary size provided part of the picture, other researchers looked at which words ESL students needed and how they should be sequenced. In the 1930's through to the 1950's, a few researchers (Ogden, 1930; Richards, 1943; Thorndike, & Lorge, 1938; West, 1953) ranked vocabulary, using criterion mostly based on frequency and coverage. Ogden's 850 basic words and West's 2000 word general service list sought to provide a way to assist the learners in acquiring a sufficient vocabulary to overcome what Coady (1993) would later refer to as “the paradox of learning words through context”, whereby students must have a command of enough words to read in the first place. With the 2,000 high-frequency words accounting for 81 percent of the running words in a text (Nation, 2001), students who have mastered this list are better prepared to handle the demands of reading.

However, research by Laufer (1989, 1992) clearly shows that even this amount may not be sufficient for academic study in an L2 environment or reading unsimplified texts.Another important issue involved the depth of knowledge necessary to understand the various dimensions of a meaningful and full representation of a given word. Depth of word knowledge has been categorized along a continuum from receptive to productive, into four categories consisting of form, position, function and meaning (Nation, 1990), or into comprehension processes (Quian, 1999) including pronunciation and spelling, morphological properties, syntactic properties, meaning, register, and word frequency. Many techniques have been suggested to increase vocabulary knowledge (Bauer & Nation, 1993; Crow & Quigley, 1985; Hafiz & Tudor, 1990; Joe, 1995; Nation, 1990, 1994a, 1994b; Nattinger, 1988; Williams, 1986; Wodinsky & Nation, 1988), varying on the explicitness of the presentation from word list memorization techniques (Crow & Quigley, 1985) to learning through communicative interaction (Joe, 1995). The need to expand vocabulary learning in line with overall linguistic development has received considerable attention (Carter & McCarthy, 1988; Chall, 1987; Nation, 1990, 1994a, 1994b; Parry, 1993), but until now there hasn’t been a technologically feasible way to achieve this expansion. Emerging technologies in communication and personal computers are ideally suited to support and advance understanding of vocabulary such that an efficient, personalized learning experience can be provided.

According to Brown, (1995) an essential component of any pedagogical program is a needs analysis. Before designing and presenting materials, it is imperative to gather “information to find out how much the students already know and what they still need to learn” (p.35). In a vocabulary program, the first requirement is to identify what words the students need to learn through the analysis of corpora. The second procedure is to test the words to find out how many of the words the student already knows.

Since the pioneering work of George Kingsley Zipf and E. L. Thorndike, the statistics analyses of large collections of texts have helped to determine some of the more valuable properties of usage. One such field of study has to do with the relationship between the rank of a word, the frequency to which it occurs in text, and the cumulative coverage of the text. The most common word in English, the, occurs about 7 times in every 100 words of text. About a quarter of all the words in a text will be one of the 10 most common words. As words become less frequent, their contribution to the coverage of the text decreases. While the 100 most common words account for about half of all the words in a text, the next 100 only account for 7 percent, bringing the coverage up to 57 percent of the running words in a text. Nation (1990) summarizes Carrol, Davies, and Richman research on frequency counts in the Brown corpus in the Figure 1 below. Column one represents the cumulative number of words starting from the highest frequency. There are 86,741 different words in the Brown corpus. The second column shows the percentage of words in the corpus that the words account for. For example, the 10 most frequent words account for 23.7 percent of all the words in the corpus.

Figure 1

Different words Percentage of running words
86,741 100
43,831 99
5,000 89.4
3,000 85.2
2,000 81.3
100 49
10 23.7

While these figures differ slightly from corpus to corpus, the general trend is consistent. After about 2000 words, lower frequency words contribute little to the coverage. Learners with less than 2000 words would have great difficulty comprehending natural text, as approximately one out of every five words is unknown. Learners with 2000 to 3000 word vocabularies would still struggle with the text with one unknown word out of seven. Nation (1991) argues that at least 4000 word families, derived from the analysis of academic corpora, are needed before learners can read unsimplified academic texts, since this would provide about 95 percent coverage, leaving only one out of twenty words as unknown. For reading general, non-academic texts, the number of word families needed for 95 percent coverage would be much lower.

As seen above, it is possible to extract lexical units that are common and frequent to a given genre of text, by comparing their frequency in the genre to their expected frequency in that type of text. By this process, we can identify vocabulary for special purposes. For example, in the 100 million words British National Corpus, the word nocturnal appears twice per million words. In a book about wildlife, we would expect to see it more frequently than that. This deviation, clustered with similar lexical deviations, would identify the text as being different from the general text. Alternatively, by analyzing genre-specific text, we can identity the specialized vocabulary. This process was used to compile academic word lists (Coxhead, 2000, Xue & Nation, 1984). One of our goals is to identify and compile lists of words and multiword lexical units for a number of fields.

Traditionally, after having identified which words were necessary in order to comprehend a written text, researchers such as Thorndike and Lorge (1944), and West, (1955) evaluated the words for usability and generalizability in order to compile a list for teaching. Until now, however, given the large number of words, and the problems with test equating and item indices under classical test theory, it has been practically impossible to find out which words the students knew. Scores on different tests by different groups could not be compared. Under classical test theory, comparisons between scores on different tests could only be done by a regression analysis of the two tests with normal distributions from a common population. Regression analysis assumes that both tests have equivalent variance. Given the large number of tests that would be required, these assumptions could never be met in practice, making this approach practically impossible. Word frequency was thus used as a substitute for word difficulty.

The advent of Item Response Theory (“IRT”) in the late 1950s and 60s brought with it many benefits. Because of its focus on the probabilistic relationship between ability of the test taker and the difficulty of the item, many of the assumptions of classical test theory, such as equivalence of variance and normal distribution just do not apply. Two other significant benefits are the application of IRT to large scale testing, and the ability to assign a score to the difficulty of an item regardless of the group who took the test. The System uses a unique IRT model to estimate word difficulty from large scale vocabulary testing, and applies the findings to generate both ability estimates for the person and specific sequences of target vocabulary for learning.

IRT is a theory in the scientific sense of the word in that it attempts to predict an outcome based on observable phenomena. It is a probabilistic model that attempts to explain the response of a person to an item. In its simplest form, item response theory posits that the probability of a random person j with ability j answering a random item i with difficulty bi correctly is conditioned upon the ability of the person and the difficulty of the item. In other words, if a person has a high ability, he or she will probably get an easy item correct. Conversely, if a person has a low ability and the item is difficult, he or she will probably get the item wrong. When we analyze item responses, we are trying to answer the question, “what is the probability of a person with a given ability responding correctly to an item with a given difficulty?” The probabilities of a given response can be expressed mathematically through a number of different formulae, depending upon the situation.

With large-scale testing of our wordlists, we have been able to compare the measured difficulty of the word with a mathematical manifestation of the rank of the word. This can be seen in Figure 2 below. The horizontal axis represents the ranking of the frequency of the words. The data are arranged in ascending order, with the highest frequency words on the left. For this particular manifestation, the vertical axis shows the difficulty index as calculated from 4,217 Yes/No tests on 6000 words. The data are arranged in increasing difficulty with the easiest items at the bottom and the hardest items at the top. The data shows the relationship between frequency and difficulty as represented by the regression line. It also shows that there are many words of low frequency that are well recognized, and there are many high frequency words that may not be known.

Figure 2

Yes/No tests, also known as Lexical Decision Tasks, ask learners to identify known words from a list of real and non-words, or pseudo-words. While these types of tests are not common in most language classrooms, they have a long history in research in the field of psycholinguistics, where they have played an important role in our understanding of how the mental lexicon works. These tests are often analyzed using a branch of Decision Theory known as Signal Detection Theory (“SDT”), which compares the learner’s responses to the real words and non-words, and determines the probability of a correct decision as well as the degree of accuracy to which the learner makes the decision. With the increasing availability of CAT, these methods are now making the jump from the research lab to the digital classroom.

Unlike convention pencil and paper tests where the reliability and accuracy of the test can only be established through the statistical analysis of the responses after the test has been taken, CAT predetermines the level of accuracy, then in an interactive manner administers items, based upon the response pattern of the test taker, until the desired level of accuracy has been achieved. Since the test is constantly zeroing in on a respondent’s level based on their correct or incorrect responses, a far fewer number of questions are needed to accurately estimate their level.

The accuracy of a measure is associated with the Standard Error of Measurement (“SEM”). With conventional pencil and paper tests, the SEM is derived from the Standard Deviation (“SD”) and reliability of the test as shown in Formula 1 below.

With IRT, the standard error of the estimate, a statistic related to SEM, is derived from the amount of information that each item contributes to the test results. Formula 2 shows the information function for the estimate based on a test, and Formula 3 illustrates the relationship with the standard error of the estimate.

In a CAT, the respondent is presented with the first item, usually drawn from a pool of items very close to the population mean. Depending on how the test taker responds, the next item will be drawn from approximately one standard deviation from the mean. This will continue until there is at least one item answered correctly and one item answered incorrectly, or in the case of a Yes/No test, one real word is identified as being known and one real word is identified as unknown. At this point, a maximum likelihood estimate of the test-taker is calculated using the derivative of the likelihood function, as well as the test information function and standard error shown above.

Each next item is selected to give the maximum amount of information at the estimate of the ability. Then the maximum likelihood, test information, and standard error of the estimate are calculated again. This process is repeated until the desired level of accuracy is achieved. The amount of time necessary to take the test is variable because the process depends on the responses of the test taker. However, because each item is selected to maximize the information and minimize the error based on an individuals responses, these tests are always more efficient than conventional pencil and paper tests or non-interactive computer tests.

Much research in the area of second language vocabulary acquisition has focused or depended on estimates of a learner’s overall vocabulary size (i.e., breadth of vocabulary). Tests such as Nation’s (2001) Vocabulary Levels Test attempt to measure respondent’s passive recognition of vocabulary words at different frequency bands for purposes such as measuring group gains, program evaluation, or student placement. While useful, such tests have a number of limitations, including an inability to assess how well particular words are known (Read, 1988). More recent work (Nassaji, 2004, Vermeer, 2001, Paribakht & Wesche, 1993, Wesche & Paribakht, 1996), has begun to explore how to assess a learner’s level of familiarity with a given word.

In general, knowledge of a given lexical item is considered to exist on a continuum of less knowledge to more knowledge (see Figure 4), from a receptive understanding of the item at the beginning stages to a more productive understanding at later stages of learning. In other words, early stages of vocabulary knowledge might include the receptive ability of being able to recognize a word in a written sentence or stream of speech, while later stages might include the ability to use the word productively in a written or spoken sentence.

Over the years a variety of depth of knowledge scales based on student self-assessment-type questionnaires have been developed. These include Eichholz and Barbe’s test of word knowledge (1961), D’Anna, Zechmeister, and Hall’s vocabulary knowledge scale (1991) and Zimmerman’s 4-point vocabulary knowledge scale (1997).

Lexxica’s assessment of vocabulary knowledge builds on ideas from the above models in order to provide a fast and efficient means of assessing certain aspects of a respondent’s depth of vocabulary knowledge utilizing an interactive computer interface. In order to make the system’s online test as efficient as possible, no depth of knowledge questions are asked until after the Yes/No section of the test is complete and the system has been able to determine the approximate number of words the respondent knows. Once this has been established, a small number of depth of knowledge questions will be asked, at the respondent’s estimated level of difficulty and next at progressively lower difficulty levels. The reason for testing at lower levels is that respondents’ depth of knowledge of words located toward the high end of the respondent’s level of difficulty will most likely be quite shallow. Deeper understanding is to be expected for easier words. The system seeks to generate information about a respondent’s depth of knowledge at different levels of difficulty in order to best determine a more useful and effective individualized course of study.

For non-native language knowledge assessments, the system is capable of testing respondents on certain words that have been identified by the respondent as being known, in order to ascertain which of these items are false-friends (i.e., words from the respondent’s native mother-tongue that are spelled or sound like words in the non-native language being tested but whose usage or meaning in the native language is actually very different), and which are genuinely known.

Once the system has obtained the respondent’s ability estimate based on the test results, the system can convert the score into an estimate of the number of words the respondent knows through use of its regression formula. By converting ability estimates into the number of words known, respondents and their teachers can receive a useful absolute assessment of language knowledge. Not only can the assessment score be used to accurately gauge a respondent’s learning progress over periods of time, it can also provide a more meaningful way to interpret respondent test results, and it can be used to create, select or assign ability appropriate graded reading material at any level of ability.

In developing individualized courses of vocabulary study, one likely approach would be to prioritize words as follows: The first group of words to be presented for study would be important and highly frequent general vocabulary words for which a learner has indicated a low depth of knowledge. In other words, common lexical items that a respondent thinks they know, but of which they have little or incorrect knowledge as revealed by the test. The next group of words to be presented would be important high frequency general vocabulary words at a slightly higher difficulty level. These words will be presented in order of importance, as will all words within their specific sub-domain. Where possible (not all learners have a special field of interest), the next group of words to be presented would be drawn from specialized words appropriate to a learner’s professional field or area of special interest at or near the learner’s assessed level of ability, and for which indications are that little depth of knowledge is possessed. The fourth group of words to be presented would be specialized words appropriate to a learner’s professional field or area of interest that are above a learner’s assessed ability level. The fifth group of words to be presented would be important low frequency general vocabulary words slightly above a learners’ assessed vocabulary ability. The sixth group of words to be presented for study would be important low frequency general vocabulary words that are well above a learner’s assessed vocabulary size and at a higher level of difficulty.

Next : Acknowledgment »

Back to Top