# Vocabulary size to learn a language well



## pimlicodude

Hi, we all know languages have various ways of forming vocabulary. Some have an open-ended vocabulary size due to suffixation and compounds. Some like English have roots from multiple languages, leading to the notion that English has an unusually large vocabulary. I think we can leave that discussion for another thread. That is not what this is about here.

I think 10,000 lemmas is regarded as good for fluency in most languages. In the Frequency Dictionary of Russian, there are 10,000 entries. It is stated there that in their large corpus, those 10,000 words are all the ones that occur at least 8 times per one million words of text. Clearly it depends on your corpus, and what books and genres are covered. But most lists of 10,000 most frequent words in a language will have a large overlap between them, even if the ordering of words is not the same. That list of 10,000 words has imperfective and perfective forms of Russian words listed separately. 

Nick Brown, the author, states that beyond this list of 10,000, the words are not worth learning for a foreigner, because you could read Russian for years and not come across them. He cites дятел, "woodpecker", which native speakers will know, but which doesn't really improve an L2 speaker's proficiency as the likelihood of ever coming across it is low. So my presumption is:

A1 command of a language = approx 1,000 lemmas
A2 command of a language = approx 2,000 lemmas
B1 command of a language = approx 4,000 lemmas
B2 command of a language = approx 6,000 lemmas
C1 command of a language = approx 8,000 lemmas (Brown states these are the words found more than 10 times per million)
C2 command of a language = approx 10,000 lemmas (all words more than 8 times per million)

But note that this will give you only "proficiency" (and that assumes you have the pronunciation and grammar right). It does not mean you can read literature without encountering new words. Unfortunately, for learners, to read all literature like a native does, you would need at least 50,000 lemmas, and the more educated native speakers of any language have 100,000+ lemmas. So to aim to read as fluently as a native speaker really requires a lifetime of learning.

I am in awe of the many L2 speakers of English who do indeed read English novels very fluently without needing a dictionary. That reflects the ubiquitousness of English nowadays, but it is indeed a very high level of command of the language. 

It almost seems like a madcap enterprise, to attempt to learn a WHOLE LANGUAGE. 100,000 lemmas? The problem is the low-frequency lemmas you come across might never be seen again, making it difficult to master the less frequent vocabulary. Another issue is active vs. passive vocabularies. If you can master 10,000 lemmas in your active vocabulary and get the grammar perfect, then you are doing well. To have 10,000 lemmas in your passive vocabulary, but actively use much less (I'm probably in this category in Russian - there are thousands of words I recognise but don't use) isn't so great at all.

[Note: the plural of lemma can be lemmas or lemmata, but I've never seen lemmata in real use. I think few native speakers would recognise it.]


----------



## AndrasBP

pimlicodude said:


> Nick Brown, the author, states that beyond this list of 10,000, the words are not worth learning for a foreigner, because you could read Russian for years and not come across them. He cites дятел, "woodpecker", which native speakers will know, but which doesn't really improve an L2 speaker's proficiency as the likelihood of ever coming across it is low.


I don't think "woodpecker" is a good example. Interest in wildlife is not uncommon and it's a well-known bird species.
It might be infrequent in books but if you go for a hike in the woods or a walk in the park with Russian speakers, you're likely to be able to use it.
Some Lithuanian animal words that I learned and later made active use of include _woodpecker, stork, owl, hawk, seagull, eel, pike, catfish, wasp, tick, mosquito, mole, lynx, boar, tortoise _and _lizard_.


----------



## pimlicodude

AndrasBP said:


> I don't think "woodpecker" is a good example. Interest in wildlife is not uncommon and it's a well-known bird species.
> It might be infrequent in books but if you go for a hike in the woods or a walk in the park with Russian speakers, you're likely to be able to use it.
> Some Lithuanian animal words that I learned and later made active use of include _woodpecker, stork, owl, hawk, seagull, eel, pike, catfish, wasp, tick, mosquito, mole, lynx, boar, tortoise _and _lizard_.


In English, I only remember "woodpecker" from children's books.
Andras, do you find there is a big gap between effective fluency and being able to read literature like a native?


----------



## AndrasBP

pimlicodude said:


> In English, I only remember "woodpecker" from children's books.


Really? They might not be common in your area. I see/hear them almost every day.



pimlicodude said:


> Andras, do you find there is a big gap between effective fluency and being able to read literature like a native?


Yes, I think so, but the size of the gap depends on what kind of literature you read.


----------



## elroy

AndrasBP said:


> Some Lithuanian animal words that I learned and later made active use of include _woodpecker, stork, owl, hawk, seagull, eel, pike, catfish, wasp, tick, mosquito, mole, lynx, boar, tortoise _and _lizard_.


Interesting!  The only ones I would expect most C2 speakers to know are owl, hawk, mosquito, tortoise/turtle, and lizard.

Let’s see.  I would say I’m a C2 in German.  Let’s see how many I know without looking them up:

_woodpecker = Hmmm… I’m almost positive it’s something -stecher.  Baumstecher??? Holzstecher??? 
stork = Storch (only because it’s a cognate!)
owl = Eule
hawk = I wanna say it’s a cognate of “falcon”: Falke??? 
seagull = ??? 
eel = Aal (cognate)
pike = ??? (Even in English all I know is that it’s a fish!) 
catfish = I’m not sure, but it’s probably a cognate: Katzenfisch
wasp = Wespe (cognate)
tick = ??? 
mosquito = Mücke
mole = Maulwurf 
lynx = ??? 
boar = I’m not sure, I would guess Wildschwein (wild pig) 
tortoise = Schildkröte
lizard = I’m sure I’ve known this in the past, but it’s not coming to me!  _

So I only know about half (7/16) for sure, and I’m partially sure about a quarter (4/16).

Even in Arabic, my first language, I don’t know all of them because I was schooled in English:

_woodpecker = نقّار
stork = ???
owl = بومة
hawk = صقر
seagull = نورس
eel =  ???
pike = ???
catfish = ??? If it’s a literal translation, then سمكة القط
wasp = I know دبّور is a bee-like insect, but not a bee, so it might be a wasp (or a hornet or a bumblebee…) 
tick = ??? 
mosquito = ناموسة
mole = خلد 
lynx = ??? 
boar = I’m pretty sure this one is “wild pig”: خنزير برّي
tortoise = سلحفاة
lizard = سحلية_

Not that different from German!  I know half for sure (8/16), and I’m partially sure about almost a quarter (3/16).  (I counted “boar” under “partially sure” even though I’m 95%+ sure.)


----------



## elroy

I just looked up the ones I didn’t know or wasn’t sure of:

_woodpecker = Hmmm… I’m almost positive it’s something -stecher. Baumstecher??? Holzstecher??? 
*I wasn’t even close.  It’s “Specht.”  (I’ve heard this before.) *
stork = Storch (only because it’s a cognate!)
owl = Eule
hawk = I wanna say it’s a cognate of “falcon”: Falke??? 
*I was right! *
seagull = ??? 
*Möwe (I’ve heard this before.) *
eel = Aal (cognate)
pike = ??? (Even in English all I know is that it’s a fish!) 
*Hecht (never heard) *
catfish = I’m not sure, but it’s probably a cognate: Katzenfisch
*This seems to exist, but it seems that “Wels” and “Katfisch” are more common.  I’ve heard “Wels” before. *
wasp = Wespe (cognate)
tick = ??? 
*Zecke (never heard — it’s a cognate!) *
mosquito = Mücke
mole = Maulwurf 
lynx = ??? 
*Luchs (never heard — probably a cognate) *
boar = I’m not sure, I would guess Wildschwein (wild pig) 
I was right! 
tortoise = Schildkröte
lizard = I’m sure I’ve known this in the past, but it’s not coming to me!  
*Eidechse (definitely heard before!) *_

So I only know about half (7/16) for sure, and I’m partially sure about a quarter (4/16).

Even in Arabic, my first language, I don’t know all of them because I was schooled in English:

_woodpecker = نقّار
stork = ???
*لقلق (I’ve heard this before.) *
owl = بومة
hawk = صقر
seagull = نورس
eel = ???
*أنقليس (I’ve heard this before.) *
pike = ???
*كراكي (never heard) *
catfish = ??? If it’s a literal translation, then سمكة القط
*سلور (never heard) *
wasp = I know دبّور is a bee-like insect, but not a bee, so it might be a wasp (or a hornet or a bumblebee…) 
*I was right! *
tick = ??? 
*قراد (I’ve heard this before.) *
mosquito = ناموسة
mole = خلد 
lynx = ??? 
*وشق (never heard) *
boar = I’m pretty sure this one is “wild pig”: خنزير برّي
*I was right! *
tortoise = سلحفاة
lizard = سحلية_

It’s interesting that for most of these, I either guessed right or had heard them before.


----------



## pimlicodude

elroy said:


> I just looked up the ones I didn’t know or wasn’t sure of:
> 
> _woodpecker = Hmmm… I’m almost positive it’s something -stecher. Baumstecher??? Holzstecher???
> *I wasn’t even close.  It’s “Specht.”  (I’ve heard this before.) *
> stork = Storch (only because it’s a cognate!)
> owl = Eule
> hawk = I wanna say it’s a cognate of “falcon”: Falke???
> *I was right! *
> seagull = ???
> *Möwe (I’ve heard this before.) *
> eel = Aal (cognate)
> pike = ??? (Even in English all I know is that it’s a fish!)
> *Hecht (never heard) *
> catfish = I’m not sure, but it’s probably a cognate: Katzenfisch
> *This seems to exist, but it seems that “Wels” and “Katfisch” are more common.  I’ve heard “Wels” before. *
> wasp = Wespe (cognate)
> tick = ???
> *Zecke (never heard — it’s a cognate!) *
> mosquito = Mücke
> mole = Maulwurf
> lynx = ???
> *Luchs (never heard — probably a cognate) *
> boar = I’m not sure, I would guess Wildschwein (wild pig)
> I was right!
> tortoise = Schildkröte
> lizard = I’m sure I’ve known this in the past, but it’s not coming to me!
> *Eidechse (definitely heard before!) *_
> 
> So I only know about half (7/16) for sure, and I’m partially sure about a quarter (4/16).
> 
> Even in Arabic, my first language, I don’t know all of them because I was schooled in English:
> 
> _woodpecker = نقّار
> stork = ???
> *لقلق (I’ve heard this before.) *
> owl = بومة
> hawk = صقر
> seagull = نورس
> eel = ???
> *أنقليس (I’ve heard this before.) *
> pike = ???
> *كراكي (never heard) *
> catfish = ??? If it’s a literal translation, then سمكة القط
> *سلور (never heard) *
> wasp = I know دبّور is a bee-like insect, but not a bee, so it might be a wasp (or a hornet or a bumblebee…)
> *I was right! *
> tick = ???
> *قراد (I’ve heard this before.) *
> mosquito = ناموسة
> mole = خلد
> lynx = ???
> *وشق (never heard) *
> boar = I’m pretty sure this one is “wild pig”: خنزير برّي
> *I was right! *
> tortoise = سلحفاة
> lizard = سحلية_
> 
> It’s interesting that for most of these, I either guessed right or had heard them before.


Elroy, if you're C2 in German, does that mean you can read German literature and understand every word? This is my point: the 10,000 words for C2 proficiency can still leave an L2 speaker floundering when it comes to literature, as you need to up that to at least 50,000 to tackle real fiction. Maybe it just takes years of practice? If you can read German fiction well, then good on you! I just don't know how the learners of English around the well get to be so good at English. The Scandinavians and the Dutch are amazing.


----------



## Sobakus

My conclusion from elroy's messages is that different languages and their associated cultures have different "useful to know animals" the knowledge of whose names is more or less ubiquotous and a native speaker would be confused to learn that someone fluent in the language doesn't know them. These animals tend to regularly find their way into fairytales, proverbs and sayings as well as productively coined metaphors. _дя́тел_ 'woodpecker' is one such word - this is a very stereotypical bird in the Russian culture and there's a very common modern use of its name as a humorous insult for a stupid and inept person, from the popular belief that spending most of one's time drilling trees with one's head can't have a very positive _impact_ on one's brain 

Thus elroy's attempt at translating the names from English isn't entirely on the mark - the Lithuanian animal names whose knowledge came in handy to AndrasBP aren't the same that an English speaker might want to learn in the first place, and translating them into German or Arabic makes the connection even more tenuous. What's more, I often find myself knowing both the Russian and the English name for an animal or a plant but not realising they're translations of each other, or even refer to the exact same species.

I suppose this goes to show that broad generalisations about vocabulary size without reference to the exact semantic fields, parts of speech as well as specific contexts and communication types/mediums might be a misleading way to talk about this topic.


----------



## elroy

pimlicodude said:


> Elroy, if you're C2 in German, does that mean you can read German literature and understand every word?


Of course not!!!



pimlicodude said:


> the 10,000 words for C2 proficiency can still leave an L2 speaker floundering when it comes to literature, as you need to up that to at least 50,000 to tackle real fiction.


It depends on the work of fiction!  I don’t think we can generalize.



pimlicodude said:


> I just don't know how the learners of English around the well get to be so good at English.


1. On average, learners of English are exposed to a hell of a lot more English than learners of other languages.
2. The sheer number of English learners across the world means that, by the law of something-or-other (probability?), we’re going to get a decent number of English learners with very advanced vocabularies (beyond C2).  So they’re fairly visible because there’s a sizable number of them in _absolute_ terms, but they’re still probably a tiny, many even negligible, fraction of all English learners, so the figure (whatever it is) is not as striking in _relative_ terms.
3. Many non-native speakers of English study at English-speaking universities, probably way more than is the case for other languages, increasing the number of C2+ people.
4. How do you know how many of these people can comfortably read literature?  Yes, many Scandinavians and Dutch people are amazing in terms of practical, everyday proficiency, but I would wager that many of them would “flounder” when reading literature.
5. Even many native speakers can’t read literature comfortably and effortlessly, so I’m not really sure this is a good benchmark for anything.
6. Even the most amazing C2+ speakers still have some basic flaws and imperfections (at least I’ve never met one who didn't).  I remember meeting a Danish guy with incredible, astonishing English, who said “on all four,” a nails-on-chalkboard non-native mistake (although “four” is perfectly logical!).


----------



## AndrasBP

Sobakus said:


> My conclusion from elroy's messages is that different languages and their associated cultures have different "useful to know animals"





Sobakus said:


> the Lithuanian animal names whose knowledge came in handy to AndrasBP aren't the same that an English speaker might want to learn in the first place,


I agree but I'd still like to entertain @elroy with eight more that I know in Lithuanian. 

_hedgehog, squirrel, swan, sparrow, pigeon, ant, dragonfly, snail_


----------



## elroy

Sobakus said:


> Thus elroy's attempt at translating the names from English isn't entirely on the mark


I’m not sure I follow you.  My exercise actually proved the very point you’re making!  Note that I said:


elroy said:


> The only ones I would expect most C2 speakers to know are owl, hawk, mosquito, tortoise/turtle, and lizard.


That’s because these animals are fairly universally known/familiar across cultures.  The others are not.


----------



## pimlicodude

elroy said:


> Of course not!!!
> 
> 
> It depends on the work of fiction!  I don’t think we can generalize.
> 
> 
> 1. On average, learners of English are exposed to a hell of a lot more English than learners of other languages.
> 2. The sheer number of English learners across the world means that, by the law of something-or-other (probability?), we’re doing to get a decent number of English learners with very advanced vocabularies (beyond C2).  So they’re fairly visible because there’s a sizable number of them in _absolute_ terms, but they’re still probably a tiny, many even negligible, fraction of all English learners, so the figure (whatever it is) is not as striking in _relative_ terms.
> 3. Many non-native speakers of English study at English-speaking universities, probably way more than is the case for other languages, increasing the number of C2+ people.
> 4. How do you know how many of these people can comfortably read literature?  Yes, many Scandinavians and Dutch people are amazing in terms of practical, everyday proficiency, but I would wager that many of them would “flounder” when reading literature.
> 5. Even many native speakers can’t read literature comfortably and effortlessly, so I’m not really sure this is a good benchmark for anything.
> 6. Even the most amazing C2+ speakers still have some basic flaws and imperfections (at least I’ve never met one who hasn’t).  I remember meeting a Danish guy with incredible, astonishing English, who said “on all four,” a nails-on-chalkboard non-native mistake (although “four” is perfectly logical!).


2. Yes, you're right. Most learners of any language don't get that far with it. Languages are an odd teaching specialism: in what other subject area do teachers teach their subject in the knowledge that the majority of their students will not master the subject?
4. Yes, I think people may appear super-proficient in speech, but that doesn't necessarily mean they're native-equivalent when it comes to reading. Even the Swedes.
6. Ah, yes, you're right. That happens. I have also seen some podcasts by very fluent speakers of English on Youtube that are really hard to understand. Some learners can be so fluent that they don't realise they still have an accent and so speed up in their English to the point where it is hard to understand.

Also - I noticed this on the Be Fluent in Russian Youtube channel - where the (great) presenter stated that his English is indistinguishable from that of a native speaker and he went on to explain how he got so good in English. That was followed by scores of comments by people below the video telling him he most definitely does have an accent and most definitely does make linguistic mistakes. Many/most learners of a language overestimate their level.


----------



## pimlicodude

AndrasBP said:


> _woodpecker, stork, owl, hawk, seagull, eel, pike, catfish, wasp, tick, mosquito, mole, lynx, boar, tortoise _and _lizard_.


woodpecker: дятел
stork:? *[аист - I've encountered it a couple of times, but couldn't recall it]*
owl: сова
hawk: ястреб
seagull:? I feel like I might have come across it, but don't know it. *[чайка - I've encountered it many times, particularly as I read Chekhov at university, but couldn't recall it]*
eel:? *[угорь  - never heard of it]*
pike:? *[щука - I've encountered it a couple of times, but couldn't recall it]*
catfish? *[сом - I've encountered it once in the supermarket, but couldn't recall it]*
wasp: оса
tick:? *[клещ/клещик - never heard of it]*
mosquito: something likе комар, but I'm not sure if I remember it correctly.
mole: ? *[крот - I've encountered it a few times, but couldn't recall it]*
lynx: рысь, I think
boar: I've come across it, but can't remember it. Maybe кабун or something like it? *[actually кабан]*
tortoise:? I think черепаха is turtle. I might have come across tortoise, but I can't remember it.*[OK, черепаха seems right for both tortoise and turtle]*

As you can see, I didn't do well at this.


----------



## elroy

AndrasBP said:


> _hedgehog, squirrel, swan, sparrow, pigeon, ant, dragonfly, snail_


This is a much easier list!

I would expect most C2 speakers to know squirrel, pigeon, ant, and snail — but not hedgehog, swan, sparrow, and dragonfly. 

German:
Igel
Eichhörnchen
Schwan
Spacht, I think???  Something like that. 
Taube
Ameise
??? 
Schnecke 

Arabic:
??? قنفذ = porcupine, and I have a hunch this may also be used for hedgehog. OR قنفذ is actually hedgehog in Standard Arabic but used for porcupine in colloquial. 
سنجاب
بجعة
This is an interesting one.  عصفور is used as a generic word for “small songbird,” but I _believe_ it’s also the word for “sparrow.” 
حمامة
نملة
??? 
حلزونة

Again, very similar results for both languages!


----------



## pimlicodude

AndrasBP said:


> _hedgehog, squirrel, swan, sparrow, pigeon, ant, dragonfly, snail_


hedgehog: ёж
squirrel: белка (I came across an alternative, dialectal, word векша recently, but don't know which region it is from)
swan: лебедь
sparrow: I've come across it, but don't remember it. *[воробей - I've encountered it many times, but couldn't recall it]*
pigeon: same as dove? голубь? The film Love and Pigeons is also translated Love and Doves.
ant: муравей
dragonfly;? I've just remembered стрекоза. 
snail: I've come across it, but can't remember it. *[улитка - I've encountered it a couple of times, but couldn't recall it]*


----------



## elroy

Checking…

German:
Igel
Eichhörnchen
Schwan
Spacht, I think??? Something like that.
*Spatz.  I was close!  Definitely heard before.  *
Taube
Ameise
??? 
*Libelle.  Never heard! *
Schnecke 

Arabic:
??? قنفذ = porcupine, and I have a hunch this may also be used for hedgehog. OR قنفذ is actually hedgehog in Standard Arabic but used for porcupine in colloquial. 
*The latter seems to be true! *
سنجاب
بجعة
This is an interesting one. عصفور is used as a generic word for “small songbird,” but I _believe_ it’s also the word for “sparrow.” 
*It looks like there’s a specific name for a sparrow: دوري.  Never heard. *
حمامة
نملة
??? 
*يعسوب.  Never heard!*
حلزونة


----------



## pimlicodude

elroy said:


> Checking…
> 
> German:
> Igel
> Eichhörnchen
> Schwan
> Spacht, I think??? Something like that.
> *Spatz.  I was close!  Definitely heard before.  *
> Taube
> Ameise
> ???
> *Libelle.  Never heard! *
> Schnecke
> 
> Arabic:
> ??? قنفذ = porcupine, and I have a hunch this may also be used for hedgehog. OR قنفذ is actually hedgehog in Standard Arabic but used for porcupine in colloquial.
> *The latter seems to be true! *
> سنجاب
> بجعة
> This is an interesting one. عصفور is used as a generic word for “small songbird,” but I _believe_ it’s also the word for “sparrow.”
> *It looks like there’s a specific name for a sparrow: دوري.  Never heard. *
> حمامة
> نملة
> ???
> *يعسوب.  Never heard!*
> حلزونة


You did well, but won't there be many variants in the Arabic dialects? Maybe there is a parallel list for the colloquial Arabic you speak?


----------



## elroy

pimlicodude said:


> You did well, but won't there be many variants in the Arabic dialects? Maybe there is a parallel list for the colloquial Arabic you speak?


Thank you!

I went with what I would use in Palestinian Arabic.  Yes, it's a bit complicated with Arabic, because there are differences between dialects and also differences between the dialects and Standard Arabic. (See the hedgehog/porcupine example.)  But there's a lot of overlap too!  The more low-frequency an animal name is, the more likely it is to be the same across dialects and Standard Arabic (with the possible exception of pronunciation).

(But let's not derail this thread!  There are plenty of threads about Arabic varieties.)


----------



## elroy

pimlicodude said:


> Languages are an odd teaching specialism: in what other subject area do teachers teach their subject in the knowledge that the majority of their students will not master the subject?


Well, it's not a binary; it's not all or nothing.  It's a spectrum, and there are many levels between zero and mastery that can still be useful to the learner, depending on the person and their circumstances.



pimlicodude said:


> Some learners can be so fluent that they don't realise they still have an accent


I wasn't actually thinking/talking about accent.  This thread is about lexis, and phonology is a whole other kettle of fish.  There are many people with great accents but meager vocabularies and vice versa.



pimlicodude said:


> Many/most learners of a language overestimate their level.


Hmmm... In my experience, there are plenty of people who _under_estimate their skills.  I wouldn't say one is significantly more likely than the other.


----------



## Sobakus

elroy said:


> I’m not sure I follow you.  My exercise actually proved the very point you’re making!  Note that I said:
> [...]
> That’s because these animals are fairly universally known/familiar across cultures.  The others are not.


Yes, I suppose one of the points I thought I was making I never did make properly at all. When I first looked at your translations, and even more so when I look at the recent messages, it occured to me that being able to translate animal/plant names is not the same as being able to understand them in texts. This was what I was hinting at when I said:


Sobakus said:


> What's more, I often find myself knowing both the Russian and the English name for an animal or a plant but not realising they're translations of each other, or even refer to the exact same species.


As your and pimlicodude's messages show, many animal names are known by learners as passive vocabulary, and in fact this is just as true of native speakers! Many won't be able to match the picture with the word precisely; nevertheless they will be able to understand what animal is being talked about in general terms. Moreover it's the general associations that are the most crucial in literary texts, and these are usually easier to internalise than specific reference. This effect may have been predicted (not to say intended) by the author, similar to when using specialised hunting, shipping, whaling or other "traditional" terminology.

This feeds into the crucial difference between "knowing" vocabulary in the sense of being able to translate it, and having truly acquired it through repeated comprehension (and use) in context and with a specific communicative intent, i.e. when using the language as an actual means of communication. Having memorised a translation leads to temporary success at the former task, but that is normally short-lived; a genuinely acquired feeling for the word's meaning, connotations, associations etc. is much more valuable (especially when reading literature!) and is practically permanent.

Even partial acquisition and passive comprehension proves highly beneficial, and it would be entirely incorrect to say that the learner "doesn't know the word" just because they weren't able to come up with it in a translation excercise. This principle extends well beyond taxonomy, though notably in the domain of modern technical terminology direct equivalence between different languages is common - and this is quite abnormal.


----------



## AndrasBP

The fact that I know all 24 animals in both English and Lithuanian doesn't mean that I can read literature in Lithuanian as easily as in English. Far from it!


----------



## elroy

AndrasBP said:


> The fact that I know all 24 animals in both English and Lithuanian doesn't mean that I can read literature in Lithuanian as easily as in English. Far from it!


I think this speaks to another important point: language learning is not linear.  In other words, if we were to list all the lemmas of a language in order of decreasing frequency, it's certainly not the case that anyone who knows a given word will know all the ones above it!  You've acquired these 24 lemmas, some of which are fairly low-frequency, in both Lithuanian and English, yet I suspect that among the lemmas between the most low-frequency item on the list and the most high-frequency item in the language, you know a whole lot more English lemmas than Lithuanian ones!

I know some pretty low-frequency lemmas in German and Spanish without knowing some lemmas that are much more high-frequency!  For example, I know the Spanish word for "celery," which I've stumped a couple very advanced non-native Spanish speakers with!  In German, I can say "thimble," a _very_ low-frequency word.  Do you know the words for "celery" and "thimble" in Lithuanian?


----------



## AndrasBP

elroy said:


> I suspect that among the lemmas between the most low-frequency item on the list and the most high-frequency item in the language, you know a whole lot more English lemmas than Lithuanian ones!


Yes, definitely.



elroy said:


> Do you know the words for "celery" and "thimble" in Lithuanian?


I don't!


----------



## elroy

Yeah, I probably "shouldn't" know them in Spanish and German, either.   The German word for "thimble" is literally "finger hat" ("Fingerhut"), which is adorable enough to be easy to remember, and that's probably why I know it!  I have no idea why I know the Spanish word for "celery" ("apio").  Maybe I remembered it because it's so different from the English!  In German it's a cognate ("Sellerie").


----------



## Frank78

pimlicodude said:


> Elroy, if you're C2 in German, does that mean you can read German literature and understand every word? This is my point: the 10,000 words for C2 proficiency can still leave an L2 speaker floundering when it comes to literature, as you need to up that to at least 50,000 to tackle real fiction. Maybe it just takes years of practice? If you can read German fiction well, then good on you! I just don't know how the learners of English around the well get to be so good at English. The Scandinavians and the Dutch are amazing.



10,000 words of active vocabulary is the amount an uneducated native speaker has in his inventory in German. For average Joe it's between 12,000 and 16,000. So the 10,000 for an L2 speaker are pretty realistic for a C2 level.


----------



## pimlicodude

Frank78 said:


> 10,000 words of active vocabulary is the amount an uneducated native speaker has in his inventory in German. For average Joe it's between 12,000 and 16,000. So the 10,000 for an L2 speaker are pretty realistic for a C2 level.


I've heard many times that uneducated native speakers have smallish vocabularies - especially the active vocabulary - but never seen any evidence. When I look at a list of the most common 10,000 words in English, it is no way includes any words that are not really common, and it is hard to believe even uneducated people don't have many more words. Nearly everyone in the UK can read novels - and that requires a passive vocabulary of 50,000+ lemmata.


----------



## Frank78

pimlicodude said:


> I've heard many times that uneducated native speakers have smallish vocabularies - especially the active vocabulary - but never seen any evidence. When I look at a list of the most common 10,000 words in English, it is no way includes any words that are not really common, and it is hard to believe even uneducated people don't have many more words. Nearly everyone in the UK can read novels - and that requires a passive vocabulary of 50,000+ lemmata.



Let's take for example "algorithm" which is in the top 10,000. I guess, most people have at least a vague idea what it is but how many do actively use this word?


----------



## pimlicodude

Frank78 said:


> Let's take for example "algorithm" which is in the top 10,000. I guess, most people have at least a vague idea what it is but how many do actively use this word?


Corpora are all different and not all of them with have algorithm in the top 10K. It depends what books and works are fed into them. For a start, your assumption that if an "oik" doesn't use algorithm in his daily life, he will use actively NONE of the words in a given corpus deemed to be rarer than algorithm does not hold. There must be studies of the active vocabularies of native speakers. Maybe you have seen some - can you tell me what they have found?

Also the fact that daily chit-chat doesn't use every possible word in the language does not change the fact that the passive vocabularies of native speakers are huge. Any native speaker who can read a Stephen King novel has a passive vocabulary that it would be hard for an L2 speaker to get up to (although not possible). 

What is the average vocabulary size of English native speakers?


----------



## pimlicodude

COCA, the Corpus of Contemporary American English is a corpus of 1bn words:


> The corpus contains more than one billion words of text (25+ million words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, and (with the update in March 2020): TV and Movies subtitles, blogs, and other web pages.


In this corpus the word algorithm occurs 7058 times, i.e. around 7 times per 1m words. Should I add on the 4,433 times "algorithms" occurs, or is that included in the 7,058???? I think it will be a little bit outside the top 10,000 words. It may depend if the list is lemmatised or not.

I've just read "VOCABULARY SIZE, TEXT COVERAGE AND WORD LISTS", by Paul Nation and Robert Waring. It refers to the The Teachers Word Book of 30,000 words, which they say contains 30,000 lemmata and 13,000 word families.

The concepts of "word familes" and "lemmata" are different. 
e.g. "abandonment" and "to abandon" are not the same lemma. One is a noun and the other a verb, and lemmatisation strips out conjugated forms (abandon, abandons, abandoning, abandoned are one lemma), but doesn't merge cognate in the same word family. Abandonment and abandon are, however, in the same word family. Also in that article, it says "A university graduate will have a vocabulary of around 20,000 word families (Goulden, Nation and Read, 1990)". This should be equivalent to about 45,000 lemmas. A list of the top 10,000 lemmas is a good level for a foreigner, but less than one-quarter of that of a university student in an English-speaking country.


----------



## jimquk

pimlicodude said:


> COCA, the Corpus of Contemporary American English is a corpus of 1bn words:
> 
> In this corpus the word algorithm occurs 7058 times, i.e. around 7 times per 1m words. Should I add on the 4,433 times "algorithms" occurs, or is that included in the 7,058???? I think it will be a little bit outside the top 10,000 words. It may depend if the list is lemmatised or not.
> 
> I've just read "VOCABULARY SIZE, TEXT COVERAGE AND WORD LISTS", by Paul Nation and Robert Waring. It refers to the The Teachers Word Book of 30,000 words, which they say contains 30,000 lemmata and 13,000 word families.
> 
> The concepts of "word familes" and "lemmata" are different.
> e.g. "abandonment" and "to abandon" are not the same lemma. One is a noun and the other a verb, and lemmatisation strips out conjugated forms (abandon, abandons, abandoning, abandoned are one lemma), but doesn't merge cognate in the same word family. Abandonment and abandon are, however, in the same word family. Also in that article, it says "A university graduate will have a vocabulary of around 20,000 word families (Goulden, Nation and Read, 1990)". This should be equivalent to about 45,000 lemmas. A list of the top 10,000 lemmas is a good level for a foreigner, but less than one-quarter of that of a university student in an English-speaking country.


It seems odd to me to class Abandon and Abandonment as distinct lemmas. The meaning of Abandonment is completely obvious, I would suggest, if you know Abandon, so you only need to learn Abandon and the usage of -ment. Perhaps then word families are what we should be speaking of, but this brings us to another point: it's not just the number of words that need learning, but all the ways they can be used. Write Up, Write Down, Write Off, etc all have distinct non-obvious uses that have to be learnt.


----------



## pimlicodude

jimquk said:


> It seems odd to me to class Abandon and Abandonment as distinct lemmas. The meaning of Abandonment is completely obvious, I would suggest, if you know Abandon, so you only need to learn Abandon and the usage of -ment. Perhaps then word families are what we should be speaking of, but this brings us to another point: it's not just the number of words that need learning, but all the ways they can be used. Write Up, Write Down, Write Off, etc all have distinct non-obvious uses that have to be learnt.


Abandon and Abandonment are distinct lemmas, because the latter cannot be derived from the former. I mean: how do you know it is Abandoment and not Abandonance or Abandonation? It is a separate, but cognate, word. So yes, you are talking about word families, which is not the same as lemmata. Because of the peculiarities of English grammar, I think phrasal verbs should be separately entered in a list of lemmata. In any case, whatever the correct approach a list of lemmata will always contain many cognates and so is longer than a list of word families. This is why I pointed out that a careless researcher might cite the 20,000 word families a university student knows, and mis-cite that as lemmata, whereas this breaks down into 45,000 lemmata. A foreign learner who knows 10,000 word families is doing considerably better than a foreign learner who knows 10,000 lemmata.


----------



## elroy

pimlicodude said:


> Abandon and Abandonment are distinct lemmas, because the latter cannot be derived from the former. I mean: how do you know it is Abandoment and not Abandonance or Abandonation?


I’m not sure that’s a convincing argument. According to Wikipedia, “broke” is considered to fall under the lemma “break,” yet equally, there’s no way to predict that it’s “broke” and not “breaked” or “brook.”


----------



## pimlicodude

elroy said:


> I’m not sure that’s a convincing argument. According to Wikipedia, “broke” is considered to fall under the lemma “break,” yet equally, there’s no way to predict that it’s “broke” and not “breaked” or “brook.”


Abandonment is not a conjugated form of the verb. It's not a verb. Some languages have many conjugated forms that could seem to suggest that they have huge vocabularies, but know the conjugations and noun plurals is part of the grammar. True, some languages have complex grammar. Break, breaking, broke, broken, breaks - these are all one lemma. Abandon, abandons, abandoning, abandoned are are also a single lemma. But a noun is just not part of the verb conjugation.


----------



## elroy

You didn’t say anything about verb conjugations.  What’s your definition of a lemma?  (Warning: Wikipedia says it’s at least partly arbitrary.)


----------



## pimlicodude

elroy said:


> You didn’t say anything about verb conjugations.  What’s your definition of a lemma?  (Warning: Wikipedia says it’s at least partly arbitrary.)


Yes it is partly arbitrary. But I'm bearing in mind the need for intercomparison between languages. So I started off referring to a Russian Frequency Dictionary where imperfective and perfective verbs are separately entered, and nouns cognate with verbs are separately entered. Those 10,000 words the author says are all you need to speak Russian fluently. Had he organised them as word families, where nouns cognate with verbs are listed with the verb, then that would have been less than 10,000. 

Some languages don't have any conjugation or plurals at all. Even the distinction between nouns and verbs may not hold perfectly between languages. E.g. slow is an adjective. And a verb (slow down). In Chinese, man4 is an adjective, man4xialai "slowdown" is a verb, but there will be other languages where words can't be multiple parts of speech. So there is a large arbitrary element to it. But broadly speaking, 10,000 is a nice round number in an language (as long as you don't list plurals and declined and conjugated forms all separately). In some languages, due to the details of the grammar, it could be 8,000 or 12,000. 

The other point I was trying to make clear is that this 10,000 is a small fraction of the 40,000-50,000 an educated native speaker will know. But some lists claim native speakers know only 20,000, but when you look into academic literature further, that is usually referring to word families where cognates are often grouped.

But you are absolutely right. It is arbitrary at some point. As JimUK pointed out, phrasal verbs are an obvious difficulty. In Russian, "to go out" is one word (vyxodit'), so there is an obvious problem in intercomparison there.


----------



## elroy

I think what you’re talking about is lexemes, not lemmas.

*Inflected* forms of a word are part of the same lexeme (break, breaking, broke, broken).

*Derived* forms of a word are separate lexemes (abandon, abandonment, abandoner).

Lexeme - Wikipedia

I still don’t think it makes sense to treat “abandon/abandonment” just like “abandon/watermelon.”  Someone who knows “abandon” may be able to guess “abandonment,” or guess the wrong suffix (“abandonance”) and still be understood.  Even if they don’t add a suffix at all and use “abandon” instead of “abandonment” they’ll still be understood.  And they’ll certainly understand “abandonment” themselves when they come across it.  None of this can be said of “watermelon.”  So I think derived forms should be treated differently.  They’re more predictable than completely unrelated words (“watermelon”) and no less predictable than inflected forms.

Phrasal verbs are a bit of a gray area.  Some, like “go out,” are transparent and can be understood if you know the constituent parts (“go” and “out”),  while others, like “make out” in the sense of kissing, etc., are totally opaque, and knowing “make” and “out” won’t help you at all.

To me it would make sense to take into account what related words someone with a certain vocabulary can reasonably be expected to understand and have a reasonable shot at accurately producing, and what words they have no way of knowing or guessing.


----------



## Penyafort

One only has to compare some frequency dictionaries to see that not all animals would follow the same order. But I'd say that, in general, a sort of basic list could indeed be made. The problem has more to do with the fact that many basic words are polysemic, often belonging to more than one morphological category, so that always makes the counting quite complicated.

Regarding the animals mentioned above (_woodpecker, stork, owl, hawk, seagull, eel, pike, catfish, wasp, tick, mosquito, mole, lynx, boar, tortoise _and _lizard_), I knew them all in English but two, _catfish _and _pike_, which I had to translate. I can translate them all into Catalan and Spanish. The problems arise if I want to translate them into other languages I know, even those I thought I knew better. There are always like two or three I just don't know, and some more I'm unsure of. And that hasn't stopped me from reading literature in those languages at all.


----------



## elroy

Penyafort said:


> One only has to compare some frequency dictionaries to see that not all animals would follow the same order.


Do you mean different orders in different languages?  Yes, of course that’s the case.  Not all words are used with the same frequency in every language.



Penyafort said:


> The problem has more to do with the fact that many basic words are polysemic, often belonging to more than one morphological category, so that always makes the counting quite complicated.


I think different meanings of the same word need to be counted separately.  For example,   “pen” (writing instrument) and “pen” (enclosure for animals) obviously need to be counted separately!  Counting them as one would make no sense.  The former is extremely high-frequency and familiar to most A1 speakers, while the latter is very low-frequency and probably not familiar to many C2 speakers.  But even if they were similar in frequency, they would of course still need to be counted separately.

I’m not sure what you mean by polysemous words belonging to different morphological categories.  Can you give an example?  Do you mean something like “book” (noun, publication that people read) and “book” (verb, to purchase in advance)?  Those should of course also be counted separately!



Penyafort said:


> Regarding the animals mentioned above (_woodpecker, stork, owl, hawk, seagull, eel, pike, catfish, wasp, tick, mosquito, mole, lynx, boar, tortoise _and _lizard_), I knew them all in English but two, _catfish _and _pike_, which I had to translate. I can translate them all into Catalan and Spanish.


Impressive!


----------



## elroy

Just for fun, I’ll try Spanish.  I think my Spanish is at least a C1, maybe even a C2, but I predict I won’t know more than 25% for sure!

woodpecker = ???
stork = ???
 owl = búho
hawk = falcón
seagull = ???
eel = anguilla
pike = ???
catfish = ??? [Unless it’s “pez de gato.” ]
wasp = ???
tick = ???
mosquito = mosquito [This comes from Spanish!]
mole = ???
lynx = ???
boar = ??? [Unless it’s “cerdo salvaje.” ]
tortoise = tortuga
lizard = ??? [Maybe it’s “cameleón”?  I doubt it.]
hedgehog = ???
squirrel = ardilla
swan = cigueña (90% sure)
sparrow = alondra??? [I know this is a bird, not sure which.  I’m only 20% sure of this one.]
pigeon = paloma
ant = hormiga
dragonfly = ???
snail = caracol (99% sure)

Okay, I did a _bit_ better than predicted.


----------



## Penyafort

elroy said:


> I think different meanings of the same word need to be counted separately.  For example,   “pen” (writing instrument) and “pen” (enclosure for animals) obviously need to be counted separately!  Counting them as one would make no sense.  The former is extremely high-frequency and familiar to most A1 speakers, while the latter is very low-frequency and probably not familiar to many C2 speakers.  But even if they were similar in frequency, they would of course still need to be counted separately.
> 
> I’m not sure what you mean by polysemous words belonging to different morphological categories.  Can you give an example?  Do you mean something like “book” (noun, publication that people read) and “book” (verb, to purchase in advance)?  Those should of course also be counted separately!


Polysemy implies different meanings, but not necessarily meanings that are very different. They could just be related, and therefore sometimes it could be hard to draw a line. Which is why dictionaries may vary in the number of meanings for the same word. Often these words are just divided according to their etymological origin.

By difference in morphology, that's exactly what I meant. The problem here might lie in things such as substantivized adjectives, adjectivized past participles, and so on.



elroy said:


> Impressive!


Well, I wasn't schooled in English, so I learned them in my native languages. Nothing special. 

As for English, some of those animals aren't taught at all in school, I certainly must have learnt about them later in life.


----------



## Penyafort

elroy said:


> Just for fun, I’ll try Spanish.  I think my Spanish is at least a C1, maybe even a C2, bit I predict I won’t know more than 25% for sure!
> 
> woodpecker = ???
> stork = ???
> owl = búho
> hawk = falcón
> seagull = ???
> eel = anguilla
> pike = ???
> catfish = ??? [Unless it’s “pez de gato.” ]
> wasp = ???
> tick = ???
> mosquito = mosquito [This comes from Spanish!]
> mole = ???
> lynx = ???
> boar = ??? [Unless it’s “cerdo salvaje.” ]
> tortoise = tortuga
> lizard = ??? [Maybe it’s “cameleón”?  I doubt it.]
> hedgehog = ???
> squirrel = ardilla
> swan = cigueña (90% sure)
> sparrow = alondra??? [I know this is a bird, not sure which.  I’m only 20% sure of this one.]
> pigeon = paloma
> ant = hormiga
> dragonfly = ???
> snail = caracol (99% sure)
> 
> Okay, I did a _bit_ better than predicted.



Not bad at all! But how could you not know how to say boar, when it comes from Arabic! Instead of a 'wild pig', imagine a 'mountain pig' and voilà, un jabalí.


----------



## Red Arrow

In English (I don't know if this is also true for other languages), fictional books seem to have a much greater vocabulary than anything else. I have zero troubles reading posts on this forum, reading/watching the news, watching American TV series without subtitles, reading scientific articles about topics I am familiar with etc. I can also speak and write fluently in English. I would say I am C1 (but I have never done an official test, so who knows). But when I open a fictional book, every page is filled with words I have never heard before.

I dare to say that you need much less English vocabulary to read Wordreference's Culture Café than to read an average fictional book. And this is a forum for linguists! Just let that sink in for a moment.

I highly doubt the average Dutchmen and Scandinavian can read English fictional books without encountering new words.

EDIT: According to Pimlicodude's definition, I am C2 in English.


----------



## elroy

Damn it!  I was back on forth on the diaresis in “cigüeña” (one of the reasons i was only 90% sure), and it turns out I knew “stork” without realizing it!   I’ve heard “cisne” before, and probably many of the others. 



Penyafort said:


> Polysemy implies different meanings, but not necessarily meanings that are very different. They could just be related, and therefore sometimes it could be hard to draw a line.


Yes, of course.  We agree, then.



Penyafort said:


> I learned them in my native languages.


Were you schooled in both Catalan and Spanish?


----------



## elroy

OMG I stopped and wondered whether it was “halcón,” but for some reason I thought that “f” had been maintained!  And I blame the island of Anguilla for the extra “l.”  

(In all seriousness, these are good examples of the difference between active and passive proficiency.)


----------



## elroy

@Red Arrow, that supports what I said earlier:


elroy said:


> 4. How do you know how many of these people can comfortably read literature? Yes, many Scandinavians and Dutch people are amazing in terms of practical, everyday proficiency, but I would wager that many of them would “flounder” when reading literature.


----------



## Penyafort

elroy said:


> woodpecker = ???  pájaro carpintero (although it has some other variants)
> stork = ??? cigüeña
> owl = búho
> hawk = falcón (remember f>h; you wrote it in Aragonese)
> seagull = ??? gaviota
> eel = anguilla (anguila, otherwise it sounds like y)
> pike = ??? lucio
> catfish = ??? [Unless it’s “pez de gato.” ]  This is a weird one for me, what I've heard more is sirulo
> wasp = ??? avispa
> tick = ??? garrapata
> mosquito = mosquito [This comes from Spanish!]
> mole = ???  topo
> lynx = ???  lince (as the lince ibérico)
> boar = ??? [Unless it’s “cerdo salvaje.” ]
> tortoise = tortuga
> lizard = ??? [Maybe it’s “cameleón”?  I doubt it.] lagartija
> hedgehog = ??? erizo
> squirrel = ardilla
> swan = cigueña (90% sure) cisne
> sparrow = alondra??? [I know this is a bird, not sure which.  I’m only 20% sure of this one.]  An alondra is a lark, a sparrow is a gorrión
> pigeon = paloma  , but careful, as it can also be a dove
> ant = hormiga
> dragonfly = ???  libélula
> snail = caracol (99% sure)





elroy said:


> Were you schooled in both Catalan and Spanish?


Here it's Catalan the language of education, but we've got Spanish language and Spanish literature since little children, and usually know about those names in Spanish too because of the media.


----------



## elroy

Penyafort said:


> Here it's Catalan the language of education, but we've got Spanish language and Spanish literature since little children, and usually know about those names in Spanish too because of the media.


Well, it’s still impressive that 1) you knew all of them in Spanish even though you weren’t schooled in it, and 2) you knew all but two in English!


----------



## Penyafort

elroy said:


> Well, it’s still impressive that 1) you knew all of them in Spanish even though you weren’t schooled in it, and 2) you knew all but two in English!


Bear in mind that often those animals were learnt for the first time outside school, whether in cartoons, children books, etc. I mean, I don't even remember in which language I first heard some of those. I consider myself fully bilingual, so I see no real problem with that.

In English, that's a different story. It probably has more to do with all that I've read and watched in English after school than what I actually learnt there. Knowing vocabulary is something quite relative, after all. I'd probably struggle with two teenagers using modern slang but I often feel I might understand Shakespeare better than them.


----------



## elroy

Penyafort said:


> Bear in mind that often those animals were learnt for the first time outside school, whether in cartoons, children books, etc.


That wasn’t the case for me.  I was schooled in English and had Arabic as a predominant language at home and in the community, and with the exception of “pike,” which I learned in adulthood, I was exposed to all of those English animal names through my schooling, but I was only exposed to some of the Arabic ones outside of school (and in Arabic class).  I watched cartoons in Arabic, but some of those animals never came up.


----------



## Olaszinhok

pimlicodude said:


> Nearly everyone in the UK can read novels - and that requires a passive vocabulary of 50,000+ lemmata.


There are some novels which are far easier to read than others, that's for sure. As for English, Romance language speakers generally have the advantage of understanding around 50% of the Latin-based vocabulary of most novels written in English. Some idioms, phrasel verbs and collocations can be tricky, though.
Regarding animals, I probably know more than 100 mammals and birds and so on (in English) because I am a big fan of nature, wildlife and the environment. I do think, personal interests matter in this regard.


----------



## elroy

Olaszinhok said:


> There are some novels which are far easier to read than others, that's for sure.


Indeed: “literature” and “novels” are very broad categories!



Olaszinhok said:


> I do think, personal interests matter in this regard.


Of course.  But there are some animal names that most/all native speakers can be expected to know regardless of interests, and there’s a subgroup that most/all C2 speakers can be expected to know regardless of interests. 

Of the 24 animal names @AndrasBP gave, I would say that 23 belong in the first group (all but “pike”) and 9 belong in the second group (owl, hawk, mosquito, tortoise/turtle, lizard, squirrel, pigeon, ant, and snail).  The other 15 (woodpecker, stork, seagull, eel, pike, catfish, wasp, tick, mole, lynx, boar, hedgehog, swan, sparrow, and dragonfly) I would say are hit and miss for C2 speakers.


----------



## Olaszinhok

elroy said:


> The other 15 (woodpecker, stork, seagull, eel, pike, catfish, wasp, tick, mole, lynx, boar, hedgehog, swan, sparrow, and dragonfly) I would say are hit and miss for C2 speakers.


I'd add peafowl (cock and hen) pheasant, swallow (one swallow does not make a summer), blackbird, magpie, robin, nightingale (at least in Europe) leopard, cheetah, lion, eagle, rhino, elk, bison, wolverine, wolf, parrot, canary and so forth.


----------



## elroy

I would not include lion, eagle, wolf, or parrot.  "lion" for me is A2 vocabulary.


----------



## elroy

By the way, I think the proverb is "One swallow does not *a summer* *make*," with unusual/archaic word order!


----------



## Olaszinhok

elroy said:


> By the way, I think the proverb is "One swallow does not *a summer* *make*," with unusual/archaic word order!


I reckon both versions are possible:
One swallow doesn't make a summer definition and meaning | Collins English Dictionary


----------



## elroy

Maybe.  I've only ever heard the one version.


----------



## Abaye

pimlicodude said:


> In English, I only remember "woodpecker" from children's books.


Apparently you're too young, Woody Woodpecker used to be kids' animated telly celebrity until the 1970s. Which demonstrates how word's popularity may vary over such a short time period.


----------



## bearded

elroy said:


> "One swallow does not *a summer* *make*,"


In Italy ''one swallow does not make a spring/a spring make'' (_una rondine non fa primavera_). It must depend on the Mediterranean climate.


----------



## Hulalessar

Sobakus said:


> I suppose this goes to show that broad generalisations about vocabulary size without reference to the exact semantic fields, parts of speech as well as specific contexts and communication types/mediums might be a misleading way to talk about this topic.


Agreed. If a native C2 level speaker does not know the name of every bird found in the country he lives in no one is going to say that his knowledge of English is incomplete. I know what an owl looks like, but if shown a picture of an owl I could not tell you what sort of an owl it is.

If you are a non-native speaker it comes down to what you come across. I am not too bad with the names of animals in Spanish as I watch a lot of nature programmes on Spanish television and have been to the local zoo more than once. Just now I got the text of the Spanish constitution up on screen. Scrolling down at random I read half a dozen sections and understood them all. I deal with doctors, accountants and lawyers in Spanish. I am though not very good on the meat items on a menu as I do not eat meat or on the parts of a car as I do not drive. I cannot make much sense of the text in Spanish comics. What is to be made of my overall ability?


----------



## elroy

Hulalessar said:


> a native C2 level speaker


Do you mean a native speaker?
C2 is a designation used for non-native speakers only.


----------



## Hulalessar

elroy said:


> Do you mean a native speaker?
> C2 is a designation used for non-native speakers only.


I thought C2 was the equivalent of the level of a well-educated native speaker.


----------



## elroy

No matter what anyone says, C2 is not *equivalent* to native proficiency.  I have no doubt I could pass the German C2 exam with flying colors, and my German is definitely not native-level.

Be that as it may, the designation C2 is not used to refer to native speakers.  A native speaker is a native speaker, not a C2 speaker.


----------



## Hulalessar

elroy said:


> No matter what anyone says, C2 is not *equivalent* to native proficiency.  I have no doubt I could pass the German C2 exam with flying colors, and my German is definitely not native-level.
> 
> Be that as it may, the designation C2 is not used to refer to native speakers.  A native speaker is a native speaker, not a C2 speaker.


In that case in post 59 read "If a native C2 level speaker does not know" as "If a well-educated native speaker does not know".


----------



## apmoy70

bearded said:


> In Italy ''one swallow does not make a spring/a spring make'' (_una rondine non fa primavera_). It must depend on the Mediterranean climate.


To us, the swallow is the _bearer_ of spring: «ένα χελιδόνι δεν φέρνει την άνοιξη» one swallow does not bear spring


----------



## pimlicodude

Olaszinhok said:


> I reckon both versions are possible:
> One swallow doesn't make a summer definition and meaning | Collins English Dictionary


I’ve only ever heard “does not a summer make”. The word order is not optional.


----------



## Olaszinhok

pimlicodude said:


> I’ve only ever heard “does not a summer make”. The word order is not optional


That version must be the most traditional and  common one but apparently there are other alternative forms even with the word spring instead of summer.
one swallow does not a summer make - Wiktionary


----------



## pimlicodude

Olaszinhok said:


> That version must be the most traditional and  common one but apparently there are other versions even with the word spring instead of summer.
> one swallow does not a summer make - Wiktionary


Wiktionary is not an authoritative source. I could go in and enter "one swallow does not a winter make" if I wished to. Returning to the actual subject: maybe swallows are particular common in the spring in some parts of the English-speaking world and so people might have amended the English phrase? However, it is likely this phrase has a Greek origin, and that the original Greek phrase had "spring".


----------



## Olaszinhok

pimlicodude said:


> Wiktionary is not an authoritative source.


You may be right but Collins generally is. Anyway, you are the native speaker, not me so I stand corrected.


----------



## pimlicodude

Olaszinhok said:


> You may be right but Collins generally is. Anyway, you are the native speaker, not me so I stand corrected.


Maybe you can comment at one swallow does not a summer make . I want to discuss the role of vocabulary size in this thread.


----------



## dojibear

When talking about vocabulary size, we should talk about "jargon". Those are specific terms (words or phrases) with specific meanings *only* in a specific field. Even a "fluent" speaker doesn't know most of these terms -- just the terms in fields they know about. There are many fields with jargon terms: astronomy, particle phsysics, cell biology, medicine, linguistics, ballet, Argentine tango, classic music, hard rock music, baseball, cricket, volleyball, cooking, bird-watching, skiing, snorkeling, piloting aircraft, software, computer hardware, website programming, sculpting, knitting, sewing clothing, bronze casting, sales, marketing, sociology, economics...

Everyone knows many jargon terms in their native language, but not most of them. One estimate says there are more English "jargon" words than normal English words. Professional interpreters study the "jargon" of a specific field (in both languages) before attending a meeting where that field will be discussed.

So how much "jargon" should we consider in "vocabulary size"? A particle physicist talking to laymen won't use jargon -- they will translate into general English. But people talking about more common fields may assume everyone knows the words.


----------



## pimlicodude

dojibear said:


> When talking about vocabulary size, we should talk about "jargon". Those are specific terms (words or phrases) with specific meanings *only* in a specific field. Even a "fluent" speaker doesn't know most of these terms -- just the terms in fields they know about. There are many fields with jargon terms: astronomy, particle phsysics, cell biology, medicine, linguistics, ballet, Argentine tango, classic music, hard rock music, baseball, cricket, volleyball, cooking, bird-watching, skiing, snorkeling, piloting aircraft, software, computer hardware, website programming, sculpting, knitting, sewing clothing, bronze casting, sales, marketing, sociology, economics...
> 
> Everyone knows many jargon terms in their native language, but not most of them. One estimate says there are more English "jargon" words than normal English words. Professional interpreters study the "jargon" of a specific field (in both languages) before attending a meeting where that field will be discussed.
> 
> So how much "jargon" should we consider in "vocabulary size"? A particle physicist talking to laymen won't use jargon -- they will translate into general English. But people talking about more common fields may assume everyone knows the words.


Well, the basic 10,000 words for proficiency probably don't include jargon in any field. At some point, the issue becomes "what sort of reading are you expecting to do in the target language?"


----------



## dojibear

Each Chinese written character is one syllable. 20% of words are one syllable, but 80% of words are two syllables. So one character might be in many different words. Some Chinese courses trick potential students by talking about vocabulary in terms of characters, not words. They say things like "learn 1,000 characters and you know 98% of the characters in a newspaper". True, but you don't know 98% of the words. You can't read a newspaper.

A similar trick is confusing "words" and "sentences" (in many languages). Many languages have a core of 700-900 "most frequently used" words. But most common sentences don't consist 100% of those words. Instead, they use those words plus a few less-common words. If you want to know every (non-jargon) word you will see, 10,000 is more accurate than 1,000.



pimlicodude said:


> At some point, the issue becomes "what sort of reading are you expecting to do in the target language?"


I agree. For example, I was reading a Spanish version of a police drama. So I had to learn a few "police talk" words in Spanish. But that wasn't a few new words each page. The police terms occurred over and over.


----------



## Apollodorus

pimlicodude said:


> At some point, the issue becomes "what sort of reading are you expecting to do in the target language?"



I think that hits the nail on the head. It makes sense to first learn the vocabulary that you are most likely to need in your particular situation, so the size of your immediate target vocabulary will vary significantly depending on whether someone anticipates to take up employment, say, as a university professor or as a pickpocket, etc.

If we go by the assumption that 3000 high-frequency English words (out of the Oxford English Dictionary’s 175,000) provides coverage for about 95% of written English, you may well get way with an initial vocabulary of no more than a few hundred to a thousand words.

Of course, a certain percentage will be words that you don't actually "know" but the meaning of which you can guess from the context.


----------



## pimlicodude

Apollodorus said:


> I think that hits the nail on the head. It makes sense to first learn the vocabulary that you are most likely to need in your particular situation, so the size of your immediate target vocabulary will vary significantly depending on whether someone anticipates to take up employment, say, as a university professor or as a pickpocket, etc.
> 
> If we go by the assumption that 3000 high-frequency English words (out of the Oxford English Dictionary’s 175,000) provides coverage for about 95% of written English, you may well get way with an initial vocabulary of no more than a few hundred to a thousand words.
> 
> Of course, a certain percentage will be words that you don't actually "know" but the meaning of which you can guess from the context.


Where did you get the figure that 3,000 words account for 95% of written English. 3,000 words isn't even intermediate English.


----------



## Apollodorus

pimlicodude said:


> Where did you get the figure that 3,000 words account for 95% of written English. 3,000 words isn't even intermediate English.



See, for example, Nation, I. S. P., “How large a vocabulary is needed for reading and listening?”, _Canadian Modern Language Review_, 63(1), 2006, pp. 59-82. There must be some online studies as well based on statistics. I will post some links when I have a minute.

Also, note that I said "provides coverage for" not "accounts for". 3,000 English words don't amount to 95% of the total vocabulary of the English language. But they may be sufficient to enable you to understand 95% of what you're reading or hearing. 

Obviously, the percentage of coverage will vary according to the degree to which the topic is covered by the vocabulary you actually know, i.e. the degree to which it belongs to your particular specialised field. That's why the size and type of vocabulary needed to "learn a language well" will differ from one individual to another.


----------



## dojibear

The problem with "most frequent words" is that it does not correspond to "most frequent sentences". The most frequent sentences do not consist 100% of most frequent words. Instead they use those frequent words plus some less frequent words.

In a language I am learning, I very often see/hear sentences like "And then she (held/took/gripped) the XXXX, hoping to stop the XXXX before it XXXXed." Lots of frequent words, along with some words I don't know. Was she cooking? Sailing? Driving a car? Playing a sport? Aucune idée.

That is the result if you only learn the most frequent words: unknown words in most sentences.


----------



## pimlicodude

Apollodorus said:


> See, for example, Nation, I. S. P., “How large a vocabulary is needed for reading and listening?”, _Canadian Modern Language Review_, 63(1), 2006, pp. 59-82. There must be some online studies as well based on statistics. I will post some links when I have a minute.


does it say words, word families, lemmas? There is a huge difference. 3,000 words is laughably low for listening comprehension.


----------



## pimlicodude

dojibear said:


> The problem with "most frequent words" is that it does not correspond to "most frequent sentences". The most frequent sentences do not consist 100% of most frequent words. Instead they use those frequent words plus some less frequent words.
> 
> In a language I am learning, I very often see/hear sentences like "And then she (held/took/gripped) the XXXX, hoping to stop the XXXX before it XXXXed." Lots of frequent words, along with some words I don't know. Was she cooking? Sailing? Driving a car? Playing a sport? Aucune idée.
> 
> That is the result if you only learn the most frequent words: unknown words in most sentences.


I find you can sail through some pages of a book in a foreign language with few new words, and then suddenly the next paragraph has 20 new words because an odd topic is being raised. It's not even at all.


----------



## Apollodorus

pimlicodude said:


> 3,000 words is laughably low for listening comprehension.



I agree that 3,000 words sounds low, but the implication from the literature seems to be that you don't need to know 100% of the words in a given text in order to understand a sufficiently high percentage of it.

The percentage of understood content increases with the number of high-frequency words you've learned that are relevant to your particular area of interest. That's exactly why good language-courses (or teachers) tailor their study material to students' individual needs.

Obviously, if you take "learning a language well" in a more general sense, then things may be different.


----------



## Apollodorus

MostUsedWords.com publish frequency dictionaries.

I’m not sure if they’ve got a Russian dictionary, but in volume 1 of their Greek Frequency Dictionary (2019), they write:



> Pareto’s law, also known as the 80/20 rule, states that, for many events, roughly 80% of the effects come from 20% of the causes.
> 
> In language learning, this principle seems to be on steroids. It seems that just 20% of the 20% (95/5) of the most used words in a language account for roughly all the vocabulary you need.
> 
> The authoritative Dictionary of Modern Greek (4th edition) by George Babiniotis lists over 150 000 references in current use. You will only need to know 3.3% (5000 words) to achieve 95% and 89% fluency in speaking and writing respectively. Knowing the most common 10,000 words, or just 6.6%, will net you 98% fluency in spoken language and 95% fluency in written texts.
> 
> This dictionary [vol. 1] contains 2.500 most used Greek words, listed by frequency. If you know these words, you can understand 92% of all daily spoken Greek, and 82% of all written Greek texts.



I could be mistaken, but their dictionaries seem to be getting favourable reviews on Amazon, so until I see evidence to the contrary, I’m assuming there is some truth to their claims.


----------



## Hulalessar

When I first went to France I had been studying French for 2 years. If we assume 40 chapters in a book for a year's course and 30 new words in each chapter that comes to 2400 words. I was in France in the summer and not doing anything in particular. I found I was rarely lost for words and was talking for the same amount of time as I would have talked if I had been staying with a family in England which included someone my age. I went to church on Sunday and understood about two thirds of the sermon. It is unlikely that each of the 2400 words came up. That suggests that for everyday conversation about nothing special you do not need to know that many words.

Edit: On Googling I found that the vocabulary for the old O level French exam (taken at 16 after starting at 11) was 2000 words so the above assumptions are probably too high. Certainly if you passed O level French with the top grade you would expect to be able to get by travelling in France only speaking French, that is booking into a hotel, ordering food, buying tickets, asking the way, complaining you had been given the wrong change, chatting to people on a train, and so on.


----------



## Red Arrow

During the 6 years of highschool, we typically had 6 chapters per year with 60 new words per chapter. So 360 words times 6 = 2160 words.

In the final two years, we had to make our own vocabulary lists on top of the prefab ones. That's another 2 times 25 pages times 33 words/page = 1650 words, but this is definitely an overestimation because my list also contained some words from the previous 2160 that I had forgotten.

So in the end I had learned 3000 à 4000 French words in school. Not nearly enough to be fluent.

But my English and "dialectal" skills improved my French a lot.


----------



## elroy

Red Arrow said:


> we had to make our own vocabulary lists


What do you mean by this?


----------



## Red Arrow

The lists contained new words that were written on the blackboard + new words from texts we had to read in class + old words I had forgotten about. So everyone had their own list.


----------



## Apollodorus

Hulalessar said:


> I went to church on Sunday and understood about two thirds of the sermon. It is unlikely that each of the 2400 words came up. That suggests that for everyday conversation about nothing special you do not need to know that many words.



That's my impression too. I think we need to bear in mind that in situations of this type, what we understand from the context can play a significant role. The understanding is obviously increased when the foreign language is related to that of the listener, e.g. French to Spanish.

If an English person were to listen to Russian, or an Arab to Chinese, it might be different.


----------



## pimlicodude

Another interesting thing is the order in which words appear in frequency lists. This varies from book to book, of course, based on the sources used. But in the Russian Frequency Dictionary, the month "October" is the most frequent month name, because of the "October Revolution", which is referred to in many texts. Of course, no one would learn just October and not the others. Nick Brown, in the preface, explains that in frequency dictionaries for most languages "Friday" is the most common day, possibly because it is so often referred to as the end of the working week, but of course you do need to know all seven of the days of the week.


----------



## Hulalessar

Apollodorus said:


> That's my impression too. I think we need to bear in mind that in situations of this type, what we understand from the context can play a significant role. The understanding is obviously increased when the foreign language is related to that of the listener, e.g. French to Spanish.
> 
> If an English person were to listen to Russian, or an Arab to Chinese, it might be different.


When it comes to groups such as Romance a speaker of one language is going to get a lot of help when it comes to vocabulary, especially in the higher registers. English does of course have a lot of words ultimately derived from classical languages and in certain disciplines you just do not need a dictionary if reading a text in, say, French or Spanish. Everyday words can be different. Compare the words likely to be met early on in French and Spanish lessons at school:



headtêtecabezaarmbrasbrazolegjambepiernafootpiedpiehandmainmanomouthbouchebocaearoreilleorejanoseneznarizeyeoeilojokneegenourodilla




tabletablemesachairchaisesilladeskpupitrepupitreblackboardtableau noirpizarrapupilélèvealumnobooklivrelibroexercise bookcahiercuadernoboygarçonniñowindowfenêtreventanadoorportepuertafloorplanchersuelo


----------



## elroy

desk = escritorio, for me!


----------



## djmc

Most people have gaps in their vocabulary for technical subjects, this is normal. Textbooks about botany or medicine may leave me perplexed. I don't know the latest teenage textspeek etc. Books for children may be easier, but some of the vocabulary may not be obvious to adults. An example is 'bunny rabbits' or 'screes'.


----------



## pimlicodude

djmc said:


> Most people have gaps in their vocabulary for technical subjects, this is normal. Textbooks about botany or medicine may leave me perplexed. I don't know the latest teenage textspeek etc. Books for children may be easier, but some of the vocabulary may not be obvious to adults. An example is 'bunny rabbits' or 'screes'.


Well, I don't know what scree is. I thought it was a hill with loose gravel on it, or maybe that's a similar word.


----------



## Hulalessar

elroy said:


> desk = escritorio, for me!


My understanding is that "pupitre" has the narrow meaning of the sort of desk children use in school, while "escritorio" (when it refers to furniture) means something a bit more impressive. If you do a Google image search for "pupitre" Spanish sites come up showing pictures of school desks.

We need the opinion of a native Spanish speaker.


----------



## danieleferrari

Hulalessar said:


> the sort of desk children use in school


----------



## Olaszinhok

I know the word pupitre in Spanish and it is actually quite easy to memorise 'cause it comes from French but I hope this will help you.
¿Se usa la palabra pupitre para signficar un escritorio de una escuela en España? Solo españoles nativos por favor. Gracias.


----------



## Penyafort

They're right, _pupitre _is still used but becoming slightly dated, and mostly restricted to a school desk. _Escritorio _would rather be the desk you have at home.

By the way, boy is more often translated as _chico _in Spanish, _niño _being closer to the idea of child.



Hulalessar said:


> headtêtecabezalegjambepiernakneegenourodilla
> 
> 
> 
> 
> tabletablemesachairchaisesillablackboardtableau noirpizarrapupilélèvealumnoexercise bookcahiercuadernoboygarçonniñowindowfenêtreventanafloorplanchersuelo


I left the ones that are different. Here's where you see that the level of knowledge of one's own language is also essential when it comes to relate words. The three body parts in French can be related to _testa_, _gamba _(or _jamón_) and _hinojo_. Although it's true that most Spanish speakers won't probably do it, as testa and gamba are restricted and hinojo is dated. Oftentimes the word in French is so eroded (cahier, chaise) that it is difficult to see they're cuaderno and cadera. Even if so, cadera meaning chair (as well as tabla meaning table) are obsolete usages in Spanish, in whcih silla and mesa have always prevailed. Same thing for _hiniestra _(fenêtre). Élève, garçon and plancher, though, are French particularities, even if for the latter French knows _sol _too.


----------



## Apollodorus

Hulalessar said:


> English does of course have a lot of words ultimately derived from classical languages



Correct. But I think this tends to apply more to reading written material than listening to spoken language, due to the pronunciation that may sound like "Chinese" to a Continental. My French teacher used to say that in English you write "elastic" and read "rubber"! 😄


----------



## Apollodorus

Penyafort said:


> By the way, boy is more often translated as _chico _in Spanish



But we mustn't forget "_chaval_". 😉


----------



## pimlicodude

Apollodorus said:


> But we mustn't forget "_chaval_". 😉


muchacho for an older teen?


----------



## Penyafort

Apollodorus said:


> But we mustn't forget "_chaval_". 😉





pimlicodude said:


> muchacho for an older teen?


Both are possibilities too, but _chaval _is slightly informal, and _muchacho_, at least in Spain, is used but perhaps sounds slightly dated in some contexts, as if something you'd use for a translation of 'boy' in a 19th-century novel rather than in a contemporary one --this is obviously an opinion, others may have their say.


----------



## Apollodorus

Penyafort said:


> Both are possibilities too, but _chaval _is slightly informal, and _muchacho_, at least in Spain, is used but perhaps sounds slightly dated in some contexts



Interestingly, _chaval_ is supposed to be from Caló (Romani) and _muchacho _from an African language via Portuguese. _Chico_, on the other hand, may have a Latin origin. Is there a common Catalan, Galician, or Asturian equivalent?


----------



## kentix

elroy said:


> Even many native speakers can’t read literature comfortably and effortlessly, so I’m not really sure this is a good benchmark for anything.


Ain't that the truth.



pimlicodude said:


> But note that this will give you only "proficiency" (and that assumes you have the pronunciation and grammar right). It does not mean you can read literature without encountering new words.


You can't even do that at as a native speaker, unless you are reading something very easy. Even reading a book by Stephen King, a "popular fiction" writer, will put you in that position. If you are reading "literature" (in that high-falutin' sense), it's likely that one reason it belongs in that category is that you can't read it effortlessly, even as a native speaker.

I made the mistake of buying a small, compact dictionary to take with me to Africa when I went to live in a remote area as a teacher. It turns out, every word I found difficult enough to have to look up was not in there, and all the ones in there (still thousands, I suppose) were ones I knew. So it was pretty much useless for me. (I was fortunate to be able to trade for a better, more comprehensive one with someone passing through from the U.S. I forget what the trade was.)


----------



## pimlicodude

kentix said:


> You can't even do that at as a native speaker, unless you are reading something very easy. Even reading a book by Stephen King, a "popular fiction" writer, will put you in that position. If you are reading "literature" (in that high-falutin' sense), it's likely that one reason it belongs in that category is that you can't read it effortlessly, even as a native speaker.


I suppose it depends on the native speaker and the educational background. I've never read a popular fiction book and come across a word I didn't know. But I think the next generation of school leavers may indeed find many words they don't know.


----------



## jimquk

Apollodorus said:


> Interestingly, _chaval_ is supposed to be from Caló (Romani) and _muchacho _from an African language via Portuguese. _Chico_, on the other hand, may have a Latin origin. Is there a common Catalan, Galician, or Asturian equivalent?



Chaval would then possibly have the same origin as British English Chav, supposedly from Romani Chavo.


----------



## Apollodorus

jimquk said:


> Chaval would then possibly have the same origin as British English Chav, supposedly from Romani Chavo.



Correct. And ultimately, from northwestern Indian (Indo-Aryan) languages (Rajasthani, Haryanvi, Punjabi) and related to Modern Hindi < Sanskrit javan (जवान), "young man".


----------



## Odysseus54

pimlicodude said:


> does it say words, word families, lemmas? There is a huge difference. 3,000 words is laughably low for listening comprehension.



For listening comprehension, yes.  Reading is a different story, though.  You have the time to guess, decode, surmise, invoke the water spirits.

From the first article on the DW right now:

Gestörte Lieferketten, explodierende Lebensmittelpreise: Singapur spürt die Folgen der Pandemie und des Ukraine-Krieges. Bundespräsident Steinmeier sucht Partner für freien Handel und verbindliche internationale Regeln.

I am learning German.  I can't really say how many words my vocabulary is made of, but this paragraph is an example of what happens when I read.  In blue are the words I know.  In green those whose meaning I can easily guess ( I know liefern=to deliver and Kette=chain, the leap to 'supply chains' is a short and easy one, although if I had had to guess the other way, I would have put my money on 'Versorgungsketten', which does exist.  At this point, I have no idea which one of the two is more common, if there are differences in usage etc.) The only word I had to look up was 'spüren' - 'fühlen' is, for obvious reasons, the verb that pops up with the idea of 'feeling'.

That's the process the way I know it.  It goes without saying that news articles are generally easy.  More literary texts or technical/topical texts are more challenging.  But I wouldn't be surprised if the 3,000 words had the potential to allow the understanding in context of three times as much 'latent' vocabulary.


----------



## Red Arrow

Let me try, first article on Le Soir 

Annulation de vol et indemnisation: à quoi il faut s’attendre
En cas d’annulation, les compagnies doivent proposer un autre vol ou le remboursement. Et, parfois, en plus, des indemnités. Dans certaines conditions.
L'actualité aérienne en Belgique illustre les différentes variantes de l'application du droit à une indemnisation en cas d'annulation de vols. Comme ce sera pour environ 250 vols, lundi, au départ ou à l'arrivée à Brussels Airport et, éventuellement, en cas de grève chez Brussels Airlines et Ryanair d'ici la fin du mois.
(the rest in behind a paywall)

Those words look related to "damn" but there is no way to guess what they mean. You get your money back and sometimes on top of that... damn?

Long live the dictionary.


----------



## kentix

I have never studied French but I can  understand quite a few bits and pieces. I don't know if my French vocabulary is 200 words or 600 but I have a background studying Spanish (not so good at the speaking) and a passing exposure to a few snippets of French here and there (plus French imports in English). That's enough that this isn't complete gobbledygook.

Indemnity is a common enough word in English.

*indemnity*
in•dem•ni•ty _/ɪnˈdɛmnɪti/_  n., pl. *-ties.*

Business [uncountable] security against damage or loss.
Business [countable] money as payment for loss sustained.
I couldn't fully understand it overall but I got the general topic and the exact meaning of some specific parts. Looking up the meaning of just a couple of words made it 50% clearer. I had the right general idea but could not zero in on the exact meanings.

I agree with Odysseus you can go a long way when you don't have to speak or respond.


----------



## Apollodorus

kentix said:


> I agree with Odysseus you can go a long way when you don't have to speak or respond.



We mustn't lose sight of the fact that it's 3,000 _high-frequency words_, not random words. By definition, these are the words you are most likely to come across in everyday situations.

And MostUsedWords do say that if you know the 2,500 Greek words in their Volume 1 (there are four Volumes amounting to a total of 10,000 words), you can _understand _92% of all daily spoken Greek.

Obviously, comprehension comes first, speaking comes second, exactly as is the case with children.

Interestingly, I once asked some Swedish friends of mine how they manage to learn English so well. Their answer was that they watch English-language films with Swedish subtitles (whereas in some countries they tend to be dubbed in the local language).

According to MostUsedWords their frequency list is based on data from subtitles that apparently "included over 1 million entries or different "words"". And I must say from my own experience, that watching movies, TV programmes, etc. in the language you're learning does help a lot precisely because it's everyday vocabulary that will help you become fluent fast.

Plus, at the end of the day, you've got to start _somewhere_. And knowing two or three thousand high-frequency words, or even a few hundred, can't be a bad thing. The main thing is to learn some words every day, keep repeating them, mentally create sentences with as many learned words as possible and communicate with native speakers as often as you can. If you do that, you just can't go wrong.

In terms of spoken language, pronunciation is something that you have to try and get right from the start, unless your main interest is reading or communicating in writing.


----------



## dojibear

I once visited France for a week. I didn't have much trouble expressing myself or asking questions. Then I had dinner with 4 friends, all speaking French. They were having random conversation -- that is, they were talking about many different things. 

For me, much of it was "I like the XXXX way better than the XXXX because the XXXX...". I knew the common words that string things together, but not the less-common nouns used in 50 different topics. Sometimes I didn't even know what the topic was.


----------



## dojibear

Apollodorus said:


> And I must say from my own experience, that watching movies, TV programmes, etc. in the language you're learning does help a lot precisely because it's everyday vocabulary that will help you become fluent fast.


My experience too. I watch TV drama series in Mandarin, with both English and Mandarin subtitles. The Mandarin sub-titles help me train my ears to "hear" the sounds of full-speed, imperfect colloquial speech. The English sub-titles give me a rough idea of what is being said. Dramas also give you tons of context. An egotistical CEO is an egotistical CEO, in any language...


----------



## Apollodorus

dojibear said:


> The Mandarin sub-titles help me train my ears to "hear" the sounds of full-speed, imperfect colloquial speech.



I agree. Obviously, there will be differences from language to language, but I think it’s very important to systematise one’s learning from the start.

Listening to the language being used by native speakers is definitely a big part of it. Watching a film or TV programme is very helpful but just _listening _to the language being spoken can be equally useful.

Doing some listening every day does help familiarise oneself with the sounds of the language. Over time, you begin to make out individual words, and you learn what each word means as you go, sometimes from the context.

Depending on the language, another important step is to try and learn nouns with their corresponding articles. For example, German “das Fenster” (“the window”). The minute you learn it, start using it by forming bilingual sentences, e.g. “I’m looking out of/aus dem Fenster” for "i'm looking out of the window". I.e., don’t wait until you've learned all the words to build your sentence!

Likewise, when learning verbs, learn the main tenses at once, e.g. “sehen”, “sah”, “gesehen” (“see”, “saw”, “seen”). When learning words in general, try to learn them within a phrase, expression, or sentence, etc., etc.

In other words, the more structured and systematic the information input, the faster the brain will be able to process, assimilate and use all that in a coherent and meaningful way.


----------



## Penyafort

Apollodorus said:


> Interestingly, _chaval_ is supposed to be from Caló (Romani) and _muchacho _from an African language via Portuguese. _Chico_, on the other hand, may have a Latin origin. Is there a common Catalan, Galician, or Asturian equivalent?


In Catalan, *xaval *exists (attested since 1893) and is accepted. But with Romani words that are the same as in Spanish, one can never be too sure if it came straight from Catalan Romani or via Spanish. With other words such as _calés _'money' or _catipén_ 'stench', one can be more confident because they don't exist in Spanish.

Yes, _chico _comes from Latin. The cognate in Catalan is *xic*, which is used with the meaning of 'boy' in Western varieties.

_Muchacho _doesn't come from an African language. What is the source for that?


----------



## Apollodorus

Penyafort said:


> _Muchacho _doesn't come from an African language. What is the source for that?



Well, I can’t _guarantee _that it’s from an African language, but according to Wiktionary,

Etymology mucho +‎ -acho and/or macho +‎ -acho, Portuguese macaco. And macaco includes “from Kongo makaku (“monkeys”)” among suggested derivations.

muchacho – Wiktionary

macaco – Wiktionary

Ultimately, the exact etymology seems uncertain.


----------



## pimlicodude

Apollodorus said:


> Well, I can’t _guarantee _that it’s from an African language, but according to Wiktionary,
> 
> Etymology mucho +‎ -acho and/or macho +‎ -acho, Portuguese macaco. And macaco includes “from Kongo makaku (“monkeys”)” among suggested derivations.
> 
> muchacho – Wiktionary
> 
> macaco – Wiktionary
> 
> Ultimately, the exact etymology seems uncertain.


Wiktionary doesn't necessarily know....


----------



## Apollodorus

pimlicodude said:


> Wiktionary doesn't necessarily know....


I agree! 🙂 But in the absence of a more credible source we must at least consider it as a possibility (for the time being). Unless @Penyafort has a better idea ....


----------



## Hulalessar

Wikcionario says: Del castellano antiguo mochacho, derivado de mocho, tronco, y este del latín mutilus, mutilado.


----------



## pimlicodude

Hulalessar said:


> Wikcionario says: Del castellano antiguo mochacho, derivado de mocho, tronco, y este del latín mutilus, mutilado.


Eso es mucho más creíble....


----------



## Penyafort

Apollodorus said:


> I agree! 🙂 But in the absence of a more credible source we must at least consider it as a possibility (for the time being). Unless @Penyafort has a better idea ....


One thing is certain: the oldest form of the word was _mochacho_, with -o-, apparently preferred between the 13th and 16th centuries, and that means that it most likely comes indeed from _*mocho *_(+ suffix -_*acho*_, which is typically Mozarabic in origin, the genuine Castilian suffix coming from -ACEU being -_azo_). Corominas considers this origin as clear, based on articles by Baist, Schuchardt, Castro and Rohlfs, _mocho _meaning ‘cut-off, cut short, close-cropped’ -not trunk- when referred to hair, as that was something characteristic in young boys in the Middle Ages. (Other Romance forms where you can see the relationship between cropped hair and young boy are _caruso _in southern Italy and _toset _in Old Catalan/Occitan or _toso _in northern Italy. The shaven wheat or _Spica mutila_, called _blat tosell_ in Catalan or _grano carosella/tosello _in Italy, is called _trigo mocho_ in Castile).

But the real origin for _mocho _is uncertain. It can’t come straight from Latin _mutilu_, because that would have given *_mojo _in Castilian. We see _motz _in Basque, meaning ‘short, blunt, shorn’, and we clearly have the word _mozo _in Castilian and Aragonese for a lad too, so that leads us to a likely early form *_muttiu_, that would expand west (Portuguese _moço_) and east (Catalan _mosso_, French _mousse_, Italian _mozzo_). But what about _mocho_, with ch? Corominas ended up saying that it was probably expressive in origin. In any case, _mochacho _must have had its origin in a Mozarabic form.

As you can interpret from all of this, whoever suggested an African origin in that etymology is clearly misguided.


----------



## Apollodorus

Hulalessar said:


> Wikcionario says: Del castellano antiguo mochacho, derivado de mocho, tronco, y este del latín mutilus, mutilado.



So, Wikcionario gives a different etymology to Wiktionary! And when I click the “mochacho” link it says “El Wikcionario aún no tiene una página llamada mochacho.”

But the mochacho derivation seems to be supported by the Diccionario de la lengua española, published by the Real Academia Española.

Así que parece que tengas tú razón. 🙂

BTW, any ideas why a young man would be called "mutilated"?


----------



## Apollodorus

Penyafort said:


> As you can interpret from all of this, whoever suggested an African origin in that etymology is clearly misguided.



In other words, Wiktionary must be taken with a pinch of salt ....


----------



## Apollodorus

Penyafort said:


> But the real origin for _mocho _is uncertain. It can’t come straight from Latin _mutilu_



That's what I thought, too.

Presumably, this is why the Diccionario de la lengua española is more cautious and says "muchacho - Del antiguo mochacho, y este derivado de mocho" and "mocho - De origen _incierto_".


----------



## Odysseus54

Apollodorus said:


> So, Wikcionario gives a different etymology to Wiktionary! And when I click the “mochacho” link it says “El Wikcionario aún no tiene una página llamada mochacho.”
> 
> But the mochacho derivation seems to be supported by the Diccionario de la lengua española, published by the Real Academia Española.
> 
> Así que parece que tengas tú razón. 🙂
> 
> BTW, any ideas why a young man would be called "mutilated"?



Según el diccionario Oxford Languages, parece que

Mocho
2 Que tiene el pelo rapado o muy corto.

En Colombia a los jovenes (o 'chinos') los llaman 'pelaos/pelados'.  No se trata pues de mutilación, sino de un sencillo corte de pelo  🙂


----------



## pimlicodude

This thread is about the role of vocabulary size in learning a language--not the etymology of muchacho. 
Another issue we haven't considered is active vs. passive vocabulary. 
If you have 10,000 words in the target language in your active vocabulary, then you speak it quite well, even if fiction in that language uses many more words. But if you have 10,000 words in your passive vocabulary and only, say, 3000 in your active vocabulary, your speaking will fall a long way behind your reading.


----------



## Apollodorus

Odysseus54 said:


> No se trata pues de mutilación



Además, puede ser que no tenga nada de ver con el latin “mutilus”. Parece que a los del Wiktionary les gusta inventar cosas. O tomar el pelo …. 😀


----------



## Apollodorus

pimlicodude said:


> This thread is about the role of vocabulary size in learning a language--not the etymology of muchacho.


True. But this forum section is about etymology and history of languages, and a small diversion can't do any harm. On the contrary, it might enrich the thread by making it it more varied and more inclusive. 😉

I agree that an active vocabulary of "only" 3,000 words might cause your speaking to fall behind your reading, but isn't this what tends to happen anyway?


----------



## pimlicodude

Apollodorus said:


> I agree that an active vocabulary of "only" 3,000 words might cause your speaking to fall behind your reading, but isn't this what tends to happen anyway?


It is yes, but ideally we should aim to stop it. I find that conversations with native speakers don't always tend to touch on advanced topics, and so the rarer words in the 10,000 words for fluency don't tend to get an outing.


----------



## Apollodorus

pimlicodude said:


> ideally we should aim to stop it


OK, if that's what the aim is, then the question is how to get there.

In other words, if we're agreed (a) that an active vocabulary of 10,000 words is needed for speaking a language well, and (b) speaking (as opposed to merely understanding) that language well is the target, all that remains to be established is by what method this can be achieved.

Presumably, learning the 10,000 words listed in a frequency dictionary would be the core strategy. But how can we speed up the process (assuming that we don't want to invest too many years in the project)?


----------



## Red Arrow

SRS / Anki in combination with immersion (reading, listening) speeds up vocabulary learning immensely.


----------

