Wednesday, June 17, 2009

Which language has the most words?

I heard an interesting factoid at lunch yesterday. There are apparently one million words in the English language which supposedly puts it far ahead of any other language. This came as a surprise to me, while I've grown to enjoy English for its precision, this would mean it contains over ten times the amount of words of the french language (80k). It is no surprise to me however that French is stagnating, simply because the language is regulated by committee, and we all know how efficient those are. And so when we need a new word say to convey the meaning of "email", a group of people in France have to get together and pick one (courriel). Hardly a natural and vibrant process. But does the English language truly have a million words? Wolfram alpha will quickly tell you that the Oxford dictionary has 600k words, putting it ahead of say Spanish with a mere 225k. English is growing very quickly because it incorporates other languages readily, perhaps because most English speakers are not native. In fact China has the largest population of anglophones in the world. The original claim that there are 1 million words comes from the university of Texas which has an algorithm caching and crawling social networking sites all along identifying any expression used over 25 000 times as a word. Of course many people have problems with this, since it is not an accepted definition of a word. It incorporates any deformation or misspelling of common words, commercial trademarks, onomatopoeia which would include for example: "what", "wat", "waht", "wh4t", "whaaaaat", "whassssssssssssup" etc ... Many people have critisized the methodology itself, since the "Global Language Monitor" claimed the millionth word in 2006, 2007, 2008 and 2009.
So does this mean English really has the most words: not likely. Agglutinative languages such as Finnish most certainly have more words and more readily create new words to adapt to changing time by fusing existing words creating an almost infinite number of combination. Yet English remains dominant perhaps not in the number of words, but in the reach and power of the language.


29 comments:

Anonymous Coward said...

We must be pretty close to having Bayblab accepted as a word.

Kamel said...

caching and crawling social networking sites all along identifying any expression used over 25 000 times as a word

That's what we want, a bunch of semi-illiterate texters determining what's a word.

Paul JJ Payack said...

The Global Language Monitor's criterion is not 'any expression used over 25 000 times as a word'. That is the minimum times a word of expression must be cited before it is even considered worthy of study. This eliminates typos and the like. Then it is subjected to the two quality tests (breadth and depth) before it moves forward for further consideration. For more information check the site (www.LanguageMonitor.com).

Also, we have never sought the approval of linguists who have long maintained that words are impossible to count, since what a word is, itself, is open to dispute.

Paul JJ Payack

Rob said...

That's what we want, a bunch of semi-illiterate texters determining what's a word.
This is how language evolves. Do you prefer the french committee method? Words are used by people, not just very literate people but everyone, including texters (is that a word?). They would not catch on if they weren't useful and meaningful. The 'success' of English is exactly as the post describes. It can adapt quickly.
Personally, I enjoy new words. We were discussing the word noob the other day. I really don't think there is a better word to describe a.... noob.

Anonymous Coward said...

I'm sorry i think you're referring to a newb...

Anonymous Coward said...

@Paul JJ: according to your site "Just missing the top spot was n00b, a mixture of letters and numbers that is a derisive term for newcomer. It is also the only mainstream English word that contains within itself two numerals. Rounding out the final five were another technical term, cloud computing, meaning services that are delivered via the cloud (or Internet), and a term from the Climate Change debate, carbon neutral."

Isn't a word spelled with numerals (1337 speak) a misspelling? Isn't carbon neutral two words? As far as I know English isn't an agglutinative language.

Kamel said...

I was expecting that respons. I have no problem with new words, for example google as a verb.

The 25000 use approach (which is apparantly only a starting point as mentioned above) would make l8r or l33t into words.

But I have more detailed thoughts on the matter that are probably better served in their own post.

Kamel said...

Isn't a word spelled with numerals (1337 speak) a misspelling? Isn't carbon neutral two words?

Thanks AC, that's more the point I was making. Just because somebody can't, or is too lazy, to spell "you're" correctly doesn't make "ur" a word or make "your" mean "you are" any more than the common typos 'teh' or 'jsut' are valid replacements for 'the' and 'just' no matter how common or easily understood.

Anonymous said...

"This is how language evolves."

We should evolve math and science this way too! It doesn't matter what's right, only what's popular! Hooray 6000 year old earth!

Anonymous said...

rob said: This is how language evolves.

Is that a fact? The french don't seem to think so.

Is it even historically true? Does most language evolve from trolling popular media for usage frequency?

Or are you just making it up?

Anonymous Coward said...

The Urban Dictionary is to Webster what wikipedia is to Britannica?

Anonymous said...

stFU /../..an, i r teh r0xx0rz liek emin3m, u cna go tO EHLL OR ATLE4St help m3 wit hthIS!!111!!!!!!!1~~1!!``!! LOLLOLOLLOLOLlOoLLOlollLLl u n00b

Rob said...

At the two anonymous comments
The science and math comment:
Your comment implies that you think there are right and wrong words. Who decides what word is right and what is wrong?
The other comment:
I'm pretty sure it is a well known fact that languages evolve. My comment was taken out of context, I wasn't talking about the trolling of websites to count words, I was talking about Kamel's objection to texters creating new words.
I will add that, despite attempts to 'regulate' French, I think you'll find that Canadian French differs from France French. ie it evolves.

Kamel said...

I was talking about Kamel's objection to texters creating new words.

And you're taking my comment out of context which is clearly a direct response to the method of crawling social network sites and identifying anything that appears more than 25000 times as a word (notice the quote that I'm replying to)

Anonymous said...

rob said: "I'm pretty sure it is a well known fact that languages evolve."

Nobody questioned that. You proposed a method. ("That's how...") That's what I question.

See what happens when you don't use the right words?

Anonymous said...

"Your comment implies that you think there are right and wrong words. Who decides what word is right and what is wrong?"

Obviously some words are right and wrong. If I say table when I mean chair, that's wrong. If I say gumba when I mean chair, that's wrong. even some right words are wrong - most people agree that ain't isn't a proper word even though we all know its meaning. But what my sarcastic comment was taking issue with was this: "Words are used by people, not just very literate people but everyone."

That's just a weak argument. Municipal infrastructure is used by everyone, not just city planners and engineers but not just anybody gets to decide where the sewers run, or the road maintenance schedule. You haven't convinced me that language should be special in its democracy.

Kamel said...

Rob got me thinking a bit about right and wrong words and I think I have a reasonable example of a wrong one. Truthiness.

Don't get me wrong, I like the word but it's a great example of the popular vote screwing it up. (And it was, in fact, picked by popular vote) Most people don't know what it means. Even if you search the Bayblab you'll probably find examples of incorrect use (i.e. contrary to its definition). Most people take it as a synonym for truthfulness, or the state of being true when, in fact, it means almost the opposite. In it's first usage on the Colbert report, it was defined as "truth that comes from the gut, not books" and is included in the Merriam-Webster dictionary as "the quality of preferring concepts or facts one wishes to be true, rather than concepts or facts known to be true"

But people use it with the other meaning (degree to which something is true) because of rules and conventions of the English language. The suffix -ness we recognize as expressing quality or state, so happiness is the state of being happy. Saltiness is the state of being salty. It follows that truthiness is the state of being true. Which it isn't.

Now you can make an argument that this is an example of language evolution. The popular vote made it a word, and an even more popular 'vote' changed the definition (which it hasn't yet as far as dictionaries are concerned), but that only happened - if it has - because one way is wrong and the other objectively right.

Rob said...

For the record I don't think the mechanism of language evolution is through trolling social media. That's pretty obvious. Sorry if I was unclear.
You haven't convinced me that language should be special in its democracy.
In my mind, the last two comments are confusing democracy and evolution. These are very different concepts, the definitions of which I won't bother with on the Bayblab. I never said that people vote on words, nor did I mention anything about majority rules on the creation of new words. I said language evolves.
For the truthiness example voting on the word is irrelevant. It is perhaps part of the evolution of the word but evolution has not finished with ANY word. 'Truthiness' is an interesting example I think of why many people have a problem with the trolling of social media for new words. These words are in their infancy, the definition is very dynamic. Because of this, the meme that is the word 'truthiness' perhaps has weak replicative power.
Could texting and words that contain numbers and other abominations that are being rejected in these comments, just be niche words? Perhaps they will never work their way into scientific discourse, for example, as this is another niche.
BTW this is nothing new. The telegraph had its own shorthand that had similarities to texting. I just point this out in case you are rejecting words simply because they are new, and new ideas from young people scare you. HHOJ

Bayman said...

Kamel says: Just because somebody can't, or is too lazy, to spell "you're" correctly

Actually if you consult a definitive authority on the English language, such as Chaucer or the King James bible, I believe you'll find the correct spelling would be "THOU ART".

Stop getting lazy with your words.

Kamel said...

Ha ha. Good point.

Kamel said...

Since you bring it up, how many words were born out of the telegraph code?

Kamel said...

And of course new things scare me - that's why I don't have a cellphone and avoid revolving doors! (They're the work of the devil, damn it!)

Rob said...

Yeah. I don't think too many words survived from the telegraph code aka the Phillips code. So perhaps these texting words will have a short lifespan. Much like those who would choose to walk through a revolving door.

Kamel said...

As a random, but interesting, aside - there was a movement that started in the late 1700s that spanned almost 200 years to simplify spelling that attracted a lot of "big names". Darwin, Mark Twain, Tennyson and even Webster pushed for changes to spelling such as 'have' to 'hav', 'guard' to 'gard' and more recognizable changes like 'though' to 'tho', 'through' to 'thru' and 'nite' in place of 'night'. Many of those changes would probably have been ideal for telegraph or twitter where cost per letter or space put limits on your message.

"So while spelling reform has exercised some of our finest minds for nearly two centuries, the changes attributable to these efforts have been few and frequently short lived."

(source: "The Mother Tongue, English and how it got that way" by Bill Bryson, which is a really interesting book on the evolution of English)

So it seems there's some historical precedence to some of these 'new media' spellings and/or words being a flash-in-the-pan.

Bayman said...

That's interesting. As writers, I assume those guys were pushing for shorter spellings to decrease publishing costs associated with disseminating their ideas.

So who has fought to maintain archaic and ornate letter-packed "correct" spellings? Perhaps fat-ass owners of printing press monopolies who could charge by the letter for access to their their presses? Nobility and other land-owning oligarchs who felt safer when written ideas were spread by a few wealth-friendly voices rather than many diverse, dissenting, skeptical or "destabilizing" viewpoints?

Have profiteering and information control on the part of the wealthy minority been the true motives behind the language snobbery that so prominently figures in certain aspects of British culture?

Josie said...

The 25000 use approach (which is apparantly only a starting point as mentioned above) would make l8r or l33t into words.

Isn't a word spelled with numerals (1337 speak) a misspelling?

Why do you disagree that these are words? L8r or n00b are not spelled that way accidentally, it is obvious that they require more effort to type. They are words that have been created, the first to save space, the second carries a certain specific meaning and also serves as somewhat of a cultural marker for those who use it. And they have spread because they are useful for their purposes.

To me, that is exactly how language works. We have always used different words in academic writing to casual speech to news articles. It does not mean only some of those many words are really part of the language.

Anonymous said...

'Actually if you consult a definitive authority on the English language, such as Chaucer or the King James bible, I believe you'll find the correct spelling would be "THOU ART".'

It's not laziness to scrap thou and use you for all second-person pronouns; something similar occurred in Brazilian Portuguese, in Spanish (twice), and in several other languages.

And the people who want to fight for correct spelling are not 'fat-ass owners of printing monopolie;' they are people who want to use the English language correctly and maintain traiditonal spelling that shows the homologous relationship between it and other Germanic languages, as well as the analogous relationship with Italic languages and the Hellenic influence.

If a word has a number in it, it's no longer a word. The Latin alphabet (English variant) only has 26 letters.

And language is invented by all people, regardless of literacy, education, social class, etc. Whether or not these words are accepted by others, or even last, is another thing entirely. The committee for French in question does post extensive rules in an attempt to keep the languaage organised (e.g., a French speaker may use the word 'hamburger' for 'biftek haché,' but not 'e-mail' for 'courriel'), but many French speakers simply ignore this, especially in spoken language or in Internet slang.

Lastly, (one of) the reason(s) that English has so many words is its three great influences, the Germanic, Latinate (Latin+French), and Hellenic languages. But, it is also an aspect of the language since its birth. Olde English had dozens of synonyms for common terms as well. As well, due to its being the lingua franca of the modern world, most new scientific terms/names are developed in English.

As to which language has the most words, it all depends on how you define a word. Most people agree that the English language ranks very highly on that list.

Anonymous said...

We seem to forget that "words" such as n00b only exist in a visual form. Language was first spoken, writing came after. Since these new words exist first in a visual form, then they exist separate from spoken language, making the zero's significant in the meaning of the word. The difference between a newb and n00b is a very real difference. Newb is a short from a newbie, which can be used in any real life situation. n00b is a short from a newbie in an internet based social network site/video game. Both words convey that the speaker is not a newb/n00b, however n00b has a greater level of condescension to it.

Anonymous said...

The Finnish base dictionary (maintained by the University of Helsinki) has over one million entries, which is more than any other single language dictionary has. Therefore, one could say that the Finnish language has the largest vocabulary of all the languages.