Language Identification

Ranked #14,148 in Education, #288,685 overall

Finding out what language a text is written in

Language identification of written text means finding out what language the text is written in. This has concerned historians, linguists, and anthropologists for ages, but lately it has also come to affect us normal Internet users because the number of languages that we can come across in our ordinary lives has grown enormously with the coming of the Internet. I am personally very interested in the subject and has also implemented a language identifier of my own. More on that further down.

How does language identification work?

Using my own whatlanguageisthis.com as an example

What language is this?Traditionally, identification of language has been a manual task, where people had to look at the written text and figure out from their own knowledge and sources what language it might be. This often relied on looking up frequently used scripts, letters, and words of particular candidate languages. Of course, for major world languages language identification has often been trivial, but for lesser known, and especially lesser written and lesser taught languages it has always been a major hurdle.

Nowadays computers are frequently employed, and as long as the text source in question has been digitized, the language can often be identified by statistical approaches and by relying on vast databases of known text material. This is how I implemented my language identifier, What Language Is This?, which is available freely on the web and runs in the web browser.

I used Wikipedia pages as a corpus, since Wikipedia serves as a large, freely accessible collection of large amounts of texts in known languages. After collecting "sample" pages from Wikipedia, pages that seemed to be high in content and low in noise (it's hard to tell when you not even know the language, but can't understand the writing at all! still, one is able to make a guess), I ran scripts implemented by myself to analyze the content of these pages and pick out the most frequent characters and word string combinations.

"Characters" in this case can of course be in any script; Latin, Arabic, Cyrillic, Devanagari, Chinese, Hebrew, to just name a few. Frequent word string combinations refers to frequently used short sequences of characters, such as "ing" and "the" in English, and whatever it may be in other languages. But since the ending "ing" is common in the other northern European languages as well, it is not "worth" much when identifying a particular language. For that reason, I found that weighing the frequently used strings was not effective. However, weighing of the frequently used characters I did find effective, and that is used as a "first pass" filter to find candidate languages to analyze for the more processing intensive character string analysis.

Especially hard to distinguish languages

These languages in particular pose a challenge for language identification

  • Indonesian/Malay: these two languages are essentially one language, but for political purposes the variant spoken in Malaysia and the one in Indonesia are considered separate. And there are differences between them in the written language, so a language identifier has to consider these variations. The differences are small, but some words are different, and word order and usage differs.
  • Bosnian, Croatian, Serbian, and to some degree Slovenian: these languages are very, very similar, and actually form a continuum of different dialects of the same Slavic language. But, for political reasons, it is very important to distinguish these correctly. The task gets even more complicated by the fact that there are different written standards corresponding to speech forms within the languages as well. Honestly, distinguishing these is a really tough challenge.
  • Scandinavian languages, i.e. Swedish, Danish, and Norwegian are very similar. While spoken Swedish and Norwegian are close and mutually intelligible for trained speakers, when written Norwegian and Danish are confusingly similar. These are not as hard to distinguish as the above though. At least not for me, since I am a native speaker of one. :-)
  • Dutch/Afrikaans: Afrikaans essentially evolved from Dutch, mixing in some English and African language influence into it. While these are very similar to each other when doing a statistical analysis, a cleverly written algorithm can distinguish between them easily and with good accuracy.
  • Tagalog and Cebuano: these Filipino languages sound different enough when you hear them, but when doing a statistical analysis of their written forms they end up showing confusingly similar patterns. However, as in the case above they can be distinguished quite easily algorithmically.

Comments, questions...?

Please give me a shout if you're interested in language identification too

submit

by

hefa

Hi! I'm Henrik. I'm from Sweden but moved to Japan five years ago to work in the mobile software industry, and I'm loving it here and plan on staying.... more »

Feeling creative? Create a Lens!

The rise and fall of languages and cultures 

spanning over five millennia and six continents

Empires of the Word: A Language History of the World

Amazon Price: $6.02 (as of 02/17/2012)Buy Now

This book deals with the rise, spread, and eventual fall of the world's languages (and their associated empires), spanning more than five thousand years of history in the process, beginning in Sumeria and ending with our present-day English. The author is open-minded in his reasoning and doesn't stick to just his own ideas of why and how languages spread, and the style is easy-going with plenty of entertaining anecdotes. I could not recommend a book higher than I recommend this one.

Exploring the way languages change  

and building a complex model of language relationships

The Power of Babel: A Natural History of Language

Amazon Price: $7.09 (as of 02/17/2012)Buy Now

This book discusses the development and spread of language throughout the world, but it also gets down to the details of specific languages with ample examples. I think this mix is good. However, while the author is open-minded and doesn't preach his own views, he does take sides somewhat sometimes, as in insisting that there is no proto-world language, and he strongly favors spoken language over written language in his discussions. I'm alright with that though.

A comprehensive guide to the world's languages  

this book is the very definition of extensiveness...

The World's Major Languages

Amazon Price: $182.19 (as of 02/17/2012)Buy Now

This book has been a tremendous asset for me when developing the online language identifier What Language Is This?. It contains an impressive overview of all the world's language families and the major language belonging to them. The best point of this book is that each chapter is written by an expert in its field. So it's not one person's view of all of the languages of the world, but rather a comprehensive collection of professional research.