Finding out what language a text is written in
Language identification of written text means finding out what language the text is written in. This has concerned historians, linguists, and anthropologists for ages, but lately it has also come to affect us normal Internet users because the number of languages that we can come across in our ordinary lives has grown enormously with the coming of the Internet. I am personally very interested in the subject and has also implemented a language identifier of my own. More on that further down.

How does language identification work?
Using my own whatlanguageisthis.com as an example
Traditionally, identification of language has been a manual task, where people had to look at the written text and figure out from their own knowledge and sources what language it might be. This often relied on looking up frequently used scripts, letters, and words of particular candidate languages. Of course, for major world languages language identification has often been trivial, but for lesser known, and especially lesser written and lesser taught languages it has always been a major hurdle.Nowadays computers are frequently employed, and as long as the text source in question has been digitized, the language can often be identified by statistical approaches and by relying on vast databases of known text material. This is how I implemented my language identifier, What Language Is This?, which is available freely on the web and runs in the web browser.
I used Wikipedia pages as a corpus, since Wikipedia serves as a large, freely accessible collection of large amounts of texts in known languages. After collecting "sample" pages from Wikipedia, pages that seemed to be high in content and low in noise (it's hard to tell when you not even know the language, but can't understand the writing at all! still, one is able to make a guess), I ran scripts implemented by myself to analyze the content of these pages and pick out the most frequent characters and word string combinations.
"Characters" in this case can of course be in any script; Latin, Arabic, Cyrillic, Devanagari, Chinese, Hebrew, to just name a few. Frequent word string combinations refers to frequently used short sequences of characters, such as "ing" and "the" in English, and whatever it may be in other languages. But since the ending "ing" is common in the other northern European languages as well, it is not "worth" much when identifying a particular language. For that reason, I found that weighing the frequently used strings was not effective. However, weighing of the frequently used characters I did find effective, and that is used as a "first pass" filter to find candidate languages to analyze for the more processing intensive character string analysis.
Especially hard to distinguish languages
These languages in particular pose a challenge for language identification
- Indonesian/Malay: these two languages are essentially one language, but for political purposes the variant spoken in Malaysia and the one in Indonesia are considered separate. And there are differences between them in the written language, so a language identifier has to consider these variations. The differences are small, but some words are different, and word order and usage differs.
- Bosnian, Croatian, Serbian, and to some degree Slovenian: these languages are very, very similar, and actually form a continuum of different dialects of the same Slavic language. But, for political reasons, it is very important to distinguish these correctly. The task gets even more complicated by the fact that there are different written standards corresponding to speech forms within the languages as well. Honestly, distinguishing these is a really tough challenge.
- Scandinavian languages, i.e. Swedish, Danish, and Norwegian are very similar. While spoken Swedish and Norwegian are close and mutually intelligible for trained speakers, when written Norwegian and Danish are confusingly similar. These are not as hard to distinguish as the above though. At least not for me, since I am a native speaker of one. :-)
- Dutch/Afrikaans: Afrikaans essentially evolved from Dutch, mixing in some English and African language influence into it. While these are very similar to each other when doing a statistical analysis, a cleverly written algorithm can distinguish between them easily and with good accuracy.
- Tagalog and Cebuano: these Filipino languages sound different enough when you hear them, but when doing a statistical analysis of their written forms they end up showing confusingly similar patterns. However, as in the case above they can be distinguished quite easily algorithmically.
by hefa
hefa
Hi! I'm Henrik. I'm from Sweden but moved to Japan five years ago to work in the mobile software industry, and I'm loving it here and plan on staying.... more »
- 20 featured lenses
- Winner of 9 trophies!
- Top lens » Best Books for Learning Japanese
Feeling creative?
Create a Lens!
Explore related pages
- Language-related book recommendations Language-related book recommendations
- Foreign Language Foreign Language
- Practical Language Learning Tips Practical Language Learning Tips
- Verbs: What ARE You Doing? Verbs: What ARE You Doing?
- Padanian Language, a scientific approach Padanian Language, a scientific approach
- Basic Medical Terminology Basic Medical Terminology