How to Do Language Detection Using Python, NLTK, and Some Easy Statistics | by Katherine Munro | Jan, 2023


Photo by Etienne Girardet on Unsplash

Ever wondered how Google Translate’s ‘detect language’ feature works? Of course you didn’t, you had better things to do. But I went looking, and couldn’t find the answer (even though I’ve literally written a book on Natural Language Processing (NLP)). It’s Google’s secret sauce. So today, I’ll instead show you a super simple way to do language detection yourself, using one highly underrated NLP tool and some really easy maths. You’ll be adding it to your GitHub portfolio in no time.

What is Language Detection and why is it used?

Language detection just means identifying the language of a piece of input text. It’s a first step for many tasks in Natural Language Processing, including many that you use every day:

  • Spelling and grammar correction (think MS Word, Google Docs, etc)
  • Next word prediction (your phone does this all the time!)
  • Machine translation (e.g. in Google Translate’s ‘detect language’ option)
Source: Author

How can we detect a language?

A simple way to do language identification would be this: build vocabularies (word lists) for different languages, then count how many times each language’s words occur in a text. So if the test text contained five Japanese words and two English ones, we might conclude that it’s Japanese. We could even focus on so-called ‘stop words’, which are words which occur very frequently in languages and often deliver little meaning but are important for the grammar, such as ‘the’, ‘a’ and ‘and’.

The problem is, many words occur in multiple languages, even if they have different meanings. For example, ‘gift’ means ‘present’ in English, but ‘poison’ in German. So the phrase ‘Das Gift’ could present a problem, especially if there are typos*. Imagine we wanted to say ‘the poison is strong’: Das Gift ist stark. If we forgot the ‘-t’ in ‘ist’, we would have ‘Das Gift is stark’. Since ‘stark’ occurs in both languages, we now really have a problem. And focusing on stopwords could make it even worse. For example, French and German both frequently use ‘des’ and ‘du’, so if we only look at those, we’re going to come unstuck.

*Fun fact: in Natural Language Processing, there are always typos.

An alternative is to concentrate on the distribution of letters, instead of words. For example, compared to English, German uses umlauts (ä, ö, ü), and French uses lots of special characters (ç, â/ê/î/ô/û, à/è/ì/ò/ù, ë/ï/ü). Combinations of 2 and 3 letters, called bi- and trigrams, work even better. That’s because different languages have letter combinations that don’t occur in other languages.

Building a Language Detection Model in Python

Our language detection method will use uni-, bi- and tri-grams: that is, individual letters, and combinations of two and three letters. The generic term for such combinations is ‘n-grams’. We will create statistical models for different languages by counting their n-gram frequencies. Then we’ll compare these with the frequences of n-grams in a test text. The language whose n-gram frequencies best match the test sentence will be our winner.

This approach is based on [1].

Visualise N-Grams

Let’s start by visualising some n-grams of different lengths:

text = "This is a test text"
n = 3 # 1 = Unigram, 2 = bigram, 3 = trigram
text_len = len(text)
num_ngrams = text_len - n + 1 # How many ngrams of length n will fit in this text
print(f"The text is {text_len} characters long and will fit {num_ngrams} n-grams of length {n}.")

for p in range(num_ngrams) :
print(f"{p}: {text[p:p+n]}")

Build an N-Gram Extractor

Let’s define a function extract_xgrams(). It will take a text and a list of numbers, n_vals, and extract n-grams of those lengths from the text:

import typing

def extract_xgrams(text: str, n_vals: typing.List[int]) -> typing.List[str]:
"""
Extract a list of n-grams of different sizes from a text.
Params:
text: the test from which to extract ngrams
n_vals: the sizes of n-grams to extract
(e.g. [1, 2, 3] will produce uni-, bi- and tri-grams)
"""
xgrams = []

for n in n_vals:
# if n > len(text) then no ngrams will fit, and we would return an empty list
if n < len(text):
for i in range(len(text) - n + 1) :
ng = text[i:i+n]
xgrams.append(ng)

return xgrams

text = "I was taught that the way of progress was neither swift nor easy.".lower()
# Quote from Marie Curie, the first woman to win a Nobel Prize, the only woman to win it twice, and the only human to win it in two different sciences.

# Extract all ngrams of size 1 to 3.
xgrams = extract_xgrams(text, n_vals=range(1,4))

print(xgrams)

Note that we lower case our test text. This will reduce the number of n-grams we get back, without losing much information about the language itself (think about it: If I say ‘i went to new york’, you still understand me, even without the capitalisation).

Define A Function for Building a Language Model

Our build_model() function makes use of collections.Counter. The Counter takes a list, counts all occurrences of each item in the list, and returns a dictionary with each item and its frequency.

So for any language, we can model it by creating a dictionary of n-grams and their probability of occurring in that language. What’s the probability of an n-gram? It’s simply its frequency, divided by the total number of extracted n-grams. Let’s run the code and print the language model, sorted so that the most frequent n-gram comes first:

def build_model(text: str, n_vals: typing.List[int]) -> typing.Dict[str, int]:
"""
Build a simple model of probabilities of xgrams of various lengths in a text
Parms:
text: the text from which to extract the n_grams
n_vals: a list of n_gram sizes to extract
Returns:
A dictionary of ngrams and their probabilities given the input text
"""
model = collections.Counter(extract_xgrams(text, n_vals))
num_ngrams = sum(model.values())

for ng in model:
model[ng] = model[ng] / num_ngrams

return model

test_model = build_model(text, n_vals=range(1,4))
print({k: v for k, v in sorted(test_model.items(), key=lambda item: item[1], reverse=True)})

Install NLTK and Download Our Text Data

The Natural Language ToolKit is a hidden gem for Natural Language Processing. It contains classes and methods for text processing, and a large selection of text corpora (collections of prepared text data) for you to practice with. If you don’t have NLTK installed, you can install it using your preferred method, e.g. pip install nltk (see the guide)

For testing our language identifier we will use the Universal Declaration of Human Rights (UDHR), which is included in NLTK in 300 languages. In reality, such a dataset is too small and the text is too clean (the UN presumably proofread their work, and they probably don’t use hashtags and emojis. Those buzzkills). But this dataset is enough to demonstrate the concepts of what we’re trying to do here. Plus, it will introduce you to working with NLTK:

import nltk
nltk.download('udhr') # udhr = Universal Declaration of Human Rights

# Now import corpus and print number of files and the fileids (as these reveal the languages)
from nltk.corpus import udhr
print(f"There are {len(udhr.fileids())} files with the following ids: {udhr.fileids()}")

For simplicity, I’ll choose just a handful of languages to work with. They all use similar characters, so it’ll be a tougher test for our detector. Feel free to add more languages: the code comments will show you how:

languages = ['english', 'german', 'dutch', 'french', 'italian', 'spanish']
language_ids = ['English-Latin1', 'German_Deutsch-Latin1', 'Dutch_Nederlands-Latin1', 'French_Francais-Latin1', 'Italian_Italiano-Latin1', 'Spanish_Espanol-Latin1']# I chose the above sample of languages as they all use similar characters.

### Optional: If you want to add more languages:

# First use this function to find the language file id
def retrieve_fileid_by_first_letter(fileids, letter):
return [id for id in fileids if id.lower().startswith(letter.lower())]

# Example usage
print(f"Fileids beginning with 'R': {retrieve_fileid_by_first_letter(udhr.fileids(), letter='R')}")

# Then copy-paste the language name and language id into the relevant list:
languages += []
language_ids += []

The command udhr.raw(fileids) returns the complete text of the specified fileid(s). We’ll use it to build a dictionary with each language name and its text, and from this dictionary we’ll build a model of each language:

raw_texts = {language: udhr.raw(language_id) for language, language_id in zip(languages, language_ids)}
print(raw_texts['english'][:1000]) # Just print the first 1000 characters

# Build a model of each language
models = {language: build_model(text=raw_texts[language], n_vals=range(1,4)) for language in languages}
print(models['german'])

Determine the language for a given piece of text

We can now take a test text and compare its n-gram frequencies to those of our various language models. The aim will be to see which language has the closest frequencies to our test text.

We do this by calculating the cosine similarity, as per the formula below:

The cosine similarity formula

It looks scary, but we won’t go into the math. Basically the cosine similarity is used to compare two numeric vectors. The result will be in the range of −1, meaning exactly opposite, to 1, meaning exactly the same. Our calculate_cosine() formula implements the math:

import math

def calculate_cosine(a: typing.Dict[str, float], b: typing.Dict[str, float]) -> float:
"""
Calculate the cosine between two numeric vectors
Params:
a, b: two dictionaries containing items and their corresponding numeric values
(e.g. ngrams and their corresponding probabilities)
"""
numerator = sum([a[k]*b[k] for k in a if k in b])
denominator = (math.sqrt(sum([a[k]**2 for k in a])) * math.sqrt(sum([b[k]**2 for k in b])))
return numerator / denominator

It’s time to build an identify_language() function. This will take a test text, build a model for it using n grams of different sizes (specified by n_vals), and compare it to a dictionary of language models. The output will be the name of the language most-similar to the test text.

For demonstration purposes, I added a print statement in the function to show the similarity of each language to the test text. You can delete this after you’ve gotten a feeling for what cosine values look like.

Running this function for our original test text, the highest similarity correctly occurs for English:

def identify_language(
text: str,
language_models: typing.Dict[str, typing.Dict[str, float]],
n_vals: typing.List[int]
) -> str:
"""
Given a text and a dictionary of language models, return the language model
whose ngram probabilities best match those of the test text
Params:
text: the text whose language we want to identify
language_models: a Dict of Dicts, where each key is a language name and
each value is a dictionary of ngram: probability pairs
n_vals: a list of n_gram sizes to extract to build a model of the test
text; ideally reflect the n_gram sizes used in 'language_models'
"""
text_model = build_model(text, n_vals)
language = ""
max_c = 0
for m in language_models:
c = calculate_cosine(language_models[m], text_model)
# The following line is just for demonstration, and can be deleted
print(f'Language: {m}; similarity with test text: {c}')
if c > max_c:
max_c = c
language = m
return language

print(f"Test text: {text}")
print(f"Identified language: {identify_language(text, models, n_vals=range(1,4))}")

# Prints
# Test text: i was taught that the way of progress was neither swift nor easy.
# Language: english; similarity with test text: 0.7812347488239613
# Language: german; similarity with test text: 0.6638235631734796
# Language: dutch; similarity with test text: 0.6495872103674768
# Language: french; similarity with test text: 0.7073331083503462
# Language: italian; similarity with test text: 0.6635204671187273
# Language: spanish; similarity with test text: 0.6811923819801172
# Identified language: english

Before you delete that print line, looks what happens when we test the function with a radically different text; the similarity values generally decrease:

# An example text in Slovenian
tricky_text = "učili so me, da pot napredka ni ne hitra ne lahka."
print(f"Identified language: {identify_language(tricky_text, models, n_vals=range(1,4))}")

# Prints
# Language: english; similarity with test text: 0.7287873650203188
# Language: german; similarity with test text: 0.6721847143945305
# Language: dutch; similarity with test text: 0.6794130641102911
# Language: french; similarity with test text: 0.7395592659566902
# Language: italian; similarity with test text: 0.7673665450525412
# Language: spanish; similarity with test text: 0.7588017776235897
# Identified language: italian

Test our language detector on different languages

Let’s see how we go with texts in Dutch, French and Spanish:

t = "mij werd geleerd dat de weg van vooruitgang noch snel noch gemakkelijk is."  
print(identify_language(t, models, n_vals=range(1,4)))

t = "on m'a appris que la voie du progrès n'était ni rapide ni facile."
print(identify_language(t, models, n_vals=range(1,4)))

t = "me enseñaron que el camino hacia el progreso no es ni rápido ni fácil."
print(identify_language(t, models, n_vals=range(1,4)))

The results are correct, except that Italian is output for the second example, instead of French. Merde!

Improve the model

Clearly our models aren’t perfect, but there are lots of ways we could improve them:

Use bigger and more representative data: I admitted earlier that our training texts are too short and clean to realistically reflect language identification in the wild. In fact, they are only a sample of the Declaration, with each text truncated to approximately 1000 characters. You can see this by exploring the number of words and characters in the text for each language:


from nltk.tokenize import word_tokenize # A function from nltk for splitting strings into individual words
nltk.download('punkt') # Required for word_tokenize to function

print("Number of characters and words per text per language:")
for language in raw_texts.keys():
print(f"\n{language}: {len(raw_texts[language])} characters, {len(nltk.word_tokenize(raw_texts[language]))} words")

The training texts were good enough to introduce you to NLTK, and to this simple method of language detection, but in order to improve our generalisability, we need to build models using longer, more diverse, real-world text data: typos, hashtags, emojis and all.

I asked ChatGPT for the UDHR in ‘Twitter speak’. Even this is fairly clean and clear; it can get a lot worse. Source: Author provided screenshot from a dialog ‘ChatGPT’, from OpenAI.

Why do we need longer, more diverse texts? Simple: it’s the only way to capture each language, and what makes it different from other languages.

Take the word ‘gnome’, for example. Unless access to a garden gnome is considered a universal human right, the trigram ‘ gn’ (whitespace, g, n) probably doesn’t feature in our English sample data. You might think that’s ok, because there aren’t many English words beginning with ‘gn’. (There are a few, but they’re rarely used). The problem is, what if this is a common pattern in other languages? There are, in fact, lots of common German words like this, but not one of them occurs in the UDHR (I checked). So if we see a test text with the trigram ‘ gn’, it won’t contribute to the x-gram probabilities we’re summing up, for either language. And that means it won’t help us differentiate between them.

Add more features: We could have added character x-grams of additional lengths, like quadgrams (four letters). Word-based x-grams might also help. The benefit in both cases is the same as with using longer texts: more features help capture differentiating factors between languages. For example, I’m unlikely to say ‘die Marmelade!’, even though I hate the stuff. But some Germans would say this every day at breakfast (‘die’ is just one version of ‘the’). So using word x-grams could capture this difference.

There are some problems with word x-grams though. Most languages have many more words than they have characters in their alphabet, so just adding word bi-grams will explode the number of items in our language models. Tri-grams and larger will only make the matter worse. Bigger models will slow the entire process down, and for very little gain: the majority of these x-gram word combinations will barely ever occur, so they won’t even contribute much to helping differentiate between languages at test time.

A better approach could be to use stopwords, as each language has their own set that occur very frequently and are thus good indicators. I said earlier that modelling languages using only stopwords is risky, as they can appear in multiple languages. But using them as additional features to our character x-grams, or using them as part of word x-grams, tackles this problem.

Similarly, we can add the top 1000 or 10,000 words in each language (or use them in word x-grams). The theory behind this is that words tend to follow ‘Zip’s law’, where the most common word tends to occur about twice as often as the next most common, three times as often as the third most common, and so on. So by taking just the top n words, you can capture probabilities for the majority of words in your input data and — crucially — your test data. And these probabilities are what our language detection decision will be based on.

Use machine learning: You can’t talk about ‘features’ without thinking of machine learning. There are many algorithms we could try, including some surprisingly simple but effective options like Naive Bayes.

Understanding those algorithms will take an entire new blog post, but the curious can read about Naive Bayes classifiers for language detection here.

Add more languages: I used just a few, but there are thousands of languages in the world, and they all deserve love. So here’s a challenge to you: add more languages, and see how you go (I even gave you the code already).

Applying this concept to other NLP tasks

The concepts covered in this post can easily be applied to other challenges, and NLTK’s built in corpora can help you do it. So follow me for future posts, where we’ll cover document classification and speaker identification using — you guessed it — Python, NLTK, and those really simple statistics.

Thanks for reading!

Firstly, cheers to the creators of the tutorial which inspired this piece.

If this article helped you — great !— please subscribe for more content on Natural Language Processing and other data science fundamentals. You can also connect with me on Twitter (where I post loads of interesting content on AI, tech, ethics, and more).

[1] Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization (1994), Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval.




Photo by Etienne Girardet on Unsplash

Ever wondered how Google Translate’s ‘detect language’ feature works? Of course you didn’t, you had better things to do. But I went looking, and couldn’t find the answer (even though I’ve literally written a book on Natural Language Processing (NLP)). It’s Google’s secret sauce. So today, I’ll instead show you a super simple way to do language detection yourself, using one highly underrated NLP tool and some really easy maths. You’ll be adding it to your GitHub portfolio in no time.

What is Language Detection and why is it used?

Language detection just means identifying the language of a piece of input text. It’s a first step for many tasks in Natural Language Processing, including many that you use every day:

  • Spelling and grammar correction (think MS Word, Google Docs, etc)
  • Next word prediction (your phone does this all the time!)
  • Machine translation (e.g. in Google Translate’s ‘detect language’ option)
Source: Author

How can we detect a language?

A simple way to do language identification would be this: build vocabularies (word lists) for different languages, then count how many times each language’s words occur in a text. So if the test text contained five Japanese words and two English ones, we might conclude that it’s Japanese. We could even focus on so-called ‘stop words’, which are words which occur very frequently in languages and often deliver little meaning but are important for the grammar, such as ‘the’, ‘a’ and ‘and’.

The problem is, many words occur in multiple languages, even if they have different meanings. For example, ‘gift’ means ‘present’ in English, but ‘poison’ in German. So the phrase ‘Das Gift’ could present a problem, especially if there are typos*. Imagine we wanted to say ‘the poison is strong’: Das Gift ist stark. If we forgot the ‘-t’ in ‘ist’, we would have ‘Das Gift is stark’. Since ‘stark’ occurs in both languages, we now really have a problem. And focusing on stopwords could make it even worse. For example, French and German both frequently use ‘des’ and ‘du’, so if we only look at those, we’re going to come unstuck.

*Fun fact: in Natural Language Processing, there are always typos.

An alternative is to concentrate on the distribution of letters, instead of words. For example, compared to English, German uses umlauts (ä, ö, ü), and French uses lots of special characters (ç, â/ê/î/ô/û, à/è/ì/ò/ù, ë/ï/ü). Combinations of 2 and 3 letters, called bi- and trigrams, work even better. That’s because different languages have letter combinations that don’t occur in other languages.

Building a Language Detection Model in Python

Our language detection method will use uni-, bi- and tri-grams: that is, individual letters, and combinations of two and three letters. The generic term for such combinations is ‘n-grams’. We will create statistical models for different languages by counting their n-gram frequencies. Then we’ll compare these with the frequences of n-grams in a test text. The language whose n-gram frequencies best match the test sentence will be our winner.

This approach is based on [1].

Visualise N-Grams

Let’s start by visualising some n-grams of different lengths:

text = "This is a test text"
n = 3 # 1 = Unigram, 2 = bigram, 3 = trigram
text_len = len(text)
num_ngrams = text_len - n + 1 # How many ngrams of length n will fit in this text
print(f"The text is {text_len} characters long and will fit {num_ngrams} n-grams of length {n}.")

for p in range(num_ngrams) :
print(f"{p}: {text[p:p+n]}")

Build an N-Gram Extractor

Let’s define a function extract_xgrams(). It will take a text and a list of numbers, n_vals, and extract n-grams of those lengths from the text:

import typing

def extract_xgrams(text: str, n_vals: typing.List[int]) -> typing.List[str]:
"""
Extract a list of n-grams of different sizes from a text.
Params:
text: the test from which to extract ngrams
n_vals: the sizes of n-grams to extract
(e.g. [1, 2, 3] will produce uni-, bi- and tri-grams)
"""
xgrams = []

for n in n_vals:
# if n > len(text) then no ngrams will fit, and we would return an empty list
if n < len(text):
for i in range(len(text) - n + 1) :
ng = text[i:i+n]
xgrams.append(ng)

return xgrams

text = "I was taught that the way of progress was neither swift nor easy.".lower()
# Quote from Marie Curie, the first woman to win a Nobel Prize, the only woman to win it twice, and the only human to win it in two different sciences.

# Extract all ngrams of size 1 to 3.
xgrams = extract_xgrams(text, n_vals=range(1,4))

print(xgrams)

Note that we lower case our test text. This will reduce the number of n-grams we get back, without losing much information about the language itself (think about it: If I say ‘i went to new york’, you still understand me, even without the capitalisation).

Define A Function for Building a Language Model

Our build_model() function makes use of collections.Counter. The Counter takes a list, counts all occurrences of each item in the list, and returns a dictionary with each item and its frequency.

So for any language, we can model it by creating a dictionary of n-grams and their probability of occurring in that language. What’s the probability of an n-gram? It’s simply its frequency, divided by the total number of extracted n-grams. Let’s run the code and print the language model, sorted so that the most frequent n-gram comes first:

def build_model(text: str, n_vals: typing.List[int]) -> typing.Dict[str, int]:
"""
Build a simple model of probabilities of xgrams of various lengths in a text
Parms:
text: the text from which to extract the n_grams
n_vals: a list of n_gram sizes to extract
Returns:
A dictionary of ngrams and their probabilities given the input text
"""
model = collections.Counter(extract_xgrams(text, n_vals))
num_ngrams = sum(model.values())

for ng in model:
model[ng] = model[ng] / num_ngrams

return model

test_model = build_model(text, n_vals=range(1,4))
print({k: v for k, v in sorted(test_model.items(), key=lambda item: item[1], reverse=True)})

Install NLTK and Download Our Text Data

The Natural Language ToolKit is a hidden gem for Natural Language Processing. It contains classes and methods for text processing, and a large selection of text corpora (collections of prepared text data) for you to practice with. If you don’t have NLTK installed, you can install it using your preferred method, e.g. pip install nltk (see the guide)

For testing our language identifier we will use the Universal Declaration of Human Rights (UDHR), which is included in NLTK in 300 languages. In reality, such a dataset is too small and the text is too clean (the UN presumably proofread their work, and they probably don’t use hashtags and emojis. Those buzzkills). But this dataset is enough to demonstrate the concepts of what we’re trying to do here. Plus, it will introduce you to working with NLTK:

import nltk
nltk.download('udhr') # udhr = Universal Declaration of Human Rights

# Now import corpus and print number of files and the fileids (as these reveal the languages)
from nltk.corpus import udhr
print(f"There are {len(udhr.fileids())} files with the following ids: {udhr.fileids()}")

For simplicity, I’ll choose just a handful of languages to work with. They all use similar characters, so it’ll be a tougher test for our detector. Feel free to add more languages: the code comments will show you how:

languages = ['english', 'german', 'dutch', 'french', 'italian', 'spanish']
language_ids = ['English-Latin1', 'German_Deutsch-Latin1', 'Dutch_Nederlands-Latin1', 'French_Francais-Latin1', 'Italian_Italiano-Latin1', 'Spanish_Espanol-Latin1']# I chose the above sample of languages as they all use similar characters.

### Optional: If you want to add more languages:

# First use this function to find the language file id
def retrieve_fileid_by_first_letter(fileids, letter):
return [id for id in fileids if id.lower().startswith(letter.lower())]

# Example usage
print(f"Fileids beginning with 'R': {retrieve_fileid_by_first_letter(udhr.fileids(), letter='R')}")

# Then copy-paste the language name and language id into the relevant list:
languages += []
language_ids += []

The command udhr.raw(fileids) returns the complete text of the specified fileid(s). We’ll use it to build a dictionary with each language name and its text, and from this dictionary we’ll build a model of each language:

raw_texts = {language: udhr.raw(language_id) for language, language_id in zip(languages, language_ids)}
print(raw_texts['english'][:1000]) # Just print the first 1000 characters

# Build a model of each language
models = {language: build_model(text=raw_texts[language], n_vals=range(1,4)) for language in languages}
print(models['german'])

Determine the language for a given piece of text

We can now take a test text and compare its n-gram frequencies to those of our various language models. The aim will be to see which language has the closest frequencies to our test text.

We do this by calculating the cosine similarity, as per the formula below:

The cosine similarity formula

It looks scary, but we won’t go into the math. Basically the cosine similarity is used to compare two numeric vectors. The result will be in the range of −1, meaning exactly opposite, to 1, meaning exactly the same. Our calculate_cosine() formula implements the math:

import math

def calculate_cosine(a: typing.Dict[str, float], b: typing.Dict[str, float]) -> float:
"""
Calculate the cosine between two numeric vectors
Params:
a, b: two dictionaries containing items and their corresponding numeric values
(e.g. ngrams and their corresponding probabilities)
"""
numerator = sum([a[k]*b[k] for k in a if k in b])
denominator = (math.sqrt(sum([a[k]**2 for k in a])) * math.sqrt(sum([b[k]**2 for k in b])))
return numerator / denominator

It’s time to build an identify_language() function. This will take a test text, build a model for it using n grams of different sizes (specified by n_vals), and compare it to a dictionary of language models. The output will be the name of the language most-similar to the test text.

For demonstration purposes, I added a print statement in the function to show the similarity of each language to the test text. You can delete this after you’ve gotten a feeling for what cosine values look like.

Running this function for our original test text, the highest similarity correctly occurs for English:

def identify_language(
text: str,
language_models: typing.Dict[str, typing.Dict[str, float]],
n_vals: typing.List[int]
) -> str:
"""
Given a text and a dictionary of language models, return the language model
whose ngram probabilities best match those of the test text
Params:
text: the text whose language we want to identify
language_models: a Dict of Dicts, where each key is a language name and
each value is a dictionary of ngram: probability pairs
n_vals: a list of n_gram sizes to extract to build a model of the test
text; ideally reflect the n_gram sizes used in 'language_models'
"""
text_model = build_model(text, n_vals)
language = ""
max_c = 0
for m in language_models:
c = calculate_cosine(language_models[m], text_model)
# The following line is just for demonstration, and can be deleted
print(f'Language: {m}; similarity with test text: {c}')
if c > max_c:
max_c = c
language = m
return language

print(f"Test text: {text}")
print(f"Identified language: {identify_language(text, models, n_vals=range(1,4))}")

# Prints
# Test text: i was taught that the way of progress was neither swift nor easy.
# Language: english; similarity with test text: 0.7812347488239613
# Language: german; similarity with test text: 0.6638235631734796
# Language: dutch; similarity with test text: 0.6495872103674768
# Language: french; similarity with test text: 0.7073331083503462
# Language: italian; similarity with test text: 0.6635204671187273
# Language: spanish; similarity with test text: 0.6811923819801172
# Identified language: english

Before you delete that print line, looks what happens when we test the function with a radically different text; the similarity values generally decrease:

# An example text in Slovenian
tricky_text = "učili so me, da pot napredka ni ne hitra ne lahka."
print(f"Identified language: {identify_language(tricky_text, models, n_vals=range(1,4))}")

# Prints
# Language: english; similarity with test text: 0.7287873650203188
# Language: german; similarity with test text: 0.6721847143945305
# Language: dutch; similarity with test text: 0.6794130641102911
# Language: french; similarity with test text: 0.7395592659566902
# Language: italian; similarity with test text: 0.7673665450525412
# Language: spanish; similarity with test text: 0.7588017776235897
# Identified language: italian

Test our language detector on different languages

Let’s see how we go with texts in Dutch, French and Spanish:

t = "mij werd geleerd dat de weg van vooruitgang noch snel noch gemakkelijk is."  
print(identify_language(t, models, n_vals=range(1,4)))

t = "on m'a appris que la voie du progrès n'était ni rapide ni facile."
print(identify_language(t, models, n_vals=range(1,4)))

t = "me enseñaron que el camino hacia el progreso no es ni rápido ni fácil."
print(identify_language(t, models, n_vals=range(1,4)))

The results are correct, except that Italian is output for the second example, instead of French. Merde!

Improve the model

Clearly our models aren’t perfect, but there are lots of ways we could improve them:

Use bigger and more representative data: I admitted earlier that our training texts are too short and clean to realistically reflect language identification in the wild. In fact, they are only a sample of the Declaration, with each text truncated to approximately 1000 characters. You can see this by exploring the number of words and characters in the text for each language:


from nltk.tokenize import word_tokenize # A function from nltk for splitting strings into individual words
nltk.download('punkt') # Required for word_tokenize to function

print("Number of characters and words per text per language:")
for language in raw_texts.keys():
print(f"\n{language}: {len(raw_texts[language])} characters, {len(nltk.word_tokenize(raw_texts[language]))} words")

The training texts were good enough to introduce you to NLTK, and to this simple method of language detection, but in order to improve our generalisability, we need to build models using longer, more diverse, real-world text data: typos, hashtags, emojis and all.

I asked ChatGPT for the UDHR in ‘Twitter speak’. Even this is fairly clean and clear; it can get a lot worse. Source: Author provided screenshot from a dialog ‘ChatGPT’, from OpenAI.

Why do we need longer, more diverse texts? Simple: it’s the only way to capture each language, and what makes it different from other languages.

Take the word ‘gnome’, for example. Unless access to a garden gnome is considered a universal human right, the trigram ‘ gn’ (whitespace, g, n) probably doesn’t feature in our English sample data. You might think that’s ok, because there aren’t many English words beginning with ‘gn’. (There are a few, but they’re rarely used). The problem is, what if this is a common pattern in other languages? There are, in fact, lots of common German words like this, but not one of them occurs in the UDHR (I checked). So if we see a test text with the trigram ‘ gn’, it won’t contribute to the x-gram probabilities we’re summing up, for either language. And that means it won’t help us differentiate between them.

Add more features: We could have added character x-grams of additional lengths, like quadgrams (four letters). Word-based x-grams might also help. The benefit in both cases is the same as with using longer texts: more features help capture differentiating factors between languages. For example, I’m unlikely to say ‘die Marmelade!’, even though I hate the stuff. But some Germans would say this every day at breakfast (‘die’ is just one version of ‘the’). So using word x-grams could capture this difference.

There are some problems with word x-grams though. Most languages have many more words than they have characters in their alphabet, so just adding word bi-grams will explode the number of items in our language models. Tri-grams and larger will only make the matter worse. Bigger models will slow the entire process down, and for very little gain: the majority of these x-gram word combinations will barely ever occur, so they won’t even contribute much to helping differentiate between languages at test time.

A better approach could be to use stopwords, as each language has their own set that occur very frequently and are thus good indicators. I said earlier that modelling languages using only stopwords is risky, as they can appear in multiple languages. But using them as additional features to our character x-grams, or using them as part of word x-grams, tackles this problem.

Similarly, we can add the top 1000 or 10,000 words in each language (or use them in word x-grams). The theory behind this is that words tend to follow ‘Zip’s law’, where the most common word tends to occur about twice as often as the next most common, three times as often as the third most common, and so on. So by taking just the top n words, you can capture probabilities for the majority of words in your input data and — crucially — your test data. And these probabilities are what our language detection decision will be based on.

Use machine learning: You can’t talk about ‘features’ without thinking of machine learning. There are many algorithms we could try, including some surprisingly simple but effective options like Naive Bayes.

Understanding those algorithms will take an entire new blog post, but the curious can read about Naive Bayes classifiers for language detection here.

Add more languages: I used just a few, but there are thousands of languages in the world, and they all deserve love. So here’s a challenge to you: add more languages, and see how you go (I even gave you the code already).

Applying this concept to other NLP tasks

The concepts covered in this post can easily be applied to other challenges, and NLTK’s built in corpora can help you do it. So follow me for future posts, where we’ll cover document classification and speaker identification using — you guessed it — Python, NLTK, and those really simple statistics.

Thanks for reading!

Firstly, cheers to the creators of the tutorial which inspired this piece.

If this article helped you — great !— please subscribe for more content on Natural Language Processing and other data science fundamentals. You can also connect with me on Twitter (where I post loads of interesting content on AI, tech, ethics, and more).

[1] Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization (1994), Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
detectionEasyJanKatherinelanguagelatest newsMunroNLTKpythonstatisticsTechnoblenderTechnology
Comments (0)
Add Comment