2.9 Project: Language Detection#

Next, let’s make a simple project where we detect the language of texts. This is usually done as a preprocessing step for filtering texts that will be provided as input to mono-lingual models (e.g. a sentiment analysis model).

Language Detection#

The problem of detecting the language of a text can be formalized as a text classification problem, where each language is a class output. As we have learned in the previous lessons, let’s see if there are suitable pre-trained models in the hub that we can leverage in our project.

After a quick search, here are some available pre-trained models from the Hub:

  • papluca/xlm-roberta-base-language-detection: The model is a fine-tuned version of xlm-roberta-base on the Language Identification dataset, which is a collection of 90k samples consisting of text passages and corresponding language label. The model supports the following 20 languages: arabic (ar), bulgarian (bg), german (de), modern greek (el), english (en), spanish (es), french (fr), hindi (hi), italian (it), japanese (ja), dutch (nl), polish (pl), portuguese (pt), russian (ru), swahili (sw), thai (th), turkish (tr), urdu (ur), vietnamese (vi), and chinese (zh).

  • huggingface/CodeBERTa-language-id: It’s a fine-tuned version of CodeBERTa-small-v1 on the task of classifying a sample of code into the programming language it’s written in (i.e. programming language identification). CodeBERTa is a RoBERTa-like model trained on the CodeSearchNet dataset from GitHub, i.e. a collection of programming codes.

Since we are interested in detecting the language of natural language texts, we’ll test the papluca/xlm-roberta-base-language-detection model.

Install and Import Libraries#

Let’s install and import the necessary libraries.

!pip install transformers datasets plotly-express
from datasets import load_dataset
from transformers import pipeline
import pandas as pd
import plotly.express as px

Get Texts with Different Languages#

We’ll test the model on texts from the Language Identification dataset, i.e. a collection of 90k samples consisting of text passages and corresponding language label. Let’s load its test split.

# download the language identification dataset
dataset = load_dataset("papluca/language-identification", split="test")
print(dataset)
Dataset({
    features: ['labels', 'text'],
    num_rows: 10000
})

Then, we convert the dataset to a pandas dataframe and keep only the first 100 samples to make computations faster for this project.

# convert dataset to pandas dataframe
df = pd.DataFrame(dataset).drop("labels", axis=1)

# keep only the first 100 texts
df_subset = df[:100]
df_subset.head()
text
0 Een man zingt en speelt gitaar.
1 De technologisch geplaatste Nasdaq Composite I...
2 Es muy resistente la parte trasera rígida y lo...
3 "In tanti modi diversi, l'abilità artistica de...
4 منحدر يواجه العديد من النقاشات المتجهه إزاء ال...

Using the Language Detection Model#

We can now download the pre-trained model from the Hub. Note that we are setting device=0 in order to use the GPU during inference, thus speeding up the computations.

# download pre-trained language detection model
model = pipeline(
    "text-classification",
    model="papluca/xlm-roberta-base-language-detection",
    device=0
)

Next, we use the model to predict the language of each text and add their language codes to the dataframe.

# detect the language of each article using the model
all_texts = df_subset["text"].values.tolist()
all_langs = model(all_texts)
df_subset["language_label"] = [d["label"] for d in all_langs]
df_subset.head()
text language_label
0 Een man zingt en speelt gitaar. nl
1 De technologisch geplaatste Nasdaq Composite I... nl
2 Es muy resistente la parte trasera rígida y lo... es
3 "In tanti modi diversi, l'abilità artistica de... it
4 منحدر يواجه العديد من النقاشات المتجهه إزاء ال... ar

Let’s see what are the most predicted languages by the model.

# show the predicted languages
plot_title = "Detected languages"
labels = {
  "x": "Language",
  "y": "Count of texts"
}
fig = px.histogram(df_subset, x="language_label", template="plotly_dark", title=plot_title, labels=labels)
fig.show()

To see if the model performed well, let’s check some texts labeled as Italian or as Spanish.

# show italian texts
texts_italian = df_subset[df_subset["language_label"] == "it"]

print("Italian texts:")
for i, row in texts_italian.iterrows():
  print(f"- {row['text']}")
  if i == 10:
    break
Italian texts:
- "In tanti modi diversi, l'abilità artistica dei musicisti neri ha trasmesso l'esperienza dei neri americani nel corso della nostra storia", ha detto Bush.
- Un uomo sta tagliando una patata.
- L'orso panda giaceva sui tronchi.
- Un uomo con una camicia blu.
- Il rendimento dei buoni del Tesoro a 10 anni di riferimento <US10YT=RRR> è sceso al di sotto del 4,20% martedì.
- Un okapi sta mangiando da un albero.
# show spanish texts
texts_spanish = df_subset[df_subset["language_label"] == "es"]

print("Spanish texts:")
for i, row in texts_spanish.iterrows():
  print(f"- {row['text']}")
  if i == 10:
    break
Spanish texts:
- Es muy resistente la parte trasera rígida y los laterales de silicona para evitar arañar el metal. Muy buena
- No funciona lo he devuelto, no hace nada
- Esta correa cumple con lo anunciado, tiene un agarré de goma que facilita la sujeción del perro en caso de que tire. La cuerda tiene puntadas con tejido reflectante que permite ser visto en la noche. Respecto del porta bolsas para excrementos, es el normal, eso sí, es una gran idea la argolla que lleva la correa para sujetar dicho portabolsas.
- Lo compre en una oferta flash mucho más barato de su precio. En líneas generales estamos satisfechos, pero tiene algo que hace inclinarse hacia la derecha constantemente...
- Protege muy bien las columnas de posibles golpes. El adhesivo es fuerte y no se despega. Buena compra, estoy contenta porque cumple su cometido.
- Pues el olor de la colonia genial como siempre ha sido. El envío se retrasó un día y el anuncio del producto cuanto menos confuso por no decir engañoso. 200ml si contamos la colonia y la loción corporal.

Why Using Neural Networks for Language Detection#

The task of language detection seems rather easy, isn’t it enough to count how many words occur in each language-specific dictionary and then return the language with the highest count?

Looking at the model card of the papluca/xlm-roberta-base-language-detection model, we see that it achieves an accuracy of 99.6% over the test set of the Language Identification dataset. Another common library for language detection, langid, works by looking for a subset of these dictionary matches and assigning them individual weights learned with Naive Bayes. The langid library achieves 98.5% accuracy on the same test set.

So, there’s a small improvement in accuracy when using neural networks, with the downside that they are slower. Whether to use neural networks or not depends on understanding whether that 1% improvement in accuracy is important for your specific use case.

Code Exercises#

Quiz#

What is Language Detection commonly used for?

  1. To develop a better understanding of the structure of a language.

  2. To detect text that has been plagiarized from another source.

  3. Automatically translating a text from one language to another.

  4. As a preprocessing step before passing texts to mono-lingual models.

How many languages are present in the Language Identification dataset?

  1. 10

  2. 20

  3. 50

  4. 200

What’s the gap in accuracy between deep learning models and other machine learning/statistical models for language detection?

  1. ~1%

  2. ~5%

  3. ~10%

Questions and Feedbacks#

Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the NLPlanet Discord server and interact with the community! There’s a specific channel for this course called practical-nlp-nlplanet.