2.9 Project: Language Detection#

Next, let’s make a simple project where we detect the language of texts. This is usually done as a preprocessing step for filtering texts that will be provided as input to mono-lingual models (e.g. a sentiment analysis model).

Language Detection#

The problem of detecting the language of a text can be formalized as a text classification problem, where each language is a class output. As we have learned in the previous lessons, let’s see if there are suitable pre-trained models in the hub that we can leverage in our project.

After a quick search, here are some available pre-trained models from the Hub:

  • papluca/xlm-roberta-base-language-detection: The model is a fine-tuned version of xlm-roberta-base on the Language Identification dataset, which is a collection of 90k samples consisting of text passages and corresponding language label. The model supports the following 20 languages: arabic (ar), bulgarian (bg), german (de), modern greek (el), english (en), spanish (es), french (fr), hindi (hi), italian (it), japanese (ja), dutch (nl), polish (pl), portuguese (pt), russian (ru), swahili (sw), thai (th), turkish (tr), urdu (ur), vietnamese (vi), and chinese (zh).

  • huggingface/CodeBERTa-language-id: It’s a fine-tuned version of CodeBERTa-small-v1 on the task of classifying a sample of code into the programming language it’s written in (i.e. programming language identification). CodeBERTa is a RoBERTa-like model trained on the CodeSearchNet dataset from GitHub, i.e. a collection of programming codes.

Since we are interested in detecting the language of natural language texts, we’ll test the papluca/xlm-roberta-base-language-detection model.

Install and Import Libraries#

Let’s install and import the necessary libraries.

!pip install transformers datasets plotly-express
from datasets import load_dataset
from transformers import pipeline
import pandas as pd
import plotly.express as px

Get Texts with Different Languages#

We’ll test the model on texts from the Language Identification dataset, i.e. a collection of 90k samples consisting of text passages and corresponding language label. Let’s load its test split.

# download the language identification dataset
dataset = load_dataset("papluca/language-identification", split="test")
print(dataset)
Dataset({
    features: ['labels', 'text'],
    num_rows: 10000
})

Then, we convert the dataset to a pandas dataframe and keep only the first 100 samples to make computations faster for this project.

# convert dataset to pandas dataframe
df = pd.DataFrame(dataset).drop("labels", axis=1)

# keep only the first 100 texts
df_subset = df[:100]
df_subset.head()
text
0 Een man zingt en speelt gitaar.
1 De technologisch geplaatste Nasdaq Composite I...
2 Es muy resistente la parte trasera rígida y lo...
3 "In tanti modi diversi, l'abilità artistica de...
4 منحدر يواجه العديد من النقاشات المتجهه إزاء ال...

Using the Language Detection Model#

We can now download the pre-trained model from the Hub. Note that we are setting device=0 in order to use the GPU during inference, thus speeding up the computations.

# download pre-trained language detection model
model = pipeline(
    "text-classification",
    model="papluca/xlm-roberta-base-language-detection",
    device=0
)

Next, we use the model to predict the language of each text and add their language codes to the dataframe.

# detect the language of each article using the model
all_texts = df_subset["text"].values.tolist()
all_langs = model(all_texts)
df_subset["language_label"] = [d["label"] for d in all_langs]
df_subset.head()
text language_label
0 Een man zingt en speelt gitaar. nl
1 De technologisch geplaatste Nasdaq Composite I... nl
2 Es muy resistente la parte trasera rígida y lo... es
3 "In tanti modi diversi, l'abilità artistica de... it
4 منحدر يواجه العديد من النقاشات المتجهه إزاء ال... ar

Let’s see what are the most predicted languages by the model.

# show the predicted languages
plot_title = "Detected languages"
labels = {
  "x": "Language",
  "y": "Count of texts"
}
fig = px.histogram(df_subset, x="language_label", template="plotly_dark", title=plot_title, labels=labels)
fig.show()