2.11 Project: Multilingual Search and Recsys over Wikipedia#

In this project we’ll put together several concepts that we learned about in the previous lessons:

  • How to leverage embedding models to build semantic search engines and recommender systems.

  • How to use those models over a large number of texts efficiently.

Moreover, as a new challenge, we are going to use a multilingual embedding model to perform semantic search in Italian (or other languages as well).

The Wikipedia Dataset#

On the Hugging Face Hub we can find the Wikipedia Dataset, comprising several gigabytes of data of Wikipedia articles in different languages.

As mentioned on the dataset card, some subsets of the dataset have already been pre-processed by HuggingFace, and you can load them easily with a few lines of code. There’s also a subset of Italian articles that we’ll use in this project.

Multilingual Embedding Models#

The SBERT library showcases multilingual models, have a look at its pre-trained models page.

Here are two available models:

  • distiluse-base-multilingual-cased-v1: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. Supports 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.

  • distiluse-base-multilingual-cased-v2: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. This version supports 50+ languages, but performs a bit weaker than the v1 model.

In our project we are going to use the v1 model.

Coding with Python#

We are now ready to write the code for the semantic search engine and the recommender system.

Install and Import Libraries#

Let’s install the necessary libraries. The apache_beam and mwparserfromhell libraries are required to load the Wikipedia dataset, as explained in its dataset card. Apache Beam is an open-source SDK for defining and executing data processing workflows, whereas mwparserfromhell (a.k.a. the MediaWiki Parser from Hell) is a Python package that provides an easy-to-use parser for MediaWiki wikicode.

pip install datasets apache_beam mwparserfromhell faiss-cpu sentence-transformers
from datasets import load_dataset
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss

Download and Prepare Dataset#

Let’s load the Italian subset of the Wikipedia dataset, according to the instructions in the dataset card.

# download part of the Italian Wikipedia dataset
dataset = load_dataset("wikipedia", "20220301.it", split="train")
print(dataset)
Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 1743035
})

The dataset contains more than 1.7M articles. To make computations faster in this project, let’s keep only 10k of them.

# keep only first 10k articles to make computations faster
dataset_subset = dataset.train_test_split(train_size=10000)["train"]
print(dataset_subset)
Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 10000
})

Then we convert the dataset to a pandas dataframe.

# convert dataset to pandas dataframe
df = pd.DataFrame(dataset_subset)
df.head()
id url title text
0 861695 https://it.wikipedia.org/wiki/049 049 049 – codice che designa l'Osservatorio astron...
1 8403470 https://it.wikipedia.org/wiki/Pietro%20Neri-Ba... Pietro Neri-Baraldi \n\nBiografia\nNacque a Minerbio, una piccola ...
2 1171573 https://it.wikipedia.org/wiki/Obergr%C3%B6ningen Obergröningen Obergröningen è un comune tedesco di 460 abita...
3 174512 https://it.wikipedia.org/wiki/Saint-%C3%89lix-... Saint-Élix-Theux Saint-Élix-Theux è un comune francese di 117 a...
4 543890 https://it.wikipedia.org/wiki/Catch%20Thirtythree Catch Thirtythree Catch Thirtythree è il quinto album in studio ...

The only preprocessing we need to do is to concatenate the title and body content of each article into a new column that we call full_text.

# join article title and text into a single column
df["full_text"] = df["title"] + ". " + df["text"]

Create Articles Embeddings#

Then, we download the distiluse-base-multilingual-cased-v1 pre-trained embedding model using the sentence-transformers library.

# download the sentence embeddings model
embedder = SentenceTransformer('distiluse-base-multilingual-cased-v1')

Next, we use the model to compute the embeddings of the Wikipedia articles. This step might take some seconds to execute, but must be performed only once. The resulting embeddings have 512 dimensions.

# embed article texts
corpus_embeddings = embedder.encode(df["full_text"].values)
print(corpus_embeddings.shape)
(10000, 512)

Create Faiss Index#

Let’s create a IndexIVFFlat index with faiss and train it on the articles embeddings. This step builds a space-partitioning data structure that makes computing nearest neighbors more efficient.

# create faiss index
n_cells = 1000
num_dimensions = corpus_embeddings.shape[1]
quantizer = faiss.IndexFlatL2(num_dimensions)
index = faiss.IndexIVFFlat(quantizer, num_dimensions, n_cells)
index.train(corpus_embeddings)
index.add(corpus_embeddings)

Recommender System#

Let’s now write the code for a simple content-based recommender system: we’ll consider an article and suggest articles semantically similar to it. In our example we take the 10-th article in the dataframe, whose title is “Salame di Cioccolato” (“Chocolate Salami” in English).

# choose an article
article_row = df.iloc[10] # Salame di Cioccolato
print(article_row["title"])

We then embed its text and look for similar embeddings.

# show results
relevant_rows = df.iloc[indexes[0]]
for i,row in relevant_rows.iterrows():
  print(f"- {row['title']}")
- Salame di cioccolato
- Stroopwafel
- Smörgåstårta
- Cjarsons
- Sklandrausis

All the results are about similar sweet dishes!

Code Exercises#

Quiz#

How are models able to work with text in different languages called?

  1. Polyglot models

  2. Machine translation

  3. Cross-lingual models

  4. Multilingual models

  5. Universal language models

What is Apache Beam?

  1. A cloud computing platform for big data analytics

  2. A distributed processing system for data-parallel applications

  3. A machine learning algorithm for text classification

  4. A natural language processing library

  5. An open-source SDK for defining and executing data processing workflows

True or False. With multilingual models, the distance between the embeddings of a sentence and its translation in another language is close to zero.

Questions and Feedbacks#

Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the NLPlanet Discord server and interact with the community! There’s a specific channel for this course called practical-nlp-nlplanet.