1.13 Project: Search Engine over Medium with Embeddings#

In the previous lesson, we saw how to compute the semantic similarity between two sentences leveraging pre-trained embedding models. Now we’ll put this knowledge into practice, upgrading our search engine over Medium articles by leveraging word embeddings.

Install Libraries#

We’ll need the datasets library to download the Medium articles dataset from the Hugging Face Hub, and the sentence-transformers library that makes it easy to use sentence embeddings models.

!pip install datasets sentence-transformers

Import Libraries#

Let’s import the necessary libraries and functions.

from huggingface_hub import hf_hub_download

import pandas as pd
import numpy as np

from sentence_transformers import SentenceTransformer, util
import torch

Download Dataset#

We download the dataset leveraging the hf_hub_download function and load it as a pandas dataframe. We’ll keep only 1000 random articles in this example to make computations faster. Indeed, using a sentence embedding model (typically a neural network) to produce embeddings is rather slow on a CPU, as it should take a few minutes, while only a few seconds on a GPU.

# download dataset of Medium articles from 
# https://huggingface.co/datasets/fabiochiu/medium-articles
df_articles = pd.read_csv(
  hf_hub_download("fabiochiu/medium-articles", repo_type="dataset", filename="medium_articles.csv")
)

# There are 192,368 articles in total, but let's keep only the first 1,000 to
# make computations faster
df_articles = df_articles[:1000].reset_index(drop=True)

df_articles.head()
title text url authors timestamp tags
0 Brexit isn’t the issue. A hostile and convolut... With the catastrophe of Brexit comes a cacopho... https://liambarrett1996.medium.com/brexit-isnt... ['Liam Barrett'] 2019-04-05 15:24:38.241000+00:00 ['UK Politics', 'Opinion', 'Parliament', 'Cult...
1 The Warm Rays of Benevolence SPIRITUALITY | NEWSLETTER\n\nThe Warm Rays of ... https://medium.com/spiritual-secrets/the-warm-... ['Darshak Rana'] 2020-11-03 17:54:57.911000+00:00 ['Gratitude', 'Spiritual Secrets', 'Letters', ...
2 A Cult of Personalities Waste Billions A Cult of Personalities Waste Billions\n\nDelu... https://medium.com/@maxrottersman/crap-tech-ha... ['Max Rottersman'] 2021-09-16 12:53:06.680000+00:00 ['Spacex', 'Starlink', 'Tesla', 'Elon Musk', '...
3 Why We Still Need Our Fathers The BLM movement is indeed different this time... https://thesecretaspirant.medium.com/why-we-st... ['The Secret Aspirant'] 2020-11-12 03:06:33.724000+00:00 ['American Dad', 'Fathers Day 2020', 'Social J...
4 Counting Derangements Hey everyone here is the easiest explanation o... https://medium.com/@harshittheone007/counting-... [] 2021-07-06 18:43:30.646000+00:00 ['Programming', 'Coding', 'Software Engineerin...

Download the Model#

Let’s use the SentenceTransformer class to download a pre-trained model, such as all-MiniLM-L6-v2, and instantiate the embedder object. Here is the list of available sentence embeddings models.

# download the sentence embeddings model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

Embed Documents and Queries, and Compute Cosine Similarity#

Now it’s time to use the embedder.encode method to create an embedding for each article in the dataset. Using the all-MiniLM-L6-v2 model, each embedding has 384 dimensions.

# Embed article texts.
# It's slow, but it must be done only once
corpus = df_articles["text"].values
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
print(corpus_embeddings.shape)
torch.Size([1000, 384])

Last, let’s embed the query as well and find the articles with the highest cosine similarity with respect to the query. We leverage the util.cos_sim function to compute the cosine similarity between vectors, and the torch.topk to easily find the top_k highest cosine similarities.

# embed the query
query = "data science nlp"
query_embedding = embedder.encode(query, convert_to_tensor=True)

# find the article with highest cosine-similarity with the query
def show_results(query_embedding, corpus_embeddings, df_articles, top_k=10):
  cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
  top_results = torch.topk(cos_scores, k=top_k)
  position = 1
  for score, idx in zip(top_results[0], top_results[1]):
      row = df_articles.iloc[idx.item()]
      title = row["title"]
      print(f"{position} [score = {score}]: {title}")
      position += 1

show_results(query_embedding, corpus_embeddings, df_articles)
1 [score = 0.5880753397941589]: spaCy NER Model to Identify Scientific Datasets — Coleridge Initiative
2 [score = 0.4981346130371094]: De-mystifying English text for computers through Natural Language Processing (NLP) in Python
3 [score = 0.41203635931015015]: The Case for Humanitarian AI: Using data to proactively address complex problems
4 [score = 0.3967365622520447]: One-Stop News
5 [score = 0.36300334334373474]: Kaggle Session 4 (Toxic Comments Classification)
6 [score = 0.3227446675300598]: Spreadsheets Revolutionizing Business Analytics
7 [score = 0.3211461305618286]: How to Build the Perfect Dashboard with Power BI
8 [score = 0.30576711893081665]: Processing Data To Improve Machine Learning Models Accuracy
9 [score = 0.30441880226135254]: Sentiment Analysis of a Youtube video (Part 2)
10 [score = 0.29799699783325195]: Predict Car Accident Severity In Seattle with Machine Learning

The results are rather ok, let’s try with another query with stopwords in it.

# embed the query
query = "how to learn data science"
query_embedding = embedder.encode(query, convert_to_tensor=True)
show_results(query_embedding, corpus_embeddings, df_articles)
1 [score = 0.4695347845554352]: Here’s why so many data scientists are leaving their jobs
2 [score = 0.3661503791809082]: Forget the 10,000-Hour Rule — You Only Need 20 to Learn a New Skill
3 [score = 0.34999096393585205]: The Best Software Engineering Books I Read in 2020
4 [score = 0.34676799178123474]: Data Analysis on Korean Triage and Acuity Scale
5 [score = 0.3398664593696594]: How To Swap Two Values Without Temporary Variables Using JavaScript
6 [score = 0.3272307217121124]: Top 20 benefits of having an .edu Email Address
7 [score = 0.3194262683391571]: How to Build the Perfect Dashboard with Power BI
8 [score = 0.31751155853271484]: Best Online Computer Courses
9 [score = 0.31078311800956726]: Data Visualization using Pandas, NumPy, and Matplotlib Python Libraries
10 [score = 0.3093498945236206]: PWiC at Women in Analytics, 2019

Again, the results are quite ok. Remember that the search engine is working only on 1,000 random samples, therefore there are few articles to choose from.

Sentence Embeddings and stopwords

Sentence embeddings models automatically manage stopwords because they are trained to give similar representations to semantically similar sentences. As a consequence, it is automatically learned that stopwords have a low influence on the final embedding, as they are present in many training sentences.

Code Exercises#

Questions and Feedbacks#

Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the NLPlanet Discord server and interact with the community! There’s a specific channel for this course called practical-nlp-nlplanet.