1.13 Project: Search Engine over Medium with Embeddings
Contents
1.13 Project: Search Engine over Medium with Embeddings#
In the previous lesson, we saw how to compute the semantic similarity between two sentences leveraging pre-trained embedding models. Now we’ll put this knowledge into practice, upgrading our search engine over Medium articles by leveraging word embeddings.
Install Libraries#
We’ll need the datasets
library to download the Medium articles dataset from the Hugging Face Hub, and the sentence-transformers
library that makes it easy to use sentence embeddings models.
!pip install datasets sentence-transformers
Import Libraries#
Let’s import the necessary libraries and functions.
from huggingface_hub import hf_hub_download
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util
import torch
Download Dataset#
We download the dataset leveraging the hf_hub_download
function and load it as a pandas
dataframe. We’ll keep only 1000 random articles in this example to make computations faster. Indeed, using a sentence embedding model (typically a neural network) to produce embeddings is rather slow on a CPU, as it should take a few minutes, while only a few seconds on a GPU.
# download dataset of Medium articles from
# https://huggingface.co/datasets/fabiochiu/medium-articles
df_articles = pd.read_csv(
hf_hub_download("fabiochiu/medium-articles", repo_type="dataset", filename="medium_articles.csv")
)
# There are 192,368 articles in total, but let's keep only the first 1,000 to
# make computations faster
df_articles = df_articles[:1000].reset_index(drop=True)
df_articles.head()
title | text | url | authors | timestamp | tags | |
---|---|---|---|---|---|---|
0 | Brexit isn’t the issue. A hostile and convolut... | With the catastrophe of Brexit comes a cacopho... | https://liambarrett1996.medium.com/brexit-isnt... | ['Liam Barrett'] | 2019-04-05 15:24:38.241000+00:00 | ['UK Politics', 'Opinion', 'Parliament', 'Cult... |
1 | The Warm Rays of Benevolence | SPIRITUALITY | NEWSLETTER\n\nThe Warm Rays of ... | https://medium.com/spiritual-secrets/the-warm-... | ['Darshak Rana'] | 2020-11-03 17:54:57.911000+00:00 | ['Gratitude', 'Spiritual Secrets', 'Letters', ... |
2 | A Cult of Personalities Waste Billions | A Cult of Personalities Waste Billions\n\nDelu... | https://medium.com/@maxrottersman/crap-tech-ha... | ['Max Rottersman'] | 2021-09-16 12:53:06.680000+00:00 | ['Spacex', 'Starlink', 'Tesla', 'Elon Musk', '... |
3 | Why We Still Need Our Fathers | The BLM movement is indeed different this time... | https://thesecretaspirant.medium.com/why-we-st... | ['The Secret Aspirant'] | 2020-11-12 03:06:33.724000+00:00 | ['American Dad', 'Fathers Day 2020', 'Social J... |
4 | Counting Derangements | Hey everyone here is the easiest explanation o... | https://medium.com/@harshittheone007/counting-... | [] | 2021-07-06 18:43:30.646000+00:00 | ['Programming', 'Coding', 'Software Engineerin... |
Download the Model#
Let’s use the SentenceTransformer
class to download a pre-trained model, such as all-MiniLM-L6-v2
, and instantiate the embedder
object. Here is the list of available sentence embeddings models.
# download the sentence embeddings model
embedder = SentenceTransformer('all-MiniLM-L6-v2')
Embed Documents and Queries, and Compute Cosine Similarity#
Now it’s time to use the embedder.encode
method to create an embedding for each article in the dataset. Using the all-MiniLM-L6-v2
model, each embedding has 384
dimensions.
# Embed article texts.
# It's slow, but it must be done only once
corpus = df_articles["text"].values
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
print(corpus_embeddings.shape)
torch.Size([1000, 384])
Last, let’s embed the query as well and find the articles with the highest cosine similarity with respect to the query. We leverage the util.cos_sim
function to compute the cosine similarity between vectors, and the torch.topk
to easily find the top_k
highest cosine similarities.
# embed the query
query = "data science nlp"
query_embedding = embedder.encode(query, convert_to_tensor=True)
# find the article with highest cosine-similarity with the query
def show_results(query_embedding, corpus_embeddings, df_articles, top_k=10):
cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)
position = 1
for score, idx in zip(top_results[0], top_results[1]):
row = df_articles.iloc[idx.item()]
title = row["title"]
print(f"{position} [score = {score}]: {title}")
position += 1
show_results(query_embedding, corpus_embeddings, df_articles)
1 [score = 0.5880753397941589]: spaCy NER Model to Identify Scientific Datasets — Coleridge Initiative
2 [score = 0.4981346130371094]: De-mystifying English text for computers through Natural Language Processing (NLP) in Python
3 [score = 0.41203635931015015]: The Case for Humanitarian AI: Using data to proactively address complex problems
4 [score = 0.3967365622520447]: One-Stop News
5 [score = 0.36300334334373474]: Kaggle Session 4 (Toxic Comments Classification)
6 [score = 0.3227446675300598]: Spreadsheets Revolutionizing Business Analytics
7 [score = 0.3211461305618286]: How to Build the Perfect Dashboard with Power BI
8 [score = 0.30576711893081665]: Processing Data To Improve Machine Learning Models Accuracy
9 [score = 0.30441880226135254]: Sentiment Analysis of a Youtube video (Part 2)
10 [score = 0.29799699783325195]: Predict Car Accident Severity In Seattle with Machine Learning
The results are rather ok, let’s try with another query with stopwords in it.
# embed the query
query = "how to learn data science"
query_embedding = embedder.encode(query, convert_to_tensor=True)
show_results(query_embedding, corpus_embeddings, df_articles)
1 [score = 0.4695347845554352]: Here’s why so many data scientists are leaving their jobs
2 [score = 0.3661503791809082]: Forget the 10,000-Hour Rule — You Only Need 20 to Learn a New Skill
3 [score = 0.34999096393585205]: The Best Software Engineering Books I Read in 2020
4 [score = 0.34676799178123474]: Data Analysis on Korean Triage and Acuity Scale
5 [score = 0.3398664593696594]: How To Swap Two Values Without Temporary Variables Using JavaScript
6 [score = 0.3272307217121124]: Top 20 benefits of having an .edu Email Address
7 [score = 0.3194262683391571]: How to Build the Perfect Dashboard with Power BI
8 [score = 0.31751155853271484]: Best Online Computer Courses
9 [score = 0.31078311800956726]: Data Visualization using Pandas, NumPy, and Matplotlib Python Libraries
10 [score = 0.3093498945236206]: PWiC at Women in Analytics, 2019
Again, the results are quite ok. Remember that the search engine is working only on 1,000 random samples, therefore there are few articles to choose from.
Code Exercises#
Questions and Feedbacks#
Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the NLPlanet Discord server and interact with the community! There’s a specific channel for this course called practical-nlp-nlplanet.