1.10 Project: Search Engine over Medium with TF-IDF#

We are now ready to create a new search engine over our Medium articles dataset using TF-IDF. To do so, we leverage the TfidfVectorizer class from sklearn that will do most of the work for us.

Libraries#

Let’s import the necessary libraries and functions.

from huggingface_hub import hf_hub_download

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

Download Dataset#

We can now download the dataset of Medium articles from the Hugging Face Hub.

# download dataset of Medium articles from 
# https://huggingface.co/datasets/fabiochiu/medium-articles
df_articles = pd.read_csv(
  hf_hub_download("fabiochiu/medium-articles", repo_type="dataset", filename="medium_articles.csv")
)

# There are 192,368 articles in total, but let's sample 10,000 of them to
# make computations faster
df_articles = df_articles[:10000]

df_articles.head()
title text url authors timestamp tags
0 Mental Note Vol. 24 Photo by Josh Riemer on Unsplash\n\nMerry Chri... https://medium.com/invisible-illness/mental-no... ['Ryan Fan'] 2020-12-26 03:38:10.479000+00:00 ['Mental Health', 'Health', 'Psychology', 'Sci...
1 Your Brain On Coronavirus Your Brain On Coronavirus\n\nA guide to the cu... https://medium.com/age-of-awareness/how-the-pa... ['Simon Spichak'] 2020-09-23 22:10:17.126000+00:00 ['Mental Health', 'Coronavirus', 'Science', 'P...
2 Mind Your Nose Mind Your Nose\n\nHow smell training can chang... https://medium.com/neodotlife/mind-your-nose-f... [] 2020-10-10 20:17:37.132000+00:00 ['Biotechnology', 'Neuroscience', 'Brain', 'We...
3 The 4 Purposes of Dreams Passionate about the synergy between science a... https://medium.com/science-for-real/the-4-purp... ['Eshan Samaranayake'] 2020-12-21 16:05:19.524000+00:00 ['Health', 'Neuroscience', 'Mental Health', 'P...
4 Surviving a Rod Through the Head You’ve heard of him, haven’t you? Phineas Gage... https://medium.com/live-your-life-on-purpose/s... ['Rishav Sinha'] 2020-02-26 00:01:01.576000+00:00 ['Brain', 'Health', 'Development', 'Psychology...

Using the TfidfVectorizer#

Let’s create a TfidfVectorizer object and call its fit_transform method on our corpus. By fitting the vectorizer, it computes the TF-IDF score of each token with respect to every article.

As result, the corpus_vectorized variable is a scipy sparse matrix containing 10k rows (one row for each article) and ~110k columns (one column for each token found in the corpus).

# apply the TfidfVectorizer to the corpus
corpus = df_articles["text"]
vectorizer = TfidfVectorizer()
corpus_vectorized = vectorizer.fit_transform(corpus)
print(corpus_vectorized.shape)
(10000, 110038)

We can then reuse the vectorizer with the transform method to compute the TF-IDF values of the tokens in the query.

# vectorize query
query = "data science nlp"
query_vectorized = vectorizer.transform([query])
print(query_vectorized.shape)
(1, 110038)

Now, both the query and each article have been mapped to vectors of TF-IDF scores with the same dimensions.

Compute Similarities between Queries and Articles#

Next, we compute the similarity between the query vector and each of the articles vectors by performing a matrix multiplication between query_vectorized and the transpose of corpus_vectorized, thus obtaining an array of 10k elements where each element is the score of an article.

# compute scores as the dot product between the query vector
# and the documents vectors
scores = query_vectorized.dot(corpus_vectorized.transpose())
scores_array = scores.toarray()[0]
print(scores_array.shape)
(10000,)

Computing the similarity between vectors

There are multiple similarity measures to choose from for computing the similarity between two vectors, such as:

Learn more about their differences here.

Show Results#

Now we just have to find the indices of scores_array with the highest scores, find their corresponding articles in df_articles and show them.

# retrieve the top_n articles with the highest scores and show them
def show_best_results(df_articles, scores_array, top_n=10):
  sorted_indices = scores_array.argsort()[::-1]
  for position, idx in enumerate(sorted_indices[:top_n]):
    row = df_articles.iloc[idx]
    title = row["title"]
    score = scores_array[idx]
    print(f"{position + 1} [score = {score}]: {title}")

show_best_results(df_articles, scores_array)
1 [score = 0.5913069114145734]: What in the “Hello World” is Natural Language Processing (NLP)?
2 [score = 0.47487715081627846]: The Story of how Natural Language Processing is changing Financial Services in 2020
3 [score = 0.3672260843689108]: The Application of Natural Language Processing in OpenSearch
4 [score = 0.3483482100035714]: 5 Steps to Become a Data Scientist
5 [score = 0.3413479210936063]: Data Science Scholarships-Full-list Compilations.
6 [score = 0.3139018781861753]: Data science… without any data?!
7 [score = 0.3106738813439215]: Transform your Data Science Projects with these 5 Steps of Design Thinking
8 [score = 0.29735216501354833]: The Top Online Data Science Courses for 2019
9 [score = 0.2820392959961161]: How bad data is weakening the study of big data
10 [score = 0.27787107649790765]: I ranked every Intro to Data Science course on the internet, based on thousands of data points

The results are ok, but let’s try also with a query containing some stopwords.

# try a different query
query = "how to learn data science"
query_vectorized = vectorizer.transform([query])
scores = query_vectorized.dot(corpus_vectorized.transpose())
scores_array = scores.toarray()[0]
show_best_results(df_articles, scores_array)
1 [score = 0.5141045309102721]: 5 Steps to Become a Data Scientist
2 [score = 0.48323515081273327]: Data science… without any data?!
3 [score = 0.47352171167560625]: Data Science Scholarships-Full-list Compilations.
4 [score = 0.4716663492490108]: The Top Online Data Science Courses for 2019
5 [score = 0.46151667508531957]: Transform your Data Science Projects with these 5 Steps of Design Thinking
6 [score = 0.4573418924098517]: Roadmap to Becoming a Successful Data Scientist
7 [score = 0.44829250702849616]: Data Science, the Good, the Bad, and the… Future
8 [score = 0.4308820988444495]: Freelance your way into Data Science now
9 [score = 0.4271310483696457]: A Layman’s Guide to Data Science: How to Become a (Good) Data Scientist
10 [score = 0.42437577911736074]: The Difference Between Theory and Theorem and What It Tells Us About Data ‘Science’

Even without filtering stopwords, the TF-IDF heuristic gives low importance to common words and high importance to rare words, thus producing a similar effect.

Code Exercises#

Questions and Feedbacks#

Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the NLPlanet Discord server and interact with the community! There’s a specific channel for this course called practical-nlp-nlplanet.