# 1.10 Project: Search Engine over Medium with TF-IDF

We are now ready to create a new search engine over our Medium articles dataset using TF-IDF. To do so, we leverage the [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class from `sklearn` that will do most of the work for us.

## Libraries

Let's import the necessary libraries and functions.

In [None]:
from huggingface_hub import hf_hub_download

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

## Download Dataset

We can now download the dataset of Medium articles from the Hugging Face Hub.

In [None]:
# download dataset of Medium articles from 
# https://huggingface.co/datasets/fabiochiu/medium-articles
df_articles = pd.read_csv(
  hf_hub_download("fabiochiu/medium-articles", repo_type="dataset", filename="medium_articles.csv")
)

# There are 192,368 articles in total, but let's sample 10,000 of them to
# make computations faster
df_articles = df_articles[:10000]

df_articles.head()

In [None]:
df_articles = pd.read_csv("df_articles_10_head.csv")
df_articles.head()

## Using the TfidfVectorizer

Let's create a `TfidfVectorizer` object and call its `fit_transform` method on our corpus. By fitting the vectorizer, it computes the TF-IDF score of each token with respect to every article.

As result, the `corpus_vectorized` variable is a `scipy` sparse matrix containing 10k rows (one row for each article) and ~110k columns (one column for each token found in the corpus).

In [None]:
# apply the TfidfVectorizer to the corpus
corpus = df_articles["text"]
vectorizer = TfidfVectorizer()
corpus_vectorized = vectorizer.fit_transform(corpus)
print(corpus_vectorized.shape)

In [None]:
print("(10000, 110038)")

We can then reuse the vectorizer with the `transform` method to compute the TF-IDF values of the tokens in the query.

In [None]:
# vectorize query
query = "data science nlp"
query_vectorized = vectorizer.transform([query])
print(query_vectorized.shape)

In [None]:
print("(1, 110038)")

Now, both the query and each article have been mapped to vectors of TF-IDF scores with the same dimensions.

## Compute Similarities between Queries and Articles

Next, we compute the similarity between the query vector and each of the articles vectors by performing a matrix multiplication between `query_vectorized` and the transpose of `corpus_vectorized`, thus obtaining an array of 10k elements where each element is the score of an article.

In [None]:
# compute scores as the dot product between the query vector
# and the documents vectors
scores = query_vectorized.dot(corpus_vectorized.transpose())
scores_array = scores.toarray()[0]
print(scores_array.shape)

In [None]:
print("(10000,)")

```{admonition} Computing the similarity between vectors
There are multiple similarity measures to choose from for computing the similarity between two vectors, such as:
- [Dot Product](https://en.wikipedia.org/wiki/Dot_product)
- [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
- [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance)

Learn more about their differences [here](https://developers.google.com/machine-learning/clustering/similarity/measuring-similarity).
```

## Show Results

Now we just have to find the indices of `scores_array` with the highest scores, find their corresponding articles in `df_articles` and show them.

In [None]:
# retrieve the top_n articles with the highest scores and show them
def show_best_results(df_articles, scores_array, top_n=10):
  sorted_indices = scores_array.argsort()[::-1]
  for position, idx in enumerate(sorted_indices[:top_n]):
    row = df_articles.iloc[idx]
    title = row["title"]
    score = scores_array[idx]
    print(f"{position + 1} [score = {score}]: {title}")

show_best_results(df_articles, scores_array)

In [None]:
text = """1 [score = 0.5913069114145734]: What in the “Hello World” is Natural Language Processing (NLP)?
2 [score = 0.47487715081627846]: The Story of how Natural Language Processing is changing Financial Services in 2020
3 [score = 0.3672260843689108]: The Application of Natural Language Processing in OpenSearch
4 [score = 0.3483482100035714]: 5 Steps to Become a Data Scientist
5 [score = 0.3413479210936063]: Data Science Scholarships-Full-list Compilations.
6 [score = 0.3139018781861753]: Data science… without any data?!
7 [score = 0.3106738813439215]: Transform your Data Science Projects with these 5 Steps of Design Thinking
8 [score = 0.29735216501354833]: The Top Online Data Science Courses for 2019
9 [score = 0.2820392959961161]: How bad data is weakening the study of big data
10 [score = 0.27787107649790765]: I ranked every Intro to Data Science course on the internet, based on thousands of data points"""

print(text)

The results are ok, but let's try also with a query containing some stopwords.

In [None]:
# try a different query
query = "how to learn data science"
query_vectorized = vectorizer.transform([query])
scores = query_vectorized.dot(corpus_vectorized.transpose())
scores_array = scores.toarray()[0]
show_best_results(df_articles, scores_array)

In [None]:
text = """1 [score = 0.5141045309102721]: 5 Steps to Become a Data Scientist
2 [score = 0.48323515081273327]: Data science… without any data?!
3 [score = 0.47352171167560625]: Data Science Scholarships-Full-list Compilations.
4 [score = 0.4716663492490108]: The Top Online Data Science Courses for 2019
5 [score = 0.46151667508531957]: Transform your Data Science Projects with these 5 Steps of Design Thinking
6 [score = 0.4573418924098517]: Roadmap to Becoming a Successful Data Scientist
7 [score = 0.44829250702849616]: Data Science, the Good, the Bad, and the… Future
8 [score = 0.4308820988444495]: Freelance your way into Data Science now
9 [score = 0.4271310483696457]: A Layman’s Guide to Data Science: How to Become a (Good) Data Scientist
10 [score = 0.42437577911736074]: The Difference Between Theory and Theorem and What It Tells Us About Data ‘Science’"""
print(text)

Even without filtering stopwords, the TF-IDF heuristic gives low importance to common words and high importance to rare words, thus producing a similar effect.

# Code Exercises

<button
    class="colab-button"
    onclick="window.open('https://colab.research.google.com/drive/1psAK1koyiQKdCmLSk3eIrj8aHOsjjCcb?usp=sharing','_blank')"
    type="button">
    Go to Notebook
</button>

# Questions and Feedbacks

Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the [NLPlanet Discord server](https://discord.gg/zfC862H2dJ) and interact with the community! There's a specific channel for this course called **practical-nlp-nlplanet**.