1.7 Project: Search Engine over Medium with Bag of Words#

Let’s do another project in which we build a search engine over Medium articles leveraging their bag of words representation.

Our search engine will be very simple, as it simply returns the articles with the most token matches between their content and the query, and we’ll see how to make it better by filtering stopwords.

Libraries#

First, let’s install the datasets library, which allows for easy access to datasets on the Hugging Face Hub.

!pip install datasets

Let’s then import the necessary libraries:

  • The hf_hub_download to download our dataset.

  • The word_tokenize function from nltk.tokenize for tokenization.

  • pandas, numpy, and the Counter class, which is a Python class specialized in counting objects. Note that we could use the CountVectorizer class from sklearn to do the same job.

from huggingface_hub import hf_hub_download

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

from collections import Counter
import numpy as np
import pandas as pd

Download the Dataset#

We can now download the dataset from the Hugging Face Hub. As there are ~190k articles in the dataset, to make computations faster let’s keep only 10k of them.

# download dataset of Medium articles from 
# https://huggingface.co/datasets/fabiochiu/medium-articles
df_articles = pd.read_csv(
  hf_hub_download("fabiochiu/medium-articles", repo_type="dataset",
                  filename="medium_articles.csv")
)

# There are 192,368 articles in total, but let's keep only the first 10,000 to
# make computations faster
df_articles = df_articles[:10000]

df_articles.head()
title text url authors timestamp tags
0 Mental Note Vol. 24 Photo by Josh Riemer on Unsplash\n\nMerry Chri... https://medium.com/invisible-illness/mental-no... ['Ryan Fan'] 2020-12-26 03:38:10.479000+00:00 ['Mental Health', 'Health', 'Psychology', 'Sci...
1 Your Brain On Coronavirus Your Brain On Coronavirus\n\nA guide to the cu... https://medium.com/age-of-awareness/how-the-pa... ['Simon Spichak'] 2020-09-23 22:10:17.126000+00:00 ['Mental Health', 'Coronavirus', 'Science', 'P...
2 Mind Your Nose Mind Your Nose\n\nHow smell training can chang... https://medium.com/neodotlife/mind-your-nose-f... [] 2020-10-10 20:17:37.132000+00:00 ['Biotechnology', 'Neuroscience', 'Brain', 'We...
3 The 4 Purposes of Dreams Passionate about the synergy between science a... https://medium.com/science-for-real/the-4-purp... ['Eshan Samaranayake'] 2020-12-21 16:05:19.524000+00:00 ['Health', 'Neuroscience', 'Mental Health', 'P...
4 Surviving a Rod Through the Head You’ve heard of him, haven’t you? Phineas Gage... https://medium.com/live-your-life-on-purpose/s... ['Rishav Sinha'] 2020-02-26 00:01:01.576000+00:00 ['Brain', 'Health', 'Development', 'Psychology...

Data Preprocessing#

Then, let’s count the number of occurrences of each token in each article. The counts are then stored as Counter objects in the token_counters list.

Here you can see some tokens with 10 or more occurrences in the first article of the dataset. Notice that there are several prepositions and articles such as “for”, “to”, and “and”.

# count the number of occurrences of each token in each text
texts_lowercase = df_articles["text"].str.lower()
texts_lowercase_tokenized = texts_lowercase.apply(word_tokenize)
token_counters = texts_lowercase_tokenized.apply(Counter).values.tolist()

# show the tokens found in the first article with at least 10 occurrences
print({token: n_occ for token, n_occ in token_counters[0].items() if n_occ >= 10})
{'and': 32, ',': 52, 'we': 15, 'to': 30, 'for': 13, '.': 39, '’': 23, 'be': 13, 'that': 14, 'of': 23, '“': 12, 'the': 31, 'a': 16, '”': 11, 'i': 25, 'can': 16, 'it': 17, 's': 10}

Make Queries#

Now that we have counted the number of occurrences of each token in each article, we are able to write code to make our search engine work efficiently. Let’s first tokenize our query, which is "data science nlp" in this example.

# tokenize the query
query = "data science nlp"
query_tokens = word_tokenize(query)

Let’s write the get_scores function which computes a score for each article on the dataset, keeping into account how many times each token of the query can be found in each article.

# Compute a matching score for each text with respect to the query. The score is
# the number of times each token in the query can be found in a specific text.
def get_scores(query_tokens, token_counters):
  scores = []
  for token_counter in token_counters:
    matches = [token_counter[query_token] for query_token in query_tokens]
    total_score = sum(matches)
    scores.append(total_score)
  return scores

scores = get_scores(query_tokens, token_counters)

Now that each article has a score, we just have to sort them and show the titles of the articles with the highest scores. We can implement the show_best_results which does the work for us.

# retrieve the top_n articles with the highest scores and show them
def show_best_results(df_articles, scores, top_n=10):
  best_indexes = np.argsort(scores)[::-1]
  for position, idx in enumerate(best_indexes[:top_n]):
    row = df_articles.iloc[idx]
    title = row["title"]
    score = scores[idx]
    print(f"{position + 1} [score = {score}]: {title}")

show_best_results(df_articles, scores)
1 [score = 186]: The Top Online Data Science Courses for 2019
2 [score = 132]: How Much Do You Know About Your Data And Is Your Product Ready To Benefit From Data Science?
3 [score = 122]: Under the Hood of K-Nearest Neighbors (KNN) and Popular Model Validation Techniques
4 [score = 122]: Streaming Real-time data to AWS Elasticsearch using Kinesis Firehose
5 [score = 120]: Financial Times Data Platform: From zero to hero
6 [score = 118]: No data governance, no data intelligence!
7 [score = 107]: Data Science for Everyone: Getting To Know Your Data — Part 1
8 [score = 102]: Data Science, the Good, the Bad, and the… Future
9 [score = 102]: A Layman’s Guide to Data Science: How to Become a (Good) Data Scientist
10 [score = 98]: Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing Data Science

See how all of the returned articles are about data science, our search engine seems to be working properly! Let’s try it with the query "how to learn data science".

# try a different query
query = "how to learn data science"
query_tokens = word_tokenize(query)
scores = get_scores(query_tokens, token_counters)
show_best_results(df_articles, scores)
1 [score = 589]: How to Make Your First $10,000 as a Freelance Writer
2 [score = 583]: Russ Roberts and Tyler on COVID-19 (Ep. 90 — BONUS)
3 [score = 526]: The Big Disruption
4 [score = 461]: Sam Altman on Loving Community, Hating Coworking, and the Hunt for Talent (Ep. 61 — Live)
5 [score = 394]: Paul Romer on a Culture of Science and Working Hard (Ep. 96)
6 [score = 392]: Nicholas Bloom on Management, Productivity, and Scientific Progress (Ep. 102)
7 [score = 349]: SXSW 2019 Ultimate Guide to the Panels, Popups and Parties
8 [score = 341]: Glen Weyl on Fighting COVID-19 and the Role of the Academic Expert (Ep. 94 — BONUS)
9 [score = 319]: The Top Online Data Science Courses for 2019
10 [score = 308]: A Deep Conceptual Guide to Mutual Information

Unfortunately, this time the results are not as good as before. Why?

That’s because the query contains tokens like “how” and “to”, which are very frequent in most of the articles in the dataset. As a consequence, the articles with the majority of these tokens are returned and tokens like “data” and “science” have less influence on the results.

Removing Stopwords#

We can simply filter out all these “insignificant” words in the first place and not take them into account when scoring the articles. Let’s download the list of English stopwords from nltk. Together with stopwords, let’s filter out punctuation as well from the tokens.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string

english_stopwords = stopwords.words('english')

Punctuation characters can be all found in the string string.punctuation.

print(string.punctuation)
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Then, just before counting the occurrences of each token, we filter out all the tokens that are stopwords or punctuation characters.

# count the number of occurrences of each token in each text
texts_lowercase = df_articles["text"].str.lower()
texts_lowercase_tokenized = texts_lowercase.apply(word_tokenize)
texts_lowercase_tokenized_no_sw = texts_lowercase_tokenized.apply(
  lambda token_list: [token for token in token_list
                      if token not in english_stopwords and
                      token not in string.punctuation]
)
token_counters = texts_lowercase_tokenized_no_sw.apply(Counter).values.tolist()

# show the tokens found in the first article with at least 6 occurrences
print({token: n_occ for token, n_occ in token_counters[0].items() if n_occ >= 6})
{'’': 23, '“': 12, 'us': 6, '”': 11, 'life': 6}

So, let’s try again the "how to learn data science" query and see the results.

# tokenize the query and remove stopwords
query = "how to learn data science"
query_tokens = word_tokenize(query)
query_tokens_no_sw = [token for token in query_tokens
                      if token not in english_stopwords and
                      token not in string.punctuation]
print(f"Tokenized query without stopwords: {query_tokens_no_sw}")
print()

# show best results
scores = get_scores(query_tokens, token_counters)
show_best_results(df_articles, scores)
Tokenized query without stopwords: ['learn', 'data', 'science']

1 [score = 200]: The Top Online Data Science Courses for 2019
2 [score = 132]: How Much Do You Know About Your Data And Is Your Product Ready To Benefit From Data Science?
3 [score = 124]: Under the Hood of K-Nearest Neighbors (KNN) and Popular Model Validation Techniques
4 [score = 123]: Streaming Real-time data to AWS Elasticsearch using Kinesis Firehose
5 [score = 120]: Financial Times Data Platform: From zero to hero
6 [score = 119]: No data governance, no data intelligence!
7 [score = 107]: Data Science for Everyone: Getting To Know Your Data — Part 1
8 [score = 107]: A Layman’s Guide to Data Science: How to Become a (Good) Data Scientist
9 [score = 104]: Data Science, the Good, the Bad, and the… Future
10 [score = 99]: Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing Data Science

Indeed, the results are again relevant to the query!

Code Exercises#

Questions and Feedbacks#

Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the NLPlanet Discord server and interact with the community! There’s a specific channel for this course called practical-nlp-nlplanet.