1.9 Representing Texts as Vectors: TF-IDF#

In the lesson about making a search engine over Medium with bag of words, we saw that some words are somewhat more important than others when computing article scores with respect to the query. Indeed, we heuristically noticed that ignoring stopwords (i.e. giving them a weight of zero, while giving a unitary weight to all the other tokens) results in a better search engine.

In this lesson, we learn about another heuristic that assigns different weights to tokens and often results in better search engines, namely TF-IDF.

Zipf’s Law#

First, let’s notice that words are typically distributed in languages according to a power law called Zipf’s law. Here’s how Wikipedia describes it.

Zipf’s law was originally formulated in terms of quantitative linguistics, stating that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. […] For example, in the Brown Corpus of American English text, the word “the” is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences. True to Zipf’s Law, the second-place word “of” accounts for slightly over 3.5% of words […]. Only 135 vocabulary items are needed to account for half the Brown Corpus.

So, this is what the distribution of the frequencies of the words in a corpus should look like.


Let’s quickly verify if this distribution holds also for the dataset of Medium articles that we used in the last projects.

Zipf’s Law in Medium Articles#

Let’s import the necessary libraries. The only different library from the ones in the previous lessons is plotly, which is a data visualization library for making beautiful interactive plots.

from huggingface_hub import hf_hub_download
import nltk
from nltk.tokenize import word_tokenize

import pandas as pd
import plotly.express as px

from collections import Counter
import re

We then download the dataset of Medium articles from the Hugging Face Hub and convert it to a pandas dataframe.

# download dataset of Medium articles from 
# https://huggingface.co/datasets/fabiochiu/medium-articles
df_articles = pd.read_csv(
  hf_hub_download("fabiochiu/medium-articles", repo_type="dataset", filename="medium_articles.csv")

# There are 192,368 articles in total, but let's sample only 30,000 of them to
# make computations faster
df_articles = df_articles.sample(n=30000)

title text url authors timestamp tags
0 What if Robots Take Over the World Image Courtesy: Unsplash.com\n\nYou might have... https://medium.com/@minisculestories/what-if-r... ['Miniscule Stories'] 2020-12-05 19:35:52.037000+00:00 ['Robots']
1 Spice It Up With Language Honey Spice It Up With Language Honey\n\nPhoto by An... https://2madness.com/spice-it-up-with-language... ['Jodi Urgitus'] 2020-09-16 01:59:13.972000+00:00 ['Millennials', 'Language', 'Exercise', 'Lifes...
2 How NOT to get scammed on the Internet I have an interesting story for you to read. J... https://medium.com/@angeliichoo77/how-not-to-g... [] 2020-12-14 01:13:46.746000+00:00 ['Scam', 'Online Safety', 'Online Shopping', '...
3 Three months dating an Artificial ‘almost’ Int... Three months dating an Artificial ‘almost’ Int... https://medium.com/q-n-english/three-months-da... [] 2017-11-29 12:21:28.429000+00:00 ['AI', 'Voice Assistant', 'Qnenglish', 'Amazon...
4 Streaming Payments and the Future of Interoper... Talk given by Evan Schwartz\n\nEvan Schwartz i... https://medium.com/mitbitcoinclub/streaming-pa... ['Sathya Peri'] 2018-05-10 21:45:37.760000+00:00 ['Blockchain', 'Altcoins', 'Cryptocurrency', '...

Next, let’s do some simple text preprocessing and count how many occurrences we have for each token in the whole dataset.

def from_text_to_counter(full_text):
  # convert texts to lowercase
  full_text_lower = full_text.lower()

  # remove punctuation from articles
  full_text_no_punctuation = re.sub(r'[^\w\s]', ' ', full_text_lower)

  # tokenize texts
  full_text_tokenized = word_tokenize(full_text_no_punctuation)

  # count the occurrences of each token
  token_counter = Counter(full_text_tokenized)

  return token_counter

# count the occurrences of each token
full_text = " ".join(df_articles["text"].values)
token_counter = from_text_to_counter(full_text)

We are now ready to plot the number of occurrences of the most frequent tokens in the dataset.

# get occurrences of top 30 tokens
top_n = 30
xs, ys = list(zip(*token_counter.most_common(top_n)))

# plot
plot_title = "Tokens and their occurrencies on the Medium dataset"
labels = {
  "x": "Tokens",
  "y": "Number of occurrencies"
fig = px.bar(x=xs, y=ys, template="plotly_dark", title=plot_title, labels=labels)