1.11 Project: Recommender System over Medium with TF-IDF#

Do you remember what’s the purpose of a recommender system? The goal of recommender systems is to suggest items that users are likely to enjoy and interact with. In the case of Medium, we’d like to suggest relevant articles to a user that have already read several articles in the past.

We previously saw how to leverage TF-IDF to retrieve articles according to a search query:

  1. We vectorize all the articles with TF-IDF.

  2. We vectorize the query as well.

  3. The vectors of the articles and the query vector are compared, looking for high similarity.

  4. The vectors of the articles are ranked according to the similarity to the search query.

With small changes to these steps, let’s see how to build a recommender system with TF-IDF.

Libraries#

Let’s import the necessary libraries. We’ll need also the sparse module from scipy, which helps dealing with sparse matrices.

from huggingface_hub import hf_hub_download

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import sparse

Download Dataset#

We can now download the dataset of Medium articles from the Hugging Face Hub.

# download dataset of Medium articles from 
# https://huggingface.co/datasets/fabiochiu/medium-articles
df_articles = pd.read_csv(
  hf_hub_download("fabiochiu/medium-articles", repo_type="dataset", filename="medium_articles.csv")
)

# There are 192,368 articles in total, but let's keep only 10,000 of them to
# make computations faster
df_articles = df_articles[:10000].reset_index(drop=True)

df_articles.head()
title text url authors timestamp tags
0 Mental Note Vol. 24 Photo by Josh Riemer on Unsplash\n\nMerry Chri... https://medium.com/invisible-illness/mental-no... ['Ryan Fan'] 2020-12-26 03:38:10.479000+00:00 ['Mental Health', 'Health', 'Psychology', 'Sci...
1 Your Brain On Coronavirus Your Brain On Coronavirus\n\nA guide to the cu... https://medium.com/age-of-awareness/how-the-pa... ['Simon Spichak'] 2020-09-23 22:10:17.126000+00:00 ['Mental Health', 'Coronavirus', 'Science', 'P...
2 Mind Your Nose Mind Your Nose\n\nHow smell training can chang... https://medium.com/neodotlife/mind-your-nose-f... [] 2020-10-10 20:17:37.132000+00:00 ['Biotechnology', 'Neuroscience', 'Brain', 'We...
3 The 4 Purposes of Dreams Passionate about the synergy between science a... https://medium.com/science-for-real/the-4-purp... ['Eshan Samaranayake'] 2020-12-21 16:05:19.524000+00:00 ['Health', 'Neuroscience', 'Mental Health', 'P...
4 Surviving a Rod Through the Head You’ve heard of him, haven’t you? Phineas Gage... https://medium.com/live-your-life-on-purpose/s... ['Rishav Sinha'] 2020-02-26 00:01:01.576000+00:00 ['Brain', 'Health', 'Development', 'Psychology...

Using the TfidfVectorizer#

Similar to the search engine we implemented in the previous lesson, we vectorize all the articles with TfidfVectorizer.

# apply the TfidfVectorizer to the corpus
corpus = df_articles["text"]
vectorizer = TfidfVectorizer()
corpus_vectorized = vectorizer.fit_transform(corpus)
print(corpus_vectorized.shape)
(10000, 110038)

The corpus_vectorized variable is a sparse matrix with 10k rows (one row for each article) and ~110k columns (one column for each token in the corpus).

Representing Users with Vectors#

Now there’s the interesting part: let’s find a way of representing a user with a vector (with the same dimensions as the article vectors), so that we can compare it with the article vectors to find the ones to recommend.

There are several heuristics to do it. A simple way is representing the user with a vector equal to the average of all the vectors of his/her read articles. In this example, we are considering a user that has read three random articles containing the tag “Data Science” (and therefore we expect the user to receive suggestions of other articles about data science). Let’s write the function get_sparse_user_vector_for_tag that randomly selects articles with a specific tag, retrieves their TF-IDF vectors and averages them, returning a user_vector.

def get_sparse_user_vector_for_tag(df_articles, tag, corpus_vectorized, n_articles=3):
  # get the indices of "n_articles" random articles with the "tag" tag
  df_articles_data_science = df_articles[df_articles["tags"].apply(lambda l: tag in eval(l))]
  read_articles_indices = df_articles_data_science.sample(n=n_articles).index.values.tolist()
  read_articles_indices = np.array(read_articles_indices)
  df_read_articles = df_articles.loc[read_articles_indices]

  # compute user vector as the average of the vectors of the read articles
  read_articles_rows = []
  for idx in read_articles_indices:
    article_row = corpus_vectorized.getrow(idx).toarray()[0]
    read_articles_rows.append(article_row)
  read_articles_rows = np.array(read_articles_rows)
  user_vector_dense = np.average(read_articles_rows, axis=0).reshape((1, -1))
  user_vector = sparse.csr_matrix(user_vector_dense)

  return user_vector, df_read_articles

# suppose the user has read three articles about data science
user_vector, df_read_articles = get_sparse_user_vector_for_tag(df_articles, "Data Science",
                                             corpus_vectorized, n_articles=3)
print(user_vector.shape)
(1, 110038)

The user_vector variable is a sparse matrix with one row and ~120k columns (one column for each token in the corpus). We need it to be sparse because later we’ll compute the dot product between it and corpus_vectorized, which is a sparse matrix as well. Multiplying a dense matrix with a sparse matrix would cause the sparse matrix to be transformed to dense before the multiplication, thus using a lot of memory because corpus_vectorized is very large.

Let’s also check that the read articles are indeed about data science.

df_read_articles["title"].values
array(['6 Types of Neural Networks Every Data Scientist Must Know',
       'Why is Everyone Going to Iceland?',
       'RL — Model-based Reinforcement Learning'], dtype=object)

Compute Similarities between User and Articles#

We are now ready to compute the similarity of the user vector with respect to the vectors of the articles. Let’s multiply user_vector with the transpose of corpus_vectorized to get a similarity score for each article (the higher the score, the more similar the user and the article).

# compute scores as the dot product between the query vector
# and the documents vectors
scores = user_vector.dot(corpus_vectorized.transpose())
scores_array = scores.toarray()[0]
print(scores_array.shape)
(10000,)

Show Results#

Last, let’s sort the articles by their similarity score and show the most relevant ones.

# retrieve the top_n articles with the highest scores and show them
def show_best_results(df_articles, scores_array, top_n=10):
  sorted_indices = scores_array.argsort()[::-1]
  for position, idx in enumerate(sorted_indices[:top_n]):
    row = df_articles.iloc[idx]
    title = row["title"]
    score = scores_array[idx]
    print(f"{position + 1} [score = {score}]: {title}")

show_best_results(df_articles, scores_array)
1 [score = 0.43243171290023863]: RL — Model-based Reinforcement Learning
2 [score = 0.41497633316730165]: 6 Types of Neural Networks Every Data Scientist Must Know
3 [score = 0.39197994149298554]: Why is Everyone Going to Iceland?
4 [score = 0.3386122842326819]: Neural Networks with Memory
5 [score = 0.3308103459804056]: Beyond DQN/A3C: A Survey in Advanced Reinforcement Learning
6 [score = 0.32658852811652495]: The definitive guide to Neural Networks
7 [score = 0.3169999894877312]: How to Build a Simple Image Recognition System with TensorFlow (Part 2)
8 [score = 0.31644579390692273]: You will never believe how machines can learn like humans: Part 2
9 [score = 0.31497011651937956]: Hand-written Digit Recognition Using CNN Classification(Process Explanation)
10 [score = 0.31441080874061944]: Neural Networks, Demystified

The code is suggesting other articles about data science, that’s great! Notice that the first three suggested articles have a significantly higher similarity than the others: that’s because they are exactly the three articles read by our fictitious user. A good recommender system would not recommend them again, thus filtering already consumed articles and suggesting from the fourth onwards.

Let’s try also with users that have read articles with other tags.

# suppose the user has read three articles about Computer Vision
user_vector, _ = get_sparse_user_vector_for_tag(df_articles, "Computer Vision",
    corpus_vectorized, n_articles=3)
scores = user_vector.dot(corpus_vectorized.transpose())
scores_array = scores.toarray()[0]
show_best_results(df_articles, scores_array)
1 [score = 0.4323438402663196]: Image Processing with OpenCV
2 [score = 0.4304814358924063]: What Are RBMs, Deep Belief Networks and Why Are They Important to Deep Learning?
3 [score = 0.425601246236392]: Image Segmentation with Python
4 [score = 0.3105837947926712]: Essential OpenCV Functions to Get You Started into Computer Vision
5 [score = 0.29096883859107076]: Intro to Segmentation
6 [score = 0.2901207909231179]: Hand-written Digit Recognition Using CNN Classification(Process Explanation)
7 [score = 0.2797908107017236]: Introduction to Artificial Intelligence
8 [score = 0.2796386323906408]: Image Creation for Non-Artists (OpenCV Project Walkthrough)
9 [score = 0.27887870547034405]: Image Recognition APIs: Google, Amazon, IBM, Microsoft, and more
10 [score = 0.2691248832865112]: A Classic Computer Vision Project — How to Add an Image Behind Objects in a Video
# suppose the user has read three articles about Reinforcement Learning
user_vector, _ = get_sparse_user_vector_for_tag(df_articles, "Reinforcement Learning",
    corpus_vectorized, n_articles=3)
scores = user_vector.dot(corpus_vectorized.transpose())
scores_array = scores.toarray()[0]
show_best_results(df_articles, scores_array)
1 [score = 0.5157439852729353]: Chess Playing Algorithm Explained
2 [score = 0.49783164482294656]: Reinforcement learning with Skinner
3 [score = 0.48277763028818627]: MuZero 101: a brief introduction to DeepMind’s latest AI
4 [score = 0.39286004017311016]: Reinforcement Learning: Super Mario, AlphaGo and beyond
5 [score = 0.36120858642711723]: Crash Course: Reinforcement Learning
6 [score = 0.3522303935785226]: Tools Of The Best: Software Development Edition
7 [score = 0.35045555231717096]: Industries AI Is Poised to Revolutionize in the Next 20 Years
8 [score = 0.3433698143394783]: How Entrepreneurs Can Thrive in a New Era of Uncertainty
9 [score = 0.34228377246756575]: The Carrot and the Stick — Reinforcement Learning
10 [score = 0.3417871901120011]: How do you get to Carnegie Hall?

Also with articles with tags like “Computer Vision” and “Reinforcement Learning” the results are relevant.

Code Exercises#

Questions and Feedbacks#

Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the NLPlanet Discord server and interact with the community! There’s a specific channel for this course called practical-nlp-nlplanet.