1.6 Project: Classify Medium Articles with Bag of Words
Contents
1.6 Project: Classify Medium Articles with Bag of Words#
Now that we have some basic knowledge about n-grams and text normalization (e.g. stemming, lemmatization, tokenization), let’s make a simple project where we train a model to distinguish whether Medium articles talk about data science or not.
In this project, we’ll use the medium-articles dataset from Hugging Face, which is a collection of scraped articles containing their title, text, and tags. We won’t deal with how to scrape content from the web.
Libraries#
First, let’s install the datasets
library, which allows for easy access to datasets on the Hugging Face Hub.
!pip install datasets
Then, we import the necessary libraries and modules:
The
hf_hub_download
function from thehuggingface_hub
module makes it easy to download datasets from the Hugging Face Hub.pandas
lets us delve with tabular data using dataframes.We split the dataset into train set and test set with the
train_test_split
helper function fromsklearn
.We use the
CountVectorizer
class to build a bag of words representation of the articles.Our classifier will be a
LogisticRegression
.We analyse the performance of our classification model on the test set with some standard
sklearn
helper functions likeclassification_report
andconfusion_matrix
.
from huggingface_hub import hf_hub_download
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (classification_report, confusion_matrix,
ConfusionMatrixDisplay)
Download the Dataset#
Let’s download the dataset and convert it to a pandas
dataframe.
# download dataset of Medium articles from
# https://huggingface.co/datasets/fabiochiu/medium-articles
df_articles = pd.read_csv(
hf_hub_download("fabiochiu/medium-articles", repo_type="dataset",
filename="medium_articles.csv")
)
df_articles.head()
title | text | url | authors | timestamp | tags | |
---|---|---|---|---|---|---|
0 | Mental Note Vol. 24 | Photo by Josh Riemer on Unsplash\n\nMerry Chri... | https://medium.com/invisible-illness/mental-no... | ['Ryan Fan'] | 2020-12-26 03:38:10.479000+00:00 | ['Mental Health', 'Health', 'Psychology', 'Sci... |
1 | Your Brain On Coronavirus | Your Brain On Coronavirus\n\nA guide to the cu... | https://medium.com/age-of-awareness/how-the-pa... | ['Simon Spichak'] | 2020-09-23 22:10:17.126000+00:00 | ['Mental Health', 'Coronavirus', 'Science', 'P... |
2 | Mind Your Nose | Mind Your Nose\n\nHow smell training can chang... | https://medium.com/neodotlife/mind-your-nose-f... | [] | 2020-10-10 20:17:37.132000+00:00 | ['Biotechnology', 'Neuroscience', 'Brain', 'We... |
3 | The 4 Purposes of Dreams | Passionate about the synergy between science a... | https://medium.com/science-for-real/the-4-purp... | ['Eshan Samaranayake'] | 2020-12-21 16:05:19.524000+00:00 | ['Health', 'Neuroscience', 'Mental Health', 'P... |
4 | Surviving a Rod Through the Head | You’ve heard of him, haven’t you? Phineas Gage... | https://medium.com/live-your-life-on-purpose/s... | ['Rishav Sinha'] | 2020-02-26 00:01:01.576000+00:00 | ['Brain', 'Health', 'Development', 'Psychology... |
The dataframe has six columns:
The
title
of the article.The
text
of the article, i.e. its body.The
url
where the article has been scraped.The
authors
of the article.The
timestamp
, i.e. the date and time in which the article has been published.The
tags
of the article.
Many blogs and newspaper websites use tags to better organize the articles within the website. For example, an article titled “Transcribing YouTube videos with Whisper in a few lines of code” may have as tags “Data Science”, “Artificial Intelligence”, “Machine Learning”, “NLP”, and “Speech”.
Data Preprocessing#
In this small project, we’ll train a classifier to distinguish whether an article has the “Data Science” tag or not, so let’s add a new column is_data_science
to our dataframe. The classifier will learn based upon the title and the text of the articles, so let’s add a full_text
column containing the concatenation of the title
and the text
columns.
# create two columns:
# - full_text: contains the concatenation of the title and the text of the article.
# - is_data_science: a boolean which is True if the article has the "Data Science" tag
df_articles["is_data_science"] = df_articles["tags"] \
.apply(lambda tags_list: "Data Science" in tags_list)
df_articles["full_text"] = df_articles["title"] + " " + df_articles["text"]
df_articles.head()
title | text | url | authors | timestamp | tags | is_data_science | full_text | |
---|---|---|---|---|---|---|---|---|
0 | Mental Note Vol. 24 | Photo by Josh Riemer on Unsplash\n\nMerry Chri... | https://medium.com/invisible-illness/mental-no... | ['Ryan Fan'] | 2020-12-26 03:38:10.479000+00:00 | ['Mental Health', 'Health', 'Psychology', 'Sci... | False | Mental Note Vol. 24 Photo by Josh Riemer on Un... |
1 | Your Brain On Coronavirus | Your Brain On Coronavirus\n\nA guide to the cu... | https://medium.com/age-of-awareness/how-the-pa... | ['Simon Spichak'] | 2020-09-23 22:10:17.126000+00:00 | ['Mental Health', 'Coronavirus', 'Science', 'P... | False | Your Brain On Coronavirus Your Brain On Corona... |
2 | Mind Your Nose | Mind Your Nose\n\nHow smell training can chang... | https://medium.com/neodotlife/mind-your-nose-f... | [] | 2020-10-10 20:17:37.132000+00:00 | ['Biotechnology', 'Neuroscience', 'Brain', 'We... | False | Mind Your Nose Mind Your Nose\n\nHow smell tra... |
3 | The 4 Purposes of Dreams | Passionate about the synergy between science a... | https://medium.com/science-for-real/the-4-purp... | ['Eshan Samaranayake'] | 2020-12-21 16:05:19.524000+00:00 | ['Health', 'Neuroscience', 'Mental Health', 'P... | False | The 4 Purposes of Dreams Passionate about the ... |
4 | Surviving a Rod Through the Head | You’ve heard of him, haven’t you? Phineas Gage... | https://medium.com/live-your-life-on-purpose/s... | ['Rishav Sinha'] | 2020-02-26 00:01:01.576000+00:00 | ['Brain', 'Health', 'Development', 'Psychology... | False | Surviving a Rod Through the Head You’ve heard ... |
Then, to make training faster (just for convenience and learning purposes), let’s sample 1000 articles with the is_data_science
column set to True
and 1000 articles with it set to False
.
# sample 1000 articles is_data_science = True and 1000 articles with
# is_data_science = False
df = pd.concat([
df_articles[df_articles["is_data_science"]].sample(n=1000),
df_articles[~df_articles["is_data_science"]].sample(n=1000)
])
Last, let’s split the dataframe into a train set and a test set using the train_test_split
function. Notice that we are passing the labels to the stratify
argument, so that the distribution of the labels in the train and test sets will approximately be the same. Read about Stratified Sampling to learn more.
# train/test split
X = df[["full_text"]]
y = df["is_data_science"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
Model Training#
We are now ready to fit the CountVectorizer
on the training set and then train our LogisticRegression
.
Remember to fit the CountVectorizer after the train/test split to avoid data leakage.
# fit vectorizer, vectorize train set, and train the classification model
vectorizer = CountVectorizer(ngram_range=(1, 1))
full_texts_vectorized = vectorizer.fit_transform(X_train["full_text"])
model = LogisticRegression()
model.fit(full_texts_vectorized, y_train)
Get Metrics on Test Data#
Now that our model is ready, let’s vectorize the test data as well and compute the predictions.
# vectorize test set and predict
full_texts_vectorized = vectorizer.transform(X_test["full_text"])
predictions = model.predict(full_texts_vectorized)
The classification_report
function prints the results of several classification metrics on the test set, such as precision, recall, and f1-score.
# plot precision, recall, f1-score on test set
print(classification_report(y_test, predictions))
precision recall f1-score support
False 0.88 0.91 0.89 200
True 0.90 0.88 0.89 200
accuracy 0.89 400
macro avg 0.89 0.89 0.89 400
weighted avg 0.89 0.89 0.89 400
The confusion_matrix
function builds an array-shaped representation of the confusion matrix over the test set. The ConfusionMatrixDisplay
then plots this confusion matrix using the graph library matplotlib
.
# plot confusion matrix
cm = confusion_matrix(y_test, predictions, labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=["No Data Science", "Data Science"])
p = disp.plot()
fig = p.figure_
fig.set_facecolor('white')

Last, to get some insight on what the model has learned, let’s show the top 20 words with the highest weights from the LogisticRegression
model, i.e. the words whose presence contributes the most to a True
prediction of is_data_science
.
# show top 20 ngrams by logistic regression weight
ngram_indices_sorted = sorted(list(vectorizer.vocabulary_.items()), key=lambda t: t[1])
ngram_sorted = list(zip(*ngram_indices_sorted))[0]
ngram_weight_pairs = list(zip(ngram_sorted, model.coef_[0]))
ngram_weight_pairs_sorted = sorted(ngram_weight_pairs, key=lambda t: t[1], reverse=True)
ngram_weight_pairs_sorted[:20]
[('science', 0.8269998908269114),
('data', 0.7649240028489988),
('grafiti', 0.5164785998868469),
('average', 0.45935491367469017),
('install', 0.44434687858001426),
('python', 0.42677800598516474),
('apa', 0.3764474411540241),
('problem', 0.376067783595879),
('itu', 0.37441604838059017),
('mengenal', 0.37441604838059017),
('saja', 0.37441604838059017),
('tipe', 0.37441604838059017),
('what', 0.3682372634460391),
('dataset', 0.3660530682179952),
('intelligence', 0.35362543691872533),
('graphs', 0.3468732224423762),
('math', 0.3442278904564499),
('hey', 0.3341740770629924),
('were', 0.3326927860423783),
('alexey', 0.32750477125896876)]
Among these words we can find data
and python
, as can be expected.
Code Exercises#
Questions and Feedbacks#
Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the NLPlanet Discord server and interact with the community! There’s a specific channel for this course called practical-nlp-nlplanet.