1.6 Project: Classify Medium Articles with Bag of Words
Contents
1.6 Project: Classify Medium Articles with Bag of Words#
Now that we have some basic knowledge about n-grams and text normalization (e.g. stemming, lemmatization, tokenization), let’s make a simple project where we train a model to distinguish whether Medium articles talk about data science or not.
In this project, we’ll use the medium-articles dataset from Hugging Face, which is a collection of scraped articles containing their title, text, and tags. We won’t deal with how to scrape content from the web.
Libraries#
First, let’s install the datasets library, which allows for easy access to datasets on the Hugging Face Hub.
!pip install datasets
Then, we import the necessary libraries and modules:
The
hf_hub_downloadfunction from thehuggingface_hubmodule makes it easy to download datasets from the Hugging Face Hub.pandaslets us delve with tabular data using dataframes.We split the dataset into train set and test set with the
train_test_splithelper function fromsklearn.We use the
CountVectorizerclass to build a bag of words representation of the articles.Our classifier will be a
LogisticRegression.We analyse the performance of our classification model on the test set with some standard
sklearnhelper functions likeclassification_reportandconfusion_matrix.
from huggingface_hub import hf_hub_download
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (classification_report, confusion_matrix,
ConfusionMatrixDisplay)
Download the Dataset#
Let’s download the dataset and convert it to a pandas dataframe.
# download dataset of Medium articles from
# https://huggingface.co/datasets/fabiochiu/medium-articles
df_articles = pd.read_csv(
hf_hub_download("fabiochiu/medium-articles", repo_type="dataset",
filename="medium_articles.csv")
)
df_articles.head()
| title | text | url | authors | timestamp | tags | |
|---|---|---|---|---|---|---|
| 0 | Mental Note Vol. 24 | Photo by Josh Riemer on Unsplash\n\nMerry Chri... | https://medium.com/invisible-illness/mental-no... | ['Ryan Fan'] | 2020-12-26 03:38:10.479000+00:00 | ['Mental Health', 'Health', 'Psychology', 'Sci... |
| 1 | Your Brain On Coronavirus | Your Brain On Coronavirus\n\nA guide to the cu... | https://medium.com/age-of-awareness/how-the-pa... | ['Simon Spichak'] | 2020-09-23 22:10:17.126000+00:00 | ['Mental Health', 'Coronavirus', 'Science', 'P... |
| 2 | Mind Your Nose | Mind Your Nose\n\nHow smell training can chang... | https://medium.com/neodotlife/mind-your-nose-f... | [] | 2020-10-10 20:17:37.132000+00:00 | ['Biotechnology', 'Neuroscience', 'Brain', 'We... |
| 3 | The 4 Purposes of Dreams | Passionate about the synergy between science a... | https://medium.com/science-for-real/the-4-purp... | ['Eshan Samaranayake'] | 2020-12-21 16:05:19.524000+00:00 | ['Health', 'Neuroscience', 'Mental Health', 'P... |
| 4 | Surviving a Rod Through the Head | You’ve heard of him, haven’t you? Phineas Gage... | https://medium.com/live-your-life-on-purpose/s... | ['Rishav Sinha'] | 2020-02-26 00:01:01.576000+00:00 | ['Brain', 'Health', 'Development', 'Psychology... |
The dataframe has six columns:
The
titleof the article.The
textof the article, i.e. its body.The
urlwhere the article has been scraped.The
authorsof the article.The
timestamp, i.e. the date and time in which the article has been published.The
tagsof the article.
Many blogs and newspaper websites use tags to better organize the articles within the website. For example, an article titled “Transcribing YouTube videos with Whisper in a few lines of code” may have as tags “Data Science”, “Artificial Intelligence”, “Machine Learning”, “NLP”, and “Speech”.
Data Preprocessing#
In this small project, we’ll train a classifier to distinguish whether an article has the “Data Science” tag or not, so let’s add a new column is_data_science to our dataframe. The classifier will learn based upon the title and the text of the articles, so let’s add a full_text column containing the concatenation of the title and the text columns.
# create two columns:
# - full_text: contains the concatenation of the title and the text of the article.
# - is_data_science: a boolean which is True if the article has the "Data Science" tag
df_articles["is_data_science"] = df_articles["tags"] \
.apply(lambda tags_list: "Data Science" in tags_list)
df_articles["full_text"] = df_articles["title"] + " " + df_articles["text"]
df_articles.head()
| title | text | url | authors | timestamp | tags | is_data_science | full_text | |
|---|---|---|---|---|---|---|---|---|
| 0 | Mental Note Vol. 24 | Photo by Josh Riemer on Unsplash\n\nMerry Chri... | https://medium.com/invisible-illness/mental-no... | ['Ryan Fan'] | 2020-12-26 03:38:10.479000+00:00 | ['Mental Health', 'Health', 'Psychology', 'Sci... | False | Mental Note Vol. 24 Photo by Josh Riemer on Un... |
| 1 | Your Brain On Coronavirus | Your Brain On Coronavirus\n\nA guide to the cu... | https://medium.com/age-of-awareness/how-the-pa... | ['Simon Spichak'] | 2020-09-23 22:10:17.126000+00:00 | ['Mental Health', 'Coronavirus', 'Science', 'P... | False | Your Brain On Coronavirus Your Brain On Corona... |
| 2 | Mind Your Nose | Mind Your Nose\n\nHow smell training can chang... | https://medium.com/neodotlife/mind-your-nose-f... | [] | 2020-10-10 20:17:37.132000+00:00 | ['Biotechnology', 'Neuroscience', 'Brain', 'We... | False | Mind Your Nose Mind Your Nose\n\nHow smell tra... |
| 3 | The 4 Purposes of Dreams | Passionate about the synergy between science a... | https://medium.com/science-for-real/the-4-purp... | ['Eshan Samaranayake'] | 2020-12-21 16:05:19.524000+00:00 | ['Health', 'Neuroscience', 'Mental Health', 'P... | False | The 4 Purposes of Dreams Passionate about the ... |
| 4 | Surviving a Rod Through the Head | You’ve heard of him, haven’t you? Phineas Gage... | https://medium.com/live-your-life-on-purpose/s... | ['Rishav Sinha'] | 2020-02-26 00:01:01.576000+00:00 | ['Brain', 'Health', 'Development', 'Psychology... | False | Surviving a Rod Through the Head You’ve heard ... |
Then, to make training faster (just for convenience and learning purposes), let’s sample 1000 articles with the is_data_science column set to True and 1000 articles with it set to False.
# sample 1000 articles is_data_science = True and 1000 articles with
# is_data_science = False
df = pd.concat([
df_articles[df_articles["is_data_science"]].sample(n=1000),
df_articles[~df_articles["is_data_science"]].sample(n=1000)
])
Last, let’s split the dataframe into a train set and a test set using the train_test_split function. Notice that we are passing the labels to the stratify argument, so that the distribution of the labels in the train and test sets will approximately be the same. Read about Stratified Sampling to learn more.
# train/test split
X = df[["full_text"]]
y = df["is_data_science"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
Model Training#
We are now ready to fit the CountVectorizer on the training set and then train our LogisticRegression.
Remember to fit the CountVectorizer after the train/test split to avoid data leakage.
# fit vectorizer, vectorize train set, and train the classification model
vectorizer = CountVectorizer(ngram_range=(1, 1))
full_texts_vectorized = vectorizer.fit_transform(X_train["full_text"])
model = LogisticRegression()
model.fit(full_texts_vectorized, y_train)
Get Metrics on Test Data#
Now that our model is ready, let’s vectorize the test data as well and compute the predictions.
# vectorize test set and predict
full_texts_vectorized = vectorizer.transform(X_test["full_text"])
predictions = model.predict(full_texts_vectorized)
The classification_report function prints the results of several classification metrics on the test set, such as precision, recall, and f1-score.
# plot precision, recall, f1-score on test set
print(classification_report(y_test, predictions))
precision recall f1-score support
False 0.88 0.91 0.89 200
True 0.90 0.88 0.89 200
accuracy 0.89 400
macro avg 0.89 0.89 0.89 400
weighted avg 0.89 0.89 0.89 400
The confusion_matrix function builds an array-shaped representation of the confusion matrix over the test set. The ConfusionMatrixDisplay then plots this confusion matrix using the graph library matplotlib.
# plot confusion matrix
cm = confusion_matrix(y_test, predictions, labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=["No Data Science", "Data Science"])
p = disp.plot()
fig = p.figure_
fig.set_facecolor('white')
Last, to get some insight on what the model has learned, let’s show the top 20 words with the highest weights from the LogisticRegression model, i.e. the words whose presence contributes the most to a True prediction of is_data_science.
# show top 20 ngrams by logistic regression weight
ngram_indices_sorted = sorted(list(vectorizer.vocabulary_.items()), key=lambda t: t[1])
ngram_sorted = list(zip(*ngram_indices_sorted))[0]
ngram_weight_pairs = list(zip(ngram_sorted, model.coef_[0]))
ngram_weight_pairs_sorted = sorted(ngram_weight_pairs, key=lambda t: t[1], reverse=True)
ngram_weight_pairs_sorted[:20]
[('science', 0.8269998908269114),
('data', 0.7649240028489988),
('grafiti', 0.5164785998868469),
('average', 0.45935491367469017),
('install', 0.44434687858001426),
('python', 0.42677800598516474),
('apa', 0.3764474411540241),
('problem', 0.376067783595879),
('itu', 0.37441604838059017),
('mengenal', 0.37441604838059017),
('saja', 0.37441604838059017),
('tipe', 0.37441604838059017),
('what', 0.3682372634460391),
('dataset', 0.3660530682179952),
('intelligence', 0.35362543691872533),
('graphs', 0.3468732224423762),
('math', 0.3442278904564499),
('hey', 0.3341740770629924),
('were', 0.3326927860423783),
('alexey', 0.32750477125896876)]
Among these words we can find data and python, as can be expected.
Code Exercises#
Questions and Feedbacks#
Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the NLPlanet Discord server and interact with the community! There’s a specific channel for this course called practical-nlp-nlplanet.