1.6 Project: Classify Medium Articles with Bag of Words#

Now that we have some basic knowledge about n-grams and text normalization (e.g. stemming, lemmatization, tokenization), let’s make a simple project where we train a model to distinguish whether Medium articles talk about data science or not.

In this project, we’ll use the medium-articles dataset from Hugging Face, which is a collection of scraped articles containing their title, text, and tags. We won’t deal with how to scrape content from the web.

Libraries#

First, let’s install the datasets library, which allows for easy access to datasets on the Hugging Face Hub.

!pip install datasets

Then, we import the necessary libraries and modules:

  • The hf_hub_download function from the huggingface_hub module makes it easy to download datasets from the Hugging Face Hub.

  • pandas lets us delve with tabular data using dataframes.

  • We split the dataset into train set and test set with the train_test_split helper function from sklearn.

  • We use the CountVectorizer class to build a bag of words representation of the articles.

  • Our classifier will be a LogisticRegression.

  • We analyse the performance of our classification model on the test set with some standard sklearn helper functions like classification_report and confusion_matrix.

from huggingface_hub import hf_hub_download

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (classification_report, confusion_matrix,
  ConfusionMatrixDisplay)

Download the Dataset#

Let’s download the dataset and convert it to a pandas dataframe.

# download dataset of Medium articles from 
# https://huggingface.co/datasets/fabiochiu/medium-articles
df_articles = pd.read_csv(
  hf_hub_download("fabiochiu/medium-articles", repo_type="dataset",
                  filename="medium_articles.csv")
)

df_articles.head()
title text url authors timestamp tags
0 Mental Note Vol. 24 Photo by Josh Riemer on Unsplash\n\nMerry Chri... https://medium.com/invisible-illness/mental-no... ['Ryan Fan'] 2020-12-26 03:38:10.479000+00:00 ['Mental Health', 'Health', 'Psychology', 'Sci...
1 Your Brain On Coronavirus Your Brain On Coronavirus\n\nA guide to the cu... https://medium.com/age-of-awareness/how-the-pa... ['Simon Spichak'] 2020-09-23 22:10:17.126000+00:00 ['Mental Health', 'Coronavirus', 'Science', 'P...
2 Mind Your Nose Mind Your Nose\n\nHow smell training can chang... https://medium.com/neodotlife/mind-your-nose-f... [] 2020-10-10 20:17:37.132000+00:00 ['Biotechnology', 'Neuroscience', 'Brain', 'We...
3 The 4 Purposes of Dreams Passionate about the synergy between science a... https://medium.com/science-for-real/the-4-purp... ['Eshan Samaranayake'] 2020-12-21 16:05:19.524000+00:00 ['Health', 'Neuroscience', 'Mental Health', 'P...
4 Surviving a Rod Through the Head You’ve heard of him, haven’t you? Phineas Gage... https://medium.com/live-your-life-on-purpose/s... ['Rishav Sinha'] 2020-02-26 00:01:01.576000+00:00 ['Brain', 'Health', 'Development', 'Psychology...

The dataframe has six columns:

  • The title of the article.

  • The text of the article, i.e. its body.

  • The url where the article has been scraped.

  • The authors of the article.

  • The timestamp, i.e. the date and time in which the article has been published.

  • The tags of the article.

Many blogs and newspaper websites use tags to better organize the articles within the website. For example, an article titled “Transcribing YouTube videos with Whisper in a few lines of code” may have as tags “Data Science”, “Artificial Intelligence”, “Machine Learning”, “NLP”, and “Speech”.

Data Preprocessing#

In this small project, we’ll train a classifier to distinguish whether an article has the “Data Science” tag or not, so let’s add a new column is_data_science to our dataframe. The classifier will learn based upon the title and the text of the articles, so let’s add a full_text column containing the concatenation of the title and the text columns.

# create two columns:
# - full_text: contains the concatenation of the title and the text of the article.
# - is_data_science: a boolean which is True if the article has the "Data Science" tag
df_articles["is_data_science"] = df_articles["tags"] \
  .apply(lambda tags_list: "Data Science" in tags_list)
df_articles["full_text"] = df_articles["title"] + " " + df_articles["text"]
df_articles.head()
title text url authors timestamp tags is_data_science full_text
0 Mental Note Vol. 24 Photo by Josh Riemer on Unsplash\n\nMerry Chri... https://medium.com/invisible-illness/mental-no... ['Ryan Fan'] 2020-12-26 03:38:10.479000+00:00 ['Mental Health', 'Health', 'Psychology', 'Sci... False Mental Note Vol. 24 Photo by Josh Riemer on Un...
1 Your Brain On Coronavirus Your Brain On Coronavirus\n\nA guide to the cu... https://medium.com/age-of-awareness/how-the-pa... ['Simon Spichak'] 2020-09-23 22:10:17.126000+00:00 ['Mental Health', 'Coronavirus', 'Science', 'P... False Your Brain On Coronavirus Your Brain On Corona...
2 Mind Your Nose Mind Your Nose\n\nHow smell training can chang... https://medium.com/neodotlife/mind-your-nose-f... [] 2020-10-10 20:17:37.132000+00:00 ['Biotechnology', 'Neuroscience', 'Brain', 'We... False Mind Your Nose Mind Your Nose\n\nHow smell tra...
3 The 4 Purposes of Dreams Passionate about the synergy between science a... https://medium.com/science-for-real/the-4-purp... ['Eshan Samaranayake'] 2020-12-21 16:05:19.524000+00:00 ['Health', 'Neuroscience', 'Mental Health', 'P... False The 4 Purposes of Dreams Passionate about the ...
4 Surviving a Rod Through the Head You’ve heard of him, haven’t you? Phineas Gage... https://medium.com/live-your-life-on-purpose/s... ['Rishav Sinha'] 2020-02-26 00:01:01.576000+00:00 ['Brain', 'Health', 'Development', 'Psychology... False Surviving a Rod Through the Head You’ve heard ...

Then, to make training faster (just for convenience and learning purposes), let’s sample 1000 articles with the is_data_science column set to True and 1000 articles with it set to False.

# sample 1000 articles is_data_science = True and 1000 articles with
# is_data_science = False
df = pd.concat([
    df_articles[df_articles["is_data_science"]].sample(n=1000),
    df_articles[~df_articles["is_data_science"]].sample(n=1000)
])

Last, let’s split the dataframe into a train set and a test set using the train_test_split function. Notice that we are passing the labels to the stratify argument, so that the distribution of the labels in the train and test sets will approximately be the same. Read about Stratified Sampling to learn more.

# train/test split
X = df[["full_text"]]
y = df["is_data_science"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42, stratify=y)

Model Training#

We are now ready to fit the CountVectorizer on the training set and then train our LogisticRegression.

Remember to fit the CountVectorizer after the train/test split to avoid data leakage.

# fit vectorizer, vectorize train set, and train the classification model
vectorizer = CountVectorizer(ngram_range=(1, 1))
full_texts_vectorized = vectorizer.fit_transform(X_train["full_text"])
model = LogisticRegression()
model.fit(full_texts_vectorized, y_train)

Get Metrics on Test Data#

Now that our model is ready, let’s vectorize the test data as well and compute the predictions.

# vectorize test set and predict
full_texts_vectorized = vectorizer.transform(X_test["full_text"])
predictions = model.predict(full_texts_vectorized)

The classification_report function prints the results of several classification metrics on the test set, such as precision, recall, and f1-score.

# plot precision, recall, f1-score on test set
print(classification_report(y_test, predictions))
              precision    recall  f1-score   support

       False       0.88      0.91      0.89       200
        True       0.90      0.88      0.89       200

    accuracy                           0.89       400
   macro avg       0.89      0.89      0.89       400
weighted avg       0.89      0.89      0.89       400

The confusion_matrix function builds an array-shaped representation of the confusion matrix over the test set. The ConfusionMatrixDisplay then plots this confusion matrix using the graph library matplotlib.

# plot confusion matrix
cm = confusion_matrix(y_test, predictions, labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=["No Data Science", "Data Science"])
p = disp.plot()
fig = p.figure_
fig.set_facecolor('white')
../_images/06-classify-articles_25_0.png

Last, to get some insight on what the model has learned, let’s show the top 20 words with the highest weights from the LogisticRegression model, i.e. the words whose presence contributes the most to a True prediction of is_data_science.

# show top 20 ngrams by logistic regression weight
ngram_indices_sorted = sorted(list(vectorizer.vocabulary_.items()), key=lambda t: t[1])
ngram_sorted = list(zip(*ngram_indices_sorted))[0]
ngram_weight_pairs = list(zip(ngram_sorted, model.coef_[0]))
ngram_weight_pairs_sorted = sorted(ngram_weight_pairs, key=lambda t: t[1], reverse=True)
ngram_weight_pairs_sorted[:20]
[('science', 0.8269998908269114),
 ('data', 0.7649240028489988),
 ('grafiti', 0.5164785998868469),
 ('average', 0.45935491367469017),
 ('install', 0.44434687858001426),
 ('python', 0.42677800598516474),
 ('apa', 0.3764474411540241),
 ('problem', 0.376067783595879),
 ('itu', 0.37441604838059017),
 ('mengenal', 0.37441604838059017),
 ('saja', 0.37441604838059017),
 ('tipe', 0.37441604838059017),
 ('what', 0.3682372634460391),
 ('dataset', 0.3660530682179952),
 ('intelligence', 0.35362543691872533),
 ('graphs', 0.3468732224423762),
 ('math', 0.3442278904564499),
 ('hey', 0.3341740770629924),
 ('were', 0.3326927860423783),
 ('alexey', 0.32750477125896876)]

Among these words we can find data and python, as can be expected.

Code Exercises#

Questions and Feedbacks#

Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the NLPlanet Discord server and interact with the community! There’s a specific channel for this course called practical-nlp-nlplanet.