# 1.6 Project: Classify Medium Articles with Bag of Words

Now that we have some basic knowledge about n-grams and text normalization (e.g. stemming, lemmatization, tokenization), let's make a simple project where we train a model to distinguish whether [Medium](https://medium.com/) articles talk about data science or not.

In this project, we'll use the [*medium-articles*](https://huggingface.co/datasets/fabiochiu/medium-articles) dataset from Hugging Face, which is a collection of scraped articles containing their title, text, and tags. We won't deal with how to scrape content from the web.

## Libraries

First, let's install the [`datasets`](https://github.com/huggingface/datasets) library, which allows for easy access to datasets on the [Hugging Face Hub](https://huggingface.co/docs/hub/index).

In [None]:
!pip install datasets

Then, we import the necessary libraries and modules:
- The [`hf_hub_download`](https://huggingface.co/docs/huggingface_hub/v0.10.1/en/package_reference/file_download#huggingface_hub.hf_hub_download) function from the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) module makes it easy to download datasets from the Hugging Face Hub.
- `pandas` lets us delve with tabular data using dataframes.
- We split the dataset into train set and test set with the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) helper function from `sklearn`.
- We use the `CountVectorizer` class to build a bag of words representation of the articles.
- Our classifier will be a `LogisticRegression`.
- We analyse the performance of our classification model on the test set with some standard `sklearn` helper functions like [`classification_report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) and [`confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

In [None]:
from huggingface_hub import hf_hub_download

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (classification_report, confusion_matrix,
  ConfusionMatrixDisplay)

In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (classification_report, confusion_matrix,
  ConfusionMatrixDisplay)

## Download the Dataset

Let's download the dataset and convert it to a `pandas` dataframe.

In [None]:
# download dataset of Medium articles from 
# https://huggingface.co/datasets/fabiochiu/medium-articles
df_articles = pd.read_csv(
  hf_hub_download("fabiochiu/medium-articles", repo_type="dataset",
                  filename="medium_articles.csv")
)

df_articles.head()

In [None]:
df_articles = pd.read_csv("df_articles_6_head.csv")
df_articles.head()

The dataframe has six columns:
- The `title` of the article.
- The `text` of the article, i.e. its body.
- The `url` where the article has been scraped.
- The `authors` of the article.
- The `timestamp`, i.e. the date and time in which the article has been published.
- The `tags` of the article.

Many blogs and newspaper websites use tags to better organize the articles within the website. For example, an article titled ["Transcribing YouTube videos with Whisper in a few lines of code"](https://medium.com/nlplanet/transcribing-youtube-videos-with-whisper-in-a-few-lines-of-code-f57f27596a55) may have as tags "Data Science", "Artificial Intelligence", "Machine Learning", "NLP", and "Speech".

## Data Preprocessing

In this small project, we'll train a classifier to distinguish whether an article has the "Data Science" tag or not, so let's add a new column `is_data_science` to our dataframe. The classifier will learn based upon the title and the text of the articles, so let's add a `full_text` column containing the concatenation of the `title` and the `text` columns.

In [None]:
# create two columns:
# - full_text: contains the concatenation of the title and the text of the article.
# - is_data_science: a boolean which is True if the article has the "Data Science" tag
df_articles["is_data_science"] = df_articles["tags"] \
  .apply(lambda tags_list: "Data Science" in tags_list)
df_articles["full_text"] = df_articles["title"] + " " + df_articles["text"]
df_articles.head()

In [None]:
df_articles = pd.read_csv("df_articles_6_filtered_head.csv")
df_articles.head()

Then, to make training faster (just for convenience and learning purposes), let's sample 1000 articles with the `is_data_science` column set to `True` and 1000 articles with it set to `False`.

In [None]:
# sample 1000 articles is_data_science = True and 1000 articles with
# is_data_science = False
df = pd.concat([
    df_articles[df_articles["is_data_science"]].sample(n=1000),
    df_articles[~df_articles["is_data_science"]].sample(n=1000)
])

In [None]:
df = pd.read_csv("df_articles_6_dataset.csv")

Last, let's split the dataframe into a train set and a test set using the `train_test_split` function. Notice that we are passing the labels to the `stratify` argument, so that the distribution of the labels in the train and test sets will approximately be the same. Read about [Stratified Sampling](https://en.wikipedia.org/wiki/Stratified_sampling) to learn more.

In [None]:
# train/test split
X = df[["full_text"]]
y = df["is_data_science"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42, stratify=y)

## Model Training

We are now ready to fit the `CountVectorizer` on the training set and then train our `LogisticRegression`.

Remember to fit the CountVectorizer **after** the train/test split to avoid [data leakage](https://machinelearningmastery.com/data-leakage-machine-learning/).

In [None]:
# fit vectorizer, vectorize train set, and train the classification model
vectorizer = CountVectorizer(ngram_range=(1, 1))
full_texts_vectorized = vectorizer.fit_transform(X_train["full_text"])
model = LogisticRegression()
model.fit(full_texts_vectorized, y_train)

## Get Metrics on Test Data

Now that our model is ready, let's vectorize the test data as well and compute the predictions.

In [None]:
# vectorize test set and predict
full_texts_vectorized = vectorizer.transform(X_test["full_text"])
predictions = model.predict(full_texts_vectorized)

The `classification_report` function prints the results of several classification metrics on the test set, such as precision, recall, and f1-score.

In [None]:
# plot precision, recall, f1-score on test set
print(classification_report(y_test, predictions))

The `confusion_matrix` function builds an array-shaped representation of the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) over the test set. The `ConfusionMatrixDisplay` then plots this confusion matrix using the graph library [`matplotlib`](https://matplotlib.org/).

In [None]:
# plot confusion matrix
cm = confusion_matrix(y_test, predictions, labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=["No Data Science", "Data Science"])
p = disp.plot()
fig = p.figure_
fig.set_facecolor('white')

Last, to get some insight on what the model has learned, let's show the top 20 words with the highest weights from the `LogisticRegression` model, i.e. the words whose presence contributes the most to a `True` prediction of `is_data_science`.

In [None]:
# show top 20 ngrams by logistic regression weight
ngram_indices_sorted = sorted(list(vectorizer.vocabulary_.items()), key=lambda t: t[1])
ngram_sorted = list(zip(*ngram_indices_sorted))[0]
ngram_weight_pairs = list(zip(ngram_sorted, model.coef_[0]))
ngram_weight_pairs_sorted = sorted(ngram_weight_pairs, key=lambda t: t[1], reverse=True)
ngram_weight_pairs_sorted[:20]

Among these words we can find `data` and `python`, as can be expected.

# Code Exercises

<button
    class="colab-button"
    onclick="window.open('https://colab.research.google.com/drive/1Yc19Amenop1FET4jgdfd6OYMS3wVWQAQ?usp=sharing','_blank')"
    type="button">
    Go to Notebook
</button>

# Questions and Feedbacks

Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the [NLPlanet Discord server](https://discord.gg/zfC862H2dJ) and interact with the community! There's a specific channel for this course called **practical-nlp-nlplanet**.