2.12 Project: Clustering Newspaper Articles#

In this lesson, we’ll see how to leverage sentence embeddings to perform clustering of newspaper articles.

The Clustering Pipeline#

Here’s what we are going to do:

  1. Create embeddings for the newspaper articles using sentence-transformers.

  2. Reduce the dimensionality of the embeddings using umap.

  3. Visualize the embeddings with plotly.

  4. Clusterize the embeddings using hdbscan.

  5. Assign a meaningful name to each cluster leveraging keyword extraction techniques from keybert.

The Dataset#

The AG (Antonio Gulli) dataset is a collection of more than 1 million news articles. Here’s its description in its Hugging Face dataset card:

News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity.

We’ll cluster a subset of 3k articles from it.

Dimensionality Reduction with UMAP#

UMAP is a flexible non-linear dimensionality reduction algorithm based on manifold learning, very useful for visualizing high-dimensional datasets. Read this article to get an understanding of how it works and see a comparison with TSNE.

Clustering with HDBSCAN#

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a variation of the DBSCAN clustering algorithm, allowing finding clusters of varying densities (unlike DBSCAN), and being more robust to parameter selection. You can learn more about it in its documentation.

I’d suggest reading also this article about clustering methods in sklearn to learn about commonly used clustering algorithms. The following image is taken from that article.

../_images/clustering_methods.png

Extracting Keywords with KeyBERT#

KeyBERT is a easy-to-use library for extracting keywords from text. Here’s how it’s described in its repository:

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

In short words, the library uses BERT-embeddings and simple cosine similarity to find the sub-phrases in a document that are the most similar to the document itself. You can learn more about how it works in this Medium post.

Coding with Python#

We are now ready to implement the clustering pipeline over the AG dataset.

Install and Import Libraries#

Let’s install the necessary libraries.

pip install datasets sentence-transformers umap-learn hdbscan keybert

We’re going to use the umap-learn library for its implementation of the UMAP dimensionality reduction algorithm, hdbscan for its implementation of the clustering algorithm HDBSCAN, and keybert for its keyword extraction techniques.

# manage data
from datasets import load_dataset
import pandas as pd

# embeddings
from sentence_transformers import SentenceTransformer

# dimensionality reduction
import umap

# clustering
import hdbscan

# extract keywords from texts
# used to assign meaningful names to clusters
from keybert import KeyBERT

# visualization
import plotly.express as px

Download and Prepare Dataset#

As we often do, we load the dataset using the load_dataset function.

# download data
dataset = load_dataset("ag_news", split="train")
print(dataset)
Dataset({
    features: ['text', 'label'],
    num_rows: 120000
})

The dataset contains 120k articles. Let’s keep only 3k of them to make computations faster in this project.

# keep only first 3k articles to make computations faster
dataset_subset = dataset.train_test_split(train_size=3000)["train"]
print(dataset_subset)
Dataset({
    features: ['text', 'label'],
    num_rows: 120000
})

Then, we convert the dataset to a pandas dataframe and drop its label column (which is used for benchmarking text classification and it’s not our goal in this project).

# convert dataset to pandas dataframe
df = pd.DataFrame(dataset_subset).drop("label", axis=1)
df.head()
text
0 Mistakes cost Dallas dearly Mistakes will driv...
1 Bomb scare forces Singapore plane to UK A Sing...
2 Only Drills, but Houston Looks Ready to Return...
3 Pact to speed Navy Yard plans The Boston Redev...
4 Pixar's Waiting for Summer Plus, IBM's win-win...

Create Articles Embeddings#

Let’s download an embedding model using the sentence-transformers library.

# download the sentence embeddings model
embedder = SentenceTransformer('all-mpnet-base-v2')

We then get the embedding of all the articles in our dataset.

# embed article texts
corpus_embeddings = embedder.encode(df["text"].values)
print(corpus_embeddings.shape)
(3000, 768)

Reduce Embeddings Size#

To visualize the embeddings in two dimensions, let’s reduce their dimensions to 2 using UMAP. UMAP is rather sensible to its hyperparameters, so you can find in the code the hyperparameters that worked the best in my experiments.

# reduce the size of the embeddings using UMAP
reduced_embeddings = umap.UMAP(n_components=2, n_neighbors=100, min_dist=0.02).fit_transform(corpus_embeddings)
print(reduced_embeddings.shape)

# put the values of the two dimensions inside the dataframe
df["x"] = reduced_embeddings[:, 0]
df["y"] = reduced_embeddings[:, 1]

# substring of the full text, for visualization purposes
df["text_short"] = df["text"].str[:100]
(3000, 2)

Embeddings Visualization#

We are now ready to visualize our embeddings in 2 dimensions using plotly.

# scatter plot
hover_data = {
    "text_short": True,
    "x": False,
    "y": False
}
fig = px.scatter(df, x="x", y="y", template="plotly_dark",
                   title="Embeddings", hover_data=hover_data)
fig.update_layout(showlegend=False)
fig.show()