1.8 Project: Text Generation with N-Grams#

N-grams are also useful to build (naive) probabilistic text generation models. Let’s see how to build one starting from Shakespeare plays.

Download Books from Project Gutenberg#

First, let’s import the necessary libraries. We’ll download Shakespeare’s plays using the corpus module of the NLTK library. The gutenberg module allows you to easily download texts from Project Gutenberg, a library of over 60,000 free books and cultural works, among which are some of the plays of Shakespeare.

import numpy as np
import nltk
nltk.download('gutenberg')

from collections import defaultdict
# list all corpora from project Gutenberg in NLTK
nltk.corpus.gutenberg.fileids()
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

As we can see, the gutenberg module allows easy access to a subset of all the books available on the website. In this example, we’ll focus on shakespeare-caesar, shakespeare-hamlet, and shakespeare-macbeth. The plays come as .txt files, they can be read with the nltk.corpus.gutenberg.words function which returns them already tokenized.

# list of Shakespeare corpora
shakespeare_corpora = [
    "shakespeare-caesar.txt",
    "shakespeare-hamlet.txt",
    "shakespeare-macbeth.txt"
]

# get all corpora
corpora = {
    corpus_name: nltk.corpus.gutenberg.words(corpus_name)
    for corpus_name in shakespeare_corpora
}

# print some sentences from a corpus...
some_n_tokens = corpora["shakespeare-caesar.txt"][1002:1050]
print(" ".join(some_n_tokens))
Set him before me , let me see his face Cassi . Fellow , come from the throng , look vpon Caesar Caes . What sayst thou to me now ? Speak once againe , Sooth . Beware the Ides of March Caes . He is a Dreamer

Count Occurrencies#

We now count how many times a token is followed by another token and save that info in the from_token_to_next_token_counts dictionary.

# count how many times a specific token is right after another specific token
# in the corpora

# example: from the text "the dog is under the table.", we want to obtain the dictionary
# {
#  "the": { "dog": 1, "table": 1 },
#  "dog": { "is": 1 },
#  "is": { "under": 1 },
#  "under": { "the": 1 },
#  "table": { ".": 1 }
# }

# from_token_to_next_token_counts = { token: { next_token: num_of_occurrencies } }
from_token_to_next_token_counts = defaultdict(dict)

for corpus in corpora.values():
  for i in range(len(corpus) - 1):
    token = corpus[i].lower()
    next_token = corpus[i + 1].lower()
    if next_token not in from_token_to_next_token_counts[token]:
      from_token_to_next_token_counts[token][next_token] = 0
    from_token_to_next_token_counts[token][next_token] += 1

# print 10 examples of tokens that followed the token "from" in the corpora, along
# with their counts of occurrences
print(list(from_token_to_next_token_counts["from"].items())[:10])
[('the', 37), ('caesars', 1), ('your', 5), ('their', 3), ('brutus', 1), ('that', 2), ('seuerall', 1), ('qualitie', 1), ('bondage', 1), ('power', 1)]

The token “from” is followed 37 times by the token “the”, 5 times by “your”, 3 times by “their”, and so on.

Estimate Probabilities#

Then, from the counts of occurrences, we can easily estimate the probability of a specific token following another specific token. We save that info in the from_token_to_next_token_probs dictionary.

# transform occurrencies into probabilities

# example: from the text "the dog is under the table.", we want to obtain the dictionary
# {
#  "the": { "dog": 0.5, "table": 0.5 },
#  "dog": { "is": 1 },
#  "is": { "under": 1 },
#  "under": { "the": 1 },
#  "table": { ".": 1 }
# }

# from_token_to_next_token_probs = { token: { next_token: probability } }
from_token_to_next_token_probs = {}

for token, d_token in from_token_to_next_token_counts.items():
  sum_of_counts_for_token = sum(d_token.values())
  from_token_to_next_token_probs[token] = {
      next_token: count / sum_of_counts_for_token
      for next_token, count
      in d_token.items()
  }

# print 10 examples of tokens that followed the token "from" in the corpora, along
# with their probabilities
print(list(from_token_to_next_token_probs["from"].items())[:10])
[('the', 0.19170984455958548), ('caesars', 0.0051813471502590676), ('your', 0.025906735751295335), ('their', 0.015544041450777202), ('brutus', 0.0051813471502590676), ('that', 0.010362694300518135), ('seuerall', 0.0051813471502590676), ('qualitie', 0.0051813471502590676), ('bondage', 0.0051813471502590676), ('power', 0.0051813471502590676)]

Now we just have to write the small function sample_next_token that samples the next token from a starting token, according to the computed probabilities. We can leverage the numpy.random.choice utility function that samples elements from a list according to specific probabilities.

# sample the next token according to the computed probabilities
def sample_next_token(token, from_token_to_next_token_probs):
  next_tokens, next_tokens_probs = list(zip(*from_token_to_next_token_probs[token].items()))
  next_token_sampled = np.random.choice(next_tokens, size=1, p=next_tokens_probs)[0]
  return next_token_sampled

print(sample_next_token("from", from_token_to_next_token_probs))
him

Last, we can use the sample_next_token function repeatedly inside the new generate_text_from_token function and generate a text long several tokens.

# repeatedly sample tokens to generate long text
def generate_text_from_token(token, from_token_to_next_token_probs, n_words_to_generate):
  text = token
  for _ in range(n_words_to_generate):
    next_token = sample_next_token(token, from_token_to_next_token_probs)
    text += " " + next_token
    token = next_token
  return text

first_token = "from"
n_words_to_generate = 50
generated_text = generate_text_from_token(first_token, from_token_to_next_token_probs, n_words_to_generate)
print(generated_text)
from what is a day caes the maleuolence of griefe due to disgest his way : vnequall match , and goes hence , this i pray god of watchfull tyranny is begun . macb . and wisely and tell him three to the faire world , wee ' twer well spoken

The generated text is remotely reminiscent of the English text, although there are numerous grammatical flaws.

The N-grams Tradeoff#

The results are not the best, but you can see that there are some regularities, such as articles that are usually followed by nouns. To generate better text, we need to consider more info than just one previous token: we could build the same model with 2-grams or 3-grams for example, which typically works better. However, when using 4-grams and 5-grams, it becomes clear that there’s a tradeoff:

  • When using 1-grams, the model has few context information but it’s able to estimate the probabilities well as there are a lot of repeating 1-grams in the corpora.

  • When using 2-grams and 3-grams, the model has a bit of context information and it’s still able to decently estimate the probabilities as there are some repeating 2-grams and 3-grams in the corpora.

  • When using n-grams with a high n (e.g. > 3), each n-gram tends to appear few times (maybe only once) in the corpora and therefore the model would degenerate to always generating the same text from the corpora, doing a sort of copy-paste.

# merge all the corpora in a single string
all_corpora_tokens = corpora["shakespeare-caesar.txt"] + corpora["shakespeare-hamlet.txt"] + corpora["shakespeare-macbeth.txt"]
all_corpora_tokens = [token.lower() for token in all_corpora_tokens]
all_corpora_text = " ".join(all_corpora_tokens)

# see how many specific 1-grams, 2-grams, 3-grams can be found in the corpus
print(all_corpora_text.count("from ")) # 1-grams
print(all_corpora_text.count("from the ")) # 2-grams
print(all_corpora_text.count("from the streets ")) # 3-grams
193
37
1

As you can see in the previous example, the 1-gram “from” appears 193 times in the corpora, while the 2-gram “from the” appears 37 times, and the 3-gram “from the streets” one time. As the length of the n-gram increases, the number of times it can be found in a corpus decreases.

So, how do we generate better text? One solution is to use a bigger corpus, so that we can find many occurrences for 4-grams and 5-grams as well. However, even if around the Internet there are billions and billions of texts, the probability to find repeated n-grams rapidly decreases as we increase n. Moreover, to generate coherent text we need to consider all the previous text (maybe long hundreds of tokens), not only the last few tokens. As a consequence, a change of approach is needed.

Still, in a previous lesson, we showed that today we can generate longer coherent text.

../_images/text_generation.png

How do we do that? We do that with word embeddings and neural networks, as you’ll learn along the course.

Code Exercises#

Questions and Feedbacks#

Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the NLPlanet Discord server and interact with the community! There’s a specific channel for this course called practical-nlp-nlplanet.