Big Data/Analytics Zone is brought to you in partnership with:

Gary Sieling is a software developer interested in dev-ops, database technologies, and machine learning. He has a computer science degree from the Rochester Institute of Technology. He has worked on many products in the legal and regulatory industries, having worked on and supported several data warehousing applications. Gary is a DZone MVB and is not an employee of DZone and has posted 62 posts at DZone. You can read more from them at their website. View Full User Profile

Rhyming with NLP and Shakespeare

07.22.2013
| 3835 views |
  • submit to reddit

Natural Language Processing with Python” (read my review) has lots of motivating examples for natural language processing, focused on NLTK, which among other things, does a nice job of collecting NLP datasets and algorithms into one library.

Let’s take one of Shakespeare’s sonnets and see if we can recommend alternate rhymes:

import nltk
from nltk.corpus import cmudict
 
sonnet = ['O thou, my lovely boy, who in thy power', \
   'Dost hold Time\'s fickle glass, his sickle, hour;', \
   'Who hast by waning grown, and therein showst', \
   'Thy lovers withering as thy sweet self growst;', \
   'If Nature, sovereign mistress over wrack,', \
   'As thou goest onwards, still will pluck thee back,', \
   'She keeps thee to this purpose, that her skill', \
   'May time disgrace and wretched minutes kill.', \
   'Yet fear her, O thou minion of her pleasure!', \
   'She may detain, but not still keep, her treasure:', \
   'Her audit, though delayed, answered must be,', \
   'And her quietus is to render thee.']

First, we tokenize the sonnet and grab the last word of each line.

from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
tokens = [wordpunct_tokenize(s) for s in sonnet]
punct = set(['.', ',', '!', ':', ';'])
filtered = [ [w for w in sentence if w not in punct ] for sentence in tokens]
last = [ sentence[len(sentence) - 1] for sentence in filtered]

One of the corpora included in NLTK is the CMU pronouncing dictionary, which gives us pronunciations for words, so the next step is to look up pronunciations:

from nltk.corpus import cmudict
[[(w, p) for (word, pron) in cmudict.entries() if word == w] for w in last]
 
[[('power', ['P', 'AW1', 'ER0'])],
[('hour', ['AW1', 'ER0']),
('hour', ['AW1', 'R'])],
[], [],
[('wrack', ['R', 'AE1', 'K'])],
[('back', ['B', 'AE1', 'K'])],
[('skill', ['S', 'K', 'IH1', 'L'])],
[('kill', ['K', 'IH1', 'L'])],
[('pleasure', ['P', 'L', 'EH1', 'ZH', 'ER0'])],
[('treasure', ['T', 'R', 'EH1', 'ZH', 'ER0'])],
[('be', ['B', 'IY1']), ('be', ['B', 'IY0'])],
[('thee', ['DH', 'IY1'])]]

Unfortunately this falls apart for unrecognizable words. In a real system, we might want a fallback, such as comparing the last letters of the word to other words in a dictionary. An implementation of this would reduce letters to a common set and remove silent letters, e.g. “remove e from the end of words”, reduce some letter combinations to common sounds, “ie” -> “y”, “c” -> “k”, etc. Obviously this is tightly tied to English, but would help with made-up or rare words.

We can join the prior result to our list of words, and see pronunciations:

[[(w, p) for (word, pron) in cmudict.entries() if word == w] for w in last]
 
[[('power', ['P', 'AW1', 'ER0'])],
 [('hour', ['AW1', 'ER0']),
  ('hour', ['AW1', 'R'])], [], [],
  [('wrack', ['R', 'AE1', 'K'])],
  [('back', ['B', 'AE1', 'K'])],
  [('skill', ['S', 'K', 'IH1', 'L'])],
  [('kill', ['K', 'IH1', 'L'])],
  [('pleasure', ['P', 'L', 'EH1', 'ZH', 'ER0'])],
  [('treasure', ['T', 'R', 'EH1', 'ZH', 'ER0'])],
  [('be', ['B', 'IY1']),
  ('be', ['B', 'IY0'])],
  [('thee', ['DH', 'IY1'])]]

We can retrieve the number of sounds in a similar fashion:

syllables = \
  [[(word, len(pron)) for (word, pron) in cmudict.entries() if word == w] \
    for w in last]
 
[[('power', 3)],
 [('hour', 2),
  ('hour', 2)],
  [],
  [],
  [('wrack', 3)],
  [('back', 3)],
  [('skill', 4)],
  [('kill', 3)],
  [('pleasure', 5)],
  [('treasure', 5)],
  [('be', 2),
   ('be', 2)],
   [('thee', 2)]]

Combining the last two together, we can get the word, number of sounds, and phenome list in one tuple. What the following code shows is that even though Python is simpler than Perl, it’s still easy to build data structures that are impossible to read.

syllables = \
[[(w, len(p), p) for (w, p) in cmudict.entries() if word == w] \
   for w in last]
 
[[('power', 3, ['P', 'AW1', 'ER0'])],
 [('hour', 2, ['AW1', 'ER0']),
 ('hour', 2, ['AW1', 'R'])],
  [],
  [],
 [('wrack', 3, ['R', 'AE1', 'K'])],
 [('back', 3, ['B', 'AE1', 'K'])],
 [('skill', 4, ['S', 'K', 'IH1', 'L'])],
 [('kill', 3, ['K', 'IH1', 'L'])],
 [('pleasure', 5, ['P', 'L', 'EH1', 'ZH', 'ER0'])],
 [('treasure', 5, ['T', 'R', 'EH1', 'ZH', 'ER0'])],
 [('be', 2, ['B', 'IY1']), ('be', 2, ['B', 'IY0'])],
 [('thee', 2, ['DH', 'IY1'])]]

Now that we have this, we need to write a test to see whether two words rhyme. To do this, we remove the first sound, and compare the ends of two words:

def rhymes(s):
  try:
    (w, l, p) = s[0]
    try:
      return [wt for (wt, pt) in cmudict.entries() \
             if l == len(pt) and p[1:] == pt[1:] ]
    except:
      return [w]
  except:
    return []
 
[rhymes(s) for s in syllables]

This returns a lot of results – much of it is good, but a lot is also garbage.

[['bauer', 'baur', 'bougher', 'bower', 'cower',
 'dauer', 'dour', 'gauer', 'gower', 'hauer', 'hower',
 "how're", 'kauer', 'knauer', 'lauer', 'mauer',
 'nauer', 'power', 'rauer', 'sauer', 'schauer',
'shower', 'sour', 'tauer', 'tower'], ['auer', 'ayer',
'ayer', 'eyer', 'for', 'her', 'hour', 'iyer', 'our',
 'oyer', 'uher', 'were'], [], [], ['back', 'backe',
 'bak', 'bakke', 'cac', 'caq', 'dac', 'dack', 'dak',
 'fac', 'hack', 'hacke', 'haq', 'haque', 'jac', 'jack',
 'jacques', 'knack', 'lac', 'lack', 'lak', 'mac',
 'mack', 'macke', 'mak', 'nack', 'nacke', 'pac',
'pack', 'pak', 'paque', 'ptak', 'rack', 'rak', 'sac',
 'sack', 'sak', 'schack', 'shack', 'shaq', 'tac',
't_a_c', 'tack', 'tacke', 'tak', 'wack', 'whack',
'wrack', 'yack', 'yak', 'zach', 'zack', 'zak'],
['back', 'backe', 'bak', 'bakke', 'cac', 'caq',
'dac', 'dack', 'dak', 'fac', 'hack', 'hacke', 'haq',
 'haque', 'jac', 'jack', 'jacques', 'knack', 'lac',
'lack', 'lak', 'mac', 'mack', 'macke', 'mak', 'nack',
 'nacke', 'pac', 'pack', 'pak', 'paque', 'ptak', 'rack',
... truncated, since this goes on for a while...

One issue with the above list is a lot of these “words” are names or garbage tokens, but not really used in a literary sense. We can filter these down by joining to Wordnet, which is a corpus with some semantic information, and more focused on real words.

def rhymes(s):
  try:
    (w, l, p) = s[0]
    try:
      filtered = [wt for (wt, pt) in cmudict.entries() \
                  if l == len(pt) \
                  and p[1:] == pt[1:] \
                  and len(wordnet.synsets(wt)) > 0]
      return filtered
    except:
      return [w]
  except:
    return []

This is better – we get a lot less, although not all of these words could be used in context, and most of them are pretty obvious (if you look for a rhyming word, you can always run through the alphabet in your head, which doesn’t require software)

[['bower', 'cower', 'dour', 'power', 'shower', 'sour', 'tower'],
['hour', 'were'], [], [], ['back', 'dak', 'hack',
'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak',
'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack',
 'whack', 'wrack', 'yack', 'yak'], ['back', 'dak', 'hack',
 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak',
 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack',
 'whack', 'wrack', 'yack', 'yak'], ['skill'], ['bill',
'chill', 'dill', 'fill', 'gill', 'hill', 'kill', 'lille',
 'mil', 'mill', 'nil', 'pill', 'rill', 'shill', 'sill',
 'thill', 'till', 'will', 'zill'], ['pleasure'],
['treasure'], ['b', 'be', 'bee', 'c', 'd', 'de', 'dea',
'fee', 'g', 'gee', 'ghee', 'he', 'ji', 'kea', 'key', 'ki',
 'knee', 'lea', 'lee', 'leigh', 'li', 'me', 'mi', 'ne',
'nee', 'ni', 'p', 'pea', 'pee', 'qi', 'quay', 're', 'sea',
'see', 'si', 't', 'te', 'tea', 'tee', 'ti', 'v', 'vi', 'wee',
'xi', 'yi', 'z', 'zea', 'zee'], ['b', 'be', 'bee', 'c', 'd',
'de', 'dea', 'fee', 'g', 'gee', 'ghee', 'he', 'ji', 'kea',
'key', 'ki', 'knee', 'lea', 'lee', 'leigh', 'li', 'me', 'mi',
 'ne', 'nee', 'ni', 'p', 'pea', 'pee', 'qi', 'quay', 're',
 'sea', 'see', 'si', 't', 'te', 'tea', 'tee', 'ti', 'v',
'vi', 'wee', 'xi', 'yi', 'z', 'zea', 'zee']]

The next version is a minor improvement, checking the ends, rather than the whole words. Note that for very short words, it removes all the random garbage (all the single sound words were treated as matching, since we compared 0 length lists)

def rhymes(s):
  try:
    (w, l, p) = s[0]
    try:
      filtered = [wt for (wt, pt) in cmudict.entries()
                  if l == len(pt)
                  and p[-2:] == pt[-2:]
                  and len(nltk.corpus.wordnet.synsets(wt)) > 0]
      return filtered
    except:
      return [w]
  except:
    return []
 
[['bower', 'cower', 'dour', 'power', 'shower', 'sour', 'tower'],
['hour'], [], [], ['back', 'dak', 'hack', 'jack', 'knack', 'lac',
 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac',
'sack', 'shack', 'tack', 'whack', 'wrack', 'yack', 'yak'],
['back', 'dak', 'hack', 'jack', 'knack', 'lac', 'lack', 'mac',
 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 'sack', 'shack',
 'tack', 'whack', 'wrack', 'yack', 'yak'], ['brill', 'drill',
'frill', 'grill', 'grille', 'krill', 'quill', 'shrill', 'skill',
'spill', 'still', 'swill', 'thrill', 'trill', 'twill'], ['bill',
'chill', 'dill', 'fill', 'gill', 'hill', 'kill', 'lille', 'mil',
'mill', 'nil', 'pill', 'rill', 'shill', 'sill', 'thill', 'till',
 'will', 'zill'], ['closure', 'crozier', 'pleasure', 'treasure'],
 ['closure', 'crozier', 'pleasure', 'treasure'],
['b', 'be', 'bee'], []]

This returns a few words that are very similar to the original. To reduce this, we can check the “edit distance”, the number of letters you would have to change to turn one letter into another. Unfortunately on it’s own this takes away some good rhymes, so we also to check if the first letters differ too. This allos “tower” and “power” to match, but not “old” and “olde” (the reason you will still see some examples like this is words are compared to the original – not each other).

def rhymes(s):
  try:
    (w, l, p) = s[0]
    try:
      filtered = [wt for (wt, pt) in cmudict.entries()
                  if l == len(pt)
                  and p[-2:] == pt[-2:]
                  and (nltk.distance.edit_distance(w, wt) > 2 \
                  or not w[0:2] == wt[0:2])
                  and len(nltk.corpus.wordnet.synsets(wt)) > 0]
      return filtered
    except:
      return [w]
  except:
    return []
 
[['bower', 'cower', 'dour', 'shower', 'sour', 'tower'],
[], [], [], ['back', 'dak', 'hack', 'jack', 'knack',
'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack',
'rack', 'sac', 'sack', 'shack', 'tack', 'whack', 'yack',
 'yak'], ['dak', 'hack', 'jack', 'knack', 'lac', 'lack',
 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac',
'sack', 'shack', 'tack', 'whack', 'wrack', 'yack',
'yak'], ['brill', 'drill', 'frill', 'grill', 'grille',
'krill', 'quill', 'shrill', 'spill', 'still', 'swill',
'thrill', 'trill', 'twill'], ['bill', 'chill', 'dill',
'fill', 'gill', 'hill', 'lille', 'mil', 'mill', 'nil',
'pill', 'rill', 'shill', 'sill', 'thill', 'till', 'will',
 'zill'], ['closure', 'crozier', 'treasure'], ['closure',
'crozier', 'pleasure'], ['b'], []]

From here, we can remove some words that don’t make sense in context by checking what parts of speech Wordnet has seen them in. We get the set of all possible part of speech tags for a word, and compare that to a potentially rhyming word – if there is overlap, we keep it.

def cmp(p, wt):
  pt = set([s.pos for s in wordnet.synsets(wt)])
  return len(pt & p) > 0 or len(p) == 0 or len(pt) == 0
 
def rhymes(s):
  try:
    (w, l, p) = s[0]
    try:
      pos = set([s.pos for s in wordnet.synsets(w)])
      filtered = [wt for (wt, pt) in cmudict.entries()
                  if l == len(pt)
                  and p[-2:] == pt[-2:]
                  and (nltk.distance.edit_distance(w, wt) > 2 \
                  or not w[0:2] == wt[0:2])
                  and cmp(pos, wt)
                  and len(nltk.corpus.wordnet.synsets(wt)) > 0]
      return filtered
    except:
      return [w]
  except:
    return []

Note this removes “dour” in the first part, where we expect to see a noun of some sort. The list isn’t too repetitive now, although it is getting quite short.

[['bower', 'cower', 'shower', 'sour', 'tower'],
[], [], [], ['back', 'dak', 'hack', 'jack', 'knack',
'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack',
'rack', 'sac', 'sack', 'shack', 'tack', 'whack',
'yack', 'yak'], ['dak', 'hack', 'jack', 'knack', 'lac',
 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack',
'sac', 'sack', 'shack', 'tack', 'whack', 'wrack', 'yack',
'yak'], ['brill', 'drill', 'frill', 'grill', 'grille',
 'krill', 'quill', 'spill', 'still', 'swill', 'thrill',
 'trill', 'twill'], ['bill', 'chill', 'dill', 'fill',
'gill', 'hill', 'lille', 'mil', 'mill', 'nil', 'pill',
'rill', 'shill', 'sill', 'thill', 'till', 'will',
'zill'], ['closure', 'crozier', 'treasure'],
['closure', 'crozier', 'pleasure'], ['b'], []]

One minor issue here is that “sour” is in the first set. Let’s try one last attempt – instead of checking all possible roles a word could play, lets check the role it actually plays, by running the text through a part of speech tagger. Finally can justify having the entire text the whole text of each line, even though we’ve been ignoring it thus far.

The NLKT part of speech tagger is much more detailed than what Wordnet has – this is because Wordnet acts as a dictionary of nouns, verbs, adjectives, and adverbs, but is less concerned about their roles, so we need to establish a mapping.

 
tagged = [nltk.pos_tag(t, simplify_tags = True) for t in tokens]
[[('O', 'NNP'), ('thou', 'VBD'), (',', ','),
('my', 'PRP

Interestingly, this actually returns more than the prior technique. This is because NLTK can tag many more words than Wordnet, since Wordnet is aiming at specific semantic interests. This maintains properties similar to the previous list (“dour” is still out), but less has been removed. Overall the word diversity is quite high as well. Clearly, one could spend a lot of time even on this simple task.

[rhymes(r) for r in syllables]
 
[['bower', 'shower', 'sour', 'tower'],
['aerial', 'amble', 'angel', 'angle', 'ankle',
'anvil', 'april', 'arousal', 'arrival', 'axle',
'babble', 'babel', 'baffle', 'bagel', 'barrel',
'barrel', 'basel', 'basil', 'basle', 'battle',
 'bauble', 'beadle', 'beagle', 'beetle', 'beigel',
 'beryl', 'betel', 'bethel', 'bevel', 'bible',
'bobble', 'boodle', 'bottle', 'bubble', 'buckle',
'bushel', 'bustle', 'cable', 'cackle', 'camel',
'carol', 'carol', 'carrel', 'carroll', 'carroll',
 'castle', 'cattle', 'channel', 'chapel', 'chattel',
'chisel', 'choral', 'chuckle', 'chunnel', 'circle',
 'cobble', 'cockle', 'colonel', 'coral', 'couple',
 'cuddle', 'cudgel', 'cycle', 'cyril', 'dazzle',
 'dental', 'devil', 'dibble', 'diesel', 'diesel',
 'doodle', 'double', 'duffel', 'durrell', 'equal',
 'fable', 'facial', 'faisal', 'fennel', 'fiddle',
 'final', 'fipple', 'fizzle', 'foible', 'fossil',
'fuel', 'funnel', 'gable', 'gaggle', 'gavel', 'geisel',
 'gesell', 'giggle', 'girdle', 'gobble', 'goral',
 'gurgle', 'hackle', 'haggle', 'hassel', 'hassle',
 'hatchel', 'havel', 'hazel', 'heckle', 'hegel', 'herbal',
 'herschel', 'hobble', 'hovel', 'hubble', 'huddle',
'hurdle', 'hustle', 'jackal', 'jiggle', 'jostle',
 'journal', 'juggle', 'kennel', 'kernel', 'kettle',
 'kibble', 'knuckle', 'label', 'ladle', 'laurel',
'level', 'libel', 'little', 'local', 'lovell',
 'mammal', 'maple', 'medal', 'memel', 'metal',
'methyl', 'mettle', 'michael', 'mickle', 'middle',
 'missal', 'missile', 'mitchell', 'mobile', 'modal',
 'model', 'mogul', 'moral', 'mosul', 'motile',
'muckle', 'muddle', 'muffle', 'muscle', 'mussel',
'muzzle', 'myrtle', 'nasal', 'natal', 'navel', 'needle',
'nestle', 'nettle', 'nibble', 'nickel', 'nipple', 'noble',
'noodle', 'novel', 'nozzle', 'paddle', 'panel', 'pebble',
 'pedal', 'people', 'peril', 'petal', 'phenol', 'pickle',
'piddle', 'pommel', 'poodle', 'pottle', 'puddle', 'purple',
 'puzzle', 'rabble', 'rachel', 'raffle', 'rattle', 'ravel',
'razzle', 'rebel', 'revel', 'riddle', 'riffle', 'rifle',
 'rigel', 'ripple', 'rival', 'roble', 'rommel', 'rouble',
 'rubble', 'rubel', 'ruble', 'ruddle', 'ruffle', 'russell',
 'rustle', 'sable', 'saddle', 'seckel', 'segal', 'seidel',
 'settle', 'shackle', 'shekel', 'shovel', 'shuffle',
 'shuttle', 'sibyl', 'social', 'sorrel', 'table',
'tackle', 'tassel', 'tattle', 'tercel', 'thermal',
'thistle', 'tickle', 'tipple', 'title', 'tittle',
 'toggle', 'tootle', 'total', 'trial', 'tunnel',
'turtle', 'tussle', 'umbel', 'uncle', 'vassal', 'vergil',
'vessel', 'vigil', 'vinyl', 'virgil', 'vocal', 'waddle',
 'waffle', 'wattle', 'weasel', 'weevil', 'whistle',
'whittle', 'wiesel', 'wiggle', 'wobble', 'wrestle',
'wriggle', 'yodel'], [], [], ['back', 'dak', 'hack',
 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak',
 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack',
 'whack', 'yack', 'yak'], [], ['brill', 'drill', 'frill',
 'grill', 'grille', 'krill', 'quill', 'spill', 'still',
 'swill', 'thrill', 'trill', 'twill'], ['bill', 'chill',
 'fill', 'hill', 'mill', 'shill', 'till', 'will'],
['closure', 'crozier', 'treasure'],
 ['closure', 'crozier', 'pleasure'], [], []]

Published at DZone with permission of Gary Sieling, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)