Text analysis using Python

I wanted to do a quick analysis of word frequency and decided that writing a Python script was the way to go. The main reason to use Python was that I wanted to keep on practicing its use so I do not ‘lose it’; of course I could not remember how to do a few things. For example:

  1. Removing punctuation marks: this involved creating a translation dictionary that mapped one-to-one for all characters (string.maketrans("","") creates an ‘identity’ map) and then dropping punctuation with string.punctuation.
  2. Sorting a dictionary by values (not keys). This one was tricky for me until I found this explanation in Stack Overflow, which creates an ordered list of tuples with keys in the first element and values in the second one: sorted(huck.iteritems(), key=operator.itemgetter(1), reverse = True)

In addition, I had never used matplotlib (the scientific graphics library) before. I downloaded a text file with Mark Twain’s ‘The Adventures of Huckleberry Finn’ from Project Gutenberg (called TwainHuckFinn.txt) for the test.

import operator, time, string
import matplotlib.pyplot as plt

folder = '/Users/Luis/Documents/Programming/python-ideas/HuckFinn/'
f = open(folder + 'TwainHuckFinn.txt', 'r')

start = time.time()

huck = {}
for line in f:
    line = line.split()
    for word in line:
        word = word.lower()
        new_word = word.translate(string.maketrans("",""), string.punctuation)
        if new_word in huck:
            huck[new_word] += 1
        else:
            huck[new_word] = 1

sorted_huck = sorted(huck.iteritems(), key=operator.itemgetter(1), reverse = True)
elapsed = time.time() - start

print 'Run took ', elapsed, ' seconds.'
print 'Number of distinct words: ', len(sorted_huck)

# Printing and plotting most popular words
npopular = 50
x = range(npopular)
y = []
for pair in range(npopular):
    y = y + [sorted_huck[pair][1]]
    print sorted_huck[pair]

plt.plot(x, y, 'ro')
plt.xlabel('Word ranking')
plt.ylabel('Word frequency')
plt.show()

Running the program takes less than one second, prints the list of the 50 most common words: (‘and’, 6299), (‘the’, 4949), (‘i’, 3209), (‘to’, 2994), etc. It also creates a plot like so:

Word frequency in Mark Twain Huckleberry Fin Word frequency in Mark Twain’s Huckleberry Fin

By the way, i- the word nigger shows up 150 times and ii- I think it is a crime to change it to water down the book.