Document tree: home « code « text-analysis-python.html
This page was updated on 2011-04-28 (NZST) and is tagged text analysis, python, programming.
I wanted to do a quick analysis of word frequency and decided that writing a Python script was the way to go. The main reason to use Python was that I wanted to keep on practicing its use so I do not 'lose it'; of course I could not remember how to do a few things. For example:
Removing punctuation marks: this involved creating a translation dictionary
that mapped one-to-one for all characters (string.maketrans("","") creates
an 'identity' map) and then dropping punctuation with string.punctuation.
Sorting a dictionary by values (not keys). This one was tricky for me until
I found this explanation in Stack Overflow, which creates an ordered list of tuples with
keys in the first element and values in the second one:
sorted(huck.iteritems(), key=operator.itemgetter(1), reverse = True)
In addition, I have never used matplotlib (the scientific graphics library) before. I downloaded a text file with Mark Twain's 'The Adventures of Huckleberry Finn' from Project Gutenberg (called TwainHuckFinn.txt) for the test.
import operator, time, string import matplotlib.pyplot as plt folder = '/Users/Luis/Documents/Programming/python-ideas/HuckFinn/' f = open(folder + 'TwainHuckFinn.txt', 'r') start = time.time() huck = {} for line in f: line = line.split() for word in line: word = word.lower() new_word = word.translate(string.maketrans("",""), string.punctuation) if new_word in huck: huck[new_word] += 1 else: huck[new_word] = 1 sorted_huck = sorted(huck.iteritems(), key=operator.itemgetter(1), reverse = True) elapsed = time.time() - start print 'Run took ', elapsed, ' seconds.' print 'Number of distinct words: ', len(sorted_huck) # Printing and plotting most popular words npopular = 50 x = range(npopular) y = [] for pair in range(npopular): y = y + [sorted_huck[pair][1]] print sorted_huck[pair] plt.plot(x, y, 'ro') plt.xlabel('Word ranking') plt.ylabel('Word frequency') plt.show()
Running the program takes less than one second, prints the list of the 50 most common words: ('and', 6299), ('the', 4949), ('i', 3209), ('to', 2994), etc. It also creates a plot like so:
By the way, i- the word nigger shows up 150 times and ii- I think it is a crime to change it to water down the book.