Text Classification using Naive Bayes algorithm

This was a 3-day assignment that I worked on while I was in the Analytics program at Northwestern University. It is an implementation of the Multinomial Naive Bayes Algorithm in Java. Text Analytics was by far my favorite course at the program and I thoroughly enjoyed working on this one. Hope you guys like it and is helpful. Suggestions/comments/criticism are welcome!

Problem: Classify books based on their Title name, Author Name, and Content into pre-defined categories. The categories were:


Input data format:

First line contains N, H where N = number of training data records and

H = list of headers.  N lines of training data will follow this. Each

field in N lines is tab separated. The next line will have M, H where

M = number of test data records and H = list of headers. M lines of

test data will follow this, each field in a line is tab separated.

Training data has following columns:

categoryLabel, bookID, bookTitle, bookAuthor

Test data has following columns:

bookID, bookTitle, bookAuthor

Example of training data:

N=3 H=[categoryLabel, bookID, bookTitle, bookAuthor]

AMERICAN HISTORY b9418230 American History Survey Brinkley, Alan

SOCIOLOGY b16316063 Life In Society Henslin, James M.

ENGLISH b14731993 Reading for Results Flemming, Laraine E.

M=2 H=[bookID, bookTitle, bookAuthor]

b15140145 Efficient and Flexible Reading McWhorter

b15857527 These United States Unger, Irwin


A list of all books from the Test dataset with their Book Ids and their Predicted Category.

Solution introduction:

For the given document classification problem, I decided to implement Multinomial Naive Bayes model. Classification process: classify(feat1,…,featN) = argmax(P(cat))*PROD(P(featI|cat)). I implemented this in Java (using Eclipse). Here features are words.

  • –          Multinomial (a document is represented by a feature vector with integer elements whose value is the frequency of that word in the document) preferred over Bernoulli (a document is represented by a feature vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present)
  • –          Laplace’s law of succession or add one smoothing included to eliminate possibility of zero probability


  1. Read the input data and split it into Training dataset and Test dataset
  2. Built the Multinomial Naïve Bayes Classifier using the Training dataset
    • I first started with considering just the ‘title’ field to build and classify documents (excluded Stop words and normalized the remaining words)
    • Next, I tried using ‘title’ and ‘categoryName’ fields to build and classify documents
    • Then, I tried using ‘title’, ‘categoryName’ and ‘author’ fields to build and classify documents
    • Lastly, I tried combinations of ‘title’, ‘author’ and ‘contents’  fields to build and classify documents
    • Also, I experimented by excluding the ‘categoryPriorProbability’ in the final computation
    • The Results of each of these are summarized below
  3. Classified the documents in Test dataset using this classifier




Class ‘Category’:

  • Attributes:
    • categoryName – Name of the Category
    • categoryProbability – Prior Probability of the Category
    • wordProbability – HashMap of (word-probability) pair of the ‘title’ field of the documents in the category where probability is the category conditional probability of that word
    • authWordProb – HashMap of (word-probability) pair of the ‘author’ field of the documents in the category where probability is the category conditional probability of that word
    • contentWordProb – HashMap of (word-probability) pair of the ‘contents’ (table of contents from input2.txt) field of the documents in the category where probability is the category conditional probability of that word
  • Methods:
    • Methods to ‘set’ and ‘get’ each of these attributes
    • probabilityCalculation – Calculates P(feat(i)|C) – the probability of feat(i) occurring in that document class

Class ‘BayesClassifier’:

  • readData – to read in data from both input files and split the first into training and test datasets
  • buildClassifier –     Builds the classifier. Creates an array of category objects, each of which has the Vocabulary of features (feat) and P(feat(i)|C) – the probability of feat(i) occurring in that document class (done in the class function) – for ‘title’, ‘author’ and ‘table of contents’. Also calculates the category prior probabilities
  • createWordList – take inputs as String of words and they are tokenized and normalized using an English Analyzer
  • classifyDocuments – Classifies the documents into a category which has the class conditional probability. Words are selected from ‘title’, ‘author’, ‘table of contents’ columns separately (only words from ‘title’ used in the final implementation) for each document in the test dataset and corresponding (P(cat)*PROD(P(feat(i)|cat)) are calculated. Finally, classify(feat1,…,featN) = argmax(P(cat)*PROD(P(feat(i)|cat))


  • The final implementation achieved an 86.67% accuracy (52/60)
  • This final model had only the ‘title’ field considered to build and classify the documents
  • The prior category probabilities was not included in the P(C(i)|D(k)) calculation
  • Started using the contents of the books but wasn’t too helpful in improving the accuracy (more time and appropriate tweaking of the model might result in improvement of accuracy)


CategoryDocumentsPrior Probability

The above table is compiled from the training dataset

Trial #Fields includedAccuracy
TitleAuthorCategory NamePrior ProbabilitiesWords TokenizedNumeric values in fields


  • The first trial was based only on ‘Title’ using the standard formula of Naïve Bayes model. When I observed a few misclassifications, I found that there were some documents which had the word “Historical” and yet wasn’t categorized in “American History”. So I thought I could include the categoryNames as a part of the Vocabulary (Trial #2)
  • As we see there wasn’t any significant change in the overall model accuracy. Hence decided not to use it
  •  Also, in trial #1, I had Normalized (excluded stop words) all the words that appear in the title of the documents in the training dataset during building the model and classification of the test data. Hence I tried retaining the words as they were and the accuracy dipped. Hence Normalization helped
  • Including ‘Author’ field didn’t improve the results. In fact deteriorated it further.
  • Exclusion of numeric occurrences in the fields doesn’t improve accuracy either (numeric years help in prediction)
  • There we a few documents that were being misclassified narrowly and this was because of the prior category probabilities (one was far greater than the other and without prior probability the classification of that document would have been correct). Hence I decided try out by excluding prior category probabilities and the accuracy considerably improved and that was the best I could get from these experimentation (87%)



Why so serious….????

This is one of the projects that I worked on with Ling Jin (@ljin8118) and Peter Schmidt (@pjschmidt007)as a part of our Text Analytics course at Northwestern University. It was one of the most exciting projects to have worked on and in the process learnt the latest and cutting edge techniques used in the field of Text Analytics and Text Mining. Hope you will enjoy it as much as we enjoyed doing it! Cheers 🙂


Provide a textual analysis of the movie script, The Dark Knight, which was robbed of the best picture Oscar at the 81st Annual Academy Awards on February 22, 2009. All project team members are still bitter about this fact. This assignment hopes to resurrect the greatness that is The Dark Knight.

More seriously though, if given a script, the text analytics conducted in this assignment would be able to produce insights into the genre, mood, plot, theme and characters. Ideally the analysis is intended to understand and answer the who, what, when, where and why in regards to a movie.


Specifically, the objectives of the textual analysis of The Dark Knight will cover:

  • Determine the major characters in the script 
  • Show the character to character interaction 
  • Provide insights into sentiment by character 
  • Show how sentiment changes over time 
  • Determine major themes/topics of the script


Processing Steps

  1. Acquire the movie script of choice 
  2. Parse the script into lines by scenes and lines by character 
  3. Tokenize, normalize, stem the lines of dialogue as appropriate 
  4. Build an index based on available components for subsequent queries 
  5. Perform part of speech (POS) Tagging on the lines of dialogue 
  6. From the POS Tagging, perform sentiment scoring on the lines of dialogue 
  7. Perform named entity recognition 
  8. Perform co-reference 
  9. Perform topic modeling 
  10. Analyze results 
  11. Produce visualizations


Character Identification

The above two visuals carry the same information, just two different representations, about the important characters in the movie. The first visual is a Bubble chart where the size of the bubble is proportional to the # of lines said by the character. 

The second one is a Heat map diagram where again, the area represents the quantity of lines of dialogue across scenes by characters. These two visuals help us identify the major characters of the movie. One can see that Harvey Dent (aka Two-Face), Gordon, The Joker, Bruce Wayne, Batman, Rachel, Fox, and Alfred were easily the major characters of the movie, with Lau, Chechen, Maroni and Ramirez all playing supporting roles. It is interesting to note that in the script, Two-Face is never named as a separate character, unlike Bruce Wayne and Batman. Combining Bruce Wayne and Batman’s line would have made him the most prominent character over Harvey Dent.

Character Interactions

Now that the major characters are established, the next obvious step would be to identify how these characters interact with each other.

The above visual gives us an insight into this. Each node is a character and each edge tells us that the two nodes connected by that edge have interacted at least once in the movie. Our definition of interaction is when two or more characters speak in a single scene. Hence more the number of interactions with distinct characters, bigger will be the size of the node.

The nodes (characters) marked in Red are the central characters. Most of the characters whom have a lot of dialogues also have more interactions with distinct characters. But are there some exceptions, i.e. are there characters who have a lot of lines but less interactions (may be someone like Alfred – having watched the movie) or vice versa? Let’s look further to see what was observed.

Sentiment over time

Below is a visual description of the sentiment of the scenes over time. The methodology to calculate the sentiment for each scene was to first split each scene into dialogues by individuals. Then each dialogue was run through the Design process explained above. At the end of it we get a score for each dialogue and an average of senti scores of all these dialogues gave us the senti score of the scene. As we can see, this was a dark dark movie.

We also looked how character sentiment varied over time. Again the methodology to calculate this was similar to the one above, but this is by character and not by scene.


What does the Batman say?

As the superhero in this movie, Batman does not talk that much (based on the IDENTIFICATION OF IMPORTANT CHARACTERS and part the real movie). He does mention his opponents, all the killings and of course the word “hero

What does the Joker say?

The joker talks quite often actually, which was confirmed earlier. He talks about his scar /the smile, his childhood and the whole plan stuff. He also mentions all the names quite often. 

Design and Implementation Challenges


One of the first things was to find an appropriate script, which turned out to be a little harder than expected. It was sort of like finding a needle in a haystack. But after some perseverance, The Dark Knight Script was found at:


There were 8704 actual lines that needed to be parsed and fit together in the above script.


There were several nuances that needed to be taken into consideration for the actual script parsing. First of all, the script that was found was not broken down by multiple html tags representing the different portions of the script. Instead, the entire script was basically under one tag, which meant parsing was for an entire block of unstructured text. Hence we had to carefully find patterns and parse the script.

Tokenization and Lemmatization

The Standard Analyzer in the Stanford NLP was chosen to handle the tokenization and any normalization required. It also provided lower case and stop word filtering. As it was decided that stemming was not going to be necessary for the analysis that was to be conducted, the Standard Analyzer was chosen over the English Analyzer as the aggressive stemming performed by the PorterStemFilter was not necessary to support the other downstream pipeline processes. The Standard Analyzer was then used consistently across the pipeline to prevent any inconsistency concerns.

POS and Sentiment 

There were a couple of options available to perform sentiment mining on the dialogue in the script. 

Option 1:

The initial selection was to use SentiWordNet, http://sentiwordnet.isti.cnr.it SentiWordNet is a lexical resource that is based on WordNet 3.0, http://wordnet.princeton.edu and is used for opinion mining. SentiWordNet assigns a score to each synset, defined as a set of one or more synonyms, of a word for a particular part of speech. The parts of speech in SentiWordNet are defined as:

a = adjective

n = noun

r = adverb

v = verb

Obtaining the parts of speech from the Stanford NLP part of speech annotator would then require mapping from the parts of speed defined in the Penn Tree Bank Tag set, http://www.computing.dcu.ie/~acahill/tagset.html, to the part of speech defined in SentiWordNet so that a sentiment score could be produced.

The SentiWordNet resource is constructed as follows:

POS = part of speech

ID = along with pos, uniquely identifies a WordNet (3.0) sunset.

PosScore = the positivity score assigned by SentiWordNet to the synset.

NegScore = the negativity score assigned by SentiWordNet to the synset.

SynsetTerms = terms, including the sense number, belonging to the sunset

Gloss = glossary

Note: The objectivity score can be calculated as: ObjScore = 1 – (PosScore + NegScore)

Option 2

Another option instead of SentiWordNet was to use the sentiment annotator in the Standford NLP pipeline. The team discovered this new addition to the Stanford NLP during the course of the project. It is recent “bleeding edge” sentiment technology that Stanford is now including the Stanford NLP. Excerpted from there website, as most sentiment prediction systems work just by looking at words in isolation, giving positive points for positive words and negative points for negative words and then summing up these points. Which by the way in essence is what we were doing. That way, the order of words is ignored and important information is lost. In contrast, the sentiment annotation that is part of the Stanford NLP institutes a new deep learning model that builds up a representation of whole sentences based on the sentence structure, computing the sentiment based on how words compose the meaning of longer phrases. There are 5 classes of sentiment classification: very negative, negative, neutral, positive, and very positive.

Sentiment  Scoring 

There were several methods available in calculating a sentiment for character lines in a given scene. This is due to the fact that the actual “sense” of the word was not known when passed to the parser to do the actual sentiment. So if “good” had n-senses in the lexical resource, it was not known which sense was used in the dialogue.

Method 1

The first method was to sum all the senti scores within the body of text then divide by the sum of all scores. This was the method that was provided in the demo on the SentiWordNet web site.

Method 2

The second method was to sum all the senti scores within the body of text then divide by the count of all scores.

Method 3

The third method was to just sum all the senti scores within the body of text. Although we actually implemented both options for obtaining sentiment (i.e SentiWordNet and Stanford NLP sentiment annotation), the option that was chosen to score the dialogue was SentiWordNet. The scoring method that was then used, although each were explored, was method 2 defined above. This is the method we finally decided to use in our sentiment analysis of the script.

Concluding Remarks

Text analytics is quite the involved process. As with most data analysis activities a major portion of the time is spent identifying, acquiring and cleansing the source data. The field of text analytics is quite broad with many best of breed components. However, text analytics does not have well integrated toolsets, so you can observe from the solution that was crafted in having to leverage several technologies (Java, R, Excel, Gephi, Tableau) using different libraries (Stanford NLP, Lucene) and various other packages to perform specific functions within the data pipeline. All in all, though, it has been shown that with some blood, sweat and tears (over 1200 lines of code were written for this assignment), and by all means time, a text analytics tool can be built to analyze movie scripts with a pretty accurate view when compared to the overall reality of the movie. And lastly, it should be mentioned, then the inherent complexities in the dialogue and the richness of the script should have guaranteed the Oscar for The Dark Knight!