Featured

2019 India vs. S. Africa Test series: A victory to savor for the 90s Indian cricket fan

The Indian Men’s cricket team beat their South African counterparts by an innings and 202 runs in the third Test to win the series 3-0. They won the first match by 203 runs and the second by an innings and 137 runs. All the wins can be considered comprehensive and the dominance of the Indian team was on full display. Given the recent form and quality of the Indian team and the lack of it on the South African side, one can say that nothing unexpected happened and therefore not delve too much into it. But being an Indian cricket fan who started following the game from the early 90s this result is greatly satisfying and worth rejoicing.

Embed from Getty Images

26-30 Nov 1992: Sachin Tendulkar of India in action during the Second Test match against South Africa at The Wanderers in Johannesburg, South Africa. The match ended in a draw. \ Mandatory Credit: Mike Hewitt/Allsport

Embed from Getty Images

Pravin Amre batting for India during the 1st Test against South Africa in Durban, November 1992. Amre scored a century in the match, which was his test debut. (Photo by Mike Hewitt/Getty Images)

South Africa is a great sporting nation and are among the best in a lot of team sports, cricket being just one of them. In Rugby, for example, the South African national team, The Springboks, has performed well taking the 1995 and 2007 Rugby World Cups and are in the semis of the ongoing one. They are highly competitive in individual sports having won 29 medals in athletics in Olympics. So, despite not playing international cricket for more than 2 decades because of Apartheid, South Africa’s quality of cricket never dipped. They still had a robust domestic structure producing some of the world’s best players many of who, unfortunately, could never showcase their skills and talents on the international stage. With the release of Nelson Mandela in 1990, South African cricket saw a newfound hope of returning to international cricket. In fact India was instrumental in fast tracking them back on to the world arena by hosting them for their first ever series after readmission. It was a 3 match ODI series which India won 2-1 but one could see the quality of the South Africans in those games. It was the start of a bilateral battle that was so heavily dominated by the Proteas which makes the current series win so sweet.

A tour to South Africa for a Test series was close to nightmarish almost every single time for the Indian team and to cricket fans like me. Having played most of the domestic and home international matches on either flat tracks or dust bowls, Indians were found wanting in every series there. No wonder we still haven’t won a Test series there yet. The first Test series there in 92/93 was probably the least brutal of all. A 1-0 defeat in a 4-match series was definitely not a bad return against a pace-filled bowling attack. Memorable knocks by Tendulkar and a century on debut by Pravin Amre were the highlights of the tour. Things only went downhill after that in SA. I remember watching the 96/97 series as a young kid hoping that a Tendulkar led side would dominate SA in their home conditions. But the spicy wickets were too hot to handle for the Indians who were not used to playing the horizontal bat and dealing the swinging ball. The first Test in Durban particularly hurt the most. Winning the toss, India opted to field first and did a decent job of restricting SA to 235. What followed was one of worst batting performances by India in the last 2-3 decades. In the first innings they were bowled out for 100 and 66 in the second. The result did not change in the second Test either, although there was a better fight and the 200+ run stand between Tendulkar and Azharuddin will always be fondly remembered. There was more respect restored in the 3rd Test with India coming very close to a victory and Dravid emerging as an all-weather star.

The tour of 2001/02 was marred with controversy with the Mike Denness fiasco which only made the final score line paint a prettier image, 1-0 instead of 2-0 loss. The final unofficial Test resulted in a drubbing by an innings reflecting how ill-equipped India were in overseas conditions. However, there was a silver lining to this series. This was a new look Indian team under a new captain, Sourav Ganguly. They had just beaten the mighty Aussies at home in one of the most famous Test series ever. They were headed in the right direction and would go on to win Tests and draw series in England and Australia. And in 2006, India finally managed to win a Test in SA on their 4th tour there. The Test is one of the few things that Sreesanth is fondly remembered for in an otherwise topsy-turvy career. But then again, India squandered the lead and went onto lose the remaining two games and the series. In 2010, we finally managed to not lose a series in South Africa and drew 1-1. The trajectory however again went down after that with India losing a short 2-Test series 1-0 in 2013 and their most recent tour in 2018, by a margin of 2-1. However, the fight shown by India in these last two series is something the team can be proud of. In fact the 2018 tour as a whole was a highly successful one as India won the ODI series 5-1 and the T20I series 2-1. Nevertheless, a series win in SA still remains elusive and it must definitely be on Kohli’s to-do list.

India, on the other hand, has been a powerhouse at home right from the early 90s when they developed a good batting line up around Tendulkar who were excellent players of spin and had an oversupply of good spinners led by Anil Kumble. But the current streak of 26-1 at home since the last home series defeat against England in 2012/13 is one for the ages. One team, however, always challenged India even at home and on multiple occasions got the better of them during this period. And that was South Africa. SA toured India 5 times for a Test series since readmission and till 2010. In those, they won one (2-0 in 2000), drew two (1-1 in 2008 and 2010), and lost two Test series with an overall win-loss ratio of 5-5. Compare this with 2-8 for the Indians in SA till 2010. That is pretty impressive for a team that has not produced a lot of quality spinners. This goes to show how good their batsmen were, how smart their fast bowlers were, and how well they fought in every series that they just never gave up! India did win 3-0 even in the last series at home in 2015 but the pitches were a shame and even Kohli, Shastri and co. do not take pride in that win anymore. This time around the pitches were of great quality with much more bounce and pace in the wickets even on Day 4. Indian pacers outdid their South African equivalents but it was heartening to see Indian pacers dominate and pick up so many wickets.

In all it was a very satisfying win and result for a cricket fan like me. Hopefully next time when the Proteas tour India, they have a more robust and stable side but I also hope that India continue to maintain their dominance and still win it as comprehensively as they did this time. Till then, the Indian cricket aficionado in me will celebrate and savor this victory as much as the Indian team itself.

Tit-bits from the 90s India vs. SA series:

– In the ODI series in 1992, Md. Azharuddin, one of the finest fielders of the era, dropped 3 high catches which got him a lot of wrong attention as well as helped build some theories around it

– The 1996 Boxing Day Test loss in Durban is the 3rd lowest match total for India – Azharuddin was known for his counter attacking batting in Tests. Two such innings came against SA, one at home, at his favorite venue Eden Gradens, and another one away in Cape Town.

– The partnership of 222 between Tendulkar and Azharuddin at Cape Town still remains the highest partnership for any wicket for India in South Africa

– In the final of the 1996/97 ODI tri-series, Rahul Dravid, considered to be a Test-only batsman, smashed Alan Donald for a six over his head that angered him and made him give Dravid a piece of his mind. India narrowly lost the rain-affected match chasing more than 6-an-over but the fight shown was heartening to watch

– The 2000 home series was followed by the match fixing scandal which cost Hansie Cronje, Azharuddin and several others their careers

Text Classification using Naive Bayes algorithm

This was a 3-day assignment that I worked on while I was in the Analytics program at Northwestern University. It is an implementation of the Multinomial Naive Bayes Algorithm in Java. Text Analytics was by far my favorite course at the program and I thoroughly enjoyed working on this one. Hope you guys like it and is helpful. Suggestions/comments/criticism are welcome!

Problem: Classify books based on their Title name, Author Name, and Content into pre-defined categories. The categories were:

AMERICANHISTORY
BIOLOGY
COMPUTERSCIENCE
CRIMINOLOGY
ENGLISH
MANAGEMENT
MARKETING
NURSING
SOCIOLOGY

Input data format:

First line contains N, H where N = number of training data records and

H = list of headers.  N lines of training data will follow this. Each

field in N lines is tab separated. The next line will have M, H where

M = number of test data records and H = list of headers. M lines of

test data will follow this, each field in a line is tab separated.

Training data has following columns:

categoryLabel, bookID, bookTitle, bookAuthor

Test data has following columns:

bookID, bookTitle, bookAuthor

Example of training data:

N=3 H=[categoryLabel, bookID, bookTitle, bookAuthor]

AMERICAN HISTORY b9418230 American History Survey Brinkley, Alan

SOCIOLOGY b16316063 Life In Society Henslin, James M.

ENGLISH b14731993 Reading for Results Flemming, Laraine E.

M=2 H=[bookID, bookTitle, bookAuthor]

b15140145 Efficient and Flexible Reading McWhorter

b15857527 These United States Unger, Irwin

Output:

A list of all books from the Test dataset with their Book Ids and their Predicted Category.

Solution introduction:

For the given document classification problem, I decided to implement Multinomial Naive Bayes model. Classification process: classify(feat1,…,featN) = argmax(P(cat))*PROD(P(featI|cat)). I implemented this in Java (using Eclipse). Here features are words.

  • –          Multinomial (a document is represented by a feature vector with integer elements whose value is the frequency of that word in the document) preferred over Bernoulli (a document is represented by a feature vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present)
  • –          Laplace’s law of succession or add one smoothing included to eliminate possibility of zero probability

Design:

  1. Read the input data and split it into Training dataset and Test dataset
  2. Built the Multinomial Naïve Bayes Classifier using the Training dataset
    • I first started with considering just the ‘title’ field to build and classify documents (excluded Stop words and normalized the remaining words)
    • Next, I tried using ‘title’ and ‘categoryName’ fields to build and classify documents
    • Then, I tried using ‘title’, ‘categoryName’ and ‘author’ fields to build and classify documents
    • Lastly, I tried combinations of ‘title’, ‘author’ and ‘contents’  fields to build and classify documents
    • Also, I experimented by excluding the ‘categoryPriorProbability’ in the final computation
    • The Results of each of these are summarized below
  3. Classified the documents in Test dataset using this classifier

Code:

https://github.com/aniketd006/NaiveBayes

Details:

Class ‘Category’:

  • Attributes:
    • categoryName – Name of the Category
    • categoryProbability – Prior Probability of the Category
    • wordProbability – HashMap of (word-probability) pair of the ‘title’ field of the documents in the category where probability is the category conditional probability of that word
    • authWordProb – HashMap of (word-probability) pair of the ‘author’ field of the documents in the category where probability is the category conditional probability of that word
    • contentWordProb – HashMap of (word-probability) pair of the ‘contents’ (table of contents from input2.txt) field of the documents in the category where probability is the category conditional probability of that word
  • Methods:
    • Methods to ‘set’ and ‘get’ each of these attributes
    • probabilityCalculation – Calculates P(feat(i)|C) – the probability of feat(i) occurring in that document class

Class ‘BayesClassifier’:

  • readData – to read in data from both input files and split the first into training and test datasets
  • buildClassifier –     Builds the classifier. Creates an array of category objects, each of which has the Vocabulary of features (feat) and P(feat(i)|C) – the probability of feat(i) occurring in that document class (done in the class function) – for ‘title’, ‘author’ and ‘table of contents’. Also calculates the category prior probabilities
  • createWordList – take inputs as String of words and they are tokenized and normalized using an English Analyzer
  • classifyDocuments – Classifies the documents into a category which has the class conditional probability. Words are selected from ‘title’, ‘author’, ‘table of contents’ columns separately (only words from ‘title’ used in the final implementation) for each document in the test dataset and corresponding (P(cat)*PROD(P(feat(i)|cat)) are calculated. Finally, classify(feat1,…,featN) = argmax(P(cat)*PROD(P(feat(i)|cat))

Results:

CategoryActualPredictedDifference
AMERICANHISTORY8102
BIOLOGY74-3
COMPUTERSCIENCE440
CRIMINOLOGY660
ENGLISH1211-1
MANAGEMENT451
MARKETING671
NURSING76-1
SOCIOLOGY671
  • The final implementation achieved an 86.67% accuracy (52/60)
  • This final model had only the ‘title’ field considered to build and classify the documents
  • The prior category probabilities was not included in the P(C(i)|D(k)) calculation
  • Started using the contents of the books but wasn’t too helpful in improving the accuracy (more time and appropriate tweaking of the model might result in improvement of accuracy)

Trials:

CategoryDocumentsPrior Probability
AMERICANHISTORY197%
BIOLOGY228%
COMPUTERSCIENCE187%
CRIMINOLOGY2810%
ENGLISH5520%
MANAGEMENT228%
MARKETING197%
NURSING249%
SOCIOLOGY6323%

The above table is compiled from the training dataset

Trial #Fields includedAccuracy
TitleAuthorCategory NamePrior ProbabilitiesWords TokenizedNumeric values in fields
1YesNoNoYesYesYes75%
2YesNoYesYesYesYes75%
3YesNoNoYesNoYes67%
4YesYesNoYesYesYes63%
5YesYesNoYesYesNo60%
6YesNoNoNoYesYes87%

Findings:

  • The first trial was based only on ‘Title’ using the standard formula of Naïve Bayes model. When I observed a few misclassifications, I found that there were some documents which had the word “Historical” and yet wasn’t categorized in “American History”. So I thought I could include the categoryNames as a part of the Vocabulary (Trial #2)
  • As we see there wasn’t any significant change in the overall model accuracy. Hence decided not to use it
  •  Also, in trial #1, I had Normalized (excluded stop words) all the words that appear in the title of the documents in the training dataset during building the model and classification of the test data. Hence I tried retaining the words as they were and the accuracy dipped. Hence Normalization helped
  • Including ‘Author’ field didn’t improve the results. In fact deteriorated it further.
  • Exclusion of numeric occurrences in the fields doesn’t improve accuracy either (numeric years help in prediction)
  • There we a few documents that were being misclassified narrowly and this was because of the prior category probabilities (one was far greater than the other and without prior probability the classification of that document would have been correct). Hence I decided try out by excluding prior category probabilities and the accuracy considerably improved and that was the best I could get from these experimentation (87%)

References:

Information: FAQs about the MSiA program

I have come up with this post so that it can serve as a source of information about the Northwestern University’s MS in Analytics (MSiA) program beyond what is provided on the official university website: http://www.analytics.northwestern.edu/. If you are interested in this program, I recommend you to go through the official website thoroughly before going through this post.

I hope this post answers the questions you have in your mind and outside it. If there are still any questions which are not touched upon here, then please do leave a comment and I shall try answering it for you.

NOTE: ALL THE ANSWERS MENTIONED BELOW ARE STRICTLY PERSONAL BASED ON MY UNDERSTANDING, EXPERIENCE AND DISCUSSIONS WITH THE ADMINISTRATORS AND STUDENTS HERE.

1)         Background and skills relevant to this program

·          A background (education/work) in any of the fields like Economics/Econometrics, Math, Stats, Computer/Information Science/Technology, Business Administration, etc. is relevant to this program. This because Analytics is primarily a combination of Business, Math and Technology. To evaluate whether you are a right candidate for this program you could ask the following questions to yourself and research to get the answers for it:

  • What do I know about Analytics?
  • Why Analytics for me?
  • Where is it applied?
  • Is it aligned to my career goals?
  • Where do I see myself in future after this program?
  • What skill-set do I already possess and what would I need to develop to be successful in this field?

2)         What is acceptance criteria?

·            According to me the following are the criteria on which the Admission Committee would base their decision on:

  • Education or courses taken in under grad in a relevant field (Economics, Math, Stats, Computer/Information Science/Technology, Business Administration) class distribution can be found here: http://www.analytics.northwestern.edu/current-students/index.html
  • Performance in undergrad will add a lot of weight to your case (Good acads*, Good University*)
  • Relevant work experience – if any (any work-ex related to working with data, technology or business management)
  • Need some prior exposure to computer programming (for applicants from non-computer science background)
  • GRE/GMAT test scores (2015-16 batch will be the first one for which standardized test score will be considered. So hard to provide a benchmark for this one at this stage)
  • Needless to say that your SOP has to be top notch and very convincing; Resume very professional and of really high standards; Recommendations that are aligned with your case from reliable and from credible supervisors/colleagues 

*see next question

3)        But do they really focus of undergraduate grades/university for international students?

·            Since there is no direct conversion for international grades to US 4-pt GPA, I do not think they would focus too much on your grades from your under-grad as long as you are above the 3.0/4.0 cutoff or equivalent of that (there are various avenues where you can get this conversion done if you are an international student and I think if you have more than 50% from any of the Indian universities then you should be good). Their main focus is your suitability to the program through your under-grad courses and/or your work-experience and how your career goals align to this field etc. 

4)        How important is work experience to get into this program?

·            There are students in both the cohorts thus far who were just out of their under-grad school with minimal work experience (internships) when they joined this program. But all of them had relevant educational background required for this program. So work experience is not necessary but a relevant one would definitely add a lot of weight to your application.

5)         Pre-requisites courses or preparation required before joining MSiA program.

·            There are no pre-requisites as such since students come from really diverse background. But knowledge of elementary stats, probability and calculus is pretty important. So do it whenever you get time. More information can be found here: http://www.analytics.northwestern.edu/prospective-students/index.html

6)         Placements at MSiA

·            Job prospects are excellent after graduating from MSiA. Most of the guys from the first cohort of 2012-13 have got multiple good offers across industries like finance, technology, retail, insurance. If you have any specific questions or queries regarding jobs then you can try getting in touch with them or the program director/asst. director. This page will provide more information: http://www.analytics.northwestern.edu/current-students/career%20placement.html

7)         Program expenses (entire duration excluding internship period)

·            60k-61k (tuition fees) + 3k (Health insurance) + 10k (off campus expenses) + additional 2k (Text books, travel, partying, etc.) – rough estimates and varies from person to person. Tuition fee is $15,038 per quarter for 2013-14 (3 Qs). You can expect it to increase by another 5% may be for the Fall quarter of next year (cant bet on this!). So all together it will be around $60-61k as tuition fee. Additional costs would be a $3.4k annual health insurance for international students and living expenses (staying off-campus is cheaper average rent per month per person could be anywhere between $400 and $600). One can stay on-campus which generally is more expensive than staying off campus (http://www.northwestern.edu/living/). Most text books have either an e-version or are available in the library. So you can expect the buying of textbooks to be minimum. For more information: http://www.analytics.northwestern.edu/prospective-students/tuition-and-fees.html.

8)         Paid internships, on-campus jobs and assistant-ships.

·            There are on campus opportunities here which you can research on the University website. Also there are a few guys who work part-time for companies (paid). As far as my knowledge goes there are no assistant-ships available but you can always mail Lindsay (lindsaymontanari@northwestern.edu) and enquire about it. There are 7 scholarships (50% tuition waiver) also provided by the NU for this program but I am not sure what is the basis on which they provide these (I personally feel it is for the early applicants). For more information: http://www.analytics.northwestern.edu/prospective-students/tuition-and-fees.html

9)         Industry exposure and relevance in this program

·            MSiA is a professional program instituted to cater to the growing industry demand of skillful Analytics professionals – the dearth of which is plaguing small and large businesses alike. Hence this program is structured in a way so the industry exposure is maximized enabling students to directly apply the concepts learned in the classroom out in the real world. The following link gives more information: http://www.analytics.northwestern.edu/prospective-students/index.html expressed

10)    Eligible for OPT

·            Yes, it is eligible of OPT since it comes under the STEM category

11)     Acceptance rate at MSiA?

·            I do not have an exact figure to quote but all I can say is that there are very few established pure-analytics programs in this country (and probably the world), though this count is increasing year over year, and MSiA is one among them. But at the same time it’s a new field and yet to reach its peak in terms of popularity among candidates unlike, say, a computer science program. So if you have a great case for yourself on why you should be admitted to the program and if you satisfy the necessary pre-requisites mentioned above, then you can definitely get through.

12)      Which are the other universities that offer Analytic degree programs?

·         As mentioned earlier, there is a huge surge in Analytics related programs in the US. A few years back there were only a handful that offered a full-fledged degree concentrated purely on Analytics and candidates didn’t have a lot of options to get a degree in this field. But this is changing at a rapid pace. The following link gives you a great overview of all the analytics degree programs in this country: https://analytics.ncsu.edu/?page_id=4184

13)      For more FAQ’s: http://www.analytics.northwestern.edu/prospective-students/faqs.html

Featured

Why so serious….????

This is one of the projects that I worked on with Ling Jin (@ljin8118) and Peter Schmidt (@pjschmidt007)as a part of our Text Analytics course at Northwestern University. It was one of the most exciting projects to have worked on and in the process learnt the latest and cutting edge techniques used in the field of Text Analytics and Text Mining. Hope you will enjoy it as much as we enjoyed doing it! Cheers 🙂

Goal

Provide a textual analysis of the movie script, The Dark Knight, which was robbed of the best picture Oscar at the 81st Annual Academy Awards on February 22, 2009. All project team members are still bitter about this fact. This assignment hopes to resurrect the greatness that is The Dark Knight.

More seriously though, if given a script, the text analytics conducted in this assignment would be able to produce insights into the genre, mood, plot, theme and characters. Ideally the analysis is intended to understand and answer the who, what, when, where and why in regards to a movie.

Objectives 

Specifically, the objectives of the textual analysis of The Dark Knight will cover:

  • Determine the major characters in the script 
  • Show the character to character interaction 
  • Provide insights into sentiment by character 
  • Show how sentiment changes over time 
  • Determine major themes/topics of the script

Approach

Processing Steps

  1. Acquire the movie script of choice 
  2. Parse the script into lines by scenes and lines by character 
  3. Tokenize, normalize, stem the lines of dialogue as appropriate 
  4. Build an index based on available components for subsequent queries 
  5. Perform part of speech (POS) Tagging on the lines of dialogue 
  6. From the POS Tagging, perform sentiment scoring on the lines of dialogue 
  7. Perform named entity recognition 
  8. Perform co-reference 
  9. Perform topic modeling 
  10. Analyze results 
  11. Produce visualizations

Results

Character Identification

The above two visuals carry the same information, just two different representations, about the important characters in the movie. The first visual is a Bubble chart where the size of the bubble is proportional to the # of lines said by the character. 

The second one is a Heat map diagram where again, the area represents the quantity of lines of dialogue across scenes by characters. These two visuals help us identify the major characters of the movie. One can see that Harvey Dent (aka Two-Face), Gordon, The Joker, Bruce Wayne, Batman, Rachel, Fox, and Alfred were easily the major characters of the movie, with Lau, Chechen, Maroni and Ramirez all playing supporting roles. It is interesting to note that in the script, Two-Face is never named as a separate character, unlike Bruce Wayne and Batman. Combining Bruce Wayne and Batman’s line would have made him the most prominent character over Harvey Dent.

Character Interactions

Now that the major characters are established, the next obvious step would be to identify how these characters interact with each other.

The above visual gives us an insight into this. Each node is a character and each edge tells us that the two nodes connected by that edge have interacted at least once in the movie. Our definition of interaction is when two or more characters speak in a single scene. Hence more the number of interactions with distinct characters, bigger will be the size of the node.

The nodes (characters) marked in Red are the central characters. Most of the characters whom have a lot of dialogues also have more interactions with distinct characters. But are there some exceptions, i.e. are there characters who have a lot of lines but less interactions (may be someone like Alfred – having watched the movie) or vice versa? Let’s look further to see what was observed.

Sentiment over time

Below is a visual description of the sentiment of the scenes over time. The methodology to calculate the sentiment for each scene was to first split each scene into dialogues by individuals. Then each dialogue was run through the Design process explained above. At the end of it we get a score for each dialogue and an average of senti scores of all these dialogues gave us the senti score of the scene. As we can see, this was a dark dark movie.

We also looked how character sentiment varied over time. Again the methodology to calculate this was similar to the one above, but this is by character and not by scene.

BATMAN vs. JOKER

What does the Batman say?



As the superhero in this movie, Batman does not talk that much (based on the IDENTIFICATION OF IMPORTANT CHARACTERS and part the real movie). He does mention his opponents, all the killings and of course the word “hero


What does the Joker say?

The joker talks quite often actually, which was confirmed earlier. He talks about his scar /the smile, his childhood and the whole plan stuff. He also mentions all the names quite often. 

Design and Implementation Challenges

Script

One of the first things was to find an appropriate script, which turned out to be a little harder than expected. It was sort of like finding a needle in a haystack. But after some perseverance, The Dark Knight Script was found at:

http://www.pages.drexel.edu/~ina22/splaylib/Screenplay-Dark_Knight.HTM

There were 8704 actual lines that needed to be parsed and fit together in the above script.

Parsing

There were several nuances that needed to be taken into consideration for the actual script parsing. First of all, the script that was found was not broken down by multiple html tags representing the different portions of the script. Instead, the entire script was basically under one tag, which meant parsing was for an entire block of unstructured text. Hence we had to carefully find patterns and parse the script.

Tokenization and Lemmatization

The Standard Analyzer in the Stanford NLP was chosen to handle the tokenization and any normalization required. It also provided lower case and stop word filtering. As it was decided that stemming was not going to be necessary for the analysis that was to be conducted, the Standard Analyzer was chosen over the English Analyzer as the aggressive stemming performed by the PorterStemFilter was not necessary to support the other downstream pipeline processes. The Standard Analyzer was then used consistently across the pipeline to prevent any inconsistency concerns.

POS and Sentiment 


There were a couple of options available to perform sentiment mining on the dialogue in the script. 

Option 1:

The initial selection was to use SentiWordNet, http://sentiwordnet.isti.cnr.it SentiWordNet is a lexical resource that is based on WordNet 3.0, http://wordnet.princeton.edu and is used for opinion mining. SentiWordNet assigns a score to each synset, defined as a set of one or more synonyms, of a word for a particular part of speech. The parts of speech in SentiWordNet are defined as:

a = adjective

n = noun

r = adverb

v = verb

Obtaining the parts of speech from the Stanford NLP part of speech annotator would then require mapping from the parts of speed defined in the Penn Tree Bank Tag set, http://www.computing.dcu.ie/~acahill/tagset.html, to the part of speech defined in SentiWordNet so that a sentiment score could be produced.

The SentiWordNet resource is constructed as follows:

POS = part of speech

ID = along with pos, uniquely identifies a WordNet (3.0) sunset.

PosScore = the positivity score assigned by SentiWordNet to the synset.

NegScore = the negativity score assigned by SentiWordNet to the synset.

SynsetTerms = terms, including the sense number, belonging to the sunset

Gloss = glossary

Note: The objectivity score can be calculated as: ObjScore = 1 – (PosScore + NegScore)

Option 2

Another option instead of SentiWordNet was to use the sentiment annotator in the Standford NLP pipeline. The team discovered this new addition to the Stanford NLP during the course of the project. It is recent “bleeding edge” sentiment technology that Stanford is now including the Stanford NLP. Excerpted from there website, as most sentiment prediction systems work just by looking at words in isolation, giving positive points for positive words and negative points for negative words and then summing up these points. Which by the way in essence is what we were doing. That way, the order of words is ignored and important information is lost. In contrast, the sentiment annotation that is part of the Stanford NLP institutes a new deep learning model that builds up a representation of whole sentences based on the sentence structure, computing the sentiment based on how words compose the meaning of longer phrases. There are 5 classes of sentiment classification: very negative, negative, neutral, positive, and very positive.

Sentiment  Scoring 

There were several methods available in calculating a sentiment for character lines in a given scene. This is due to the fact that the actual “sense” of the word was not known when passed to the parser to do the actual sentiment. So if “good” had n-senses in the lexical resource, it was not known which sense was used in the dialogue.

Method 1

The first method was to sum all the senti scores within the body of text then divide by the sum of all scores. This was the method that was provided in the demo on the SentiWordNet web site.

Method 2

The second method was to sum all the senti scores within the body of text then divide by the count of all scores.

Method 3

The third method was to just sum all the senti scores within the body of text. Although we actually implemented both options for obtaining sentiment (i.e SentiWordNet and Stanford NLP sentiment annotation), the option that was chosen to score the dialogue was SentiWordNet. The scoring method that was then used, although each were explored, was method 2 defined above. This is the method we finally decided to use in our sentiment analysis of the script.

Concluding Remarks

Text analytics is quite the involved process. As with most data analysis activities a major portion of the time is spent identifying, acquiring and cleansing the source data. The field of text analytics is quite broad with many best of breed components. However, text analytics does not have well integrated toolsets, so you can observe from the solution that was crafted in having to leverage several technologies (Java, R, Excel, Gephi, Tableau) using different libraries (Stanford NLP, Lucene) and various other packages to perform specific functions within the data pipeline. All in all, though, it has been shown that with some blood, sweat and tears (over 1200 lines of code were written for this assignment), and by all means time, a text analytics tool can be built to analyze movie scripts with a pretty accurate view when compared to the overall reality of the movie. And lastly, it should be mentioned, then the inherent complexities in the dialogue and the richness of the script should have guaranteed the Oscar for The Dark Knight!

WHY SO SERIOUS?

Power of Friendship

It’s been more than 2 and half years since we started out with our initiative ABC (www.abc-org.blogspot.in) and I am happy as well as a little disappointed on how the organization has churned out to be during this period. Happy for the fact that it still exists (yes, trust me it’s pretty easy for such initiatives to die down soon after the initial burst), is bigger (in terms of the projects that we have been undertaking) and is functioning better as well (for the school we have been associated with during this time). And disappointed because we had envisaged it to become a larger organization (in terms of the no. of members) and one with a larger reach (in terms of no. of schools we helping out with). But at the end of the day I am glad that at least the people who are still actively participating in ABC’s activities are really committed and are self-driven in this common journey of ours.

During this time, we have undertaken a number of activities for the kids of these schools like

  • Stationery Drive
  • Shoe Drives in 2010 and 2013
  • Annual scholarship program for the 7th std kids
  • Annual sponsorship of 10th std. students’ exam fees
  • Water filter project, the biggest project we have undertaken 

to name a few…

But the main objective of starting this organization was to provide good quality education to these kids, similar to the one which the likes of me were blessed to have, and we are definitely strive towards it.

But during this time I also started realizing that we, a group of about 5-7 friends, were able to pull off something that helped improve, in whatever little way, the studying conditions of more than 400 kids. 

They had uniforms and shoes to wear, pens and pencils to write and give their exams, money to pay their exam fee and access to clean drinking water. This triggered a thought in my mind that why can’t then  each group of friends pick up a small school around their locality and provide whatever support, financial if not anything else, they can to improve the studying environment of the school? Because I believe that in this context and at the juncture where India is today, something is definitely and always better than nothing.

From my little experience and interactions with people across all age groups and different backgrounds, everyone has been appreciative when I have talked about ABC to them and most of them also expressed a desire that they too want/wanted to do something like this. This shows that people are grateful to our society and have a sense of giving-back to it from which we have received so much. But still only a small percentage of these people actually venture out and do something about it. So who/what is stopping them?

The most common reasons I can think of that people give themselves, more than anyone else, are: lack of time and money, and a feeling that a lot has to be sacrificed to do something in this area. But according to me each one of them is not relevant today.

As I see it, our country is definitely more prosperous, especially the middle class, compared to where the nation was a couple of decades ago. More people are better educated, have stable jobs, good salaries; more members of the family (women mainly) are working hence increasing the household income as well. Hence the youth coming from this class definitely can’t complain of both lack of money and time. I am not saying that the youth are filthy rich or they have all the time in the world. They do have career ambitions, work/study and certain other responsibilities to fulfill as well. In fact, none of us in ABC are any different. But we also know that we can definitely spare some time off to fulfill our responsibilities in this regard.

Then there is SACRIFICE, something that is commonly associated when it comes to serving the society. Of course, it is true and I greatly appreciate the likes of Anna Hazare and Medha Patkar for giving their entire lives for the upliftment of people, but if you don’t intend to do that but still wish to contribute in your own little way, then that is something not wrong at all. And I assure you, after having been a part of the activities undertaken by ABC, I haven’t sacrificed anything at all. Not studies, job, family, friends, fun or anything else. All you need to have is the willingness to take up something like this and then the drive to continue doing it. Other practical issues such as finance and funds for bigger projects, the right place to start etc will automatically be eliminated as hindrances.

So now, what exactly can a group of friends do?

      1)      They can just visit a nearby school which is not in a good state. Talk to the principal and enquire what are the problems that the school children are facing in getting the right education

      2)      Then you can see where are the places that you can fill in – teaching something I personally don’t advocate being done unless it is a very structures and professional. Else it is of no use. But again if someone believes otherwise 

      3)      Filling in the infrastructure voids is definitely where each one of us can contribute towards. The projects that we have done at ABC are testaments of that. Every year thousands of students, many of them bright and talented, are not able to write their annual exams just because their families can’t afford his/her exam fees. And what’s the cost of the exam fee? A few hundred rupees at max. Can’t we fill in this void to at least ensure kids are able to give their exams? Kids walk bare footed to schools; can’t we provide them with a pair of shoes which hardly costs a 100 bucks but can go a long way in enabling that kid to go to school study and play!

      4)      Next thing that come to your mind is finance since you can’t provide only one kid with shoes. What if there are about 100 children in the school that doesn’t have shoes which means you need to raise 10k bucks. And I think that shouldn’t be a really big problem either since I don’t think it’s hard to find about 15-20 people in our circle family and friends and in this era of social networking who can contribute 500-1000 buck each to buy these shoes

     5)      And finally on a very selfish level, there is always this nice feeling of playing your part in nation building. Also it is a great reason for all your friends to meet up more regularly and enjoy even more!!! A definite win-win situation as I see it.

This is what according to me is the power and potential of the youth and friendship in particular. I hope I have made sense in whatever I have jotted down and it appeals to all who read this.

CheersJ

SBI… IPL… Hockey…

April 2008: That was the month when began the greatest sporting championship the country had ever seen, The Indian Premier League or better known as the IPL.

Based on the concept of the hugely successful and popular English Premier League for football in England, the IPL broke all the traditional barriers to embrace the latest, shortest and the most exciting form of one of the oldest field sports recorded in the history of mankind. The formation of the IPL was something similar to the story of State Bank of India (SBI) eventually embracing ATMs in India. Well, the story goes something like this:

SBI initially criticized and ridiculed the concept of ATMs, saying that in a country with so many illiterates, people living in villages, lack of lawlessness in many parts of the nation etc, a person using a card with protected magnetic strip, interacting with a machine to withdraw money, and that too safely, is definitely going to be a flop show. But other banks thought otherwise and started setting up their own ATMs in various parts of the country. Yes, there were isolated incidents of loot outside ATMs, people not being comfortable dealing with the machine etc, just like the problems faced initially when a new system is introduced. Slowly and steadily there was a rise in the popularity of the ATMs and banks having more ATMs started becoming more profitable as well. This was when SBI realized its folly and found itself lacking in terms of the competition with other newer and smaller sized banks and thus accepted this wonder invention and how. Today SBI by far has the maximum no. of ATMs in the country and is one of the few govt. owned institutions in the country to be hugely successful and giving other private banks a run for their money!

Indian cricket too treaded on the same path. After the wonder invention of the T20 cricket, the board dismissed it as a too-short-for-cricket format and a diluter of the traditional 5-day game, the TEST cricket which still remains the ultimate test for a cricketer. It was 2007 and the mood of the most spectacular event in cricket was about to start, the ICC World Cup. India sent a decently strong team under the leadership of Rahul Dravid. But India’s hopes of making it to the knockout stages were dashed in the very first match itself after losing against Bangladesh. It was a big flop show from the Indian cricket team and which in turn flopped the whole of WC itself considering that even Pakistan crashed out unceremoniously. But the team got a great chance to heal its World Cup wounds by doing nothing but performing better in the upcoming World Cup, only that it was different ball game (not literally!) since it was the T20 WC which the Indians neither had favored since its inception nor did they have enough experience in the format. The experienced players took a back seat and withdrew their participation from the tournament and a young team under a young leader was sent to the competition. And rest as they say is history and one which no one expected. The T20 WC came home and opened the eyes of many people against it and showed the potential it held in this country.

In the mean time came the Indian Cricket League (ICL) as well. The ICL went against the all powerful BCCI to start a T20 cricket league of its own. BCCI used its entire mite and even threatened the people who showed interest in being associated with the league that they would be banned for life from playing matches for their country or their respective domestic tournaments. Still the ICL happened, defying all the warnings and threats from cricket boards. It made a lot of noise and had its fair of success in terms of viewership but more importantly it sent a message to the BCCI that cricket without it can happen! Suddenly the young lot Indian cricket started getting lured by this new league and undermined the threats issued by the board. National prospects quit domestic cricket and participated in the league. This was the state not just in India but even outside. Former and current overseas players too started giving up their desire of playing for their nation by accepting the lucrative offers of the ICL. Kerry Packer was again back to haunt cricket since he had treaded on a similar path back in the late 70s which ruined careers of many bright cricketers like Tony Grieg. This was too much for the egoistic BCCI to be mum and meekly accept defeat. And to merely teach the ICL a lesson for its deeds, BCCI came up the concept of IPL. Loosely based on the functioning of the English Premier League for football in England, it was an instant success amongst the players, team-owners and fans all around the world.

April/May 2011: IPL is still alive and kicking in its 4th season. Just like the SBI, the BCCI too took time to realize the potential of a new invention. But after the realization the invention was taken to an altogether different level in terms of its reach!

Now where does hockey come into the picture? Once the pride of our country, hockey today has been relegated to merely being called the national game in different school text books but not living up to the name and benchmarks set by the predecessors in this beautiful and enthralling game. The game is definitely not getting its due and it is high time we started taking it seriously and bring back the lost glory. Hockey needs a large scale revamp to match to the popularity that cricket has managed to garner. And IPL is one great model which the hockey authorities can emulate and create a league of its own. Yes, it was tried previously and didn’t succeed to the expected levels but the way it organized can be changed with the IPL model coming into the picture. 2 things about hockey that need urgent attention are: improving the image of the game and attracting youngsters pick up the “stick” and not the “bat”. Short sports have always been more exciting than their longer-duration cousins. Still cricket has managed to become popular amongst the young and the old alike in this country. Surely the quality of cricket and players playing has improved for which the credit should be given to the various cricket boards for having a sound domestic set up to nurture young cricketers. This surely can be done for hockey too. But as all of us know the ever efficient Hockey India, if wished, would have sweated it out done that long back. So this option no more remains an option. Then what is the way out? Commercialization of the game is one way that comes to my mind. So many youngsters are today attracted by the fame and name that T20 cricket gets along with it for the players, which makes them take their game seriously and not just as a hobby. This in turn churns out talent from the lot. Any day, having more options to choose from is better than having few. That’s what the case with the advent of IPL is. So can’t we have an HPL for hockey too, where corporates can be asked to pick up teams and players and have a tournament among them? What will this ultimately do?

a) It will serve as a great platform for young players to showcase their talent and be recognized

b) More youngsters will take up the game seriously as a career option since they can see a bright future in the game with the corporate bigwigs involved.

c) Better playing facilities (which ideally should have been provided by the hockey board, but… never mind!) from the team owners.

d) The game will again get a chance to connect with the people since they can catch it live on the television.

For starters, the hockey board can approach the BCCI itself to help them out with the whole procedure to begin with and I think the BCCI would be more than happy to extend a helping hand to Hockey India. But the initiative should come from the hockey board. The existing IPL team owners can be called for a meeting and can be asked for their opinions on it which would come in very handy, since they have vast experience in their field and also for the very fact that they might ultimately be the likely owners of the teams. Hopefully this happens and we can get to cheer our regional teams.

Finally, I am not sure if this idea is novel since I feel the IPL success is very visible and at the same time Hockey India is thinking of how to get the sports back into news for the right reasons. The question and the answer both are with the hockey board. But the big question is whether they are really looking out for the solution??? My guess is as good as yours.