In [1]:
import re
from collections import Counter

# For the example later
example_text = open("review_polarity/txt_sentoken/pos/cv750_10180.txt").read()
bag_of_words=Counter({'and': 37, 'is': 26, 'he': 11, 'great': 10, 'carlito': 9, 'film': 8, 'but': 8, 'some': 7, 'pacino': 7, "carlito's": 7, 'palma': 5, 'well': 5, 'like': 5,  'woman': 4, 'amazing': 4}) 

Text classification

A very common problem in NLP:

Given a piece of text, assign a label from a predefined set

What could the labels be?

  • positive vs negative (e.g. sentiment in reviews)
  • about world politics or not
  • author name (author identification)
  • pass or fail in essay grading

In this section

We will see how to:

  • representing text with numbers
  • learn a classifier using the perceptron rule

... and you will ask me questions!

The maths we need is addition, subtraction, multiplication and division!

Sentiment analysis on film reviews

Representing text with numbers

In [4]:
print(example_text)
what's shocking about " carlito's way " is how good it is . 
having gotten a bit of a bad rap for not being a big box office hit like pacino's previous film , " scent of a woman , " and not having as strong a performance as he did in that one ( he had just won an oscar ) , " carlito's way " was destined for underrated heaven . 
that's what it is : an underrated gem of a movie . 
and what a shame because pacino and de palma both do amazing jobs with it , and turn it into a great piece of a pulpy character study . 
 " carlito's way " deals with , well , carlito brigante ( pacino ) , a puerto rican ex-drug kingpin , who gets out of a long jailterm when his coke-addicted , curly-haired lawyer ( sean penn ) points out a legal technicality . 
of course , carlito was actually awoken in prison , and has decided to go straight , even if he's really a crook at heart . 
carlito , like barry lyndon , is a man who is trapped by fate at every turn , and can't escape into something he is not . 
carlito's attempts at a clean , legal life are thwarted at nearly every turn . 
when he first gets out , a friend of his ends up leading him into a big shoot-out , where he has to kill a couple people to survive . 
he's constantly getting bugged by the government to see if he's doing anything illegal , and his lawyer finds himself neck-deep in a pile of shit , needing him to try and help him out , which includes him doing some prison breaking . 
carlito , like ratso rizzo , wants to go to miami ( since , according to film logic , that's where it's at ) , but needs some funding . 
being a legend , he is quickly able to get a nice job running a big dance club ( this is the 70s , by the way , and since some of this takes place in night clubs , we get to hear all sorts of 70s classics , including several k . c . 
and the sunshine band tunes - my personal favorite ) . 
he gets a bodyguard ( the great luis guzman , at his best ) , and is soon running a pretty good business , even if he's constantly attracting underworld young thugs , like benny blanco " from the bronx " ( john leguiziamo ) , who is more than once pointed out to be a young version of carlito . 
on the other side , the symbol of promise and hope , is gail ( penelope ann miller - what happened to her ? ) , his girlfriend from before prison . 
she's a goregeous ballerina , and a stripper , and soon carlito is trying to get back with her , and take her with him when he finally leaves for miami . 
while this relationship is never fully defined or anything , we get a sense of love between them , and they have some truly interesting scenes between them ( she never gives him addresses or locales - he always has to track her down ) . 
all of these elements clash together at the end , in a brilliantly executed , emotional climax , which is inevitable . 
when i say inevitable , i mean we see it at the beginning and then backtrack , putting a great spin on it . 
sure , it's going to eleveate some of the tension , but it gives the film a lot of depth , as carlito is seen trapped by fate . 
what's amazing is the big chase sequence ( amazingly done by de palma ) has a lot of tension and thrills . 
like " apollo 13 , " we know what's going to happen , but we're still thrilled by what happens in the middle . 
it's also very emotional , thanks to a great script by david koepp , and amazing performances by pacino and miller . 
de palma is famous ( or infamous ) for lots of violence in his films . 
his earlier flim , " scarface " ( which starred pacino in the lead ) , has a ton of it , especially at the end ( and a nasty chainsaw scene towards the beginning which i'm still not over ) . 
but de palma actually reigns in more quieter scenes . 
to me , the best scene in the film is when carlito is on top of a building , looking down into the room where gail is doing ballet . 
this is the most brilliantly done , and most emotionally stimulating scene in the entire film , and probably the best in de palma film history . 
with a gorgeous soprano duet in the background , and rain pouring down onto a trashcan lid covering carlito's head , and a saddened , remorseful look on pacino's face , it's a tear-jerker ( well , for me , i dunno about you . . . ) . 
the acting from all is great , especially from the three leads . 
pacino was panned for his performance , chiefly because his accent wasn't puerto rican enough and , well , it wasn't as " strong " as his oscar-winning role in " scent of a woman . " 
well , his " scent of a woman " performance was great and all , but it was nothing really but , as comic kevin pollack said , a " foghorn leghorn impression . " 
in " carlito's way , " he's emotional , and strong , despite the fact that he's remorseful over his entire lifestyle , which he cannot change . 
i felt more for carlito brigante than i did for the tango-dancing , insult-throwing blind guy in " scent of a woman . " 
as i said , penelope ann miller is great , and she and pacino actually have very good chemistry . 
and they're scenes are well-written , with some good clever dialogue which adds some interest to an otherwise bland relationship . 
and sean penn is amazing as the coke-addicted rat attorney . 
every scene he's in , he has great energy , and even measures up to the greatness that is pacino . 
in smaller roles , john leguiziamo and luis guzman are great . 
 " carlito's way " is one of those films which you heard about briefly , but when you finally watch it , you're absolutely blown away . 
it's a wonderful film , a highly underrated little masterpiece which was shelved after it didn't do so hot . 
but trust me and check it out . 
it's a great little film , and proof that the residential critics and mass populus are not always right . 

Counting words

In [2]:
dictionary = Counter(re.sub("[^\w']"," ",example_text).split())
print(dictionary)
Counter({'a': 47, 'and': 37, 'the': 31, 'is': 26, 'of': 24, 'to': 20, 'in': 18, 'it': 13, 'his': 12, 'he': 11, 'great': 10, 'at': 9, 'carlito': 9, 'film': 8, 'for': 8, 'which': 8, 'as': 8, 'by': 8, 'but': 8, "he's": 7, 'out': 7, 'some': 7, "carlito's": 7, 'pacino': 7, 'when': 6, 'him': 6, "it's": 6, 'i': 6, 'with': 6, 'has': 6, 'way': 6, 'was': 6, 'not': 5, 'de': 5, 'like': 5, 'palma': 5, 'well': 5, 'her': 4, 'all': 4, 'we': 4, 'that': 4, 'this': 4, 'scent': 4, 'into': 4, 'what': 4, 'woman': 4, 'get': 4, 'on': 4, 'are': 4, 'from': 4, 'good': 4, 'big': 4, 'amazing': 4, 'scene': 4, 'gets': 3, 'or': 3, 'who': 3, 'prison': 3, "what's": 3, 'more': 3, 'about': 3, 'me': 3, 'if': 3, 'even': 3, 'miller': 3, 'best': 3, 'actually': 3, 'down': 3, 'turn': 3, 'scenes': 3, 'where': 3, 'an': 3, 'doing': 3, 'strong': 3, 'underrated': 3, 'performance': 3, 'you': 3, 'every': 3, 'emotional': 3, 'inevitable': 2, 'especially': 2, 'finally': 2, '70s': 2, 'coke': 2, 'tension': 2, "pacino's": 2, 'soon': 2, 'guzman': 2, 'entire': 2, 'addicted': 2, 'one': 2, 'them': 2, 'going': 2, 'films': 2, 'since': 2, "that's": 2, 'penn': 2, 'legal': 2, 'trapped': 2, 'puerto': 2, 'oscar': 2, 'than': 2, 'sean': 2, 'running': 2, 'do': 2, 'constantly': 2, 'rican': 2, 'really': 2, 'remorseful': 2, 'have': 2, 'always': 2, 'because': 2, 'did': 2, 'done': 2, 'never': 2, 'most': 2, 'brigante': 2, 'leguiziamo': 2, 'relationship': 2, "wasn't": 2, 'penelope': 2, 'end': 2, 'little': 2, 'lot': 2, 'said': 2, 'over': 2, 'being': 2, 'lawyer': 2, 'gives': 2, 'fate': 2, 'ann': 2, 'still': 2, 'miami': 2, 'gail': 2, 'up': 2, 'go': 2, 'beginning': 2, 'having': 2, 'very': 2, 'anything': 2, 'she': 2, 'between': 2, 'john': 2, 'young': 2, 'brilliantly': 2, 'see': 2, 'luis': 2, 'sense': 1, 'famous': 1, 'straight': 1, 'c': 1, 'leading': 1, 'points': 1, 'stripper': 1, 'defined': 1, 'character': 1, 'shocking': 1, 'after': 1, 'gorgeous': 1, 'room': 1, 'technicality': 1, 'shit': 1, 'benny': 1, 'tear': 1, 'amazingly': 1, 'greatness': 1, 'escape': 1, 'panned': 1, 'apollo': 1, 'duet': 1, "i'm": 1, 'hope': 1, 'able': 1, 'probably': 1, 'felt': 1, '13': 1, 'acting': 1, 'locales': 1, 'according': 1, 'k': 1, 'populus': 1, 'interesting': 1, 'several': 1, 'happens': 1, 'addresses': 1, 'decided': 1, "she's": 1, 'nasty': 1, 'eleveate': 1, 'funding': 1, 'friend': 1, 'watch': 1, 'ratso': 1, 'box': 1, 'earlier': 1, "they're": 1, 'hot': 1, 'thrilled': 1, 'three': 1, 'top': 1, 'background': 1, 'ex': 1, 'how': 1, 'blind': 1, 'man': 1, 'bodyguard': 1, 'piece': 1, 'place': 1, 'long': 1, 'while': 1, 'drug': 1, 'climax': 1, 'blown': 1, 'sequence': 1, 'truly': 1, 'crook': 1, 'help': 1, 'jobs': 1, 'study': 1, 'thwarted': 1, 'trying': 1, 'goregeous': 1, 'band': 1, 'illegal': 1, 'finds': 1, 'bland': 1, 'adds': 1, 'shoot': 1, 'emotionally': 1, 'critics': 1, 'track': 1, 'pouring': 1, 'including': 1, 'absolutely': 1, 'dunno': 1, 'thugs': 1, 'sunshine': 1, 'head': 1, 'destined': 1, 'check': 1, 'fully': 1, 'say': 1, 'fact': 1, 'course': 1, 'energy': 1, 'favorite': 1, 'reigns': 1, 'heard': 1, 'accent': 1, 'trust': 1, 'dancing': 1, 'thanks': 1, 'clever': 1, 'people': 1, 'violence': 1, 'bit': 1, 'ends': 1, 'highly': 1, 'deep': 1, 'blanco': 1, 'spin': 1, 'quickly': 1, 'clean': 1, 'shame': 1, 'roles': 1, 'couple': 1, 'other': 1, 'won': 1, 'bad': 1, 'leads': 1, 'away': 1, 'includes': 1, 'lifestyle': 1, 'something': 1, 'sorts': 1, 'girlfriend': 1, 'happen': 1, 'try': 1, 'enough': 1, 'know': 1, 'breaking': 1, 'promise': 1, 'survive': 1, 'script': 1, 'otherwise': 1, 'kevin': 1, 'once': 1, 'guy': 1, 'kingpin': 1, 'personal': 1, 'despite': 1, 'ton': 1, 'cannot': 1, 'they': 1, 'lyndon': 1, 'pointed': 1, 'history': 1, 'side': 1, 'thrills': 1, 'performances': 1, 'just': 1, 'love': 1, 'my': 1, 'seen': 1, 'saddened': 1, 'jailterm': 1, 'lid': 1, 'onto': 1, 'awoken': 1, 'middle': 1, 'right': 1, 'happened': 1, 'attracting': 1, 'lots': 1, 'shelved': 1, 'office': 1, 'nice': 1, 'rizzo': 1, 'neck': 1, 'tango': 1, 'haired': 1, 'classics': 1, 'jerker': 1, 'written': 1, 'scarface': 1, 'flim': 1, 'those': 1, 'leaves': 1, 'attempts': 1, 'hear': 1, 'foghorn': 1, 'back': 1, 'comic': 1, 'hit': 1, 'clubs': 1, 'needing': 1, 'needs': 1, 'tunes': 1, 'gotten': 1, 'chainsaw': 1, 'also': 1, 'getting': 1, 'infamous': 1, 'stimulating': 1, 'version': 1, 'gem': 1, "you're": 1, 'david': 1, 'pollack': 1, 'briefly': 1, 'clash': 1, 'building': 1, 'depth': 1, 'club': 1, 'symbol': 1, 'dance': 1, 'quieter': 1, 'previous': 1, 'koepp': 1, 'backtrack': 1, 'towards': 1, 'bugged': 1, 'business': 1, 'lead': 1, 'trashcan': 1, 'ballerina': 1, 'bronx': 1, 'together': 1, 'smaller': 1, 'logic': 1, 'takes': 1, "can't": 1, 'elements': 1, 'both': 1, 'nothing': 1, 'legend': 1, 'these': 1, 'chase': 1, 'heaven': 1, 'had': 1, 'government': 1, 'night': 1, 'curly': 1, 'so': 1, 'throwing': 1, 'masterpiece': 1, 'impression': 1, 'first': 1, 'pulpy': 1, 'look': 1, 'executed': 1, 'take': 1, 'interest': 1, 'covering': 1, 'face': 1, 'pile': 1, 'pretty': 1, "we're": 1, 'residential': 1, 'barry': 1, 'sure': 1, 'rat': 1, 'chemistry': 1, 'looking': 1, 'deals': 1, 'ballet': 1, 'role': 1, 'wants': 1, 'soprano': 1, 'proof': 1, 'before': 1, 'measures': 1, 'rap': 1, 'starred': 1, "didn't": 1, 'putting': 1, 'dialogue': 1, 'nearly': 1, 'movie': 1, 'insult': 1, 'chiefly': 1, 'mean': 1, 'leghorn': 1, 'underworld': 1, 'heart': 1, 'change': 1, 'then': 1, 'mass': 1, 'winning': 1, 'life': 1, 'be': 1, 'wonderful': 1, 'job': 1, 'himself': 1, 'attorney': 1, 'rain': 1, 'kill': 1})

Bag of words representation

  • The higher the counts for a word, the more important it is for the document
  • No document has every word; most have 0 counts (implicitly)

Anything missing?

  • which words to keep?
  • how to value their presence/absence?
  • word order is ignored, could we add bigrams?

Choice of representation (features) matters a lot!

Our first classifier

Now we have represented a text as counts over words/features.

We need a model to decide whether the review is positive or negative.

If each word $n$ has counts $x_n$ in the review and is associated with a weight ($w_n$), then:

$$\hat y = sign(\sum_{n=1}^N w_nx_n) = sign(\mathbf{w} \cdot \mathbf{x})$$
$$\hat y = sign(\sum_{n=1}^N w_nx_n) = sign(\mathbf{w} \cdot \mathbf{x})$$
In [3]:
print(bag_of_words)
Counter({'and': 37, 'is': 26, 'he': 11, 'great': 10, 'carlito': 9, 'film': 8, 'but': 8, 'pacino': 7, 'some': 7, "carlito's": 7, 'well': 5, 'palma': 5, 'like': 5, 'amazing': 4, 'woman': 4})
In [4]:
weights = dict({'and': 0.0, 'is': 0.0, 'he': 0.0, 'great': 0.0,\
                'carlito': 0.0, 'but': 0.0, 'film': 0.0, 'some': 0.0,\
                'carlito\'s': 0.0, 'pacino': 0.0, 'like': 0.0,\
                'palma': 0.0, 'well': 0.0, 'amazing': 0.0, 'woman': 0.0})
In [12]:
score = 0.0
for word, counts in bag_of_words.items():
    score += counts * weights[word]
print(score)
print("positive") if score >= 0.0 else print("negative")
-600.0
negative

Another view

How to learn the weights $\mathbf{w}$?

The perceptron

Proposed by Rosenblatt in 1958 and still in use by researchers

Supervised learning

Given training documents with the correct labels

$$D_{train} = \{\mathbf{x}^1,y^1)...(\mathbf{x}^M,y^M)\}$$

Find the weights $\mathbf{w}$ for the linear classifier

$$\hat y = sign(\sum_{n=1}^N w_nx_n) = sign(\mathbf{w} \cdot \mathbf{x})$$

so that we can predict the labels of unseen documents

Supervised learning

Learning with the perceptron

\begin{align} & \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,y^1)...(\mathbf{x}^M,y^M)\}\\ & set\; \mathbf{w} = \mathbf{0} \\ & \mathbf{for} \; (\mathbf{x},y) \in D_{train} \; \mathbf{do}\\ & \quad predict \; \hat y = sign(\mathbf{w}\cdot \phi(\mathbf{x}))\\ & \quad \mathbf{if} \; \hat y \neq y \; \mathbf{then}\\ & \quad \quad \mathbf{if} \; \hat y\; \mathbf{is}\; 1 \; \mathbf{then}\\ & \quad \quad \quad update \; \mathbf{w} = \mathbf{w} - \phi(\mathbf{x})\\ & \quad \quad \mathbf{else}\\ & \quad \quad \quad update \; \mathbf{w} = \mathbf{w} + \phi(\mathbf{x})\\ & \mathbf{return} \; \mathbf{w} \end{align}

  • error-driven, online learning
  • $x$ is the document $\phi(x)$ is the bag of words, bigrams, etc.

A little test

Given the following tweets labeled with sentiment:

Label Tweet
negative Very sad about Iran.
negative No Sat off...Need to work 6 days a week.
negative I’m a sad panda today.
positive such a beautiful satisfying day of bargain shopping. loves it.
positive who else is in a happy mood??
positive actually quite happy today.

What features would the perceptron find indicative of positive/negative class?

Would they generalize to unseen test data?

Sparsity and the bias

In NLP, no matter how large our training dataset, we will never see (enough of) all the words/features.

  • features unseen in training are ignored in testing
  • there are ways to ameliorate this issue (e.g. word clusters), but it never goes away
  • there will be texts containing only unseen words

Bias: that appears in each instance

  • its value is hardcoded to 1
  • that 1 in the diagram
  • effectively learns to predict the majority class

Evaluation

The standard way to evaluate our classifier is:

$$ Accuracy = \frac{correctLabels}{allInstances}$$

What could go wrong?

When one class is much more common than the other, predicting it always gives high accuracy.

Evaluation

Predicted/Correct MinorityClass MajorityClass
MinorityClass TruePositive FalsePositive
MajorityClass FalseNegative TrueNegative
$$ Precision = \frac{TruePositive}{TruePositive+FalsePositive}$$$$ Recall = \frac{TruePositive}{TruePositive+FalseNegative}$$

Time for (another!) test

Discuss in pairs what features you would use in a classifier that predicts FAIL/PASS for an essay!