import re
from collections import Counter
# For the example later
example_text = open("review_polarity/txt_sentoken/pos/cv750_10180.txt").read()
bag_of_words=Counter({'and': 37, 'is': 26, 'he': 11, 'great': 10, 'carlito': 9, 'film': 8, 'but': 8, 'some': 7, 'pacino': 7, "carlito's": 7, 'palma': 5, 'well': 5, 'like': 5, 'woman': 4, 'amazing': 4})
A very common problem in NLP:
Given a piece of text, assign a label from a predefined set
What could the labels be?
We will see how to:
... and you will ask me questions!
The maths we need is addition, subtraction, multiplication and division!
print(example_text)
dictionary = Counter(re.sub("[^\w']"," ",example_text).split())
print(dictionary)
Anything missing?
Choice of representation (features) matters a lot!
Now we have represented a text as counts over words/features.
We need a model to decide whether the review is positive or negative.
If each word $n$ has counts $x_n$ in the review and is associated with a weight ($w_n$), then:
$$\hat y = sign(\sum_{n=1}^N w_nx_n) = sign(\mathbf{w} \cdot \mathbf{x})$$print(bag_of_words)
weights = dict({'and': 0.0, 'is': 0.0, 'he': 0.0, 'great': 0.0,\
'carlito': 0.0, 'but': 0.0, 'film': 0.0, 'some': 0.0,\
'carlito\'s': 0.0, 'pacino': 0.0, 'like': 0.0,\
'palma': 0.0, 'well': 0.0, 'amazing': 0.0, 'woman': 0.0})
score = 0.0
for word, counts in bag_of_words.items():
score += counts * weights[word]
print(score)
print("positive") if score >= 0.0 else print("negative")
Proposed by Rosenblatt in 1958 and still in use by researchers
Given training documents with the correct labels
$$D_{train} = \{\mathbf{x}^1,y^1)...(\mathbf{x}^M,y^M)\}$$Find the weights $\mathbf{w}$ for the linear classifier
$$\hat y = sign(\sum_{n=1}^N w_nx_n) = sign(\mathbf{w} \cdot \mathbf{x})$$so that we can predict the labels of unseen documents
\begin{align} & \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,y^1)...(\mathbf{x}^M,y^M)\}\\ & set\; \mathbf{w} = \mathbf{0} \\ & \mathbf{for} \; (\mathbf{x},y) \in D_{train} \; \mathbf{do}\\ & \quad predict \; \hat y = sign(\mathbf{w}\cdot \phi(\mathbf{x}))\\ & \quad \mathbf{if} \; \hat y \neq y \; \mathbf{then}\\ & \quad \quad \mathbf{if} \; \hat y\; \mathbf{is}\; 1 \; \mathbf{then}\\ & \quad \quad \quad update \; \mathbf{w} = \mathbf{w} - \phi(\mathbf{x})\\ & \quad \quad \mathbf{else}\\ & \quad \quad \quad update \; \mathbf{w} = \mathbf{w} + \phi(\mathbf{x})\\ & \mathbf{return} \; \mathbf{w} \end{align}
Given the following tweets labeled with sentiment:
Label | Tweet |
---|---|
negative | Very sad about Iran. |
negative | No Sat off...Need to work 6 days a week. |
negative | I’m a sad panda today. |
positive | such a beautiful satisfying day of bargain shopping. loves it. |
positive | who else is in a happy mood?? |
positive | actually quite happy today. |
What features would the perceptron find indicative of positive/negative class?
Would they generalize to unseen test data?
In NLP, no matter how large our training dataset, we will never see (enough of) all the words/features.
Bias: that appears in each instance
The standard way to evaluate our classifier is:
$$ Accuracy = \frac{correctLabels}{allInstances}$$What could go wrong?
When one class is much more common than the other, predicting it always gives high accuracy.
Predicted/Correct | MinorityClass | MajorityClass |
---|---|---|
MinorityClass | TruePositive | FalsePositive |
MajorityClass | FalseNegative | TrueNegative |
Discuss in pairs what features you would use in a classifier that predicts FAIL/PASS for an essay!