# CS224N: Natural Language Processing with Deep Learning¶

## Lecture 1 Introduction to NLP and Deep Learning¶

• Representations of NLP levels: Semantics
• Traditional V.S. DL (rules v.s. sophisticated algorithm)
• Applications:
• Sentiment Analysis
• Dialogue agents / response generation

## Lecture 2 Word Vector Representations: word2vec¶

• "one-hot" representation, localist representation
• distributional similarity based representations
• "You shall know a word by the company it keeps” (J. R. Firth 1957:11)"
• dense vector for each word type, chosen so that it is good at predicting other words appearing in its context (gets a bit recursive)
• Learning neural network word embeddings
• model $p(\text{context} | w_t) = ?$
• loss function: $J = 1 - p(w_{-t}|w_{t})$, $w_{-t}$, context words that doesn't include word $w_t$.
• word2vec
• Skip-grams (SG) - predict context words given target center words
• Continuous Bag of Words (CBOW) - predict target center word from bag-of-words context words
• 2 training methods
• hierarchical softmax
• negative sampling: tain binary logistic regression for a true pair versus a couple of noice pairs.
• Core ideas of SG prediction
• maximize the prediction of the model $p(\text{context} | w_t) = ?$ for all context words in the form of the cost function $J(\theta)$.
• cost function: $$J'(\theta) = \prod_{t=1}^T\prod_{\substack{-m \le j \le m\ j \ne 0}} p(w_{t+j}|w_t; \theta)$$
• Negative log likelihood $$J(\theta) = -\frac{1}{T} \sum_{t=1}^T\sum_{\substack{-m \le j \le m\ j \ne 0}} \log p(w_{t+j}|w_t)$$
• softmax $$p(o|c) = \frac{\exp(u_o^T v_c)}{\sum_{w=1}^{v}\exp(u_w^T v_c)}$$
• What's really mean when you say train word2vec model
• optimize the parameter $\theta$, which is a $R^{2\cdot d \cdot V}$, $d$ is the word vector dimention, $V$ is the vacabular size, each word is represented by 2 vectors!

## Lecture 3 Advanced Word Vector Representations¶

• Compare count based and direct prediction
• count based: LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert)
• Fast training
• Efficient usage of statistics
• Primarily used to capture word similarity
• Disproportionate importance given to large counts
• direct prediction: NNLM, HLBL, RNN, Skip-gram/CBOW, (Bengio et al; Collobert & Weston; Huang et al; Mnih & Hinton; Mikolov et al;Mnih & Kavukcuoglu)
• Scales with corpus size
• Inefficient usage of statistics
• Can capture complex patterns beyond word similarity
• Generate improved performance on other tasks
• Combining the best of both worlds: GloVe
• Fast training
• Scalable to huge corpora
• Good performance even with small corpus, and small vectors
• How to evaluate word2vec?
• Intrinsic:
• Evaluation on a specific/intermediate subtask
• Fast to compute
• Helps to understand that system
• Not clear if really helpful unless correlation to real task is established
• Extrinsic:
• Evaluation on a real task
• Can take a long time to compute accuracy
• Unclear if the subsystem is the problem or its interaction or other subsystems
• If replacing exactly one subsystem with another improves accuracy --> Winning!

## Assignment 1 (Spring 2019)¶

• Singular Value Decomposition (SVD) is a kind of generalized PCA (Principal Components Analysis).

### Review materials¶

• Singular Value Decomposition (SVD)
• cross entropy loss
• max-margin loss

## Lecture 4 Word Window Classification and Neural Networks¶

• Window classification: Train softmax classifier by assigning a label to a center word and concatenating all word vectors surrounding it.
• max-margin loss; $J(\theta) = \max(0, 1 - s + s_{corrupted})$. $s$ is the good part, $s_{corrupted}$ is the bad part, we would like the bad part is smaller than $s - 1$.
• backpropagation:
• insight: reuse the derivative computed previously
• Hadamard product ($\circ, \odot, \otimes$)

## Lecture 5 Backpropagation (Feb 24, 2019)¶

### Details of backpropagation¶

The backprop algorithm is essentially compute the gradient (partial derivative) of the cost function with respect all the parameters, $U, W, b, x$

With the following setup:

• max-margin cost function: $J = \max(0, 1 - s + s_c)$
• Scores: $s = U^T f(Wx + b), s_c = U^T f(Wx_c + b)$
• input: $z = Wx + b$, hidden: $a = f(z)$, output: $s = U^T a$
• Derivatives:
• $\frac{\partial s}{\partial U} = \frac{\partial}{\partial U} U^T a = a$
• wrt one weight $W_{ij}$: $\frac{\partial s}{\partial W_{ij}} = \delta_i x_j$, $\delta_i = U_i f'(z_i) x_j$, where $f'(z) = f(z)(1 - f(z))$, $f(x)$ is logistic function or sigmoid function.
• wrt all weights $W$: $\frac{\partial s}{\partial W} = \delta x^T$
• wrt word vectors $x$: $\frac{\partial s}{\partial x} = W^T\delta$