Introduction to syntax. Context-free grammars and languages. Treebanks. Normal forms. Dependency grammars. Syntactic parsing: top-down and bottom-up. Structural ambiguity. Backtracking vs. dynamic programming for parsing. The CKY algorithm. The Earley algorithm. Probabilistic CFGs (PCFGs). PCFGs for disambiguation: the probabilistic CKY algorithm. PCFGs for language modeling.
Friday, April 5, 2013
Introduction to part-of-speech (POS) tagging. POS tagsets: the Penn Treebank tagset and the Google Universal Tagset. Rule-based POS tagging. Stochastic part-of-speech tagging. Hidden markov models. Deleted interpolation. Linear and logistic regression: Maximum Entropy models. Transformation-based POS tagging. Handling out-of-vocabulary words.
The third lecture was about language models. You discovered how important language models are and how we can approximate real language with them. N-gram models (unigrams, bigrams, trigrams) were discussed, together with their probability modeling and issues. We discussed perplexity and its close relationship with entropy, we introduced smoothing and interpolation techniques to deal with the issue of data sparsity.