Course Description
Statistical Natural Language Processing (SNLP) has revolutionised Computational Linguistics in the last 20 years. Much of our recent success in building robust real-world NLP systems, e.g. for Machine Translation, Automatic Summarisation and Question Answering, can be attributed to techniques appropriated from Statistics and Machine Learning.
This course is a grab-bag of SNLP theory and applications. I will attempt to interleave basic statistics and machine learning theory applied to NLP with demos of applications which highlight some of the advantages and limitations of statistical techniques.
Prerequisites
This course assumes only senior high school level mathematics, but will build fairly rapidly on that. However, it will emphasise conceptual and intuitive understanding of the statistical theory rather than playing with too many equations.
It assumes practically no background in linguistics or computer science.
Content
This course will sample the following five key areas:
- Basic statistics and information theory: including interpretation of probability; joint and conditional distributions; independence; random variables and expectations; information and uncertainty; entropy and mutual information.
Demos: extracting collocations from corpora.
- Machine Learning and Naive Bayes: including annotated training data; evaluation; the chain rule and Bayes' rule; relative frequencies and the Maximum Likelihood Estimate; smoothing and feature-based models.
Demos: Text Categorisation.
- Markov models: including n-grams; the Markov assumption;
sequence tagging; Markov models and Hidden Markov Models; and the Viterbi algorithm.Demos: Part-of-Speech tagging and Named Entity Recognition.
- Statistical Parsing: Context Free Grammars (CFGs); Probabilistic CFGs; Combinatory Categorial Grammar.
Demos: C&C Combinatory Categorial Grammar Parsing.
- Distributional Similarity: including the distributional hypothesis; extracting contextual information; similarity measures and clustering.
Demos: Automatic Thesaurus Extraction.
Presenter
James Curran
Presenter Biography
James Curran is a senior lecturer in the School of Information Technologies at the University of Sydney and is the Research Leader in Language Technology for the Capital Markets Cooperative Research Centre (CMCRC).
James works in the computer science end of computational linguistics, focusing on robust statistical approaches to broad-coverage large-scale natural language processing (NLP). His interests range from the design of fundamental NLP components, including text processing and tagging tools, through to statistical parsers and high-level systems for question answering and information extraction.
James has developed the C&C tools, a suite of extremely efficient
state-of-the-art NLP tools, including taggers and a wide-coverage CCG parser, with Stephen Clark. The C&C parser is orders of magnitude faster than similar parsers. The C&C tools have been publicly released, and over 500 researchers from over 200 institutions in 50 countries are using the tools.