IMPORTANT: This is the DIAG course named "Multilingual Natural Language Processing" usually taken by AI & Robotics, Computer Engineering, Data Science Master students.
Semester: Spring 2024
When and where: from February 29 till May 31, 2024 on the following days:
Course outline
Introduction to text summarization, extractive vs. abstractive. Evaluation metrics (BLEU, ROUGE, BERTScore, alternatives).
Closed Information Extraction
When and where: from February 29 till May 31, 2024 on the following days:
- Thursday (8.00-10.00), S. Pietro in Vincoli, via delle Sette Sale, 29 (room/aula 41) **UPDATED**
- Friday (10.00-13.00), S. Pietro in Vincoli, via delle Sette Sale, 29 (room/aula 41)
Phone number: 06 77274109
Email: surname chiocciola diag plus uniroma1 plus it (if you are a human being, please replace plus with . and chiocciola with @)
TAs: Edoardo Barba, Luca Gioffrè, Luca Moroni, Lu Xu, more...
Basic information
The Multilingual Natural Language Processing course introduces a key field of Artificial Intelligence which deals with the automatic processing, understanding and generation of natural language. The course is taught in English. The student will understand the theoretical and practical fundamentals of how to process natural language automatically at the different levels of language modeling, part-of-speech tagging, syntax, semantics, generation, translation. Deep learning will be introduced and used pervasively throughout the course also at a practical level with hands-on sessions with PyTorch.
The course is currently in the curriculum of the Master's Degree in AI and Robotics, Master's Degree in Engineering in Computer Science and Master's Degree in Computer Science and often attended by students of the Master's Degree in Data Science. There is also the possibility for students of linguistics and digital humanities to attend the course by establishing a personalized path in computational linguistics.
Textbooks
Email: surname chiocciola diag plus uniroma1 plus it (if you are a human being, please replace plus with . and chiocciola with @)
TAs: Edoardo Barba, Luca Gioffrè, Luca Moroni, Lu Xu, more...
Basic information
The Multilingual Natural Language Processing course introduces a key field of Artificial Intelligence which deals with the automatic processing, understanding and generation of natural language. The course is taught in English. The student will understand the theoretical and practical fundamentals of how to process natural language automatically at the different levels of language modeling, part-of-speech tagging, syntax, semantics, generation, translation. Deep learning will be introduced and used pervasively throughout the course also at a practical level with hands-on sessions with PyTorch.
The course is currently in the curriculum of the Master's Degree in AI and Robotics, Master's Degree in Engineering in Computer Science and Master's Degree in Computer Science and often attended by students of the Master's Degree in Data Science. There is also the possibility for students of linguistics and digital humanities to attend the course by establishing a personalized path in computational linguistics.
Textbooks
- Jurafsky and Martin. Speech and Language Processing, Prentice Hall, third edition.
- Jacob Eisenstein. Introduction to Natural Language Processing, MIT Press, 2019.
- Yoav Goldberg. Neural Network Methods for Natural Language Processing, Morgan & Claypool, 2017.
- Attending students: two homeworks, one to deliver during the course, one to be delivered 7 days before each exam session and, in any case, by the September session. Attending students can submit homework 3 (assigned to non-attending students) as bonus exercise by specifying clearly in the report that it is a bonus exercise (up to +5 on the final grade).
- Non-attending students and/or after September and/or attending students who failed homeworks: a project made up of three parts (two of which include the first two homeworks assigned during the class) to be delivered 7 days before each exam session (for instance, if the exam date is December 25th, the homework will be delivered by 17th by midnight++ (the ++ means we will not be strict with the submission hour).
Submission deadlines and presentation dates (all submissions via the Google form given in the project slides):
2024 exam dates (for both attending and non-attending students):
- June session: June 25, 2024 (8.30-13, room A4), homework delivery by June 18 (midnight Anytime on Earth AoE, -1 per each day of delay until June 22 -- delay admitted only for this exam date)
- July session: July 22-23-24, 2024 (8-13, room A2 on July 22, A3 on July 23-24), homework delivery by July 15 (AoE)
- September session: September 18, 2024 (8-11, room A4), homework delivery by September 11th (AoE)
All exams consist in an oral slide-based presentation of all the homeworks delivered. The presentation will be made in the classroom with slides according to the following schedule:
- a 12-minute presentation of the 2 or 3 homeworks. Time is a very strict requirement.
- Up to 10 minutes for questions
- a 12-minute presentation of the 2 or 3 homeworks. Time is a very strict requirement.
- Up to 10 minutes for questions
Course outline
Introduction to NLP
Logistic Regression and its use for classification. Explicit vs. implicit features. The cross-entropy loss function.
Machine Learning for NLP and intro to neural networks.
Introduction to Supervised, Unsupervised & Reinforcement Learning. The Supervised Learning framework. From real to computational: features extraction and features vectors. Feature Engineering and inferred features. The perceptron model. What is Deep Learning, training weights and Backpropagation.
First hands-on with PyTorch with language detection
Recap of the Supervised Learning framework, hands-on practice with PyTorch on the Language Detection Model: tensors, gradient tracking, the Dataset class, the Module class, the backward step, the training loop, evaluating a model.
word embeddings
word2vec (CBOW and skipgram), PyTorch notebook on word2vec. More on word embeddings, properties etc. Longer-text embeddings. Multilinguality.
Probabilistic and neural language modeling, RNNs and LSTMs
probabilistic language modeling, Chain rule and n-gram estimation, cenni di smoothing
recurrent neural networks (RNNs), Long-Short Term Memory networks (LSTMs)
Static vs. contextualized embeddings. Different inputs and outputs for RNNs.
Handbook of a real-world classification problems; homework 1
More on LSTMs. Notebook on training, dev, test. Notebook on a real-world NLP problem.
The attention mechanism, the Transformer
Introduction to the attention.
Introduction to the Transformer architecture.
transfer learning and pre-trained language models
Transfer learning
Pre-trained language models:
Encoders: BERT/mBERT, RoBERTa, XLM-R.
Sentence embeddings: baseline
Encoders-decoders: BART/mBART, T5/mT5
Decoders: GPT-2
Evaluation of Pre-trained Language Models
Overview of NLP libraries: Hugginface Transformers, datasets and eval; Sentence Transformers. Monolingual benchmarks. Multilingual benchmarks. The meaning of superhuman performance in NLU
Multilingual lexical semantics
Introduction to lexical semantics (Differences between discrete and continuous representations). Word Sense Disambiguation. Multilingual WSD. Computational resources: WordNet and BabelNet. Other tasks: Lexical substitution, Word-in-Context. PLM’s capability to separate word senses; Explicit and latent sense embeddings. Notebook on WordNet, BabelNet and multilingual WSD.
seq2seq, Machine Translation
The encoder-decoder architecture, seq2seq. Introduction to machine translation (MT) and history of MT. Minimal overview of statistical MT. Beam search for decoding. Introduction to neural machine translation. The BLEU evaluation score. Performances and recent improvements. Neural MT; Attention in NMT. Disambiguation Bias in Machine Translation and potential solutions
Multilingual sentence-level semantics
Multilingual Semantic Role Labeling and Semantic Parsing.
Introduction to Question Answering. Homework 2 assignment.
Multilingual and Cross-lingual Retrieval-augmented NLP
Bi-encoder, cross-encoder. Retrieval Augmented Generation
Large Language Models (LLMs)
General models; LLaMA, Mistral, OLMo. Instruction fine-tuning, RLHF. Transfer learning only. Prompt engineering. Benchmarks
Trends and problems in LLMs
Data scaling and processing. LoRA, quantization and distillation. Chinchilla law. Chain of Thought reasoning. Factuality, Hallucinations and Stochastic Parrots
Text summarization
Logistic Regression and its use for classification. Explicit vs. implicit features. The cross-entropy loss function.
Machine Learning for NLP and intro to neural networks.
Introduction to Supervised, Unsupervised & Reinforcement Learning. The Supervised Learning framework. From real to computational: features extraction and features vectors. Feature Engineering and inferred features. The perceptron model. What is Deep Learning, training weights and Backpropagation.
First hands-on with PyTorch with language detection
Recap of the Supervised Learning framework, hands-on practice with PyTorch on the Language Detection Model: tensors, gradient tracking, the Dataset class, the Module class, the backward step, the training loop, evaluating a model.
word embeddings
word2vec (CBOW and skipgram), PyTorch notebook on word2vec. More on word embeddings, properties etc. Longer-text embeddings. Multilinguality.
Probabilistic and neural language modeling, RNNs and LSTMs
probabilistic language modeling, Chain rule and n-gram estimation, cenni di smoothing
recurrent neural networks (RNNs), Long-Short Term Memory networks (LSTMs)
Static vs. contextualized embeddings. Different inputs and outputs for RNNs.
Handbook of a real-world classification problems; homework 1
More on LSTMs. Notebook on training, dev, test. Notebook on a real-world NLP problem.
The attention mechanism, the Transformer
Introduction to the attention.
Introduction to the Transformer architecture.
transfer learning and pre-trained language models
Transfer learning
Pre-trained language models:
Encoders: BERT/mBERT, RoBERTa, XLM-R.
Sentence embeddings: baseline
Encoders-decoders: BART/mBART, T5/mT5
Decoders: GPT-2
Evaluation of Pre-trained Language Models
Overview of NLP libraries: Hugginface Transformers, datasets and eval; Sentence Transformers. Monolingual benchmarks. Multilingual benchmarks. The meaning of superhuman performance in NLU
Multilingual lexical semantics
Introduction to lexical semantics (Differences between discrete and continuous representations). Word Sense Disambiguation. Multilingual WSD. Computational resources: WordNet and BabelNet. Other tasks: Lexical substitution, Word-in-Context. PLM’s capability to separate word senses; Explicit and latent sense embeddings. Notebook on WordNet, BabelNet and multilingual WSD.
seq2seq, Machine Translation
The encoder-decoder architecture, seq2seq. Introduction to machine translation (MT) and history of MT. Minimal overview of statistical MT. Beam search for decoding. Introduction to neural machine translation. The BLEU evaluation score. Performances and recent improvements. Neural MT; Attention in NMT. Disambiguation Bias in Machine Translation and potential solutions
Multilingual sentence-level semantics
Multilingual Semantic Role Labeling and Semantic Parsing.
Introduction to Question Answering. Homework 2 assignment.
Multilingual and Cross-lingual Retrieval-augmented NLP
Bi-encoder, cross-encoder. Retrieval Augmented Generation
Large Language Models (LLMs)
General models; LLaMA, Mistral, OLMo. Instruction fine-tuning, RLHF. Transfer learning only. Prompt engineering. Benchmarks
Trends and problems in LLMs
Data scaling and processing. LoRA, quantization and distillation. Chinchilla law. Chain of Thought reasoning. Factuality, Hallucinations and Stochastic Parrots
Text summarization
Introduction to text summarization, extractive vs. abstractive. Evaluation metrics (BLEU, ROUGE, BERTScore, alternatives).
Entity Linking, Relation Extraction, closed IE
No comments:
Post a Comment