Multilingual Natural Language Processing @ Sapienza: Syllabus

IMPORTANT: This is the DIAG course named "Multilingual Natural Language Processing" usually taken by AI & Robotics, Computer Engineering, Data Science Master students.

Semester: Spring 2025
When and where: from March 6 ~~February 27~~ till May 30, 2025 on the following days:

Thursday (8.30-10.00), S. Pietro in Vincoli, via delle Sette Sale, 29 (room/aula 41)
Friday (10.15-12.45), S. Pietro in Vincoli, via delle Sette Sale, 29 (room/aula 41)

Contact information

Instructor: Prof. Roberto Navigli
Office: room B119, via Ariosto, 25

Phone number: 06 77274109
Email: surname chiocciola diag plus uniroma1 plus it (if you are a human being, please replace plus with . and chiocciola with @)
TAs: Alberte Fernandez Castro, Bruno Gatti, Luca Gioffrè, Luca Moroni, Lu Xu

Basic information

The Multilingual Natural Language Processing course introduces a key field of Artificial Intelligence which deals with the automatic processing, understanding and generation of natural language. The course is taught in English. The student will understand the theoretical and practical fundamentals of how to process natural language automatically at the different levels of language modeling, part-of-speech tagging, syntax, semantics, generation, translation. Deep learning will be introduced and used pervasively throughout the course also at a practical level with hands-on sessions with PyTorch.

The course is currently in the curriculum of the Master's Degree in AI and Robotics, Master's Degree in Engineering in Computer Science and Master's Degree in Computer Science and often attended by students of the Master's Degree in Data Science. There is also the possibility for students of linguistics and digital humanities to attend the course by establishing a personalized path in computational linguistics.

Textbooks

Jurafsky and Martin. Speech and Language Processing, Prentice Hall, third edition.
Jacob Eisenstein. Introduction to Natural Language Processing, MIT Press, 2019.
Yoav Goldberg. Neural Network Methods for Natural Language Processing, Morgan & Claypool, 2017.

Exams

Attending students: two homeworks, one to deliver during the course, one to be delivered 10 days before each exam session and, in any case, by the September session. Attending students can submit homework 3 (assigned to non-attending students) as bonus exercise by specifying clearly in the report that it is a bonus exercise (up to +5 on the final grade).
Non-attending students and/or after September and/or attending students who failed homeworks: a project made up of three parts (two of which include the first two homeworks assigned during the class) to be delivered 10 days before each exam session (for instance, if the exam date is December 25th, the homework will be delivered by 15th by midnight++ (the ++ means we will not be strict with the submission hour).

See the project slides for information about homeworks (as provided during the course).

Submission deadlines and presentation dates (all submissions via the Google form provided in the project slides):

2025 exam dates (for both attending and non-attending students):

TBD
June session: June 25, 2024 (8.30-13, room A4), homework delivery by June 18 (midnight Anytime on Earth AoE, -1 per each day of delay until June 22 -- delay admitted only for this exam date)
~~July session: July 22-23-24, 2024 (8-13, room A2 on July 22, A3 on July 23-24), homework delivery by July 15~~
~~September session: September 18, 2024 (8-11, room A4), homework delivery by September 11th~~
~~January session: January 29, 2025 (9-13, room A3), homework delivery by January 22th~~
~~February session: February 18, 2025 (9-13, room A3), homework delivery by February 11th~~

All exams consist in an oral slide-based presentation of all the homeworks delivered. The presentation will be made in the classroom with slides according to the following schedule:

- a 10-minute presentation of the 2 or 3 homeworks. Time is a very strict requirement.
- Up to 5 minutes for questions

Course outline

Introduction to NLP

Logistic Regression and its use for classification. Explicit vs. implicit features. The cross-entropy loss function.

Machine Learning for NLP and intro to neural networks.

Introduction to Supervised, Unsupervised & Reinforcement Learning. The Supervised Learning framework. From real to computational: features extraction and features vectors. Feature Engineering and inferred features. The perceptron model. What is Deep Learning, training weights and Backpropagation.

First hands-on with PyTorch with language detection

Recap of the Supervised Learning framework, hands-on practice with PyTorch on the Language Detection Model: tensors, gradient tracking, the Dataset class, the Module class, the backward step, the training loop, evaluating a model.

word embeddings

word2vec (CBOW and skipgram), PyTorch notebook on word2vec. More on word embeddings, properties etc. Longer-text embeddings. Multilinguality.

Probabilistic and neural language modeling, RNNs and LSTMs

probabilistic language modeling, Chain rule and n-gram estimation, cenni di smoothing
recurrent neural networks (RNNs), Long-Short Term Memory networks (LSTMs)
Static vs. contextualized embeddings. Different inputs and outputs for RNNs.
Handbook of a real-world classification problems; homework 1
More on LSTMs. Notebook on training, dev, test. Notebook on a real-world NLP problem.

The attention mechanism, the Transformer

Introduction to the attention.
Introduction to the Transformer architecture.
transfer learning and pre-trained language models
Transfer learning
Pre-trained language models:
Encoders: BERT/mBERT, RoBERTa, XLM-R.
Sentence embeddings: baseline
Encoders-decoders: BART/mBART, T5/mT5
Decoders: GPT-2

Evaluation of Pre-trained Language Models

Overview of NLP libraries: Hugginface Transformers, datasets and eval; Sentence Transformers. Monolingual benchmarks. Multilingual benchmarks. The meaning of superhuman performance in NLU

Multilingual lexical semantics

Introduction to lexical semantics (Differences between discrete and continuous representations). Word Sense Disambiguation. Multilingual WSD. Computational resources: WordNet and BabelNet. Other tasks: Lexical substitution, Word-in-Context. PLM’s capability to separate word senses; Explicit and latent sense embeddings. Notebook on WordNet, BabelNet and multilingual WSD.

seq2seq, Machine Translation

The encoder-decoder architecture, seq2seq. Introduction to machine translation (MT) and history of MT. Minimal overview of statistical MT. Beam search for decoding. Introduction to neural machine translation. The BLEU evaluation score. Performances and recent improvements. Neural MT; Attention in NMT. Disambiguation Bias in Machine Translation and potential solutions

Multilingual sentence-level semantics

Multilingual Semantic Role Labeling and Semantic Parsing.

Introduction to Question Answering. Homework 2 assignment.

Multilingual and Cross-lingual Retrieval-augmented NLP

Bi-encoder, cross-encoder. Retrieval Augmented Generation

Large Language Models (LLMs)

General models; LLaMA, Mistral, OLMo. Instruction fine-tuning, RLHF. Transfer learning only. Prompt engineering. Benchmarks

Trends and problems in LLMs

Data scaling and processing. LoRA, quantization and distillation. Chinchilla law. Chain of Thought reasoning. Factuality, Hallucinations and Stochastic Parrots

Text summarization

Introduction to text summarization, extractive vs. abstractive. Evaluation metrics (BLEU, ROUGE, BERTScore, alternatives).

Closed Information Extraction

Entity Linking, Relation Extraction, closed IE

Multilingual Natural Language Processing @ Sapienza

Pages

Syllabus

No comments:

Post a Comment