Syllabus

IMPORTANT: This is the DIAG course named "Multilingual Natural Language Processing" usually taken by AI & Robotics, Computer Engineering, Data Science Master students.
 
Semester: Spring 2024
When and where: from February 29 till May 31, 2024 on the following days:
  • Thursday (8.00-10.00), S. Pietro in Vincoli, via delle Sette Sale, 29 (room/aula 41) **UPDATED**
  • Friday (10.00-13.00), S. Pietro in Vincoli, via delle Sette Sale, 29 (room/aula 41)
Contact information

Instructor: Prof. Roberto Navigli
Office: room B119, via Ariosto, 25
Phone number: 06 77274109
Email: surname chiocciola diag plus uniroma1 plus it (if you are a human being, please replace plus with . and chiocciola with @)
TAs: Edoardo Barba, Luca Gioffrè, Luca Moroni, Lu Xu, more...

Basic information

The Multilingual Natural Language Processing course introduces a key field of Artificial Intelligence which deals with the automatic processing, understanding and generation of natural language. The course is taught in English. The student will understand the theoretical and practical fundamentals of how to process natural language automatically at the different levels of language modeling, part-of-speech tagging, syntax, semantics, generation, translation. Deep learning will be introduced and used pervasively throughout the course also at a practical level with hands-on sessions with PyTorch.

The course is currently in the curriculum of the Master's Degree in AI and Robotics, Master's Degree in Engineering in Computer Science and Master's Degree in Computer Science and often attended by students of the Master's Degree in Data Science. There is also the possibility for students of linguistics and digital humanities to attend the course by establishing a personalized path in computational linguistics.

Textbooks

Exams

  • Attending students: two homeworks, one to deliver during the course, one to be delivered 7 days before each exam session and, in any case, by the September session. Attending students can submit homework 3 (assigned to non-attending students) as bonus exercise by specifying clearly in the report that it is a bonus exercise (up to +5 on the final grade).
  • Non-attending students and/or after September and/or attending students who failed homeworks: a project made up of three parts (two of which include the first two homeworks  assigned during the class) to be delivered 7 days before each exam session (for instance, if the exam date is December 25th, the homework will be delivered by 17th by midnight++ (the ++ means we will not be strict with the submission hour).  
See the project slides for information about homeworks (as provided during the course).
Submission deadlines and presentation dates (all submissions via the Google form given in the project slides):
  
2024 exam dates (for both attending and non-attending students):
  • June session: June 25, 2024 (8.30-13, room A4), homework delivery by June 18 (midnight Anytime on Earth AoE, -1 per each day of delay until June 22 -- delay admitted only for this exam date)
  • July session: July 22-23-24, 2024 (8-13, room A2 on July 22, A3 on July 23-24), homework delivery by July 15 (AoE)
  • September session: September 18, 2024 (8-11, room A4), homework delivery by September 11th (AoE)
All exams consist in an oral slide-based presentation of all the homeworks delivered. The presentation will be made in the classroom with slides according to the following schedule:

- a 12-minute presentation of the 2 or 3 homeworks. Time is a very strict requirement.
- Up to 10 minutes for questions

Course outline

    Introduction to NLP

    Logistic Regression and its use for classification. Explicit vs. implicit features. The cross-entropy loss function.

    Machine Learning for NLP and intro to neural networks.


    Introduction to Supervised, Unsupervised & Reinforcement Learning. The Supervised Learning framework. From real to computational: features extraction and features vectors. Feature Engineering and inferred features. The perceptron model. What is Deep Learning, training weights and Backpropagation.

    First hands-on with PyTorch with language detection

    Recap of the Supervised Learning framework, hands-on practice with PyTorch on the Language Detection Model: tensors, gradient tracking, the Dataset class, the Module class, the backward step, the training loop, evaluating a model.

    word embeddings

    word2vec (CBOW and skipgram), PyTorch notebook on word2vec. More on word embeddings, properties etc. Longer-text embeddings. Multilinguality.

    Probabilistic and neural language modeling, RNNs and LSTMs

    probabilistic language modeling, Chain rule and n-gram estimation, cenni di smoothing
    recurrent neural networks (RNNs), Long-Short Term Memory networks (LSTMs)
    Static vs. contextualized embeddings. Different inputs and outputs for RNNs.
    Handbook of a real-world classification problems; homework 1
    More on LSTMs. Notebook on training, dev, test. Notebook on a real-world NLP problem.

    The attention mechanism, the Transformer

    Introduction to the attention.
    Introduction to the Transformer architecture.
    transfer learning and pre-trained language models
    Transfer learning
    Pre-trained language models:
    Encoders: BERT/mBERT, RoBERTa, XLM-R.
    Sentence embeddings: baseline
    Encoders-decoders: BART/mBART, T5/mT5
    Decoders: GPT-2

    Evaluation of Pre-trained Language Models

    Overview of NLP libraries: Hugginface Transformers, datasets and eval; Sentence Transformers. Monolingual benchmarks. Multilingual benchmarks. The meaning of superhuman performance in NLU

    Multilingual lexical semantics

    Introduction to lexical semantics (Differences between discrete and continuous representations). Word Sense Disambiguation. Multilingual WSD. Computational resources: WordNet and BabelNet. Other tasks: Lexical substitution, Word-in-Context. PLM’s capability to separate word senses; Explicit and latent sense embeddings. Notebook on WordNet, BabelNet and multilingual WSD.

    seq2seq, Machine Translation

    The encoder-decoder architecture, seq2seq. Introduction to machine translation (MT) and history of MT. Minimal overview of statistical MT. Beam search for decoding. Introduction to neural machine translation. The BLEU evaluation score. Performances and recent improvements. Neural MT; Attention in NMT. Disambiguation Bias in Machine Translation and potential solutions

    Multilingual sentence-level semantics

    Multilingual Semantic Role Labeling and Semantic Parsing.

    Introduction to Question Answering. Homework 2 assignment.

    Multilingual and Cross-lingual Retrieval-augmented NLP

    Bi-encoder, cross-encoder. Retrieval Augmented Generation

    Large Language Models (LLMs)

    General models; LLaMA, Mistral, OLMo. Instruction fine-tuning, RLHF. Transfer learning only.  Prompt engineering. Benchmarks

    Trends and problems in LLMs

    Data scaling and processing. LoRA, quantization and distillation. Chinchilla law. Chain of Thought reasoning. Factuality, Hallucinations and Stochastic Parrots

    Text summarization

    Introduction to text summarization, extractive vs. abstractive. Evaluation metrics (BLEU, ROUGE, BERTScore, alternatives).

    Closed Information Extraction

    Entity Linking, Relation Extraction, closed IE


      No comments:

      Post a Comment