Algorithms for massive datasets

A.Y. 2020/2021
Overall hours
Learning objectives
The course aims at describing the big data processing framework, both in terms of methodologies and technologies.
Expected learning outcomes
- will be able to use technologies for the distributed storage of datasets;
- will know the map-reduce distributed processing framework and its leading extensions;
- will know the principal algorithms used in order to deal with classical big data problems, as well as to implement them using a distributed processing framework;
- will be able to choose appropriate methods for solving big data problems.
Course syllabus and organization

Single session

Lesson period
Second semester
Teaching methods: lectures will be delivered via videoconference system; students will be able to attend them both through streaming on the basis of the course schedule, and downloading their videos in a subsequnt time from the course Web page.

Program and reference material: There will be no change.

Assessment methods and criteria: depending on the regulations in force at the time of the exam, examination procedures may be carried out remotely. The evaluation criteria will not change.
Course syllabus
The course will consider the main processing techniques dealing with data at massive scale, and their implementation on distributed computational frameworks. More precisely, lectures will review the principal application contexts characterized by amounts of data that cannot be handled using standard computing facilities and procedures. Such contexts will be analyzed in terms of tailored algorithms. Meanwhile, some general big data processing techniques, such as those falling within the hat of machine learning, will be considered.

More precisely, the following topics will be covered.
- Mathematical preliminaries.
- Technical preliminaries.
- Bases of MapReduce, Hadoop, and Spark.
- Analysis of MapReduce algorithms.
- NoSQL databases: MongoDB.
- Link analysis.
- Regression.
- Logistic regression.
- Stream analysis.
- Deep learning.
- Clustering.
- Recurrent neural networks.
- Finding similar items.
- Market-basket analysis.
- Gradient boosting.
- Recommender systems.
- Dimensionality reduction.
- Embeddings.
Prerequisites for admission
The course requires knowledge of the main topics of bachelor-level computer programming, linear algebra, calculus, probability, and statistics.
Teaching methods
Frontal classes
Teaching Resources
- Anand Rajaraman and Jeff Ullman, Mining of Massive Datasets, Cambridge University Press (ISBN:9781107015357).

Suggested readings:
- Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, Learning Spark. Lightning-Fast Big Data Analysis, O'Reilly, 2015 (ISBN:978-1-449-35862-4)
- Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills, Advanced Analytics with Spark. Patterns for Learning from Data at Scale, O'Reilly, 2015 (ISBN:978-1-491-91276-8)

Lecture notes, supplementary material, and sample code:
Assessment methods and Criteria
The exam consists of a project and an oral test, both related to the topics covered in the course. The project, described in a report, requires to process one or more datasets through the critical application of the techniques described during the classes. The evaluation of the project, expressed with a pass/fail mark, considers the level of mastery of the topics and the clarity of the report. The oral test, which is accessed after a positive evaluation of the project, is based on the discussion of some topics covered in the course and on in-depth questions about the presented project. The evaluation of the oral test, expressed on a scale between 0 and 30, takes into account the level of mastery of the topics, clarity, and language skills.
INF/01 - INFORMATICS - University credits: 6
Lessons: 48 hours
Professor: Malchiodi Dario
By appointment (via e-mail)