Dario Malchiodi — Università degli Studi di Milano

2019-20

The course aims at describing the big data processing framework, both in terms of methodologies and technologies.

Expected results

Students:

will be able to use technologies for the distributed storage of datasets;
will know the MapReduce distributed processing framework and its leading extensions;
will know the principal algorithms used in order to deal with classical big data problems, as well as to implement them using a distributed processing framework;
will be able to choose appropriate methods for solving big data problems.

News

Date	Info
14/05/2020	Projects for the «Algorithms for massive datasets» course The description of the projects for the course «Algorithms for massive datasets» are available, as well as a joint project with the «Statistical methods for machine learning» course.
08/05/2020	Springer books available for free Springer publishing provides for free books on basic and advanced computer science topics.
04/05/2020	Cancellation of the Algorithms for massive datasets lecture of 4/5 The lecture of 4/5 is canceled. It will be held later during the course.
28/04/2020	Course evaluation The Web-based procedure for course evaluation is available. Students are invited to evaluate each course before the end of lectures.
20/04/2020	Lab for the Algorithms for massive datasets course (DSE master) The 23/04 lab lecture for students of the DSE master will be accessible through Microsoft Zoom, connecting between 14:30 and 18:30. Microsoft Zoom can be installed as an extension of Microsoft Teams. The lab Web page contains the assigned exercises. Students are asked to bring their own solutions, which will be discussed during the lab.
17/03/2020	Movement of video lectures In a few days, the recordings of the lectures will be moved to the University's OneDrive space. Students are therefore invited to check that their academic account linked to Office365 is activated.
12/03/2020	Remote office hours organization Starting today, office hours will take place remotely. On each Thursday, students can connect from 17:00 to the meeting «ricevimento-malchiodi» organized on meet.jit.si , writing their name and surname in the chat, and waiting to be called. The channel is open to all participants, so the need for private office hours must be reported, always in the chat, when connecting.
06/03/2020	Organization of distance learning Until further notice, the lectures of «Statistics and data analysis» and «Algorithms for massive datasets» will take place via distance learning. On the days when a course is scheduled, a video recording of the lesson will be made available on the corresponding Web page. Students can send questions to the teacher via email on any clarifications: the following day a documentcontaining the answer to the questions of general interest will be published.
06/03/2020	Recording of the lecture «Technical preliminaries» for the Algoritms for massive datasets course The recording of the lecture «Technical preliminaries» for the Algorithms for massive datasets course is available.
05/03/2020	Restricted access to lecture recordings Access to confidential content has changed. The «Course material» section in the pages of the involved courses describes the new method.
04/03/2020	Recording of the lecture «Mathematical preliminaries» for the Algoritms for massive datasets course The recording of the lecture «Mathematical preliminaries» for the Algorithms for massive datasets course is available.
23/02/2020	Cancellation of teaching activities All teaching activities are canceled until 29/2.
13/02/2020	Office hours on February, 20th The office hours of February, 20th are canceled.
21/01/2020	Beginning of the Algorithms for massive datasets course The lectures of Algorithms for massive datasets will start on Wednesday February, 26th at 14:30 in classroom alfa of the Computer science department. Starting from the following week, lectures will take place as shown in the course timetable.

Language

Lectures are in English.

Course schedule

Lectures take place at the educational sector of Città Studi, according to the following tentative schedule:

Day	Hour	Place
Monday	15:30 - 17:30 (*)	G9
Wednesday	14:30 - 18:30	G12

(*) Monday lectures, aimed at students of the Master in Computer Science, take place only in the weeks shown in the calendar below.
Any change to the schedule will be announced in class and published in paragraph News of this page.

Office hours

By appointment, room 5015 of the Computer Science Department. It is possible contact the teacher by e-mail, taking care to read in advance the guide prepared by Prof. Sebastiano Vigna and clearly specifying in the message the course name and the academic year. In particular, students are encouraged to always use their academic address (i.e. based on the domain studenti.unimi.it) signing with name and student ID number and recalling that the response time may vary depending on the teacher commitments.

Course material

Lectures are based:

on the textbook Mining of Massive Datasets, written by A. Rajaraman and J. Ullman (marked by RU in the calendar of lectures), available as a free download in the authors' Web site and published in hardcopy by Cambridge University Press (ISBN:9781107015357);
on the notes and sample code published in the calendar of lectures.

The recording of some lectures, marked with (R) in the schedule, is available until the end of the course. Authentication is done using the Office365 academic account.

It is also suggested to read the following material.

To practice with Spark: H. Karau, A. Konwinski, P. Wendell, M. Zaharia, Learning Spark. Lightning-Fast Big Data Analysis, O'Reilly, 2015 (ISBN:978-1-449-35862-4).
For a deeper study of Spark: S. Ryza, U. Laserson, S. Owen, J. Wills, Advanced Analytics with Spark. Patterns for Learning from Data at Scale, O'Reilly, 2015 (ISBN:978-1-491-91276-8).
About distributed file systems and the MapReduce paradigm: Yahoo! Hadoop Tutorial (besides Chapter 2 in RU).
For a deeper study of the practical parts: Data Science and Engineering with Spark program of edX.

Syllabus

The course explains the topics listed in the lecture calendar, covering the textbook contents as well as the contents of the remaining documents listed in Course material.

Prereqs

The course requires knowledge of the main topics of bachelor-level computer programming, calculus, probability, and statistics.

Lectures calendar

Exam modalities

The exam consists of a project and an oral test, both related to the topics covered in the course. The project requires to process one or more datasets through the critical application of the techniques described during the classes, and is described in a written report. Four projects are available, as well as a joint project with the «Statistical methods for machine learning» course. The evaluation of the project, expressed with a pass/fail mark, considers the level of mastery of the topics and the clarity of the report. The oral test, which is accessed after a positive evaluation of the project, is based on the discussion of some topics covered in the course and on in-depth questions about the presented project. The evaluation of the oral test, expressed on a scale between 0 and 30, takes into account the level of mastery of the topics, clarity, and language skills.

Exam sessions

Session	Date
June	16/06/2020
July	14/07/2020
September	07/09/2020 11/09/2020
September	24/09/2020
January	22/01/2021
February	N/A