The course aims at describing the big data processing framework, both in terms of methodologies and technologies.
- will be able to use technologies for the distributed storage of datasets;
- will know the MapReduce distributed processing framework and its leading extensions;
- will know the principal algorithms used in order to deal with classical big data problems, as well as to implement them using a distributed processing framework;
- will be able to choose appropriate methods for solving big data problems.
Office hours on February, 20th
The office hours of Februaty, 20th are canceled.
Beginning of the Algorithms for massive datasets course
The lectures of Algorithms for massive datasets will start on Wednesday February, 26th at 14:30 in classroom alfa of the Computer science department. Starting from the following week, lectures will take place as shown in the course timetable.
Lectures are in English.
Lectures take place at the educational sector of Città Studi, according to the following tentative schedule:
|Monday||15:30 - 17:30 (*)||G9|
|Wednesday||14:30 - 18:30||G12|
(*) Monday lectures, aimed at students of the Master in Computer
Science, take place only in the weeks shown in the calendar below.
Any change to the schedule will be announced in class and published in paragraph News of this page.
Thursday, at 17:00.
It is possible contact the teacher by e-mail, taking care to read in advance the guide prepared by Prof. Sebastiano Vigna and clearly specifying in the message the course name and the academic year. In particular, students are encouraged to always use their academic address (i.e. based on the domain
studenti.unimi.it) signing with name and student ID number and recalling that the response time may vary depending on the teacher commitments.
Lectures are based:
- on the textbook Mining of Massive Datasets, written by A. Rajaraman and J. Ullman (marked by RU in the calendar of lectures), available as a free download in the authors' Web site and published in hardcopy by Cambridge University Press (ISBN:9781107015357);
- on the notes and sample code published in the calendar of lectures.
It is also suggested to read the following material.
- To practice with Spark: H. Karau, A. Konwinski, P. Wendell, M. Zaharia, Learning Spark. Lightning-Fast Big Data Analysis, O'Reilly, 2015 (ISBN:978-1-449-35862-4).
- For a deeper study of Spark: S. Ryza, U. Laserson, S. Owen, J. Wills, Advanced Analytics with Spark. Patterns for Learning from Data at Scale, O'Reilly, 2015 (ISBN:978-1-491-91276-8).
- About distributed file systems and the MapReduce paradigm: Yahoo! Hadoop Tutorial (besides Chapter 2 in RU).
- For a deeper study of the practical parts: Data Science and Engineering with Spark program of edX.
The course requires knowledge of the main topics of bachelor-level computer programming, calculus, probability, and statistics.
The exam consists of a project and an oral test, both related to the topics covered in the course. The project, described in a report, requires to process one or more datasets through the critical application of the techniques described during the classes. The evaluation of the project, expressed with a pass/fail mark, considers the level of mastery of the topics and the clarity of the report. The oral test, which is accessed after a positive evaluation of the project, is based on the discussion of some topics covered in the course and on in-depth questions about the presented project. The evaluation of the oral test, expressed on a scale between 0 and 30, takes into account the level of mastery of the topics, clarity, and language skills.