ALGORITHMS FOR MASSIVE DATA

Academic year
2024/2025 Syllabus of previous years
Official course title
ALGORITHMS FOR MASSIVE DATA
Course code
CM0637 (AF:513729 AR:286758)
Modality
On campus classes
ECTS credits
6
Degree level
Master's Degree Programme (DM270)
Educational sector code
INF/01
Period
2nd Semester
Course year
1
Where
VENEZIA
Moodle
Go to Moodle page
The course introduces efficient algorithms and data structures for the representation and analysis of massive datasets. While traditional algorithmic techniques require the complete input data in an accessible format in order to work properly, the algorithms discussed in this course employ lossy and lossless compression techniques to analyze data whose size often exceeds the capacity of traditional computers. We will discuss sketching techniques to reduce (often, exponentially) the size of the data at hand, algorithms on data streams, and compressed data structures to represent and manipulate data in compressed format.
Knowledge and understanding
The student:
- knows the methodologies for designing and developing computer systems related to Data Engineering;
- knows the techniques for evaluating the performance, scalability, and reliability of software and algorithms for analyzing massive data;
- develops his/her skills in the field of programming by learning calculation techniques and knowledge of algorithms, at the state of the art;
- knows programming environments and algorithms for analyzing massive data;

Ability to apply knowledge and understanding
The graduate:
- is able to design and develop systems for storing, managing, and analyzing large-scale data;
- is able to design and evaluate highly scalable systems;
- is able to use advanced programming techniques in the fields of high-performance computing and algorithms for analyzing large amounts of data;
- is able to verify the functional and non-functional requirements (performance) of an algorithm;
- is able to access scientific literature to identify potential solutions to problems with innovative state-of-the-art methods.

Judgment skills:
At the end of the course the student will be able to use the knowledge acquired to:
- Identify the algorithm and the data structure best suited to solve a given problem in the context of massive data.
- Rigorously analyze the performance of randomized algorithms (runtime, approximation, probability of success).
- Independently read and understand research articles in the field.
- Implement existing algorithms and design new ones.
- Probability theory (expected value, variance, random variables, events)
- Algorithms and data structures (asymptotic complexity, basic data structures)
- Basics of C or C++ programming
- Discrete mathematics (modular arithmetics, combinatorics)
(1) lossless compressed data structures

- Course intro. Basics of information theory (Worst-case Entropy, statistical entropy, data compression)
- compressed data structures for sets and strings (sorted integers, Elias-Fano, zero-order compressed bitvectors, wavelet trees)
- Introduction to indexing, compressed suffix array
- FM-index

(2) lossy randomized algorithms and data structures

2.1 Probability theory recap
- Probability theory, basic definitions, concentration bounds
- Hashing

2.2 Filters
- Bloom, counting Bloom
- Quotient filters

2.3 Similarity-preserving sketching
- Rabin hashing, Shingling
- MinHash (Jaccard distance), Min-wise permutations.
- locality-sensitive hashing
- nearest neighbor search

2.4 Pattern matching on streams
- Pattern matching & streaming: applications, Karp-Rabin algorithm
- Porat-Porat's algorithm
- extension to approximate pattern matching (under Hamming distance)

2.5 Sketching on streams
- Morris’ algorithm
- Idealized Flajolet-Martin
- Frequent itemsets. Misra-Gries algorithm
- Tug-of-war algorithm and dimensionality reduction
- Datar-Gionis-Indyk-Motwani (count ones), extensions to sums of integers in a window
- Teacher's notes: https://arxiv.org/abs/2301.00754
- Original research articles provided during the course
The exam consists of two parts:

1. A mandatory individual oral test that aims at verifying the knowledge of the topics of the course. The maximum evaluation that can be obtained with the oral test is 30. The goal of the oral test is to verify knowledge of the theoretical tools explained during the course, as well as the ability to apply them in application contexts of algorithms for massive data. The assessment of this part of the examination is formulated according to this scheme: (1) knowledge (statements and proofs of the theoretical results) and ability to apply knowledge in the solution of exercises (range 40%), (2) detail and completeness of the answers (range 40%), (3) communication skills, including the use of specific terminology related to algorithms and data structures for massive data (range 20%).

2. Facultative individual assignment. The assignment allows the student to obtain an extra score of up to 5 additional points in addition to the mark of the oral test. The assignment consists of a problem (challenge) related to massive data processing and will require the implementation of software (in the C or C++ languages) in order to be solved, in addition to a short description of the proposed solution. The students’ solutions will be compared in the form of a contest, by evaluating the quality (accuracy) of the solution computed by the software, as well as its usage of computational resources (memory and speed). The assessment of this part of the examination is formulated according to this scheme: (1) overall ranking in the contest (range 80%), (2) originality of the solution (range 20%).

Honors (30 cum laude) will be awarded to the students that obtained an evaluation of 30 at the oral exam and that solved the practical assignments with an excellent solution (obtaining 5 extra points).
The topics are presented using a blackboard and the projector.
English
oral
Definitive programme.
Last update of the programme: 17/01/2025