ALGORITHMS FOR MASSIVE DATA

Academic year
2024/2025 Syllabus of previous years
Official course title
ALGORITHMS FOR MASSIVE DATA
Course code
CM0637 (AF:513729 AR:286758)
Modality
On campus classes
ECTS credits
6
Degree level
Master's Degree Programme (DM270)
Educational sector code
INF/01
Period
2nd Semester
Course year
1
Where
VENEZIA
Moodle
Go to Moodle page
The course introduces efficient algorithms and data structures for the representation and analysis of massive datasets. While traditional algorithmic techniques require the complete input data in an accessible format in order to work properly, the algorithms discussed in this course employ lossy and lossless compression techniques to analyze data whose size often exceeds the capacity of traditional computers. We will discuss sketching techniques to reduce (often, exponentially) the size of the data at hand, algorithms on data streams, and compressed data structures to represent and manipulate data in compressed format.
At the end of the course, the student will be able to apply advanced algorithmic techniques in order to analyze massive data, and will have the theoretical requirements needed in order to independently read and understand scientific articles in the field of the course. In particular, the expected learning outcomes will include:

1. Knowledge and understanding:
At the end of the course, the student will know the main algorithmic techniques concerning sketching, streaming, and compressed data structures.

2. Ability to apply knowledge and understanding:
At the end of the course, the student will be able to apply the learned techniques (sketching, streaming, and compressed data structures) in order to solve problems which are typical of massive data.

3. Ability to make judgments:
At the end of the course, the student will be able to apply the learned techniques as follows:
- Identification of the best algorithm/data structure to solve a particular problem on massive data.
- Analyze with rigorous techniques the performance of randomized algorithms (runtime, approximation ratio, success probability).
- Read and understand scientific articles in the field of the course.
- Implement existing algorithms and design new ones.
- Probability theory (expected value, variance, random variables, events)
- Algorithms and data structures (asymptotic complexity, basic data structures)
- Basics of C or C++ programming
- Discrete mathematics (modular arithmetics, combinatorics)
(1) lossless compressed data structures

- Course intro. Basics of information theory (Worst-case Entropy, statistical entropy, data compression)
- compressed data structures for sets and strings (sorted integers, Elias-Fano, zero-order compressed bitvectors, wavelet trees)
- Introduction to indexing, compressed suffix array
- FM-index

(2) lossy randomized algorithms and data structures

2.1 Probability theory recap
- Probability theory, basic definitions, concentration bounds
- Hashing

2.2 Filters
- Bloom, counting Bloom
- Quotient filters

2.3 Similarity-preserving sketching
- Rabin hashing, Shingling
- MinHash (Jaccard distance), Min-wise permutations.
- locality-sensitive hashing
- nearest neighbor search

2.4 Pattern matching on streams
- Pattern matching & streaming: applications, Karp-Rabin algorithm
- Porat-Porat's algorithm
- extension to approximate pattern matching (under Hamming distance)

2.5 Sketching on streams
- Morris’ algorithm
- Idealized Flajolet-Martin
- Frequent itemsets. Misra-Gries algorithm
- Tug-of-war algorithm and dimensionality reduction
- Datar-Gionis-Indyk-Motwani (count ones), extensions to sums of integers in a window
- Teacher's notes: https://arxiv.org/abs/2301.00754
- Original research articles provided during the course
The exam consists of two parts:

1. A mandatory individual oral test that aims at verifying the knowledge of the topics of the course. The maximum evaluation that can be obtained with the oral test is 30. The goal of the oral test is to verify knowledge of the theoretical tools explained during the course, as well as the ability to apply them in application contexts of algorithms for massive data. The assessment of this part of the examination is formulated according to this scheme: (1) knowledge (statements and proofs of the theoretical results) and ability to apply knowledge in the solution of exercises (range 40%), (2) detail and completeness of the answers (range 40%), (3) communication skills, including the use of specific terminology related to algorithms and data structures for massive data (range 20%).

2. Facultative individual assignment. The assignment allows the student to obtain an extra score of up to 5 additional points in addition to the mark of the oral test. The assignment consists of a problem (challenge) related to massive data processing and will require the implementation of software (in the C or C++ languages) in order to be solved, in addition to a short description of the proposed solution. The students’ solutions will be compared in the form of a contest, by evaluating the quality (accuracy) of the solution computed by the software, as well as its usage of computational resources (memory and speed). The assessment of this part of the examination is formulated according to this scheme: (1) overall ranking in the contest (range 80%), (2) originality of the solution (range 20%).

Honors (30 cum laude) will be awarded to the students that obtained an evaluation of 30 at the oral exam and that solved the practical assignments with an excellent solution (obtaining 5 extra points).
The topics are presented using a blackboard and the projector.
English
oral
Definitive programme.
Last update of the programme: 14/10/2024