ALGORITHMS FOR MASSIVE DATA
- Academic year
- 2025/2026 Syllabus of previous years
- Official course title
- ALGORITHMS FOR MASSIVE DATA
- Course code
- CM0637 (AF:576784 AR:323780)
- Modality
- On campus classes
- ECTS credits
- 6
- Degree level
- Master's Degree Programme (DM270)
- Educational sector code
- INF/01
- Period
- 2nd Semester
- Course year
- 1
- Where
- VENEZIA
Contribution of the course to the overall degree programme goals
Expected learning outcomes
The student:
- knows the methodologies for designing and developing computer systems related to Data Engineering;
- knows the techniques for evaluating the performance, scalability, and reliability of software and algorithms for analyzing massive data;
- develops his/her skills in the field of programming by learning calculation techniques and knowledge of algorithms, at the state of the art;
- knows programming environments and algorithms for analyzing massive data;
Ability to apply knowledge and understanding
The graduate:
- is able to design and develop systems for storing, managing, and analyzing large-scale data;
- is able to design and evaluate highly scalable systems;
- is able to use advanced programming techniques in the fields of high-performance computing and algorithms for analyzing large amounts of data;
- is able to verify the functional and non-functional requirements (performance) of an algorithm;
- is able to access scientific literature to identify potential solutions to problems with innovative state-of-the-art methods.
Judgment skills:
At the end of the course the student will be able to use the knowledge acquired to:
- Identify the algorithm and the data structure best suited to solve a given problem in the context of massive data.
- Rigorously analyze the performance of randomized algorithms (runtime, approximation, probability of success).
- Independently read and understand research articles in the field.
- Implement existing algorithms and design new ones.
Pre-requirements
- Algorithms and data structures (asymptotic complexity, basic data structures)
- Basics of C or C++ programming
- Discrete mathematics (modular arithmetics, combinatorics)
Contents
- Course intro. Basics of information theory (Worst-case Entropy, statistical entropy, data compression)
- compressed data structures for sets and strings (sorted integers, Elias-Fano, zero-order compressed bitvectors, wavelet trees)
- Introduction to indexing, compressed suffix array
- FM-index
(2) lossy randomized algorithms and data structures
2.1 Probability theory recap
- Probability theory, basic definitions, concentration bounds
- Hashing
2.2 Filters
- Bloom filters, counting Bloom filters
- Quotient filters
2.3 Similarity-preserving sketching
- Rabin hashing, Shingling
- MinHash (Jaccard distance), Min-wise permutations.
- locality-sensitive hashing
- nearest neighbor search
2.4 Pattern matching on streams
- Pattern matching & streaming: applications, Karp-Rabin algorithm
- Porat-Porat's algorithm
- extension to approximate pattern matching (under Hamming distance)
2.5 Sketching on streams
- Morris’ algorithm
- Idealized Flajolet-Martin
- Frequent itemsets. Misra-Gries algorithm
- Count-min sketch
- Tug-of-war algorithm and dimensionality reduction
- sketching for relational algebra
- Datar-Gionis-Indyk-Motwani (count ones), extensions to sums of integers in a window
Referral texts
- Original research articles provided during the course
Assessment methods
1. Individual assignment. The maximum score that can be obtained with the assignment is 15. The assignment consists of a problem (challenge) related to the processing of massive data and requires the implementation of software (in C or C++ languages) to be solved, in addition to a brief description of the proposed solution (5 pages, pdf). The students' solutions will be compared in a competition, evaluating the quality (accuracy) of the solution calculated by the software, as well as its use of computational resources (memory and speed).
2. Individual oral exam. The oral exam must be taken after the delivery of the individual assignment. The maximum score that can be obtained with the oral exam is 20. The aim of the oral exam is to verify the knowledge of the theoretical tools presented during the course, as well as to discuss the solution of the individual assignment.
The final evaluation is given by the sum of the evaluations of the assignment and the oral exam. Honors (laude) will be awarded to students who have obtained a sum of at least 34 points in the two tests.
Type of exam
Grading scale
- overall ranking in the contest (range 80%)
- originality of the solution (range 20%).
Oral exam (maximum 20 points):
- knowledge (statements and proofs of the theoretical results) and ability to apply knowledge in the solution of exercises (range 40%)
- detail and completeness of the answers (range 40%)
- communication skills, including the use of specific terminology related to algorithms and data structures for massive data (range 20%).
The final evaluation is given by the sum of the evaluations of the assignment and the oral exam. Honors (laude) will be awarded to students who have obtained a sum of at least 34 points in the two tests.