ALGORITHMS FOR MASSIVE DATA

Academic year
2023/2024 Syllabus of previous years
Official course title
ALGORITHMS FOR MASSIVE DATA
Course code
CM0622 (AF:451557 AR:245287)
Modality
On campus classes
ECTS credits
6 out of 12 of ALGORITHMS AND LEARNING OVER MASSIVE DATA
Degree level
Master's Degree Programme (DM270)
Educational sector code
INF/01
Period
2nd Semester
Course year
1
Where
VENEZIA
Moodle
Go to Moodle page
The course introduces efficient algorithms and data structures for the representation and analysis of massive datasets. While traditional algorithmic techniques require the complete input data in an accessible format in order to work properly, the algorithms discussed in this course employ lossy and lossless compression techniques to analyze data whose size often exceeds the capacity of traditional computers. We will discuss sketching techniques to reduce (often, exponentially) the size of the data at hand, algorithms on data streams, and compressed data structures to represent and manipulate data in compressed format.
At the end of the course, the student will be able to apply advanced algorithmic techniques in order to analyze massive data, and will have the theoretical requirements needed in order to independently read and understand scientific articles in the field of the course. In particular, the expected learning outcomes will include:

1. Knowledge and understanding:
At the end of the course, the student will know the main algorithmic techniques concerning sketching, streaming, and compressed data structures.

2. Ability to apply knowledge and understanding:
At the end of the course, the student will be able to apply the learned techniques (sketching, streaming, and compressed data structures) in order to solve problems which are typical of massive data.

3. Ability to make judgments:
At the end of the course, the student will be able to apply the learned techniques as follows:
- Identification of the best algorithm/data structure to solve a particular problem on massive data.
- Analyze with rigorous techniques the performance of randomized algorithms (runtime, approximation ratio, success probability).
- Read and understand scientific articles in the field of the course.
- Implement existing algorithms and design new ones.
- Probability theory (expected value, variance, random variables, events)
- Algorithms and data structures (asymptotic complexity, basic data structures)
- Discrete mathematics (modular arithmetics)
(1) lossless compressed data structures

- Course intro. Basics of information theory (Worst-case Entropy, statistical entropy, data compression)
- compressed data structures for sets and strings (sorted integers, Elias-Fano, succinct bitvectors, wavelet trees)
- Introduction to indexing, compressed suffix array
- FM-index / r-index
- indexes for graphs and regular languages

(2) lossy randomized algorithms and data structures

2.1 Probability theory recap
- Probability theory, basic definitions, concentration bounds
- Hashing

2.2 Filters
- Bloom, counting bloom
- Quotient filters

2.3 Similarity-preserving sketching
- Rabin hashing, Shingling
- MinHash (Jaccard distance), Min-wise permutations.
- locality-sensitive hashing
- nearest neighbor search

2.4 Pattern matching on streams
- Pattern matching & streaming: applications, Karp-Rabin algorithm
- Porat-Porat's algorithm
- extension to approximate pattern matching (under Hamming distance)

2.5 Sketching on streams
- Morris’ algorithm
- Idealized Flajolet-Martin, Bottom-k algorithm
- Tidemark algorithm, BJKST algorithm
- Frequent itemsets. Misra-Gries algorithm
- Tug-of-war algorithm and dimensionality reduction
- Estimating Higher moments - AMS algorithm
- Datar-Gionis-Indyk-Motwani (count ones), extensions to sums of integers in a window
- Rajaraman, A. and Ullman, J.D., 2011. Mining of massive datasets. Cambridge University Press. http://www.mmds.org/
- Teacher's notes: https://arxiv.org/abs/2301.00754
- Chakrabarti, Amit, 2020. Data stream algorithms - lecture notes: https://www.cs.dartmouth.edu/~ac/Teach/data-streams-lecnotes.pdf
- Navarro, Gonzalo, 2016. Compact data structures: a practical approach. Cambridge University Press.
- Teacher's slides and original research articles
The exam consists in an individual oral discussion concerning the topics presented during the course's lectures.
Teacher's slides and blackboard.
English
oral
Definitive programme.
Last update of the programme: 25/02/2023