ALGORITHMS FOR MASSIVE DATA

Academic year
2022/2023 Syllabus of previous years
Official course title
ALGORITHMS FOR MASSIVE DATA
Course code
CM0637 (AF:398290 AR:214338)
Modality
On campus classes
ECTS credits
6
Degree level
Master's Degree Programme (DM270)
Educational sector code
INF/01
Period
2nd Semester
Course year
1
Where
VENEZIA
Moodle
Go to Moodle page
The course introduces efficient algorithms and data structures for the representation and analysis of massive datasets. While traditional algorithmic techniques require the complete input data in an accessible format in order to work properly, the algorithms discussed in this course employ lossy and lossless compression techniques to analyze data whose size often exceeds the capacity of traditional computers. We will discuss sketching techniques to reduce (often, exponentially) the size of the data at hand, algorithms on data streams, and compressed data structures to represent and manipulate data in compressed format.
At the end of the course, the student will be able to apply advanced algorithmic techniques in order to analyze massive data, and will have the theoretical requirements needed in order to independently read and understand scientific articles in the field of the course. In particular, the expected learning outcomes will include:

1. Knowledge and understanding:
At the end of the course, the student will know the main algorithmic techniques concerning sketching, streaming, and compressed data structures.

2. Ability to apply knowledge and understanding:
At the end of the course, the student will be able to apply the learned techniques (sketching, streaming, and compressed data structures) in order to solve problems which are typical of massive data.

3. Ability to make judgments:
At the end of the course, the student will be able to apply the learned techniques as follows:
- Identification of the best algorithm/data structure to solve a particular problem on massive data.
- Analyze with rigorous techniques the performance of randomized algorithms (runtime, approximation ratio, success probability).
- Read and understand scientific articles in the field of the course.
- Implement existing algorithms and design new ones.
- Probability theory (expected value, variance, random variables, events)
- Algorithms and data structures (asymptotic complexity, basic data structures)
- Discrete mathematics (modular arithmetics)
Part I: recap

- Probability theory (Expected value, Variance, Bernoullian R.V., Markov, Chebyshev, Chernoff)
- Hashing (universal, hash tables)

Part II: Sketching

- Shingling
- Rabin hashing (identity)
- MinHash (Jaccard similarity), Min-wise permutations
- SimHash (Cosine similarity)
- sketching for Hamming and Euclidean distance
- Locality-sensitive hashing, nearest neighbor search, other applications
- Bloom filters, Count-min sketch

Part III: algorithms on streams

- the data stream model
- Streamed pattern matching (Karp-Rabin, Porat-Porat algorithms)
- Morris' algorithm (count with small registers)
- Idealized Flajolet-Martin's algorithm
- Bottom-k algorithm
- LogLog algorithms family
- lower bounds
- Estimating higher moments (AMS, tug-of-war, heavy hitters)
- Datar-Gionis-Indyk-Motwani algorithm (count ones)

Part IV: Compressed data structures

- Entropy, data compression
- bitvectors
- wavelet trees
- compressed suffix array
- FM-index
- compressing and indexing highly structured data (graphs)
- Rajaraman, A. and Ullman, J.D., 2011. Mining of massive datasets. Cambridge University Press. http://www.mmds.org/
- Teacher's notes: https://arxiv.org/abs/2301.00754
- Chakrabarti, Amit, 2020. Data stream algorithms - lecture notes: https://www.cs.dartmouth.edu/~ac/Teach/data-streams-lecnotes.pdf
- Navarro, Gonzalo, 2016. Compact data structures: a practical approach. Cambridge University Press.
- Teacher's slides and original research articles
The exam consists in an individual oral discussion concerning the topics presented during the course's lectures.
Teacher's slides and blackboard.
English
oral
Definitive programme.
Last update of the programme: 21/01/2023