ADVANCED COMPUTER SCIENCE - MOD. 2
- Academic year
- 2024/2025 Syllabus of previous years
- Official course title
- ADVANCED COMPUTER SCIENCE - MOD. 2
- Course code
- CM0604 (AF:509707 AR:286758)
- Modality
- On campus classes
- ECTS credits
- 6 out of 12 of ADVANCED COMPUTER SCIENCE
- Degree level
- Master's Degree Programme (DM270)
- Educational sector code
- ING-INF/05
- Period
- 2nd Semester
- Course year
- 1
- Where
- VENEZIA
- Moodle
- Go to Moodle page
Contribution of the course to the overall degree programme goals
Expected learning outcomes
1. Knowledge and understanding:
At the end of the course, the student will know the main algorithmic techniques concerning sketching, streaming, and compressed data structures.
2. Ability to apply knowledge and understanding:
At the end of the course, the student will be able to apply the learned techniques (sketching, streaming, and compressed data structures) in order to solve problems which are typical of massive data.
3. Ability to make judgments:
At the end of the course, the student will be able to apply the learned techniques as follows:
- Identification of the best algorithm/data structure to solve a particular problem on massive data.
- Analyze with rigorous techniques the performance of randomized algorithms (runtime, approximation ratio, success probability).
- Read and understand scientific articles in the field of the course.
- Implement existing algorithms and design new ones.
Pre-requirements
- Algorithms and data structures (asymptotic complexity, basic data structures)
- Basics of C or C++ programming
- Discrete mathematics (modular arithmetics, combinatorics)
Contents
- Course intro. Basics of information theory (Worst-case Entropy, statistical entropy, data compression)
- compressed data structures for sets and strings (sorted integers, Elias-Fano, zero-order compressed bitvectors, wavelet trees)
- Introduction to indexing, compressed suffix array
- FM-index
(2) lossy randomized algorithms and data structures
2.1 Probability theory recap
- Probability theory, basic definitions, concentration bounds
- Hashing
2.2 Filters
- Bloom, counting Bloom
- Quotient filters
2.3 Similarity-preserving sketching
- Rabin hashing, Shingling
- MinHash (Jaccard distance), Min-wise permutations.
- locality-sensitive hashing
- nearest neighbor search
2.4 Pattern matching on streams
- Pattern matching & streaming: applications, Karp-Rabin algorithm
- Porat-Porat's algorithm
- extension to approximate pattern matching (under Hamming distance)
2.5 Sketching on streams
- Morris’ algorithm
- Idealized Flajolet-Martin
- Frequent itemsets. Misra-Gries algorithm
- Tug-of-war algorithm and dimensionality reduction
- Datar-Gionis-Indyk-Motwani (count ones), extensions to sums of integers in a window
Referral texts
- Original research articles provided during the course
Assessment methods
1. A mandatory individual oral test that aims at verifying the knowledge of the topics of the course. The maximum evaluation that can be obtained with the oral test is 30. The goal of the oral test is to verify knowledge of the theoretical tools explained during the course, as well as the ability to apply them in application contexts of algorithms for massive data. The assessment of this part of the examination is formulated according to this scheme: (1) knowledge (statements and proofs of the theoretical results) and ability to apply knowledge in the solution of exercises (range 40%), (2) detail and completeness of the answers (range 40%), (3) communication skills, including the use of specific terminology related to algorithms and data structures for massive data (range 20%).
2. Facultative individual assignment. The assignment allows the student to obtain an extra score of up to 5 additional points in addition to the mark of the oral test. The assignment consists of a problem (challenge) related to massive data processing and will require the implementation of software (in the C or C++ languages) in order to be solved, in addition to a short description of the proposed solution. The students’ solutions will be compared in the form of a contest, by evaluating the quality (accuracy) of the solution computed by the software, as well as its usage of computational resources (memory and speed). The assessment of this part of the examination is formulated according to this scheme: (1) overall ranking in the contest (range 80%), (2) originality of the solution (range 20%).
Honors (30 cum laude) will be awarded to the students that obtained an evaluation of 30 at the oral exam and that solved the practical assignments with an excellent solution (obtaining 5 extra points).