ALGORITHMS FOR MASSIVE DATA

Academic year: 2024/2025 Syllabus of previous years

Official course title: ALGORITHMS FOR MASSIVE DATA

Course code: CM0637 (AF:513730 AR:286758)

Teaching language: English

Modality: On campus classes

ECTS credits: 6

Degree level: Master's Degree Programme (DM270)

Academic Discipline: INF/01

Period: 2nd Semester

Course year: 1

Where: VENEZIA

Moodle: Go to Moodle page

Contribution of the course to the overall degree programme goals

The course introduces efficient algorithms and data structures for the representation and analysis of massive datasets. While traditional algorithmic techniques require the complete input data in an accessible format in order to work properly, the algorithms discussed in this course employ lossy and lossless compression techniques to analyze data whose size often exceeds the capacity of traditional computers. We will discuss sketching techniques to reduce (often, exponentially) the size of the data at hand, algorithms on data streams, and compressed data structures to represent and manipulate data in compressed format.

Expected learning outcomes

Knowledge and understanding
The student:
- knows the methodologies for designing and developing computer systems related to Data Engineering;
- knows the techniques for evaluating the performance, scalability, and reliability of software and algorithms for analyzing massive data;
- develops his/her skills in the field of programming by learning calculation techniques and knowledge of algorithms, at the state of the art;
- knows programming environments and algorithms for analyzing massive data;

Ability to apply knowledge and understanding
The graduate:
- is able to design and develop systems for storing, managing, and analyzing large-scale data;
- is able to design and evaluate highly scalable systems;
- is able to use advanced programming techniques in the fields of high-performance computing and algorithms for analyzing large amounts of data;
- is able to verify the functional and non-functional requirements (performance) of an algorithm;
- is able to access scientific literature to identify potential solutions to problems with innovative state-of-the-art methods.

Judgment skills:
At the end of the course the student will be able to use the knowledge acquired to:
- Identify the algorithm and the data structure best suited to solve a given problem in the context of massive data.
- Rigorously analyze the performance of randomized algorithms (runtime, approximation, probability of success).
- Independently read and understand research articles in the field.
- Implement existing algorithms and design new ones.

Pre-requirements

- Probability theory (expected value, variance, random variables, events)
- Algorithms and data structures (asymptotic complexity, basic data structures)
- Basics of C or C++ programming
- Discrete mathematics (modular arithmetics, combinatorics)

(1) lossless compressed data structures

- Course intro. Basics of information theory (Worst-case Entropy, statistical entropy, data compression)
- compressed data structures for sets and strings (sorted integers, Elias-Fano, zero-order compressed bitvectors, wavelet trees)
- Introduction to indexing, compressed suffix array
- FM-index

(2) lossy randomized algorithms and data structures

2.1 Probability theory recap
- Probability theory, basic definitions, concentration bounds
- Hashing

2.2 Filters
- Bloom, counting Bloom
- Quotient filters

2.3 Similarity-preserving sketching
- Rabin hashing, Shingling
- MinHash (Jaccard distance), Min-wise permutations.
- locality-sensitive hashing
- nearest neighbor search

2.4 Pattern matching on streams
- Pattern matching & streaming: applications, Karp-Rabin algorithm
- Porat-Porat's algorithm
- extension to approximate pattern matching (under Hamming distance)

2.5 Sketching on streams
- Morris’ algorithm
- Idealized Flajolet-Martin
- Frequent itemsets. Misra-Gries algorithm
- Tug-of-war algorithm and dimensionality reduction
- Datar-Gionis-Indyk-Motwani (count ones), extensions to sums of integers in a window

Referral texts

- Teacher's notes: https://arxiv.org/abs/2301.00754
- Original research articles provided during the course

Assessment methods

The exam consists of two parts:

1. A mandatory individual oral test that aims at verifying the knowledge of the topics of the course. The maximum evaluation that can be obtained with the oral test is 30. The goal of the oral test is to verify knowledge of the theoretical tools explained during the course, as well as the ability to apply them in application contexts of algorithms for massive data. The assessment of this part of the examination is formulated according to this scheme: (1) knowledge (statements and proofs of the theoretical results) and ability to apply knowledge in the solution of exercises (range 40%), (2) detail and completeness of the answers (range 40%), (3) communication skills, including the use of specific terminology related to algorithms and data structures for massive data (range 20%).

2. Facultative individual assignment. The assignment allows the student to obtain an extra score of up to 5 additional points in addition to the mark of the oral test. The assignment consists of a problem (challenge) related to massive data processing and will require the implementation of software (in the C or C++ languages) in order to be solved, in addition to a short description of the proposed solution. The students’ solutions will be compared in the form of a contest, by evaluating the quality (accuracy) of the solution computed by the software, as well as its usage of computational resources (memory and speed). The assessment of this part of the examination is formulated according to this scheme: (1) overall ranking in the contest (range 80%), (2) originality of the solution (range 20%).

Honors (30 cum laude) will be awarded to the students that obtained an evaluation of 30 at the oral exam and that solved the practical assignments with an excellent solution (obtaining 5 extra points).

Type of exam

oral

Teaching methods

The topics are presented using a blackboard and the projector.

Definitive programme.

Last update of the programme: 17/01/2025

Type	Name	Sender (Domain)	Description	Duration	Policy
Essential	_shibsession[], _shibsstate[]	Unive.it (www.unive.it)	They maintain the session data of the SingleSignOn.	session	Information by Ca' Foscari University
Essential	PHPSESSID	Unive.it (www.unive.it)	Unique user identifier for the website applications.	session	Information by Ca' Foscari University
Essential	cookie[*]	Unive.it (www.unive.it)	It stores the user's preferences on cookies. user preferences on cookies.	1 month	Information by Ca' Foscari University
Essential	cookie	idp.unive.it	It stores the user's preferences on cookies.	1 month	Information by Ca' Foscari University
Essential	fe_typo_user	Unive.it (www.unive.it)	Unique user identifier for the reserved area of the website	session	Information by Ca' Foscari University
Essential	JSESSIONID	Unive.it (www.unive.it)	Used to create web sessions into the Personal Area.	session	Information by Ca' Foscari University
Essential	ADMCMD_prev	Unive.it (www.unive.it)	Used to create web sessions into the Personal Area.	session	Information by Ca' Foscari University
Essential	unive.it	Unive.it (www.unive.it)	It stores the user's preferences on cookies.	6 months	Information by Ca' Foscari University
Essential	noiframe	Unive.it (www.unive.it)	It stores the user's preferences on cookies.	6 months	Information by Ca' Foscari University
Essential	_pk_id[*]	unive/WAI	*	30 days	Information by Matomo
Essential	_pk_ses[*]	unive/WAI	*	1 day	Information by Matomo
Essential	_pk_ref[*]	unive/WAI	*	6 months	Information by Matomo
Essential	_gsas[*]	unive/google	It stores the user's preferences on cookies.	3 months	Information by Google
Essential	_opensaml_req_cookie%[*]	unive	Authentication and SingleSignOn (shibboleth)	session	Information by Ca' Foscari University
Google - Youtube	__Secure-1PAPISID	Google (google.com)	Used for targeting purposes in order to acquire web visitors' interests and show them pertinent and customised Google advertising.	2 years	Information by Google
Google - Youtube	CONSENT	Google (google.com)	Used by Google to store the user's preferences.	17 years	Information by Google
Google - Youtube	__Secure-1PSID	Google (google.com)	Used for targeting purposes in order to acquire web visitors' interests and show them pertinent and customised Google advertising.	2 years	Information by Google
Essential	Socialpix	Unive.it (www.unive.it)	They are used to record cookie preferences	6 months	Information by Ca' Foscari University
Facebook - Pixel	_fbp	Unive.it (www.unive.it)	Tracks users for retargeting advertising on Facebook	3 months	Information by Facebook
Facebook - Pixel	datr	Facebook	Marketing	2 anni	Information by Facebook