Data Mining and Information Retrieval

The research activities conducted at the DM&IR Lab aim at the development of novel models, algorithms and data structures for the extraction and representation of knowledge and for the efficient management of information. Research topics include:

  • Data and Web Mining;
  • Explainable AI;
  • Mobility Data Science;
  • Distributed and Parallel Data-Intensive Algorithms;
  • Compressed Data Structures for Strings and Graphs.
Research Group

Collaborators

  • Francesco Busolin (PhD Student)
  • Federico Marcuzzi (PhD Student)
  • Alberto Veneri (PhD Student)

Website: https://sites.google.com/unive.it/dmir

Collaborations

Publications

  • Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, Rossano Venturini: QuickScorer: A Fast Algorithm to Rank Documents with Additive Ensembles of Regression Trees. SIGIR 2015: 73-82. (Best Paper) (ACM Notable Article)
  • Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM (JACM). 2020 Jan 15;67(1):1-54. https://doi.org/10.1145/3375890
  • B. Brandoli, A. Raffaetà, M. Simeoni, P. Adibi, F. K. Bappee, F. Pranovi, G. Rovinelli, E. Russo, C. Silvestri, A. Soares, S. Matwin. From multiple aspect trajectories to predictive analysis: a case study on fishing vessels in the Northern Adriatic sea. GeoInformatica, pp. 1--29, March 2022
  • Stefano Calzavara, Claudio Lucchese, Gabriele Tolomei, Seyum Assefa Abebe, Salvatore Orlando: Treant: training evasion-aware decision trees. Data Min. Knowl. Discov. 34(5): 1390-1420 (2020)
  • Seyum Assefa Abebe, Claudio Lucchese, Salvatore Orlando: EiFFFeL: Enforcing Fairness in Forests by Flipping Leaves. ACM SAC 2021
  • Giulio Ermanno Pibiri. and Rossano Venturini. "Techniques for Inverted Index Compression". ACM Computing Surveys. 53, 6, Article 125, 2021, 36 pages. https://doi.org/10.1145/3415148

Awards

  • 2015 - Best Paper at ACM SIGIR Conference on Research & Development on Information Retrieval

Research Projects

REGINDEX - Compressed Indexes for Regular Languages with Applications to Computational Pan-genomics

The research project, funded by the Horizon Europe programme with an ERC starting grant, tackles the problem of organizing big and structured data sets in order to reduce their space usage and accelerate searches inside them. On a high level, the idea is very similar to the functioning of a common dictionary: it is much easier to search a term in a dictionary rather than in a book because, in the former, terms are sorted alphabetically. The REGINDEX project extends this simple idea to much more complex data: labeled graphs (or, equivalently, regular languages). While sentences in a book are formed by consecutive words, in a labeled graph "jumps" between (even very distant) words are permitted. Even if this makes the problem of searching sentences in a graph much more complicated, the project will show that the idea of sorting still applies. The developed techniques will find immediate applications in the design of algorithms for searching mutations inside sets of genomes. The DNA of two human beings is never perfectly identical. As a matter of fact, the differences existing among all human genomes can be modeled as a very large labeled graph: a pangenomic graph. The problem of searching for a particular mutation in the data set translates to that of searching a "sentence" (a path) inside this graph.

MASTER - Multiple aspect trajectories representation and analysis

Multiple ASpects TrajEctoRy management and analysis - (2018-2022) is a Marie Sklodowska-Curie RISE project (Research and Innovation Staff Exchange), which involves 10 international partners and it is intended to strengthen an international thematic network. The project is motivated by the growing number of applications, from mobile phone calls to social media, to land, sea, and air surveillance systems, which produce massive amounts of spatio-temporal data of moving objects. The project aims at developing methods for constructing, managing and analyzing holistic trajectories, i.e., sequences of spatio-temporal points enriched with semantic information coming from heterogeneous data sources, such as social media, Linked Open Data, knowledge bases. For example, in the mentioned contexts, the availability of holistic trajectories allows for the identification and monitoring of the different types of tourist flows, the definition of customized itineraries based on tourists' interests, the knowledge acquisition on fishing patterns to enforce fisheries management and conservation measures worldwide, the identification of the routes of migrants and the detection of the presence of suspicious boats.

Website: http://www.master-project-h2020.eu/

Last update: 17/04/2024