COMPUTATIONAL PHILOLOGY: DATA STRUCTURES AND ALGORITHMS
- Academic year
- 2022/2023 Syllabus of previous years
- Official course title
- COMPUTATIONAL PHILOLOGY: DATA STRUCTURES AND ALGORITHMS
- Course code
- FM0488 (AF:354986 AR:208518)
- Modality
- On campus classes
- ECTS credits
- 6
- Degree level
- Master's Degree Programme (DM270)
- Educational sector code
- L-LIN/01
- Period
- 3rd Term
- Course year
- 2
- Where
- VENEZIA
- Moodle
- Go to Moodle page
Contribution of the course to the overall degree programme goals
The main goals of this course are:
- to provide the students with the basic technical tools for the computational treatment of textual data
- to introduce the students to the fundamental linguistic annotation techniques and tools
- to strengthen the students' knowledge of the Python programming language as well as to introduce them to some of its NLP modules, among which Stanza and gensim
- to stimulate critical thinking and the ability to think out of the box
Expected learning outcomes
- familiarity with the Python programming language and with some of its NLP/text mining packages (Stanza, gensim)
- familiarity with the most commonly used techniques of (morphosyntactic) linguistic annotation
- learning of the basic techniques for the extraction of linguistic knowledge from corpora
- knowledge of the principal levels of linguistic annotation
- familiarity with the most commonly used techniques for the representation of structured information extracted from text
2. Applying knowledge and understanding
- knowledge of the features and limitations of the most common computational linguistics tools and approaches, so as to be able to pick the most appropriate solution for a given linguistic research issue
- use of Python for the implementation of scripts for the quantitative and computational analysis of text
- ability to advance and test original and sounded hypotheses
3. Making judgements
- ability to implement self-development strategies to improve technical skills
- awareness of the technical and deontological issues connected to the automatic treatment of language
- ability to retrieve the most relevant literature and to use it critically
- ability to compare competing hypotheses
4. Communication skills
- ability to write a report to describe the process, progress and result of an original scientific research
- ability to interact with the other students and the professor
5. Learning skills
- ability to learn novel scripting languages (among which, R, PERL, Matlab, Javascript)
- ability to acquire technical knowledge pertaining to issues only indirectly linked to the automatic treatment of language (e.g. the statistical analysis)
- ability to learn novel technical tools for the automatic treatment of language (e.g. annotation tools)
Pre-requirements
Contents
week 2: Automatic corpus annotation
week 3: Distributional semantics
week 4: Topic modeling
week 5: Stylometry & authorship attribution
Referral texts
- M. Baroni (2009) *Distributions in text*. In A. Lüdeling and M. Kytö (eds.), Corpus linguistics: An international handbook, Vol. 2, Mouton de Gruyter: 803-821. Available online at: http://sslmit.unibo.it/~baroni/publications/hsk_39_dist_rev2.pdf
- D.M. Blei (2012) *Probabilistic topic models*. Communications of the ACM, 55 (4): 77-84. Available online at: http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf
- M. Davies (2015) Corpora: An introduction. In D. Biber and R. Reppen (eds.), The Cambridge Handbook of English Corpus Linguistics, Cambridge University Press: 11-31.
- M. C. de Marneffe and J. Nivre (2019) Dependency Grammar. Annual Review of Linguistics 5: 197-218.
- S.T. Gries and A. L. Berez (2017) Linguistic Annotation in/for Corpus Linguistics. In N. Ide and J. Pustejovsky (eds.), Handbook of Linguistic Annotation, Springer: 379-409. Available online at: http://www.stgries.info/research/2017_STG-ALB_LingAnnotCorpLing_HbOfLingAnnot.pdf
- M. Hammond (2020) Python for Linguists. Cambridge University Press
- D. Hovy (2021) Text Analysis in Python for Social Scientists: Discovery and Exploration.
Cambridge University Press
- D. Jurafsky and J. H. Martin (2020) Speech and Language Processing, 3rd edition, DRAFT (ch. 4, 6). Available online at: https://web.stanford.edu/~jurafsky/slp3/
- A. Lenci (2018) Distributional Models of Word Meaning, Annual Review of Linguistics, 4: 151-171.
- T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, Y. and D. Woodard (2017) Surveying stylometry techniques and applications. ACM Computing Surveys (CSUR), 50 (6): 1-36. Available online at: https://dl.acm.org/doi/abs/10.1145/3132039
Assessment methods
The project will be graded as follows:
- quality of the code: 40% of the final grade
- knowledge of the relevant literature and of the state-of-the-art: 20% of the final grade
- quality of the report: 30% of the final grade
- one‐on‐one discussion with the instructor: 10% of the final grade
Teaching methods
- discussion of some programming exercises from the past homework
- overview of the session key concepts and principles
- work on the programming exercises in the relevant Jupyter notebook available on [the university e-learning platform](https://moodle.unive.it/ )