Data Mining and Computational Statistics

A.Y. 2020/2021
9
Max ECTS
80
Overall hours
SSD
SECS-S/01
Language
English
Learning objectives
This is an introductory course to basic techniques and applications in finance and economics of Data Mining and Computational Statistics, also in the more general framework of data science. We will allow students to develop programming skills using the R software in the Data Mining part, and the OpenBUGS software for Bayesian Markov Chain Monte Carlo random variable generation. Students will acquire independence in studying Data Mining & Computational Statistics subjects and will be able to solve practical problems in economic and financial data analysis.
Expected learning outcomes
At the end of the course students will be able to perform machine learning techniques and algorithms and use them in economic and financial applications. Specifically, students will be familiar with supervised and unsupervised models. In particular, in the supervised framework students will be able to perform advanced regression models like the ridge and lasso regression, classification techniques like the Bayes classifier, the K-NN classifier and the logistic model, whereas in the unsupervised framework students will become familiar with dimensional reduction techniques and cluster analysis. More sophisticated techniques like decision tree-based classification will be presented to the students. In Computational statistics, resampling techniques, random number and random variable generation and numerical integration will be part of the acquired knowledge the students will have at the end of the course.
Course syllabus and organization

Single session

Responsible
Lesson period
Third trimester
The course and the exercises, if necessary for the health emergency, will take place remotely synchronously through the Microsoft Teams platform. The registrations and any further teaching material will be available on ARIEL.

The exam will consist of a 30-minute written test, with 15 multiple choice questions, which if necessary will be conducted remotely through the exam.net platform, plus a report (5-8 pages) on a specific topic assigned during the course, to be delivered via e-mail to the professor. For attending students it will be a group work (max 5 people); for non-attending students it will be an individual report. 2/3 of the mark is determined by the test and 1/3 by the report. It will be at the discretion of the professor to ask some questions about the report delivered.
Course syllabus
(0) Introduction to R software.
(i) Review of likelihood inference;
(ii) Introduction to data mining and difference between observational and experimental data.
(iii) Exploratory data analysis and visualization.
(iv) Supervised vs. unsupervised methods: introduction.
(v) Parametric vs. nonparametric methods: introduction.
(vi) Multiple linear regression.
(vii) The importance of statistical planning and ANOVA technique.
(viii) An introduction to the optimal design theory: the information matrix; criterion functions; exact and continuous designs.
(ix) Some algorithms for computing optimal designs.
(x) Classification methods: logistic regression, linear discriminant analysis and the K-nearest neighbors method. The Bayes classifier.
(xi) Resampling methods: cross validation and the bootstrap.
(xii) Shrinkage methods: Ridge regression, the Lasso and other shrinkage methods.
(xiii) Regression splines and local regression.
(xiv) Tree-based methods: random forest, bagging and boosting. Support vector machines (main ideas).
(xv) Unsupervised learning: PCA.
(xvi) Clustering.

If there is still time:
(xvii) Computer-intensive statistical methods: overview.
(xviii) Pseudo-random number and variable generation.
(xix) Monte Carlo methods for numerical integration.
(xx) Simulation-based inference.
Prerequisites for admission
A good knowledge of basic statistical topics is requested together with basic mathematics, especially linear algebra. Some knowledge of essential computer programming is welcome but not essential.
Teaching methods
Lectures will be given on the blackboard, since the topics are quite demanding for the students. We will try to work interactively with students by stimulating their oral and written interventions.
In addition to the lectures there will be 20 hours of exercises, where applications of the concepts presented in class are carried out through the use of the R software.
Teaching Resources
Main textbooks:
(i) An Introduction to Statistical Learning, with applications in R (2013) by G. James, D. Witten, T. Hastie, R. Tibshirani, Springer.
(ii) Optimum experimental designs (1992) by A.C Atkinson and A.N Donev, Clarendon Press.
(iii) Introducing Monte Carlo Statistical Methods with R (2010) by C.P. Robert, G. Casella, Springer.
Suggested reading for insights into some topics in main textbooks:
(i) The Elements of Statistical Learning, 2nd edition (2009), T. Hastie, R. Tibshirani, J. Friedman, Springer.
(ii) Machine Learning: a Probabilistic Perspective (2012), K.P. Murphy, The MIT Press.
(iii) Monte Carlo Statistical Methods (2004) by C.P. Robert, G. Casella, Springer.
Further reading will be suggested during the course.
Assessment methods and Criteria
Two thirds of the exam consists of a 30-minute written test, with 15 multiple choice questions, and one third will be about the preparation of a report (5-8 pages) on a specific topic assigned during the course, to be delivered via e-mail to the professor. For attending students it will be a group work (max 5 people); for non-attending students it will be an individual report. It will be at the discretion of the professor to ask some questions about the report delivered.
SECS-S/01 - STATISTICS - University credits: 9
Practicals: 40 hours
Lessons: 40 hours
Professor: Tommasi Chiara
Professor(s)