Statistical Learning and Big Data

Course in University of Vienna, October 2017
Home page on http://www.tau.ac.il/ ∼ saharon/StatLearn-Vienna.html

Lecturer:	Saharon Rosset
	`saharon@post.tau.ac.il`
Textbook:	Elements of Statistical Learning by Hastie, Tibshirani & Friedman

Announcements and handouts

Homework submission: You have the option to submit (at least) 7 of the 11 homework problems, by Friday, 20 October. Please send the submission by email to both myself and Dominique Sundt. To get an extra 10 points in your grade (like solving one more problem in the final correctly), you have to do the problems you submit well - clear and correct.
Note: You may discuss the problems and solutions between you, but each student who wants to submit the HW must prepare the actual submission alone.

Homework problems: 0 (warmup) , 1 , 2 , 3 (uses the code in kNN-prob3.r) , 4 , 5+6 (weekend) , 7 , 8 , 9 (uses the code in boost-Prob9.r) , 10

(2 October) Slides from class 1 and code from class.

(5 October) Slides on bias-variance decomposition of linear regression.
Code for running regularized linear regression variants and PCA on Netflix data.

(10 October) Code for running classification methods on Netflix data.
Code for running tree methods on Netflix data.

Syllabus

The goal of this course is to gain familiarity with the basic ideas and methodologies of statistical (machine) learning. The focus is on supervised learning and predictive modeling, i.e., fitting y ≈ ∧f(x), in regression and classification.
We will start by thinking about some of the simpler, but still highly effective methods, like nearest neighbors and linear regression, and gradually learn about more complex and "modern" methods and their close relationships with the simpler ones.
As time permits, we will also cover one or more industrial "case studies" where we track the process from problem definition, through development of appropriate methodology and its implementation, to deployment of the solution and examination of its success in practice.
The homework and exam will combine hands-on programming and modeling with theoretical analysis. Topics list (we will cover some of these, as time permits):

Introduction (text chap. 1,2): Local vs. global modeling; Overview of statistical considerations: Curse of dimensionality, bias-variance tradeoff; Selection of loss functions; Basis expansions and kernels
Linear methods for regression and their extensions (text chap. 3): Regularization, shrinkage and principal components regression; Quantile regression
Linear methods for classification (text chap. 4): Linear discriminant analysis; Logistic regression; Linear support vector machines (SVM)
Classification and regression trees (text chap. 9.2)
Model assessment and selection (text chap. 7): Bias-variance decomposition; In-sample error estimates, including C_p and BIC; Cross validation; Bootstrap methods
Basis expansions, regularization and kernel methods (text chap. 5,6): Splines and polynomials; Reproducing kernel Hilbert spaces and non-linear SVM
Committee methods in embedded spaces (material from chaps 8-10): Random Forest and boosting
Deep learning and its relation to statistical learning
Learning with sparsity: Lasso, marginal modeling etc.
Case studies: Customer wallet estimation; Netflix prize competition; Testing on public databases

Prerequisites

Basic knowledge of mathematical foundations: Calculus; Linear Algebra; Geometry
Undergraduate courses in: Probability; Theoretical Statistics
Statistical programming experience in R is not a prerequisite, but an advantage

Books and resources

Textbook:
Elements of Statistical Learning by Hastie, Tibshirani & Friedman
Book home page (including downloadable PDF of the book, data and errata)

Other recommended books:
Computer Age Statistical Inference by Efron and Hastie
Modern Applied Statistics with Splus by Venables and Ripley
Neural Networks for Pattern Recognition by Bishop
(Several other books on Pattern Recognition contain similar material)
All of Statistics and All of Nonparametric Statistics by Wasserman

Online Resources:
Data Mining and Statistics by Jerry Friedman
Statistical Modeling: The Two Cultures by the late, great Leo Breiman
Course on Machine Learning from Stanford's Coursera.
The Netflix Prize competition is now over, but will still play a substantial role in our course.

Course work and grading (updated!)

The grading will be based primarily on a final exam that will be held on the last day of the course (Friday 13 October) in class. It will be a multiple choice exam, and will cover the material we study in class.

In parallel, a homework problem will be given every day. You are strongly encouraged to work on the problems in real time, as a means of integrating the material we discuss in class.

After the class (probably the week after), there will be an option to submit the homework problems, and get a bonus in the class grade.

Computing

The course will require use of statistical modeling software. It is strongly recommended to use R (freely available for PC/Unix/Mac) or its commercial kin Splus.
R Project website also contains extensive documentation.
A basic "getting you started in R" tutorial. Uses the Boston Housing Data (thanks to Giles Hooker).
Modern Applied Statistics with Splus by Venables and Ripley is an excellent source for statistical computing help for R/Splus.

File translated from T_EX by T_TH, version 4.10.
On 14 Oct 2017, 09:01.