Syllabus : Modern machine learning heavily relies on optimization tools, typically to minimize the so called loss functions on training sets. The objective of this course is to cover the necessary theoretical results of convex optimization as well as the computational aspects. This course contains a fair amount of programming as all algorithms presented will be implemented and tested on real data. At the end of the course, students shall be able to decide what algorithm is the most adapted to the machine learning problem given the size of the data (number of samples, sparsity, dimension of each observation).         

Evaluation:   

Labs. 2-3 Labs with Jupyter graded (30% of the final grade).

Project. Implementation of solvers for a machine learning model. 30% of final grade.

Exam. 3h Exam (40% of the final grade). 

Detailed Program

1 Lecture [2.00 hours lecture + 1.50 hours exe, Robert] Foundations of convex optimization: gradient, sub-differential, strong convexity, conditioning. Examples with linear regression and logistic regression for which we compute gradients, Hessian, Lipchitz constants, etc. -Exercise list : Convexity and smoothness, Ridge regression and gradient descent

2 Lecture [2.00 hours lecture + 1.50 hours exe, Robert] First order algorithms: gradient descent, proximal operator, proximal gradient descent, convergence proofs in the smooth and smooth + strongly convex case. Give exercises. -Exercise list : Proximal Operator

3 Lab [3.50 hours Lab Robert + Alex + TAs] Lab (grad + prox grad) in jupyter notebooks. Lab to be graded. -Lab1.ipynb

4 Lecture [2.0 hours lecture + 1.50 hours exe, Robert] Stochastic algorithms, SGD, with and without moment convex, strongly convex, with proofs. -Exercise list : SGD and CD for ridge regression

5 Lecture [1.50 hours lecture + 2.0 hours Lab, Robert] Stochastic variance reduced methods (SAGA, SVRG) with numerical tricks: lazy updating, sparse tricks. - Lab SGD + variants (not graded) -Exercise list : Variance reduction

6 Lab [3.50 hours Lab Robert + Alex + TAs] Lab on all methods seen so far (GD, prox GD, Fast prox GD, SGD, SAG, SVRG, SDCA). Lab to be graded -Lab2.ipynb

7 Lecture [2 hour + 1.50 hours Lab, Robert] Online variance reduction and scale invariant methods (Natural gradient descent, ADagrad, Adam, randomized Newton)

8 Lecture [3.5 hours, Alex] Part I: Coordinate descent algorithms + Lab: coordinate descent implementation on logistic and ridge + Project introduction

9 Lecture [1 hour + 2.5 hours Lab, Alex] Solvers for quadratic functions (SVD, woodbury matrix inversion lemma), conjuguate gradient (dense vs sparse data). Line search methods.

10 Lecture [2 hours + 1.5 hours Lab, Alex] Second order methods: Newton, Quasi-Newton (BFGS, L-BFGS). Illustrate convergence problem of Newton, initialisation, construction of BFGS et L-BFGS.

11 Lecture [3.5 hours, Alex] non-convex and beyond convexity (adaptive Lasso / SCAD), Frank-Wolfe.

Additional Books and resources

Book 1. Boyd & Vandenberghe: Convex Optimization. Chapters 2, 3 and 4 for a revision on convexity and chapter 9 for a revision on unconstrained optimization. Freely available here

Book 2. Shalev-Shwartz & Ben-David: Understanding Machine Learning, from Theory to Algorithms. Chapters 1 and 2 for a frequentist introduction to Machine Learning. Freely available here.

Book 3. Bubeck: Convex Optimization: Algorithms and Complexity. Chapter 6 for additional proofs for stochastic gradient methods including SGD and SVRG. Freely available here.

Paper 1. Amir Beck and Marc Teboulle (2009), SIAM J. Imaging Sciences, A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. Freely available here.

Paper 2. RMG et all (2019), Proceedings of Machine Learning Research, Volume 97, SGD: general analysis and improved rates freely available here.