We have entered the Big Data Era. The explosion and profusion of available data in a wide range of application domains rise up new challenges and opportunities in a plethora of disciplines – ranging from science and engineering to business and society in general. A major challenge is how to take advantage of the unprecedented scale of data, in order to acquire further insights and knowledge for improving the quality of the offered services, and this is where Data Science comes in capitalizing on techniques and methodologies from data engineering (acquisition, storage, indexing, retrieval, pre-processing, quality assurance,x validation), exploration (statistical profiling, visualization) and machine learning (identifying patterns, correlations, groupings, modeling etc.). This lifecycle is universal spanning all application domains.

The Data Science and Mining - Introduction to Machine Learning class will cover the following aspects:

  • The Machine Learning Pipeline
  • Data Preprocessing and Exploration
  • Feature Selection/Engineering & Dimensionality reduction
  • Supervised Learning 
  • Unsupervised Learning
  • Web Mining: recommendations, collaborative filtering, opinion/sentiment analysis, web advertising & algorithms.
  • Learning from graphs: ranking in graphs, ranked lists comparison, learning to rank, community detection and graph clustering, applications


1. The course will take place on Mondays afternoon from 19/09 for 9 weeks until 28/11, and will be divided into nine 4-hour sessions (+ final exam) of teaching (14:00 - 16:00) in amphi Faurre and the lab session (16:15 -18:15) will be split between three lab rooms (Amphi Grégory, Amphi Painlevé, Amphi Sauvy) - the split will be based on students surname initial letter).

2. Due to the high number of enrolled students, labs will not take place in rooms equipped with workstations. Therefore, the students are expected to come in class with their laptops (preferably with a Unix environment like Linux or Mac OS X for compatibility reasons). As for software, we will be using  Python among others (to be installed locally on the laptops).

We will be using the e-learning platform Moodle to share the course materials (slides and lab statements) and to upload assignments. Therefore it is imperative for students to enroll (available once logged in in Moodle with their @polytechnique.edu account using the enrollment key specified in the welcoming email they received). Additionally, the forum should be used to communicate with the staff following the guidelines

The Data Science and Machine Learning team 2016.

Detailed syllabus of the course*

*minor chaneges may apply during course evolution.

Machine Learning Pipeline

-       Task and Metrics

-       Models and parameters estimation (ML, Optimization, Penalization)

-       Case study: logistic regression, penalization

Data Preprocessing and Exploration 

-       Distance/similarity measures

-       Data normalization/ standardization/cleaning/ missing values

-        Dimensionality reduction (SVD, PCA, MDS)

Supervised Learning 

-       Generative vs. non generative

-       N Bayes vs. knn-perceptron

-       Logistic regression

-       Trees (Decision, Extra trees)

Feature Selection/Engineering & Dimensionality reduction 

-       Feature Selection, feature engineering, wrapper methods

-       Dimensionality reduction (Spectral methods - PCA, MDS,  and applications)

-       Linear Discriminant Analysis

-       Non – Linear DR

Over fitting & Regularization

-       Model validation, Resampling methods

-       Over-fitting, penalization, bias-variance tradeoff,  Cross Validation (CV), Regularisation (e.g., Lasso, Ridge, SVM)

Supervised Learning  

-       Classification Regression trees

-       Bagging, Boosting

-       Ensembling methods (Adaboost, Random Forest)

Unsupervised Learning 

-      Principles of Clustering, K-means, Hierarchical clustering, SOMs, Association Rules

Bayesian Learning 

-      Introduction to Bayesian learning (EM)