Modern data analysis methods are expected to handle massive amounts of high dimensional data that are being collected in a variety of domains. The high dimensionality of such data introduces numerous challenges, typically referred to as the “curse of dimensionality”, which render traditional statistical learning approaches impractical or ineffective for their analysis. To cope with these challenges, significant effort has been focused on developing geometric data analysis approaches that model and capture the inrinsic geometry of processed data, rather than directly modeling their distribution. In this course we will explore such approaches and provide an analytical study of the models and algorithms they use. We will start by considering supervised learning and distinguish classifiers that are based on geometric principles from statistical learning approaches, such as Bayesian classification. Next, we will consider the unsupervised learning task of clustering data and contrast density based clustering from partitional and hierarchical clustering methods that rely on metric spaces or graph constructions. Finally, we will consider more fundamental tasks in intrinsic representation learning, with particular focus on dimensionality reduction and manifold learning methods, such as Isomap, Diffusion Maps, LLE, and tSNE. Time permitting, the course will also include guest talks discussing recent development in related research areas.

The course will be suitable for CS, statistics, and applied math students interested in data science and machine learning.

- Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, Vipin Kumar, 2005.
- Data Mining: Concepts and Techniques, 3rd Ed., Jiawei Han, Micheline Kamber, Jian Pei, 2011.

- Topic 01 - Intoduction
- Topic 02 - Data Exploration & Visualization (incl. summary statistics & data types)
- Topic 03 - Bayesian Classification (incl. decision boundaries, Bayes error rate & Bayesian belief networks)
- Topic 04 - Decision Trees & Random Forests (incl. random projections)
- Topic 05 - Support Vector Machines (incl. the "kernel trick" & mercer kernels)
- Topic 06 - Principal Component Analysis (incl. preprocessing & dimensionality reduction)
- Topic 07 - Density-based Clustering [no slides yet]
- Topic 08 - Partitional Clustering (incl. lazy learners, kNN, voronoi partitions) [no slides yet]
- Topic 09 - Hierarchical Clustering (incl. large-scale & graph partitioning) [no slides yet]
- Topic 10 - Multidimensional Scaling (incl. spectral theorem & distance metrics) [no slides yet]
- Topic 11 - Manifold Learning (incl. Isomap & LLE) [no slides yet]
- Topic 12 - Diffusion Maps [no slides yet]

- Irina Rish (DIRO, Mila) -- Compressive sensing (2019-10-16)
- Guillaume Lajoie (DMS, Mila) -- Extracting low-dimensional dynamics in brain data (2019-10-30)
- Jian Tang (HEC, Mila) -- Large scale data visualization & tSNE (2019-11-20)
- Will Hamilton (McGill, Mila) -- Graph representations & graph neural networks (2019-11-20)

- 30% -- homework
- 45% -- final project report
- 25% -- final project presentation

- For the final project, students can work individually or form in small groups (at most 3 team members).
- If formed in a group, students should designate a person of contact for the group.
- Group members are expected to equally contribute to the project.
- Each group member will be expected to present their individual contribution after the final report is submitted.

- By Friday, Oct. 11, 23h59, each group should submit a project proposal (instructions will be announced/posted on StudiUM).
- Proposals are expected to span 2-3 pages and include at least the following sections:
- Project description & goals;
- Planned contributions of each team member;
- Used data / data sources.

- Projects should involve multiple methods applied to data analysis tasks chosen by each team (subject to approval of the submitted proposal), demonstrating understanding of underlying principles learned in class.

- Proposals are expected to span 2-3 pages and include at least the following sections:

- Problem Set I - due Oct. 03, 2019, 23h59 (via StudiUM).
- Data: tweets.txt

- Problem Set II - due Nov. 08, 2019, 23h59 (via StudiUM).
- Problem 2:
- Data: simple_iris.mat, simple_nonlinear.mat
- Code templates: script2.m, script2.py

- Problem 3:
- Code templates: script3.m, script3.py

- Problem 4:
- Data: leaf.mat, leaf_key.txt
- Code templates: script4.m, script4.py

- Problem 5:
- Data: mixed.csv, target.mat
- Code templates: script5.m, script5.py

- Problem 2:
- Problem Set III - planned to be posted in November.
- Data: TBD