Data analysis is
a process of inspecting, cleansing, transforming, and modeling data with
the goal of discovering and communicating useful information, informing
conclusions, and supporting decision-making.
Data analysis involves many overlapping areas of study,
such as descriptive statistics
(concerned with summarizing information in a sample),
statistical inference (concerned with inferences about a
population based on properties of a sample),
machine learning
(concerned with performing tasks using algorithms that rely on patterns and
inference), data mining (sometimes considered a subfield of machine learning and concerned with
discovering patterns in large data sets),
information
visualization (concerned with the visual representations of abstract data
to reinforce human cognition), and several
other areas. See this
interesting review
for comments about the related term data science.
Geometry is concerned with questions of shape, size, relative position of figures, isometry, and the properties of space. It underlies much of data analysis, as can be seen from
textbooks such as those by
[Kendall],
[Le Roux], [Kirby], [Tossdorff], [Hartmann], [Outot],
[Tierny], [Edelsbrunner], [Patrangenaru],
[Biau], [Wichura], [Dryden].
The recent online textbook Mathematical Foundations for Data Analysis by Jeff M. Phillips again emphasizes the importance of a geometric understanding of techniques when applying them to data analysis.
This module focuses on some geometric
methods used in data analysis. It covers the geometric
and algorithmic aspects of these methods, as well as their implementation as Python code on Linux computers, and their application to a range of different types of data.
The first half of the course emphasizes
geometric aspects of classical techniques such as least squares fitting, principal component analysis, hierarchical clustering,
nearest neighbour searching, and the Johnson-Lindenstrauss Theorem
for dimensionality reduction.
The second half of the course covers more recent
techniques that have been developed
over the last two or three decades, and emphasizes topological aspects
as well as geometric aspects. The second half of the course
makes use of R interfaces to Python Mapper and to efficient C++ routines for persistent homology.
Part I = CS4102: Classical Techniques (5 ECTS, first 24 lectures)
- Least Squares Fitting
- Principal Component Analysis
- Hierarchical Clustering and Persistence
- Nearest Neighbours and the Johnson–Lindenstrauss Theorem
Part II = CS4102: Topological Data Analysis (5 ECTS, second 24 lectures)
- Topological Preliminaries
- Mapper Clustering
- Persistent Homology