Linear Algebra and Learning from Data: A Deep Dive into Mathematical Foundations

This article explores "Linear Algebra and Learning from Data," a textbook designed to bridge the gap between linear algebra and the mathematical underpinnings of data science and deep learning. The book, authored by MIT mathematics professor Dr. Gilbert Strang, aims to provide readers with the necessary mathematical architecture to understand machine learning.

Introduction: The Importance of Mathematical Foundations

While leading-edge machine learning applications evolve rapidly, the underlying mathematical concepts remain constant. Establishing a strong foundation in these concepts is crucial for data scientists. This book serves as a valuable resource for data scientists seeking to solidify their understanding of the mathematical principles driving machine learning.

Core Concepts of Linear Algebra

Part I of the book highlights the fundamental elements of linear algebra, including matrix multiplication, eigenvalues and eigenvectors, and singular value decomposition (SVD). These topics are essential for understanding the mechanics of machine learning algorithms. It is important to note that this section provides an overview, and readers are expected to have prior exposure to linear algebra concepts such as vector spaces, subspaces, independence, bases, and linear operators. Principal components are also covered in this part.

Computation with Large Matrices

Part II focuses on computation with large matrices, covering matrix factorization and iterative methods. It also delves into algorithms for approximating matrix problem solutions using randomization and projection.

Low-Rank and Sparse Approximations

Part III explores low-rank and sparse approximation techniques, including methods such as LASSO and matrix completion algorithms. Often the goal is a low rank approximation A = CR (column-row) to a large matrix of data to see its most important part.

Read also: Linear Algebra: An Overview

Special Matrices and Their Applications

Part IV is dedicated to special matrices and their applications in data and signal analysis. It covers a range of matrices, from discrete Fourier transforms to graph node-adjacency matrices used for clustering. The section includes a brief example using MATLAB code and a discussion of k-means clustering and its applications.

Probability and Statistics for Data Science

Part V provides a foundation in probability and statistics, covering topics such as mean and variance, probability distributions, covariance matrices, multivariate Gaussian distributions, and weighted least squares. The Central Limit Theorem is also discussed.

Optimization Techniques in Machine Learning

Part VI delves into optimization techniques, which are essential for many machine learning algorithms. The book examines linear programming, gradient descent, and stochastic gradient descent. It also defines the "argmin" expression. The big problem of optimization (the heart of the calculation) is to choose weights so that the function assigns the correct output 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.

Learning from Data: The Heart of Machine Learning

Part VII focuses on the mathematics of machine learning, covering deep neural networks, convolutional neural networks (CNNs), recurrent neural networks (RNNs), the backprop algorithm, bias-variance tradeoff, and the use of hyperparameters. The chapter emphasizes the importance of the chain rule in calculus. The inputs are the samples v, the outputs are the computed classifications w = F(v). The simplest learning function would be linear: w = Av. The entries in the matrix A are the weights to be learned : not too difficult. Frequently the function also learns a bias vector b, so that F(v) = Av +b. This function is "affine". Affine functions can be quickly learned, but by themselves they are too simple. The functions that yield deep learning have the form F(v) = L(R(L(R( … (Lv))))). This is a composition of affine functions Lv = Av + b with nonlinear functions Rwhich act on each component of the vector Lv. The matrices A and the bias vectors b are the weights in the learning function F. It is the A's and b's that must be learned from the training data, so that the outputs F( v) will be (nearly) correct. Then F can be applied to new samples from the same population. If the weights (A's and b's) are well chosen, the outputs F( v) from the unseen test data should be accurate. More layers in the function F will typically produce more accuracy in F( v ). Properly speaking, F(x, v) depends on the input v and the weights x (all the A's and b's). The outputs v 1 = ReLU(A 1v + b1) from the first step produce the first hidden layer in our neural net. The complete net starts with the input layer v and ends with the output layer w = F(v ). The affine part Lk(vk-l) = Akvk-l + bk of each step uses the computed weights Ak and bk. All those weights together are chosen in the giant optimization of deep learning : Choose weights Ak and bk to minimize the total loss over all training samples. The total loss is the sum of individual losses on each sample. The loss function for least squares has the familiar form IIF( v) - true outputW. Often least squares is not the best loss function for deep learning.

Linear and Non-Linear Learning Functions

The inputs are the samples v, the outputs are the computed classifications w = F(v). The simplest learning function would be linear: w = Av. The entries in the matrix A are the weights to be learned : not too difficult. Frequently the function also learns a bias vector b, so that F(v) = Av +b. This function is "affine". Affine functions can be quickly learned, but by themselves they are too simple. More exactly, linearity is a very limiting requirement. IfMNIST used Roman numerals, then II might be halfway between I and III (as linearity demands). But what would be halfway between I and XIX? Certainly affine functions Av +bare not always sufficient. Nonlinearity would come by squaring the components of the input vector v. That step might help to separate a circle from a point inside-which linear functions cannot do. But the construction ofF moved toward "sigmoidal functions" with S-shaped graphs. It is remarkable that big progress came by inserting these standard nonlinear S-shaped functions between matrices A and B to produce A(S(Bv )). Eventually it was discovered that the smoothly curved logistic functions S could be replaced by the extremely simple ramp function now called ReLU(x) =max (0, x).

Read also: Matrix Course Navigation

Neural Nets and the Structure of F(v)

The functions that yield deep learning have the form F(v) = L(R(L(R( … (Lv))))). This is a composition of affine functions Lv = Av + b with nonlinear functions Rwhich act on each component of the vector Lv. The matrices A and the bias vectors b are the weights in the learning function F. It is the A's and b's that must be learned from the training data, so that the outputs F( v) will be (nearly) correct. Then F can be applied to new samples from the same population. If the weights (A's and b's) are well chosen, the outputs F( v) from the unseen test data should be accurate. More layers in the function F will typically produce more accuracy in F( v ). Properly speaking, F(x, v) depends on the input v and the weights x (all the A's and b's). The outputs v 1 = ReLU(A 1v + b1) from the first step produce the first hidden layer in our neural net. The complete net starts with the input layer v and ends with the output layer w = F(v ). The affine part Lk(vk-l) = Akvk-l + bk of each step uses the computed weights Ak and bk. All those weights together are chosen in the giant optimization of deep learning : Choose weights Ak and bk to minimize the total loss over all training samples. The total loss is the sum of individual losses on each sample. The loss function for least squares has the familiar form IIF( v) - true outputW. Often least squares is not the best loss function for deep learning.

Prerequisites and Recommendations

The book is intended for readers with some prior experience in mathematics and data science. It is best used as a roadmap for further study, supplemented by additional resources such as papers, books, and videos. For example, while the book introduces Fourier transforms, readers may benefit from exploring the subject in greater depth using external learning materials.

Limitations

The book does not explicitly connect specific mathematics topics with parallel topics in data science. Additionally, the book lacks a bibliography.

Additional Resources

Strang provides video lectures on MIT OpenCourseWare for Math 18.06 and 18.065. An instructor's manual with solutions to the problem sets is also available.

tags: #linear #algebra #and #learning #from #data