Machine Learning is a Subfield of Statistics

April 29, 2021 • John Vandivier

I have a background in statistics, with an applied emphasis in political science and econometrics. I am also a programmer, but I am just beginning to get serious about machine learning. Before and after looking more deeply, it seems to me that machine learning (ML) is simply a special subfield of statistics.

When looking into \"statistics vs machine learning,\" here are some fails that I found:

The Actual Difference Between Statistics and Machine Learning
1. \"Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed for inference about the relationships between variables.\"
  1. The statement on ML seems fair enough, but statistical models are used for more than inference.
  2. Moreover, how do we decide whether an ML model is accurate? We have to use statistics like fitness.
2. \"...we do not know how well the model will perform until we ‘test’ this data on additional data that was not present during training, called the test set\"
  1. This is a general problem in applied statistics. There is nothing strictly special about out-of-sample validation in ML. Sure, it can be granted that emphasis, jargon, and norms of practice are different but nbd.
3. \"Likewise, machine learning models provide various degrees of interpretability, from the highly interpretable lasso regression to impenetrable neural networks...\"
  1. My dude, Lasso Regression is applied stats. It's not original with ML.
  2. This actually highlights the intersectional point: Many statistical models are re-implemented in ML. This alone shows there is at least some intersection, but I will argue that moreover, it's a total subfield.
Machine Learning vs Statistical Modeling
1. \"Machine learning is an algorithm\"
  1. Sure, but basically all our statistical models are algorithms too.
  2. Still, somewhat interesting: There are some statistics that are not algorithms, eg the average, but this is no such thing as a non-algorithmic ML model.
  3. Which algorithms are used in ML, specifically? Refer back to 1.3 above and see that somewhere between very many and all ML algorithms are statistical algorithms.
2. \"Machine Learning uses less assumptions\"
  1. #fail - your ML algo doesn't use fewer assumptions than I use when I compute an average.

Now, let's come to an explainer that I actually found useful, from a PhD ML researcher:

https://www.youtube.com/watch?v=oUMFObEcQrQ

He emphasizes that prediction and application is the main emphasis of ML, and he goes through some interesting facts that I didn't know. For example, he says \"neural networks...are essentially just extensions of logistic regressions on top of each other.\"

What became obvious to me during his presentation is that:

Statistics is about things called statistics.
1. Examples include the mean, median, mode, standard deviation, coefficients, and so on.
2. Parameters are statistics too.
Machine Learning is a case of applied statistics.
1. ML executes statistical optimization and as such it utilizes statistics including goodness of fit, bias, variance, error, and so on.
2. ML includes supervised and unsupervised learning. In the case of supervised learning, a practitioner is specifying features in the same way a statistician would. In fact, to even make sensible specifications, I believe a practitioner needs to have some statistical understanding of the underlying data.
In any real application of ML, stats will be needed too.
1. Communicating the findings of ML requires statistics. Eg \"75 percent of input images were identified as cats.\"
2. The video above brilliantly quotes the statistician Ronald Fisher: \"The quantity of data which usually by its mere bulk is uncapable of entering the mind is to be replaced by relatively few quantities which shall adequately represent the whole of the original data\"
3. The above quote defines what a statistic is, and it also shows why big data and ML can't be communicated to an end-user without consolidation into statistics.
ML happened later in history and has historically grown out of statistics. Yes, it is also a subfield of computer science. There is no competition or exclusivity here. Essentially all meaningful statistics today are implemented as programs too.

As a concluding note, my findings today reassure me that:

A background in statistics and econometrics is a great entry into machine learning.
There is no major conflict between usual statistical and ML approaches.
1. ML is in many ways just the automation of complex statistics, including some statistical practices that were previously manual and laborious.
A good ML practitioner needs statistical, and I would suggest econometric, skills as well.