Job of a DataScientist: Separating Signal from Noise. Machine Learning Under the Hood
Posted: October 18, 2017
All data is a combination of signal and noise. Signal represents valuable consistent relationships that we want to learn. Noise is the random correlations and stuff that will not occur again in the future. The combination of signal and noise takes on familiar patterns or shapes that we can use to build a model.
Models can consider varying degrees of signal and noise. On end of the spectrum is the non-model which disregards signal & noise. Consider this common upsell question.
“Would you like fries with that?” This approach requires no model. It disregards signal and noise, rigidly canvassing everyone regardless of who they are or what they ordered:
“I’ll have a salad, hold the dressing. And a bottled water.”
“Would you like fries with that?”
“Carbs? Uh, no thank you.”
On the other end of the spectrum are flexible solutions that consider signal and noise. The predictions gathered are influenced by every piece of data, including outliers which can (and do) skew outcomes. These solutions can be more damaging than using no model at all. Such models lose predictive ability for new data because they are too tightly bound to their original training data. This is called overfitting.
This post is a continuation of my last post where I went over what is the difference between machine learning and data science. This post goes a step further to talk about how we use machine learning to separate signal from the noise. There are many machine learning algorithms. Think of them as tools in a tool box. Data scientists use these pre-built algorithms to tease models from their data sets.
A handful of these tools are based on classical statistical methods which makes them easy to interpret. If the model is being used to aid a human to make decisions it’s a good idea to develop the model with these classical methods:
- Linear regression
- Logistic regression
- K-means classification
- K-nearest neighbors
- Hierarchical clustering
- Naive Bayes
There are more options If there is no human involvement in the decision process; think of Netflix making a recommendation, no one reviews the recommendation before you get it. If there is no human element or if there is a high tolerance for opaque black box methods there is an additional group of modern machine learning algorithms:
- Random Forest
- LASSO
- Hidden Markov Models
- Support Vector Regression
- Artificial Neural Networks
- Apriori Algorithms
Discussing each of these is outside the scope of this post. But if you are interested in learning more, make sure to subscribe to my email list for data savvy professionals and get a copy of “Bull Doze Thru Bull.”
Take away. If you are evaluating a machine learning project a great question is to ask about the algorithms that were used. LASSO & Random Forest are as close to out of the box all purpose tools as you will find so they are quite common. The classical methods are a conservative choice. The modern machine learning methods are really black box solutions which means they probably tried all of them and went with the tool that performed the best in testing.