Three Learnings from Real World Large-Scale Machine Learning

I spend much of my day working on and tuning machine learning models here at Intent Media. Here we use predictive analytics for a number of business tasks, but most models currently fall into two major categories. The first are our segmentation models, which predict the likelihood of a user booking or converting on a site; we use these to make decisions about when to show ads. The second are our quality score models, which predict the likelihood of a click, which we use to decide which ads to show. We have recently moved our logistic regression model over to Apache Spark's MLLib on Amazon EMR. In a future post I would love to show some of the code it took to do that. Until then, I will be posting different thoughts on machine learning I have picked up from empirical testing and the wisdom of my more knowledgeable colleagues. Let me know if you (dis)agree or find it useful.


1. Thinking weights or betas as equivalent to feature value is dangerous

Absolute value of a weight is only a very loose proxy for feature contribution. Often business users will come and ask "which features in our model are most valuable?" and it's tempting to go to look at the absolute value of the weights. This will yield a misunderstanding of your data though, for a couple reasons. When using a linear model it's really hard to understand feature interactions and there is no visibility into the frequency a certain feature fires. We have some relatively rare features that can be predictive, but require the website user to have entered via a certain path, used a promo code, purchased a certain sort of product in the past, or similar occurrence with low frequency. We have other features which capture similar but slightly different notions--often many will 'fire' on each prediction. In aggregate they represent a valuable signal in the model, but individually they have low weights. 

I have seen some models with weights that have large positive or negative values and often end up canceling each other out in odd ways, but can still be quite predictive. Perhaps A (-3.5) and B (2.5) frequently occur together, but combined only yield a value of 1.0. We cannot really say those features are as valuable as another with a smaller absolute weight.

Of course, if your model suffers from any overfitting or if training has found a local maxima then these issues will be even more pronounced.


2. Be careful making linearity assumptions about your data

Assuming your data is linear is a little bit like assuming a normal distribution in statistics or your random variables are independent in probability. It's really tempting because you can run simple algorithms that perform really, really well. Linear and logistic regression are well studied, perform well at scale, have some great libraries available, and are conceptually simple to work with. The problem with all this fun and games is that your data may interact in unforeseen ways.

We have attacked the linearity problem on a couple fronts; we use a logistic regression model for our segmentation, but we know that our data interacts in a variety of ways. We spend a lot of time developing and testing some higher-order features that combine data or apply mathematical transformations like logs of values and similar. We are also testing models that learn feature interactions better; we have had some success with decision trees and will continue to be exploiting random forests for future models.


3. Regularization doesn't matter so much when you have big data

This was fun to realize empirically. I was recently working on trying different optimizers for our logistic regression algorithm, trying an LBFGS and SGD. As I was testing these on a large dataset I wanted to make sure I was regularizing appropriately, so I wrote a script that would train the model several times overnight with various regularization parameter values from 0.000001 to 1.0 alternating between both L1 and L2 methods.

As it turns out, when you have lots of good data, it becomes difficult to overfit. A little bit of regularization helped, but not by what I was expecting.