Chet Mancini

I have talked in previous posts about the inherent temptation to assume linearity in data, and perhaps more importantly, the inherent desire we have as humans to simplify interpretations of our data. Occam's Razor is a useful tool in which we are encouraged to seek the most simple explanation in our data, but we must remember to rule out simple explanations which do not describe our data.

Take an example of a company with a straightforward growth rate. Each month some marketing brings in some set of new clients and growth is steady. We can run a simple regression on this dataset and see the slope of the graph to make projections about the next quarter or year.

Lets say we hire a new marketing director who initiates a new campaign that has a viral effect. How do we extrapolate on our data now? Perhaps we should wait and see before making a new projection, but our trendline no longer fits the data at the edge. Is this an outlier or a new normal?

The campaign continues and growth is really picking up. Our trendline is clearly not helping. Someone on the team decides we should make projections from now on with an exponential function. We decide to draw our data with an exponential function.

And yet that still doesn't explain all the points very well. The exponent isn't as large as it should be because we're growing faster than it predicts. The linear component of the data at the beginning is throwing off the projection. Even so, the exponential view is certainly better than what we had before.

The viral campaign finishes and things unfortunately do not stick for our imaginary company. Growth slows.

How would you add a trendline to this? A bit tougher, right? The reason is trendlines in a single dimension are really bad for almost all real-world data. One thing we might be able to do to model our growth is to put the X data in a two dimensional vector like this: [TIME, HAS_VIRAL_CAMPAIGN]. The other thing we might do is to build a step function. For certain range values of X the function fits Y = M*X + B, for other ranges it fits Y = X^N + C. We need three such components in our step function, or three related single-dimension functions to accurately model this data.

It should be clear that all these solutions are merely academic. The data will change next month and we must revise our model. Perhaps we launch another campaign, a competitor launches a similar product, an act of God hits the supply chain, or the CEO says something distasteful in a room full of journalists. These things can be modeled looking backwards but could be difficult to predict forwards.

All models are wrong. Seek simplicity, but when making your model, make sure it is rich and complex enough to be helpful.

Chet Mancini

Writing about data, programming, and travel

Tag data

A Simple Graphical Example of Common Excel Data Assumptions