It's becoming more difficult for governments to willfully miscalculate the realities of inflation. The reason is increasingly broad availability of pricing data and improved abilities to harvest and analyze it. Even those who try to calculate CPIs and similar statistics do poorly because it's very difficult to say an iPhone today costs 45% more than it did 15 years ago--it just didn't exist then. Given enough gadgets and enough price points, these hiccups in baskets of goods and services get smoothed over.
Tim Harford: A Billion Prices Can't be Wrong
Sharing knowledge and information increases freedom and improves justice. It could be that thoughtful analysis could preclude future economic calamities due to false or obscured information.
Note, this is not an argument for you to buy gold or TIPS. Talk to your financial expert about what investments are right for you.
Gene Amdahl has died. For decades he was, and now will be forever one of the greatest computer technologists of history. The NYTimes has a good obituary. He was one of the primary engineers behind IBM's System/360 which became one of the most successful mainframes ever.
Amdahl's Law is simple yet quite fascinating. It explores the notion of how we think about parallel computation, or even more generally, performing a series of tasks, so there are some relevant considerations for assembly line process and/or people management. It puts a theoretical limit on the speedup we can achieve in a system, and those limits are startlingly low. For example, lets say you have a task, of which 50% is parallelizable (can be processed side by side). Amdahl's Law says it doesn't matter how many workers you have, you can never get a speedup beyond 2x. It makes no difference if you have a thousand or a million workers on a task which is 90% parallelizable: you have have an upper bound on a 10x speedup. It's a formulation of the classic law of diminishing returns, with a very straightforward equation to back it up.
]]>The Intent Media team just returned from a trip up to Ithaca, NY to recruit Cornell Engineering students for several open junior software engineer positions. We met a lot of smart, talented, and motivated students, and we learned a lot about campus recruiting along the way. Here are a few early impressions I had that might help other university students as they navigate this unusual world of technical hiring.
Take the time to stop at the less busy booths at a career fair. There are no lines to wait in, and those recruiters will have more time to have a long discussion with you individually. At a larger company you will be rushed and it is unlikely they will remember your conversation.
Think about the type company you would flourish at and specifically research those firms. Big companies have great mentorship and guidance programs, but you will likely have an impact on a very specific feature in the product and may not be given a lot of flexibility in implementation. Small firms might have a smaller impact overall, and you will likely be required to work independently and creatively. You may be expected to design and implement entire systems without a lot of coaching. Capitalize on your strengths and find roles that will stretch your weaknesses.
What separates a good candidate from a great candidate are their incisive, thoughtful questions. Turn the short interview around and act confident--as an engineering student you might have several offers and it is incumbent on the company to impress you with how great they are.
Average candidates ask more obvious questions like:
Do you have any jobs?
What does your company do?
Great candidates ask challenging questions like:
I know you sell ABC. How do you tackle the competition from companies like XYZ?
Why is your company ranked as a Top Place to Work in NYC?
Tell me about a current challenge you all are facing (bonus: weave in a skill you have that could help solve it!)
There is so much more to a company than it's brand and the salary. Understand the leadership, the team, the culture, the side perks, the location, the technical challenges, and the business challenges. You will end up with a better job and will appear more attractive to interviewers.
Think about what differentiates you from all the other people coming to the booth, and highlight that. Mention the amazing internship you had the previous summer. Call out if you are at the top of your class or win coding competitions with your friends. Remember, the smaller the company, the less GPA matters. Big firms often need a simple number to “weed out” resumes, while smaller ones will be more interested if you have a unique background, skill set, or area of interest.
Great candidates can balance honesty and salesmanship. Remember, you are a highly sought after candidate and have something of value to provide to employers. On the other hand, you might be fairly junior and not have extensive experience in a specific area. Call out your strengths loud and clear on your resume. However, if you do not really know a language, do not claim that you do. Authenticity and confidence are more desirable than exaggerated claims.
Finally, have a conversation, not a script. The person you are talking to is a human, not a robot, and wants to hire humans, not robots. For example, this sounds scripted:
“Hello, my name is Bob Smith and I’m a senior graduating in 2016 from the X University College of Engineering with a major in Computer Science and a minor in Data Analysis from the College of Arts and Letters and I am interested in Full Time employment.”
This is a much more natural intro:
“Hi, I’m Susan! I’ll be getting my CS degree next year and after browsing your website I was really interested in learning more about the XYZ role you have open.”
When talking about distributed systems, it is common to describe Eric Brewer's CAP Theorem as "Consistency, Availability, and Partition Tolerance. Pick Two." This is usually illustrated with a triangle or Venn Diagram. While that's not terribly wrong, it's not particularly helpful.
Given that network partitions are rare, but do happen, the better way to consider the theorem is: "In the event of a network partition, does your system guarantee consistency or availability?"
Lets say you have a system with incoming data.
I have talked in previous posts about the inherent temptation to assume linearity in data, and perhaps more importantly, the inherent desire we have as humans to simplify interpretations of our data. Occam's Razor is a useful tool in which we are encouraged to seek the most simple explanation in our data, but we must remember to rule out simple explanations which do not describe our data.
Take an example of a company with a straightforward growth rate. Each month some marketing brings in some set of new clients and growth is steady. We can run a simple regression on this dataset and see the slope of the graph to make projections about the next quarter or year.
Lets say we hire a new marketing director who initiates a new campaign that has a viral effect. How do we extrapolate on our data now? Perhaps we should wait and see before making a new projection, but our trendline no longer fits the data at the edge. Is this an outlier or a new normal?
The campaign continues and growth is really picking up. Our trendline is clearly not helping. Someone on the team decides we should make projections from now on with an exponential function. We decide to draw our data with an exponential function.
And yet that still doesn't explain all the points very well. The exponent isn't as large as it should be because we're growing faster than it predicts. The linear component of the data at the beginning is throwing off the projection. Even so, the exponential view is certainly better than what we had before.
The viral campaign finishes and things unfortunately do not stick for our imaginary company. Growth slows.
How would you add a trendline to this? A bit tougher, right? The reason is trendlines in a single dimension are really bad for almost all real-world data. One thing we might be able to do to model our growth is to put the X data in a two dimensional vector like this: [TIME, HAS_VIRAL_CAMPAIGN]. The other thing we might do is to build a step function. For certain range values of X the function fits Y = M*X + B, for other ranges it fits Y = X^N + C. We need three such components in our step function, or three related single-dimension functions to accurately model this data.
It should be clear that all these solutions are merely academic. The data will change next month and we must revise our model. Perhaps we launch another campaign, a competitor launches a similar product, an act of God hits the supply chain, or the CEO says something distasteful in a room full of journalists. These things can be modeled looking backwards but could be difficult to predict forwards.
All models are wrong. Seek simplicity, but when making your model, make sure it is rich and complex enough to be helpful.
I spend much of my day working on and tuning machine learning models here at Intent Media. Here we use predictive analytics for a number of business tasks, but most models currently fall into two major categories. The first are our segmentation models, which predict the likelihood of a user booking or converting on a site; we use these to make decisions about when to show ads. The second are our quality score models, which predict the likelihood of a click, which we use to decide which ads to show. We have recently moved our logistic regression model over to Apache Spark's MLLib on Amazon EMR. In a future post I would love to show some of the code it took to do that. Until then, I will be posting different thoughts on machine learning I have picked up from empirical testing and the wisdom of my more knowledgeable colleagues. Let me know if you (dis)agree or find it useful.
Absolute value of a weight is only a very loose proxy for feature contribution. Often business users will come and ask "which features in our model are most valuable?" and it's tempting to go to look at the absolute value of the weights. This will yield a misunderstanding of your data though, for a couple reasons. When using a linear model it's really hard to understand feature interactions and there is no visibility into the frequency a certain feature fires. We have some relatively rare features that can be predictive, but require the website user to have entered via a certain path, used a promo code, purchased a certain sort of product in the past, or similar occurrence with low frequency. We have other features which capture similar but slightly different notions--often many will 'fire' on each prediction. In aggregate they represent a valuable signal in the model, but individually they have low weights.
I have seen some models with weights that have large positive or negative values and often end up canceling each other out in odd ways, but can still be quite predictive. Perhaps A (-3.5) and B (2.5) frequently occur together, but combined only yield a value of 1.0. We cannot really say those features are as valuable as another with a smaller absolute weight.
Of course, if your model suffers from any overfitting or if training has found a local maxima then these issues will be even more pronounced.
Assuming your data is linear is a little bit like assuming a normal distribution in statistics or your random variables are independent in probability. It's really tempting because you can run simple algorithms that perform really, really well. Linear and logistic regression are well studied, perform well at scale, have some great libraries available, and are conceptually simple to work with. The problem with all this fun and games is that your data may interact in unforeseen ways.
We have attacked the linearity problem on a couple fronts; we use a logistic regression model for our segmentation, but we know that our data interacts in a variety of ways. We spend a lot of time developing and testing some higher-order features that combine data or apply mathematical transformations like logs of values and similar. We are also testing models that learn feature interactions better; we have had some success with decision trees and will continue to be exploiting random forests for future models.
This was fun to realize empirically. I was recently working on trying different optimizers for our logistic regression algorithm, trying an LBFGS and SGD. As I was testing these on a large dataset I wanted to make sure I was regularizing appropriately, so I wrote a script that would train the model several times overnight with various regularization parameter values from 0.000001 to 1.0 alternating between both L1 and L2 methods.
As it turns out, when you have lots of good data, it becomes difficult to overfit. A little bit of regularization helped, but not by what I was expecting.
I have been living in NYC for almost three years now. Like many New Yorkers, I am willing to share my city hacks. These will be shorter posts with a few words of of advice for the traveler or new residents. There is no order--the numbering is just to provide a monotonically increasing reference or index.
On my long President's Day weekend I took shelter from the NYC cold at the recently-reopened Cooper Hewitt Museum on the Upper East Side.
The Cooper Hewitt is museum about design. On three floors and multple rooms the curators have arranged various sets of objects, often related to each other in some way other than the expanse of time. For instance, a Sumerian cuneiform tablet might be juxtaposed with an original iPhone. The focus is on the art of design and the ways that design helps us as humans. It is well worth a couple hour visit, though I believe even more additions will be made shortly.
My favorite piece was a display of pieces from the Free Universal Construction Set.
As a child, I owned some number of most of the major construction kits. Lego was by far the most dominant, but I also used Duplo, Playmobil, Lincoln Logs, Erector sets, K'nex, Hot Wheels, Tinkertoy, and Fishertechnik. Much of my playtime was spent combining these systems. Sometimes the scales could be frustrating but there was some satisfaction to send a Lego train through a western town made of Lincoln Logs while attacked by Playmobil giants.
The FUCS is a free and open set of 3D printable adaptors for a large number of toy systems, designed to make these toys interact with each other. Finally, a child (or child at heart) with a Makerbot can construct toys and systems that cross beyond the borders of corporate imaginations.
I got to thinking how these toy systems speak some internal language, but in order to work together the appropriate universal adaptors need to be supplied, and connected with other systems.
This is the same in data and computer science. To bridge two systems we need to agree on standard APIs, formats, and schemas. Oftentimes when one does not control the data itself in a section of the pipeline an engineer must define an appropriate adaptor to parse, transform, or display some set of information. Similarly, an interface needs to be implemented on the receiving end of a pipeline to do some sort of data load process. Good engineers live by defining appropriate, lightweight abstractions to complexity, allowing systems to communicate and lever each other's power, features, or datasets.
One example of a parallel in the software world are Protocol Buffers from Google. If you're not familiar with the concept, essentially JSON has become a standard data interchange format, except it's quite inefficient to read and write at high performance and since it doesn't compress well, is expensive to transfer. Additionally, the types are limited so we're left with a variety of add-ons and supersets to try to describe types like dates and times. Protocol buffers are not the solution to every problem, of course, but they do tackle the issue well on some fronts. Essentially you define a schema like file that describes your data, and then the system can generate an adaptor for your language of choice, creating a standardized way to serialize data to a binary format from a C program and read it in with something like Javascript. Protocol buffers are the abstraction that each subsystem can look to to communicate, allowing us to build polyglot systems.
The FUCS is a great, tangible representation of simple abstractions and interfaces for toy-engineering. Children today might get a better understanding of a system stack just by expanding their toy creations, and that is exciting.
As a data lover I have a particular nonpolitical aversion to Fox News. Their travesties in representing data have been well documented on other sites.
I enjoy poking fun at bad graphs, but over time one comes to recognize common traps many fall into when designing a graph. Recently I came across this particularly bad graph on a social network. If you know the source I would appreciate the ability to cite where it came from, but perhaps it is just as well that we let the origin remain anonymous.
Let's take this graph one piece at a time and let it instruct us on how to make our own bad graphs.
You'll note that the chart is measuring two continuous measures of exactly the same type of data (employment/population ratio). However, they're plotted against two different Y axes. It can be difficult to add more than two axes, but if you can add more, it really helps aid incomprehensibility so people can't compare data well.
Your readers expect you to make zero the Y-intercept on your axes, so make sure you don't do that. Pick an arbitrary number for each axis that suits the message you wish your data to give, and allows you to exaggerate its effect.
The author of this chart was very clever. Look at the difference in scale from the top to the bottom of the chart. On the male trendline the graph measures 78-62=16 units, and the female trendline measures 60-38=22 units. The more axes with misaligned scales the better. Your readers will try to compare the relative scales, and you will have have them making incorrect inferences in no time.
It is well known that people do not mentally grasp ratios well. People understand absolute numbers, so whenever you need to shroud information, put it in the form of of a ratio over or under some other number. In this particular case the ratio actually might make more sense, except it would make more sense if it were quoted as a percentage of a certain population.
Here's a big one. Graph consumers love trendlines. Adding a trendline to the graph is like adding a puppy under the Christmas tree. They reinforce the message that everyone should assume the data you're displaying is linear. All data is linear, right? It lets your readers draw wonderfully ridiculous extrapolations off into eternity on either end, and lets you make a strong policy case that change is needed to modify the graph. Trendlines are also excellent for reinforcing the idea in your reader's minds that there is just 1 relevant variable into the function, usually time. Everyone remembers "y=mx+b". Give your readers an intercept b (see above regarding how to choose your intercepts), a comfortable m slope in the form of a trendline, and a nice range of x values along the bottom, and they will solve for truth on their own. This will keep your readers from considering other reasons you didn't have time to research that might be affecting the lines.
Hopefully this has been a valuable lesson in data-obfuscation. I'll continue to bring more tips as I find good examples of bad graphs.
]]>Recently I have been considering the nature of the software engineering job market for junior engineers. I have a lot of peripheral experience with training, recruiting, and hiring junior programmers. I have spoken to and coached non-programmer friends and acquaintances through the programming bootcamp experience, and it has been awesome to see where they end up--great companies like GoPro, Amplify, Thoughtbot, the NYTimes, and Priceline. I have recruited bootcamp graduates from most of the major programs in NYC like Dev Bootcamp, Flatiron School, Fullstack Academy, AppAcademy, and General Assembly.
Years ago I was that junior engineer coming out of university looking for any company that would take an interest while seeing senior engineers get hired at ever-increasing rates.
The market for senior software engineers at the time of this writing could not be more insane. Passively put your resume on a site like Hired.com and you will get 10-30 companies of varying size bidding ever-higher salaries (120-160k+ is common) for your skills. The stress of telling a long series of amazing companies "no I don't have time to even talk with you" should cause a cognitive dissonance when you remember how many millions of Americans have completely given up looking for jobs.
Why is it so hard to get a job as a junior engineer? Let's look at a typical profile.
The average tech startup is building a Rails or Django app depending on what programming language their CTO knows better. They might have some funding after an A or B round and a few senior developers on the team. Cash is king, but their investors obviously want them to spend that money on engineer's salaries so they can build product. Everyone is pretty busy and is trying to scale as fast as possible. Employees are expensive, but if they can build product fast enough, they're worth every penny.
The average junior engineer is looking for a job at a startup. They have touched all the tools and aren't afraid of code, but they know pretty much nothing. They've built an app, but only by adding 50 lines of Rails code. They have a Github account, but don't understand anything about snapshots or tags or why you don't force push to master. They know some basic data structures so they can reverse a linked list in an interview question but do not know what a red black tree is. They've written tests but don't understand testing. They've queried a database but haven't written a database. Or dealt with data corruption. They know what a queue is but haven't dealt with a queue that's overloading at millions of items per minute. It's the problem with any job. As a company, you want someone who can say "I think I've dealt with this problem in the past" and then nails it.
Junior engineer salaries out of these bootcamps ballpark 60-100k. This is the attraction of the bootcamp: spend 9 weeks with us and you could get a six figure job.
Here's where I think there's a market imbalance. Junior engineers at the outset aren't worth anything close to 100k. After a year of learning, they definitely might be, but in the meantime they present a huge risk to a growing company. A junior engineer will need coaching, mentoring, and encouragement. They might merge code in that breaks a production system. They might erase your git history by accident. They will break the build. Multiple times. I know because I've done all those things and seen junior engineers do all those things after me.
In a couple years of working that junior engineer might be paid 100k, but be doing output on that Rails app comparable to a senior engineer that might be paid 160k. Sure, there are those really obscure issues that only someone who has programmed for ten years will nail down, but for most product-oriented stories, those two years of experience will be paying big dividends.
My personal experience is aligned here. I made a lot of mistakes my first year from breaking the build to swallowing exceptions. A few years later, while I'm absolutely still learning and improving, I deliver more stories and merge more pull requests than most senior engineers on our team.
So here is my proposal. Let's all be honest about what we're worth. I think more companies should snatch up junior engineers, but pay them 40k--enough to pay the rent and buy groceries. But here's the catch: tell them from the outset "you're worth 40k to us now, but we'll guarantee a compensation discussion every six months, in which we'll plan to ramp your salary to 100k in 2-3 years depending on performance." That would really help stabilize the market. I think it would keep junior engineers humble but ambitious. It would de-risk companies looking to hire. It would reduce unemployment for recent grads, but guarantee a pipeline of competent programmers for startups.
I'm not sure how viable this could be in the real world, but when I launch a startup, I'd be willing to give it a try. Your best future programmer might be the one you're overlooking.
]]>This is a cross post of the same article on the Intent Media Blog
The Intent Media data team uses Hadoop extensively as way of doing distributed computation. The big data ecosystem is evolving very quickly, and the patterns and versions we had been using are being replaced with faster and more stable platforms. This post is about how we successfully and safely upgraded our Hadoop jobs and paid down some technical debt.
Many of Intent Media’s backend job flows use Hadoop to manage the sheer size of the input datasets. We have jobs that range from basic aggregation to learning models. Hadoop is an impressive framework, but we had increasingly been confronting issues with stability, missing features, and a lack of documentation. A number of indicators pointed to our version of Hadoop being too far out of date.
For our original MapReduce jobs, we were using Hadoop 0.2, which was state of the art at the time those jobs were written, but now the framework has matured through several major versions and Hadoop 2.x is gaining adoption. We needed to at least get to 1.x to give us a path for future upgrades.
On the data team, we write most of our MapReduce jobs in Cascalog, a Clojure domain specific language (DSL) for Cascading, a library that provides job flow abstractions for Hadoop. The Clojure syntax is fairly terse, and the code is easy to test. Cascalog itself has matured since we first used it and we needed to upgrade our library from 1.x to 2.1 to stay current with the latest stable versions.
Before initiating our upgrade project, we needed a way to automatically test our jobs in a cluster environment. We use TDD, so most jobs had a great unit test suite and ran against local execution engines, but job behavior on a cluster with real world data can vary quite a bit. Unfortunately, a clean run with sample data on a local Hadoop jar does not guarantee flawless execution on a 30 machine cluster against production datasets. Most of the time when we define a new job, our QA team can verify expected output and execution under a variety of conditions. However, this manual process quickly becomes tedious when we have to regression test an entire set of jobs on a broad-sweeping version change.
Our first project was to write a remote job triggering and testing harness which could pull a medium sized subset of production data, execute a long running remote job, and make sanity assertions about the output. Running remote jobs is expensive, so we did not put these tests in the integration test suite (which is run on every push to a branch). Instead, engineers can run these job tests prior to a large, possibly breaking change, and we can trust than a version upgrade is safe within a few hours.
Our job execution application is written in Java, and most of our other integration tests are in cucumber, so we decided to continue with our existing technology choices for our remote cluster tests. We chose to define scenarios in Cucumber and use Ruby to trigger jobs, tail logs, and make assertions on result data. Engineers and product managers all have the ability to see the expected job flow of each job, understand the input and output expectations, all based on a job execution run on a snapshot of real production data.
We run all our Hadoop jobs on AWS Elastic MapReduce (EMR). EMR is a powerful way to spin up entire clusters of optimized cloud instances preconfigured with Hadoop with a single API call. The Amazon team has worked to abstract a lot of the work creating and tuning a Hadoop cluster, so all we need to do is submit a job request. EMR is configured to start instances with particular virtual machine images, called Amazon Machine Images (AMIs). AMIs come in different versions and are packaged with different software, such as a specific Hadoop version. As early adopters of EMR, we were using AMI version 1.0.0, which shipped with Hadoop 0.20. To run our jobs against Hadoop 1.x, we had to modify the AMI versions in our job triggering configuration.
The Hadoop transition from 0.x to 1.x consisted of mostly versioning and lower level changes, so fortunately our application code didn’t need to change much. We just needed to swap out the Hadoop library we compiled against and test each job carefully. Most of these job upgrades went smoothly.
Some of our Cascalog based jobs did not perform as expected in the transition to running on Hadoop 1.x and returned unusual output. These will need to be significantly debugged, refactored, or replaced altogether. To prevent these few jobs from blocking our path forward with the rest, we built backwards compatibility into our EMR wrapper. This lets us specify a certain AMI version to use for each type of job, so we can continue to upgrade jobs and fix incompatible ones over time.
This had the additional benefit of reducing risk in the production system. Rather than a massive, risky release trying to forcibly upgrade everything at once and hoping we tested every job scenario, we could merge each small upgrade change to master and evolve the system over time. We started slowly and found job performance and stability improved dramatically, giving us confidence to take larger steps and move faster. Rather than fearing upgrades, we could accelerate our transition.
Last year Cascalog 2.0 launched, and it was a huge leap forward under the hood, separating the maturing Clojure API from the execution engine. After we went through the upgrade, Sam Ritchie has since written on some of the details fairly extensively here: Cascalog 2.0 In Depth
There were some breaking changes, but we found the migration to be fairly smooth. Here’s what we had to modify to make the transition:
Operations are functions, so now every operation macro is now renamed to show that it works like a function.
Many of the namespaces have changed to be more specific. Some functions are either particular to Cascalog or the underlying Cascading and the namespaces now reflect that.
Cascalog 2 uses Hadoop 1 instead of 0.2x.
The taps are much more rich and configurable for doing both input and output. We could remove a lot of serialization/deserialization hacks for working with delimited files. For example, in Cascalog 1 the textline tap doesn’t support a tap exporting compressed strings, but it does in later versions.
For example, our old workaround:
(?- (hfs-delimited output-path :delimiter "," :compress true :sinkmode :replace :quote "")
(<- [?part0 ?part1 ?part2 ?part3 ?part4 ?part5]
(get-aggregations ?field-1 ?field-2 ?field-3 ?field-4 ?field-5 ?count :> ?json)
(clojure.string/split ?json #"," :> ?part0 ?part1 ?part2 ?part3 ?part4 ?part5))))))
Could in Cascalog 2 just become:
(?- (hfs-textline output-path :compress true :sinkmode :replace)
(<- [?json]
(get-aggregations ?field-1 ?field-2 ?field-3 ?field-4 ?field-5 ?count :> ?json)))))
All our jobs continued to work as expected on the new EMR AMIs, and now we have access to the latest API and documentation, making Cascalog development much easier.
Tests are essential. You can have assurances that your application is correct after an upgrade if you have the right tests. Make sure you have unit tests to quickly iterate and test changes but make sure you have larger integration tests as well. We have integration tests for all our Hadoop jobs. These tests assured us quickly that each library and machine image version was working together properly and computing the expected results correctly.
Tech debt gets more painful over time. Many software packages use some form of semantic versioning to help you understand how important an upgrade is or what it will break. For example, we knew our upgrades could both require significant prioritization because they increased by major versions. However, you can often upgrade minor versions often to get smaller stability improvements and bug fixes. Make sure the client/customer for your product recognizes the importance of allocating time to major version upgrades, and the product owner leaves bandwidth for minor upgrades along the way.
Branches should be as short-lived as possible. Long-lived branches will make your life difficult by requiring you to rebase frequently. Merge often to master to minimize conflicts.
Use feature flags to hold off execution flow to major components. We used this strategy when we built a method to allow us to specify the Hadoop version based on job, which let us test and upgrade our jobs over a period of time rather than all at once.
Pairing is great. Often an upgrade process pushes into new technical territory, but that doesn’t mean the types of issues will be totally new. Get another engineer’s input on the problem, even if you can only pair for part of the project.
One of my favorite places to go in NYC is the Brandy Library. It's a library quite like no other, with bottles of every imaginable spirit on bookshelves surrounding the quiet TriBeCa bar.
This is a cross-post of the same article on the Intent Media tech blog.
As the web evolves, user identification has become key when it comes to research, privacy, product customization and engineering. Companies are always balancing the need to respect users’ data collection wishes with the product and economic benefits of providing a customized user experience. The days of relying on third party cookies are gone. Web companies continue to need more effective, trustworthy ways of identifying visitors as previously seen customers or households.
Anonymous user shopping patterns form the core of our ecommerce predictive decisioning platform. We have several classification and regression models that predict the probability of conversion and expected purchase price. We also model a visitor’s expected CTR (Click Through Rate) to perform customized ad selection. To accomplish all this, site page visit history is saved in a “user profile” which represents the best view we have of a site visitor.
When we see a new user, we generate a new UUID and store that in a first party cookie on a partner’s site. That becomes a partner-specific identifier for a that user’s browser.
While first-party visitor cookies are more effective than typical third-party advertising cookies at maintaining a persistent ID for a user, they suffer from a number of drawbacks.
Users still tend to delete their cookies from time to time.
Users buy new computers, discarding machines with the old visitor cookies.
Users switch browsers. If a visitor searches on Chrome and moves to Safari, they get a new visitor cookie and appear like a new user.
Users switch among multiple desktop and mobile devices. A visit from each device generates its own visitor ID cookie.
When you delete cookies, switch browsers or devices, a new visit looks like it is from a previously unseen visitor. This creates low-quality data and degrades the quality of our predictive models.
There are a number of sophisticated methods for probabilistic user identification from a variety of vendors that have built businesses around this difficult problem. In the interest of delivering a backwards-compatible system that could improve our performance in the presence of these issues, we decided to build a way to identify site visitors by their anonymous member ID in addition to the cookie-based visitor ID. This is a unique string that has a one-to-one mapping with usernames, but contains absolutely no personal information. The member ID provides limited coverage since only a small percentage of visitors log in while browsing on e-commerce sites. It still is effective given people tend to have just a single account across devices. This information retrieves and appends otherwise lost user history to a new visitor ID.
From an engineering perspective, we use a lot of Amazon Web Services (AWS) for our infrastructure ranging from EC2 for servers to S3 for long-term data storage. We use Amazon DynamoDB for storing user profiles for its simplicity and speed.
DynamoDB is a key/value store. While the actual specification is more complex, it takes a String for a key, which maps to an object that has a value with any number of attributes of various types, like Integers and Strings. For our key we use a GUID stored in a first-party cookie, that is, in the domain of the publisher sites that use our products.
We serve ads within 200 milliseconds. That is enough time for processing, but we cannot waste milliseconds. On an ad call, we send a request to DynamoDB and receive a record with the user’s history, which we use in our decisioning model. After the ad request, we put an update request on a queue for backend processing that then updates Dynamo with any new events we collected from the user.
If our data were small enough to fit in a traditional relational database, this would not be a difficult solution. We would only need to add a new index on a second column for whichever field we wanted to look up on, and then modify our request to look at both attributes with an OR in the where clause.
In the world of distributed key value stores, this is a bit trickier to implement without degrading performance. Records can only be retrieved by key. We could make secondary queries based on the results of previous queries pulled back. This however requires a sequence of requests to the database which adds significant time to process each ad request.
DynamoDB supports local secondary indices. We considered these, but found that in general they are for downsizing a multiple-row result set towards an intended record rather than expanding the result set from a single record into a larger view of a user’s history.
As we were developing, Dynamo released a feature called Global Secondary Indices. This allows multiple fields to serve as hashes for a record. GSIs may have been effective for a single additional field, but it is difficult to arbitrarily define new lookup fields and query them all in a single request. From the start we wanted the flexibility to expand to add new indices quickly. An example would be to extend this bridging across our third-party visitor ID and other future identifiers. GSIs must be defined at table creation and added to existing tables by completely rehashing. Since they did not lend themselves to our iterative development process, we looked to other options.
The solution was to develop a second table, which we call our lookups table, in addition to our main profiles table.
Alternate ID (String, Hash Key) | ID Type (Enum) | Mapped Visitor ID (String)
Visitor ID (String, Hash Key) | Alternate ID 1 | Alternate ID 2 | Other profile data
When we first see a visitor with an alternate ID, such as a member ID, we write a lookup record with a reverse index, thus mapping the member ID to the visitor ID. Now suppose that user opens a new browser. As soon as they log in, we pull the new profile for the new visitor ID, and we realize we have an existing profile mapped under an old visitor ID through the member ID. We can then pull down that old profile and merge in the historical data to populate the new one. The lookup record is then rewritten to map to the new, merged profile record.
Thus, the lookups table contains a mapping of alternate IDs to the visitor ID of the last browser seen so far of a given individual. Each profile record contains the best knowledge available for a given individual from the last time we have seen that browser. The resting state of the system at any point trades off some duplication of data for query-optimized profiles for a visit from any browser.
+----------------------+-----------+----------------------+ | User Actions | Lookups | Profiles | +----------------------+-----------+----------------------+ | User Visits Site |[empty] | abc-> {1 page} | | Gets Cookie “abc” | | | +----------------------+-----------+----------------------+ | User Logs In | 123-> abc | abc-> {2 pages} | +----------------------+-----------+----------------------+ | User Purchases | 123-> abc | abc-> {2 pages, 1 $} | +----------------------+-----------+----------------------+ | User Switches Devices| 123-> abc | abc-> {2 pages, 1 $} | | Gets cookie "def" | | def-> {1 pages} | +----------------------+-----------+----------------------+ | User logs in | 123-> def | abc-> {2 pages, 1 $} | | | | def-> {4 pages, 1 $} | +----------------------+-----------+----------------------+
The performance hit for a secondary lookup is only taken once, on the first request where we see an alternate ID. Subsequent lookups are a guaranteed hit on the primary store. Once we realize we have an existing historical profile for a user, we pull all that data and merge it into the newest profile.
This scales to support an arbitrary number of alternate ID types. Since DynamoDB supports batch requests, we do a batch request for all the alternate IDs, gather the unique visitor IDs that this retrieves, and then do a batch request for the historical profiles of the returned visitor IDs. This limits us to a maximum of two sequential requests, slashing latency.
Merging is nondestructive. We retain the profiles mapped from the old visitor ID, so if the user alternates between devices multiple times, we can retrieve accurate data quickly.
We can track how many devices we have merged together for a given user, and feed that as a signal into our model. Do users who use multiple browsers, tablets, and smartphones shop differently than users who only shop from their desktop? Our model knows.
We have demonstrated how with simple tools we have built a sophisticated cross-browser user profile. Intent Media’s data team does a lot of work to optimize ad performance for publishers and advertisers as well as the overall experience for users. User profile histories are one of many tools we use. This is just one small enhancement we’ve added to our platform as we continue to iterate and innovate.