This is a cross-post of the same article on the Intent Media tech blog.
Rationale & Background
As the web evolves, user identification has become key when it comes to research, privacy, product customization and engineering. Companies are always balancing the need to respect users’ data collection wishes with the product and economic benefits of providing a customized user experience. The days of relying on third party cookies are gone. Web companies continue to need more effective, trustworthy ways of identifying visitors as previously seen customers or households.
Anonymous user shopping patterns form the core of our ecommerce predictive decisioning platform. We have several classification and regression models that predict the probability of conversion and expected purchase price. We also model a visitor’s expected CTR (Click Through Rate) to perform customized ad selection. To accomplish all this, site page visit history is saved in a “user profile” which represents the best view we have of a site visitor.
When we see a new user, we generate a new UUID and store that in a first party cookie on a partner’s site. That becomes a partner-specific identifier for a that user’s browser.
While first-party visitor cookies are more effective than typical third-party advertising cookies at maintaining a persistent ID for a user, they suffer from a number of drawbacks.
Users still tend to delete their cookies from time to time.
Users buy new computers, discarding machines with the old visitor cookies.
Users switch browsers. If a visitor searches on Chrome and moves to Safari, they get a new visitor cookie and appear like a new user.
Users switch among multiple desktop and mobile devices. A visit from each device generates its own visitor ID cookie.
When you delete cookies, switch browsers or devices, a new visit looks like it is from a previously unseen visitor. This creates low-quality data and degrades the quality of our predictive models.
There are a number of sophisticated methods for probabilistic user identification from a variety of vendors that have built businesses around this difficult problem. In the interest of delivering a backwards-compatible system that could improve our performance in the presence of these issues, we decided to build a way to identify site visitors by their anonymous member ID in addition to the cookie-based visitor ID. This is a unique string that has a one-to-one mapping with usernames, but contains absolutely no personal information. The member ID provides limited coverage since only a small percentage of visitors log in while browsing on e-commerce sites. It still is effective given people tend to have just a single account across devices. This information retrieves and appends otherwise lost user history to a new visitor ID.
Current Architecture
From an engineering perspective, we use a lot of Amazon Web Services (AWS) for our infrastructure ranging from EC2 for servers to S3 for long-term data storage. We use Amazon DynamoDB for storing user profiles for its simplicity and speed.
DynamoDB is a key/value store. While the actual specification is more complex, it takes a String for a key, which maps to an object that has a value with any number of attributes of various types, like Integers and Strings. For our key we use a GUID stored in a first-party cookie, that is, in the domain of the publisher sites that use our products.
We serve ads within 200 milliseconds. That is enough time for processing, but we cannot waste milliseconds. On an ad call, we send a request to DynamoDB and receive a record with the user’s history, which we use in our decisioning model. After the ad request, we put an update request on a queue for backend processing that then updates Dynamo with any new events we collected from the user.
Implementation
If our data were small enough to fit in a traditional relational database, this would not be a difficult solution. We would only need to add a new index on a second column for whichever field we wanted to look up on, and then modify our request to look at both attributes with an OR in the where clause.
In the world of distributed key value stores, this is a bit trickier to implement without degrading performance. Records can only be retrieved by key. We could make secondary queries based on the results of previous queries pulled back. This however requires a sequence of requests to the database which adds significant time to process each ad request.
Options Considered
Local Secondary Indices
DynamoDB supports local secondary indices. We considered these, but found that in general they are for downsizing a multiple-row result set towards an intended record rather than expanding the result set from a single record into a larger view of a user’s history.
Global Secondary Indices (GSIs)
As we were developing, Dynamo released a feature called Global Secondary Indices. This allows multiple fields to serve as hashes for a record. GSIs may have been effective for a single additional field, but it is difficult to arbitrarily define new lookup fields and query them all in a single request. From the start we wanted the flexibility to expand to add new indices quickly. An example would be to extend this bridging across our third-party visitor ID and other future identifiers. GSIs must be defined at table creation and added to existing tables by completely rehashing. Since they did not lend themselves to our iterative development process, we looked to other options.
Solution
The solution was to develop a second table, which we call our lookups table, in addition to our main profiles table.
Lookups Table
Alternate ID (String, Hash Key) | ID Type (Enum) | Mapped Visitor ID (String)
Profiles Table
Visitor ID (String, Hash Key) | Alternate ID 1 | Alternate ID 2 | Other profile data
When we first see a visitor with an alternate ID, such as a member ID, we write a lookup record with a reverse index, thus mapping the member ID to the visitor ID. Now suppose that user opens a new browser. As soon as they log in, we pull the new profile for the new visitor ID, and we realize we have an existing profile mapped under an old visitor ID through the member ID. We can then pull down that old profile and merge in the historical data to populate the new one. The lookup record is then rewritten to map to the new, merged profile record.
Thus, the lookups table contains a mapping of alternate IDs to the visitor ID of the last browser seen so far of a given individual. Each profile record contains the best knowledge available for a given individual from the last time we have seen that browser. The resting state of the system at any point trades off some duplication of data for query-optimized profiles for a visit from any browser.
Sample Flow
+----------------------+-----------+----------------------+ | User Actions | Lookups | Profiles | +----------------------+-----------+----------------------+ | User Visits Site |[empty] | abc-> {1 page} | | Gets Cookie “abc” | | | +----------------------+-----------+----------------------+ | User Logs In | 123-> abc | abc-> {2 pages} | +----------------------+-----------+----------------------+ | User Purchases | 123-> abc | abc-> {2 pages, 1 $} | +----------------------+-----------+----------------------+ | User Switches Devices| 123-> abc | abc-> {2 pages, 1 $} | | Gets cookie "def" | | def-> {1 pages} | +----------------------+-----------+----------------------+ | User logs in | 123-> def | abc-> {2 pages, 1 $} | | | | def-> {4 pages, 1 $} | +----------------------+-----------+----------------------+
Key Benefits
The performance hit for a secondary lookup is only taken once, on the first request where we see an alternate ID. Subsequent lookups are a guaranteed hit on the primary store. Once we realize we have an existing historical profile for a user, we pull all that data and merge it into the newest profile.
This scales to support an arbitrary number of alternate ID types. Since DynamoDB supports batch requests, we do a batch request for all the alternate IDs, gather the unique visitor IDs that this retrieves, and then do a batch request for the historical profiles of the returned visitor IDs. This limits us to a maximum of two sequential requests, slashing latency.
Merging is nondestructive. We retain the profiles mapped from the old visitor ID, so if the user alternates between devices multiple times, we can retrieve accurate data quickly.
We can track how many devices we have merged together for a given user, and feed that as a signal into our model. Do users who use multiple browsers, tablets, and smartphones shop differently than users who only shop from their desktop? Our model knows.
Conclusion
We have demonstrated how with simple tools we have built a sophisticated cross-browser user profile. Intent Media’s data team does a lot of work to optimize ad performance for publishers and advertisers as well as the overall experience for users. User profile histories are one of many tools we use. This is just one small enhancement we’ve added to our platform as we continue to iterate and innovate.