One of the most interesting and surprising set of tests we ran was on personalization caching, or whether the same query should be reranked or personalized within a single session for a given user.
As background, Constructor was built on the idea that both ranking and personalization have strong applicability in real-time. Rankings can change quickly (for example, during a sale, or when items go out of stock), and personalization is often most valuable in real-time (for example, for most ecommerce businesses, users do not return regularly, and so if personalization is to have value for those users, it has to be applied right now).
We’re one of the only systems that have this capability, and the capability itself was central to us winning some of our early A/B tests because we could adapt faster to systems we were competing against. But it also left an interesting corner case: what should happen if a user has a query, tells us what kind of personalization they like via their actions, and then comes back to the same query? Should the results stay the same because that is what most users would expect, or should they personalize immediately to surface products that are more likely to appeal to the user given the information we now know?
Imagine you're shopping for new clothes for work and search for “jeans” on an apparel retailer's website. On the results page, you click into 1-2 'slim fit' options and open those PDPs. For whatever reason (maybe you don't like the colors, or maybe you just want to see all of the options again), you hit the back button to return to the full list of results. Should the products be ranked in the same order, or should you now see more slim fit options appear earlier in the results? On the one hand, it may be confusing to see the order change. On the other hand, we may be able to show you more jeans we think you’ll like before you become distracted or frustrated and leave the site.
This is a topic that people have a lot of very strong opinions about. Some of our customers demanded it and would not partner with us unless we told them we were capable of it– they wanted their users to very clearly see personalization happening. Other customers hated the idea and said they would not partner with us unless it was disabled.
Our solution was personalization caching, or the ability to keep the results the same (to cache them) for individual queries a user has recently tried. This meant giving users the benefit of personalization for queries they haven’t searched for before, but for queries a user has recently run and which have a set of results the user already expects, we’d give them those same results and avoid confusing them by personalizing or reranking.
The ability to turn this on or off satisfied everyone, but still begged the question: what is best for users? What should we recommend? And how would we really know?
Enter the A/B Test
Luckily for us, there was a way to answer the question. We could talk to some of our customers and offer a test: turn on personalization caching for half of your users, turn it off for the other half, and see what the changes in algorithm do to the metrics that matter like add-to-carts, conversions, revenue per visitor, etc.
Twelve large ecommerce companies we work with agreed to the test. They span different languages (English, French, German, and Dutch); different countries operating across North America, Europe, and Australia; and different industries (clothing, cosmetics, shoes, baby products, DIY project products, and more).
Normally, this is the sort of thing that one company could run for themselves, but we could standardize and run a test for many companies in parallel. And the resulting data would be fascinating, revealing how outcomes differ across different types of companies.
The Results
The results were not what any of us expected.
The primary metrics we cared about were purchases and revenue, but we also tracked metrics like add-to-carts and clicks (that would happen closer to the query itself, and that would be more numerous and get to statistical significance faster). We ran the test for varying time frames for each company, depending on their level of traffic.
In short, surprising all of us, personalization caching didn’t much matter one way or the other. For 9 of the 12 companies, results were either flat or did not reach statistical significance after 101 days for any of the tracked metrics. Of the companies that did reach statistical significance, none reached it with revenue, and only one company reached it with purchases (all three reached it with add-to-carts). The one company that reached significance with purchases and add-to-carts was only on their German index, and interestingly, the English version of the same index had results that were flat. They saw a positive increase of 2.47% in purchases and 1.26% in add-to-carts on the German index when personalization caching was turned on. The numbers themselves for that one test become a little doubtful, though, because they were such an outlier among all the tests and, just from the statistics of how A/B tests work, if you run enough of them, you’ll be more and more likely to get one that’s statistically significant by chance (XKCD has a delightful comic explaining this).
For the two companies that saw statistically significant changes in add-to-carts, one saw a small increase of 0.43%, and one saw a small decrease of -0.95% (but interestingly, a non-statistically significant increase in purchases of 0.63% in the other direction).
You can see additional, more specific numbers from the statistically significant tests below, but the overall conclusion from them was the most interesting part: from a pure performance standpoint, whether results change for a query a user recently tried doesn’t really matter.
What to Conclude
Within Constructor, both camps found the results a little anticlimactic. With the amount of debate and strong opinions on the subject, we all expected it to matter somehow. None of us expected results that were basically flat.
But there was a silver lining. As we spoke to more customers about the question, we found that they still had strong opinions, and results seemed to almost be freeing because it meant they could make the decision of whether to turn personalization caching on or off purely based on what they considered to be a better user experience, or better for their brands. Just because something does not have a statistically significant impact on immediate business KPIs does not mean it doesn’t matter.
And that’s the way we now talk about the question to our customers. New companies just starting to partner with us will often still have strong opinions on this subject — and that’s OK! Now, instead of debating how either decision will affect performance, we can instead show them the data and let them know they can turn personalization caching on or off within Constructor depending on what they think is best for their brand and their shoppers, without needing to worry about how it will affect their revenue and purchase numbers in the short term.
And at the end of the day, this is one of the most important things I believe our customers look to us for: it’s not just functionality and features– it’s expertise and experience. Lots of search engines have lots of features, and we want to have the ones our customers are most excited about, but more importantly, we want to have the experience and the data to share with them on what will most affect their business KPIs immediately, and what is more a decision of taste.
Note, within the data below, which shows metrics from the three A/B tests that reached some kind of statistical significance, the specific metrics that reached significance are denoted with a 🌟. Metrics that showed a positive effect are denoted with a 💚and metrics that showed a negative effect are denoted with a 🔻.