The Split/Test Digest | Insights from 1,000+ Ecommerce A/B Tests

See How 40M Lipstick Searches Expose Need for Behavior-Driven Discovery | Constructor

Written by Anastasia Yakunina | Nov 14, 2025 4:21:31 PM

For ecommerce teams, the search bar is high-stakes real estate. It is often the first and most critical touchpoint for a shopper on a retailer’s website, and shoppers who use search convert at a rate 2.5X higher compared to shoppers who browse. Yet despite its importance, most retail search experiences fall short. Shoppers in every category — from apparel to electronics, beauty, and more — often use broad, familiar terms that don’t adequately describe their true intent, making it challenging for traditional keyword-based search engines to deliver satisfactory results.

It’s no surprise, then, that according to the 2025 State of Ecommerce Report, 68% of shoppers think search on retail sites needs an upgrade, and 86% report needing to reformulate their queries at least once before seeing satisfactory results. 

So what makes it so difficult to get search results right? For starters, many initial searches begin with vague, high-level queries. Take a query like “blue jeans.” On the surface, it looks specific. In practice, it could mean dozens of washes, finishes, and fits. The words look the same, but the intent behind them couldn’t be more varied. 

To explore this topic in more depth, we studied millions of real queries from one of the world’s largest beauty retailers, comparing the shopper’s initial query to what they ultimately purchased. The data makes it clear: if you limit your understanding of a query to just keywords, you risk serving irrelevant results, losing conversions, and weakening the brand experience right when purchase intent is highest.

Only behavior-driven personalization, anchored in what shoppers actually click and buy, can surface the right product at the right time, regardless of query length. Because shoppers effectively vote with their actions — the products they click, compare, and buy — their behavior reveals the true preferences behind their words. When the underlying search engine can factor those signals into the results, even broad or vague queries can be interpreted with a level of precision that keyword engines simply can’t achieve.

Let’s explore how.

What the Data Shows

We began by examining more than 40 million search logs that resulted in purchases of lip products. Then, we took it a step further, reviewing over 5,000 product images to map every purchased shade across two color dimensions:

  • Light to dark

  • Beige to red

 Each unique SKU became a data point, sized by purchase volume. 

All of these shades are technically relevant to the query “red lipstick”. At the top of the graphic, a single, outsized dot represents the most popular product in this category — the classic red lipstick shade most shoppers probably picture when they imagine “red.” What’s striking, though, is the cluster of hundreds of other dots below it, each representing different shades that thousands of shoppers actually purchased. 

The densest part of the cluster sits in the darker, more beige range — berry and brown tones that consistently outperformed lighter shades across the board. Despite the universal familiarity of the term “red,” the data makes clear it doesn’t mean the same thing to every shopper.

A traditional keyword engine would miss many of these high-performing shades simply because they don’t contain that exact term, even though shoppers repeatedly chose them when searching for it.

Shade Isn’t the Only Player

Color wasn’t the only pattern we found in the data. Product format varied just as much. Even when the starting query was “lipstick,” shoppers often purchased glosses, oils, stains, or balms instead.

These shifts underscore just how much variation can hide behind the same two words, and how differently shoppers can interpret even a seemingly straightforward query. And those patterns don’t stand still: shifts in format trends over time make the picture even more complicated.

Over the past two years, lip balm has emerged as a rival to lipstick, becoming the most popular format, while lip oils experienced a brief surge in early 2023 before giving way to lip balms.

The format also impacts shade preferences. Shoppers favored bold, saturated colors in traditional lipsticks, but leaned toward lighter, more translucent tones when buying glosses or balms.

A keyword system can’t navigate these cross-preferences or shifting trends, because it treats shade and format as isolated terms instead of interconnected signals that shape what each shopper is most likely to buy.

The Keyword Gap

So what does this all mean? While we’ve shown that darker shades sold best in the data we looked at, those shoppers rarely used keywords like “dark” or “berry” in their searches. They defaulted to generic terms like “red,” “pink,” or “nude.”

Herein lies the problem with traditional keyword-based search engines: those words don’t mean the same thing to everyone. A “red” that converts for one shopper might be completely the wrong thing for another. 

The result is a wide spread of clicks across many shades, demonstrating that keywords alone can’t fully parse intent and shopper preferences.

Why This Matters for Retailers

Retailers face a massive translation challenge: understanding what shoppers mean, not just what they say. Keywords still matter. After all, they’re the first clue to intent. But they’re an imperfect one. 

Humans are messy communicators. We use shortcuts, forget product names, and often struggle to describe what we want. A shopper might type “red lipstick” but actually want a sheer berry balm for everyday wear. Multiply that uncertainty across thousands of categories — skincare, fragrances, foundation, and beyond — and the scale of the problem becomes enormous.

When search engines rely only on keywords, they misinterpret intent in three ways: they treat language as literal, they fail to learn from shopper behavior, and they assume every searcher means the same thing. The result is a systematic mismatch: products that technically fit the query but fail to capture the moment. 

Across a full retail catalog, those small disconnects multiply into broken journeys, wasted impressions, and lost trust. Each irrelevant result tells the shopper, 'You don’t understand me.'

At enterprise scale, this challenge grows exponentially. Tens of thousands of SKUs, seasonal trends, shifting preferences, and constant product launches make it impossible to manually tune relevance. Even the best merchandising teams can’t keep pace with how quickly shopper intent evolves. 

Retailers who can’t interpret behavior at scale risk delivering search experiences that feel static in a world that moves in real time.

How Retailers Can Close the Intent Gap

Closing the intent gap begins with transitioning from static retrieval to true learning. Traditional keyword-based systems don’t actually learn; they just match words to products using fixed rules that never adapt. 

To close the gap, retailers need search engines that evolve with every interaction. That means observing what shoppers actually do: what they click, ignore, add to their cart, and buy — and updating results in real-time based on those signals. Each of those micro-actions provides evidence of what a shopper really meant, not just what they typed.

This is where reinforcement learning excels. It continuously tests, measures, and adapts to outcomes, rewarding behaviors that lead to successful journeys: clicks that convert, products that stay in carts, sessions that end in satisfaction (and penalizing those that don’t). Over time, it becomes better at predicting what will engage each individual shopper, not the “average” one. Instead of optimizing for keyword match, it optimizes for intent fulfillment.

That’s why the beauty retailer we based this study on didn’t have to manually tune for changing shopper behavior. Their search engine, powered by reinforcement learning and real-time behavioral data, did it for them. As shoppers searched, clicked, and purchased, the system automatically learned what “red lipstick” meant for each individual: a deep berry gloss for one shopper, a matte brick for another. It continually refined relevance behind the scenes, surfacing the right product at the right time. The results were clear: higher engagement, stronger conversion, and a smoother path from intent to purchase.

Constructor makes this kind of reinforcement learning–driven discovery achievable at scale. Our platform continuously analyzes clickstream data across billions of interactions, running live experiments to tune search and recommendations toward business goals like conversion, margin, and inventory balance. The system doesn’t just learn “red” — it learns what your shoppers mean by red, in every context, across every category.

The Takeaway

Every shopper speaks their own retail language. Words like “red,” “nude,” or “classic fit” are just rough estimates of intent (i.e., placeholders for what people really want). 

As we stated earlier, humans are messy communicators. We use shortcuts, guess at terms, and often don’t know the perfect words to say what we want. Closing that gap requires systems that can learn from what shoppers do, not just what they say.

The future of product discovery will belong to retailers who pair linguistic signals with behavioral evidence, allowing reinforcement learning to translate clicks and purchases into meaningful understanding. 

Keyword search was about matching products to words. Behavior-driven discovery is about matching products to people. And that shift, from static retrieval to real-time learning, is what will separate retailers who keep up from those who get left behind.

Ready to see the difference? Book a demo