Skip to content
Logo-black
  • Native Commerce Core™
    • IDC-marketscape-q4-2024-resources-hub-cover-562x562@2x

      Constructor is a leader in the IDC marketscape

      Constructor is a Leader in the IDC MarketScape on Worldwide Knowledge Discovery Software for External-Facing Use Cases, 2024.

        • Search & Autosuggest

          Hyper-personalized ecommerce search driven by AI and real-time shopper behavior.

        • Browse

          Personalize your ecommerce category pages for maximum conversions.

        • Recommendations

          Drive upsells and increase RPV with hyper-personalized product recommendations.

        • Collections

          Dynamic, personalized ecommerce collection pages that learn from every search, click, and purchase.

        • Quizzes

          Create smart product recommendation quizzes that drive shopper engagement.

        • Merchant Controls & Intelligence

          Get the ecommerce merchandising controls and insights you need to make high-impact changes that drive revenue.

        • Attribute Enrichment

          Automatically enrich product attributes and categories to give shoppers the best possible experience.

        • AI Shopping Assistant

          Help your online customers discover great products faster.

        • Retail Media

          Constructor Sponsored Listings complement (instead of compete with) your organic results.

        • Cross-Channel & Offsite Discovery

          Create and launch personalized, data-driven marketing campaigns across digital channels.

        • B2B Use-Cases

          Product discovery technology that’s purpose-built to handle the complex needs of B2B ecommerce.

  • Customers
    • A_B Testing Navigation Image

      The Ultimate Guide to A/B Testing for Search & Discovery

      Discover what your ecommerce team needs to know about fixing broken testing strategies, avoiding wasted effort, and unlocking the full value of A/B testing.

        • Upcoming Events

          Visit us in person at ecommerce industry events.

        • Reports & Guides

          New research, guidance, and insights for the future of retail.

        • Webinars

          Educational webinars, conversations with Constructor partners, and more.

        • Blog

          The latest ideas and news from the frontier of commerce search and discovery.

        • Split/Test Digest

          The most interesting learnings from A/B testing at Constructor.

        • News

          Read Constructor's latest news coverage and press announcements.

        • Documentation

          Learn about the components of Constructor's advanced, AI-powered product discovery suite.

        • About Us

          With AI in our DNA and by keeping ecommerce as our core focus, we’ve built the best product search and discovery solution specifically for the unique needs of retailers.

        • Partners

          Constructor partners with the world’s top retail consultants, systems integrators, and technology platforms.

        • Careers

          Constructor is building the next generation of search solutions.

Login
Book a Demo
  • Native Commerce Core™
    • IDC-marketscape-q4-2024-resources-hub-cover-562x562@2x

      Constructor is a leader in the IDC marketscape

      Constructor is a Leader in the IDC MarketScape on Worldwide Knowledge Discovery Software for External-Facing Use Cases, 2024.

        • Search & Autosuggest

          Hyper-personalized ecommerce search driven by AI and real-time shopper behavior.

        • Browse

          Personalize your ecommerce category pages for maximum conversions.

        • Recommendations

          Drive upsells and increase RPV with hyper-personalized product recommendations.

        • Collections

          Dynamic, personalized ecommerce collection pages that learn from every search, click, and purchase.

        • Quizzes

          Create smart product recommendation quizzes that drive shopper engagement.

        • Merchant Controls & Intelligence

          Get the ecommerce merchandising controls and insights you need to make high-impact changes that drive revenue.

        • Attribute Enrichment

          Automatically enrich product attributes and categories to give shoppers the best possible experience.

        • AI Shopping Assistant

          Help your online customers discover great products faster.

        • Retail Media

          Constructor Sponsored Listings complement (instead of compete with) your organic results.

        • Cross-Channel & Offsite Discovery

          Create and launch personalized, data-driven marketing campaigns across digital channels.

        • B2B Use-Cases

          Product discovery technology that’s purpose-built to handle the complex needs of B2B ecommerce.

  • Customers
    • A_B Testing Navigation Image

      The Ultimate Guide to A/B Testing for Search & Discovery

      Discover what your ecommerce team needs to know about fixing broken testing strategies, avoiding wasted effort, and unlocking the full value of A/B testing.

        • Upcoming Events

          Visit us in person at ecommerce industry events.

        • Reports & Guides

          New research, guidance, and insights for the future of retail.

        • Webinars

          Educational webinars, conversations with Constructor partners, and more.

        • Blog

          The latest ideas and news from the frontier of commerce search and discovery.

        • Split/Test Digest

          The most interesting learnings from A/B testing at Constructor.

        • News

          Read Constructor's latest news coverage and press announcements.

        • Documentation

          Learn about the components of Constructor's advanced, AI-powered product discovery suite.

        • About Us

          With AI in our DNA and by keeping ecommerce as our core focus, we’ve built the best product search and discovery solution specifically for the unique needs of retailers.

        • Partners

          Constructor partners with the world’s top retail consultants, systems integrators, and technology platforms.

        • Careers

          Constructor is building the next generation of search solutions.

Login
Book a Demo
Back to Blog

Optimizing trie-based spelling correction algorithms at Constructor.io

Posted on:
July 12, 2017
Author:
George Mossessian
Table of Contents:

At Constructor.io, we take pride in not only serving relevant, typo-tolerant results with ranking learned from user behavior, but also in doing so quickly. This is no mean feat — it’s not unusual for a dataset to have tens of millions of items with an average of 10 words per item, and if results are to update in real-time as the user types, we have maybe 100ms, including network latency, to go from user input to ranked results. This includes performing any spelling corrections, phonetic and morphological analysis, re-ordering, ranking analysis, and all sorts of other bells and whistles that go into providing your users with the results that they are most likely looking for. In this post, I want to talk just about the spelling correction step, and how we took ideas from various existing spelling correction algorithms that did not quite meet our needs to design one that does.

Whatever approach we come up with has to meet some constraints. As mentioned before, it has to be fast. The data structures involved should take up as little memory as possible. Building those data structures should be fast and memory-efficient. If at all possible, it would be best to re-use data structures that we already have.

The trie-based solution

It’s no secret that the standard data structure for predictive text search is a trie [Wiki], and we keep all the individual tokens, or keywords, of a customer dataset in a trie-like structure that’s been tweaked to best suit our needs. A quick refresher: a trie is basically a tree with one character per node. The children of a node containing character X are the characters that can come after X in the corpus, so it’s easy to see what words in your corpus could start with a given prefix.

There’s also a standard, well-known way to build a spell checker using a trie, described for example here. To summarize, you assemble candidate words one character at a time as you traverse the trie in a breadth-first search, throwing out branches when it becomes evident that they differ from the user input too strongly. This approach usually uses the Damerau-Levenshtein distance [Wiki] as a comparison metric, which just counts the number of insertions, deletions, transpositions, and substitutions needed to change one string into another.

And when we were just starting out, that’s exactly what we did. Except instead of Levenshtein distance, we used our own much more advanced string comparison metrics that take into account things like language-specific phonetics, term relevance, keyboard distance, etc. On the face of it, this was perfect. It used no new data structures, runtime memory use was a non-issue, and as long as the datasets we were dealing with had fewer than 10,000 or so unique tokens, the entire process could be done in single milliseconds or faster. However, as the size of our customers’ datasets grew, so did the computation time. When we got our first customer with hundreds of thousands of unique tokens, this was still fast enough, but barely, and it became apparent that we would eventually need to find another solution if we wanted to eventually serve customers with millions of unique tokens.

I wanted to check how this algorithm would have performed on such a large dataset if we had kept using it. Here’s a quick timing check through a python wrapper on an AWS C4 instance:

In [7]: get_spelling_suggestions(‘optimizashun’) Out[7]: [‘optimization’] In [8]: timeit get_spelling_suggestions(u’optimizashun’) 10 loops, best of 3: 131 ms per loop

And that’s a relatively fast one.

Moving to precomputation

When runtime computation is too slow, one natural question to ask is how much can be precomputed. Not wanting to reinvent any wheels, we came across an approach called a symmetric-delete dictionary. This is a Damerau-Levenshtein-based approach which begins with the observation that if two strings \(s_1\) and \(s_2\) have Damerau-Levenshtein distance \(D\), then some string \(s’_1\) is equal to some string \(s’_2\), where each \(s’_i\) is derived from \(s_i\) by \(d\) deletions, where \(0\leq d\leq D\). Thus, if you precompute a multimap whose values are strings which lead to their key after \(d\) deletions, the runtime computation of all strings within a certain Damerau Levenshtein distance of a given input becomes very fast.

For us, this comes with a few problems. First, the precomputation step is cumbersome and slow, which translates to longer index update times for our customers. And second, as I mentioned above, our approach to evaluating string comparisons is not based on Damerau distance! Meaning we had to set \(D\) to a sufficiently high number, and then apply our in-house comparisons to winnow down the results. But it gave a noticeable performance improvement over the trie-based solution. However, a length-\(n\) word has \({n\choose D}\) possible distance-\(D\) deletions, so for a dataset with \(N\) tokens, the multimap size grows as \(Nn^D\). Once we got to work with datasets with millions of unique tokens, this approach once again started to become unsustainable: the size of the multimap grew into the gigabytes both on disk and in memory, the precomputation of this multimap alone could take over an hour, and some input strings could have potentially thousands of Damerau-close strings which then had to all be scored and ranked by our own comparison algorithms.

So it was back to the drawing board. Was there a way to combines these two approaches?

The Middle Way

The most computationally expensive part of the trie-based spelling correction, it turns out, is unwrapping the first few characters during the breadth-first search: with short prefixes, there isn’t enough information present to confidently throw out many of the bad branches, and the uncertainty in phonetic interpretation of just a few characters makes our own comparison metrics slower. But what if we could just precompute prefixes of potential spelling corrections for prefixes of user input queries, and use those to bypass the most expensive part of the computation? That is, given a 5-character prefix, we would be precomputing the set of 3-character prefixes for all possible spelling corrections of all possible user inputs that begin with that 5-character prefix.

This approach has a number of bonuses. Unlike the second approach, where one string could affect the values of hundreds of keys in the precomputed multimap (“optimization” has 298 possible distance-1, 2, and 3 deletions), these prefixes could be computed for one value at a time, so the precomputation routine was low on memory use and relatively easy to parallelize. Since we’re only precomputing these values for short prefixes of frequent user-input tokens, the number of values is greatly limited, and does not grow with dataset size or token length: even for our largest customers, this precomputation takes less than 10 minutes and uses a few hundred kilobytes of memory. Finally, we’re using actual user data to dictate what gets precomputed — yet another way that, when you use Constructor.io, your search learns from your users.

And by the way, it’s plenty fast again.

In [9]: timeit get_spelling_suggestions_precomputed_prefixes(‘optimizashun’) 100 loops, best of 3: 4.35 ms per loop

And that’s a relatively slow one.

Subscribe to our Blog

Get conversion-driving insights delivered straight to your inbox.

What's next?

Uncover lost revenue opportunities with a complimentary Search Experience Audit.

According to the 2024 State of Ecommerce Search & Product Discovery Survey, nearly 70% of shoppers think the search function on retail websites needs an upgrade. Our team has run over 1000 A/B tests to identify easy-to-implement algorithmic and UX improvements that get results. Use their research to your advantage with a complimentary Search Experience Audit — no strings attached.

Request an Audit
  • Analyze your search results quality
  • Identify "no results" pages
  • Pinpoint irrelevant results for long-form queries
  • Uncover UX opportunities for Search, Autocomplete, and PLPs

According to the 2024 State of Ecommerce Search & Product Discovery Survey, nearly 70% of shoppers think the search function on retail websites needs an upgrade. Our team has run over 1000 A/B tests to identify easy-to-implement algorithmic and UX improvements that get results. Use their research to your advantage with a complimentary Search Experience Audit — no strings attached.

Request an Audit
  • MACH Certified
  • 21972-312_SOC_NonCPA-150x150
  • Cartification-mark_04-09-300x253-1-150x150
Constructor.io Logo
  • Technology
    • Native Commerce Core™
    • Request a Demo
    • The Proof Schedule®
  • Solutions
    • Search & Autosuggest
    • Browse
    • Recommendations
    • Collections
    • Quizzes
    • Merchant Controls & Intelligence
    • Attribute Enrichment
    • AI Shopping Assistant
    • B2B Solutions
  • Resources Hub
    • Events
    • Reports, Guides & Webinars
    • Blog
    • Experiments
    • News
    • Constructor vs Alternatives
    • Documentation
  • About
    • About Us
    • Partners
    • Security & Compliance
    • Careers
  • Technology
    • Native Commerce Core™
    • Request a Demo
    • The Proof Schedule®
  • Solutions
    • Search & Autosuggest
    • Browse
    • Recommendations
    • Collections
    • Quizzes
    • Merchant Controls & Intelligence
    • Attribute Enrichment
    • AI Shopping Assistant
    • B2B Solutions
  • Resources Hub
    • Events
    • Reports, Guides & Webinars
    • Blog
    • Experiments
    • News
    • Constructor vs Alternatives
    • Documentation
  • About
    • About Us
    • Partners
    • Security & Compliance
    • Careers
© 2025 Constructor All rights reserved
  • Terms of Service
  • Privacy Policy