It is a crisp February evening, you’re wrapped up in a blanket with your favorite beverage in your hand, and you’re getting ready to watch your favorite TV show. You open up your laptop, fetch your headphones, click connect, and… nothing happens. Your headphones have finally given out their last breath, and after five years of reliable service, it is time for a replacement. You are actually kind of excited—this is your opportunity to treat yourself, maybe even get one of those ANC headphones you hear so much about.
You open up Google and search: “over-ear headphones ANC”. Returned are 10,000,000 results, the first 20 being various blog posts arguing why brand X is better than brand Y and why you don’t really need headphones; a tin can and some string will do the job just fine. Forget it, you think, I’ll just directly search the store’s website.
1000+ results.
You sigh and modify your search: „over-ear headphones with ANC black“
200+ results.
With another sigh, you try again „over-ear headphones with ANC black premium HiFi 2025“, apply five different filters and click search.
50 results — finally something manageable! You start scrolling, opening each headphone pair’s specifications and details, trying to remember and compare every single one of the 50 available pairs. Before you know it, it is 1 AM, you haven’t watched anything, and you don’t even have anything to show for it—you still have five headphones you can’t decide between. You sigh, close your laptop, and go to sleep, exhausted.
Thinking about this situation, which we all experience in one way or another daily, two obvious issues come to mind. The first one is that we have all been conditioned to think like machines when typing up a search query—writing only the most relevant and important keywords, removing words that make it a question (e.g., where, this, how), and making our query as short as possible. This results in you writing a query like this:
„over-ear headphones ANC black Bluetooth 2025 model premium microphone Croatia“
instead of asking a question as you would a friend:
„What over-ear headphones connectable by Bluetooth suitable for loud office environments and calls, with an excellent quality microphone can I find under 300€ in my country?“
For years, this approach to search has been all we have known—learning how machines parse our queries and adapting ourselves, our language, and our queries to simplify their jobs so they could provide us with better results. People have trained themselves to think like machines instead of machines learning to understand how people naturally think and talk.
The second issue, or perhaps better put, opportunity for improvement, is the way retrieved information is displayed. Often, the exact information we’re looking for is buried under a bunch of irrelevant content. To extract useful knowledge, we need to spend time digging through, prioritizing, and sorting relevant from irrelevant results returned by the information retrieval system. For a long time, this has been all we have known, and thus we’ve accepted our fate with silent resignation, considering this inefficiency a necessary evil in our quest for information.
However, the latest advancements in the field of generative artificial intelligence, led by the advent of large language models alongside text embedding models, could promise a long-awaited paradigm shift – if their capabilities are utilized to their full extent.
In this series of blogs, we’re going to cover various approaches to search—starting from traditional and currently most widely used methods such as word matching and BM25 to more recent AI-driven approaches, including embedding search and retrieval-augmented generation (RAG). The main focus will be on searching text-based information, but we’ll also touch on searching data in other formats, such as images or audio. We’ll go over each step needed to set up a fully functional AI-supported search service from scratch, inspired by an actual AI-augmented search project we recently implemented at CROZ AI. By the end, I hope you will be inspired and excited to augment your existing search with AI!
Let’s kick off our journey by exploring the capabilities of traditional search approaches, still the most widely-used search paradigm in most business cases.
Traditional approaches before AI-powered search
There are many (traditional) ways to search over a database of textual data.
Boolean Search
The simplest and most naive approach is Boolean search – retrieving documents that contain text that exactly matches the search query, with the optional addition of logical operators such as AND, OR, and NOT. While this approach is simple, easy to implement and understand, it comes with an obvious drawback – the user has to be very precise in their query formulation. If they include additional search terms the target document doesn’t contain, the search may fail to find an otherwise relevant document. Furthermore, in its most basic implementation, Boolean search is also relatively slow compared to other approaches, as it requires searching over the full corpus of all text documents to find relevant matches.
TF-IDF
An improvement over boolean search is TF-IDF (term frequency-inverse document frequency), a statistical measure used to score and rank documents based on how relevant they are to the search query. TF-IDF measures how important a word/search term is within a document, relative to a collection of documents. It does this in three steps:
- Check how frequently a word appears in the document (term frequency, TF). The rarer the word is, the more likely it is to indicate important or specific information, and thus is assigned a higher TF score.
- Measure how unique/rare that word is across the whole collection of documents (inverse document frequency, IDF). Terms appearing frequently across many documents (such as “and,” “the,” etc.) get a much lower score than rarer terms.
- Get the final TF/IDF score for a given query in a document by multiplying the TF with the IDF score.
This approach is often used in conjunction with other retrieval steps and provides a way to highlight terms that are most relevant to a specific query while ignoring ones too common to be meaningful.
Further improvement of the TF/IDF ranking algorithm is the famous BM25 ranking function, probably the most widely used ranking function in today’s retrieval applications. It builds upon TF/IDF with the following improvements:
- Term frequency saturation – Ensures that after a certain point, repeating the same term more frequently in the document doesn’t (disproportionately) increase its relevance.
- Document length normalization – Adjusts for document length by penalizing longer documents to avoid giving them an unfair advantage simply because they contain more terms.
- Tunable parameters – Allows further refinement of how term frequency and document length affect relevance by modifying its parameters.
Inverted indexing
Another important component in the traditional text retrieval process is inverted indexing. An inverted index is a data structure used to map keywords (possible search terms) to documents that contain them. This enables the retrieval engine to locate documents containing given search terms extremely fast.
The easiest way to understand inverted indexing is through a simplified example:
Let’s say we have the following database containing textual documents. Each document has its ID and content:
ID | CONTENT |
101 | Sony ANC Bluetooth headphones |
102 | Panasonic in-ear headphones |
103 | Bose ANC in-ear headphones |
104 | Sony Bluetooth speaker |
The following is an inverted index for the given table:
TOKEN | ID |
Sony | 101,104 |
ANC | 101,103 |
headphones | 101,102,103 |
in-ear | 102,103 |
Bose | 103 |
Panasonic | 102 |
Bluetooth | 101,104 |
speaker | 104 |
The inverted index first splits the content text from the original text into tokens (for simplicity here token = 1 word), and then maps the occurrence of each token to the document it appears in. This allows us to quickly fetch only the documents in which tokens of the search query appear, instead of having to iterate over every relevant and irrelevant document, looking for matches. For example, if our query is „ANC“, we can simply quickly look into the index for „ANC“ and fetch only the documents with ID 101 and 103, instead of searching through all documents. The indexes can then be used in conjunction with ranking algorithms such as BM25 to calculate relevance scores for each document before returning them to the user.
The cost of creating an index over a table in a database comes in terms of storage space – depending on the type of index used, and the size of the table, an index can quickly grow in size. Still, this is often an acceptable drawback, and inverted indexes represent one of the most common data structures used to support quick and efficient information retrieval.
So far, we’ve looked at traditional search methods and their limitations. In the next part of this blog, we’ll dive into how we can leverage machine learning and AI to make finding information feel more natural and intuitive.
Falls Sie Fragen haben, sind wir nur einen Klick entfernt.