coding

Text Search Outperforms Embeddings for Small Data

Traditional text search algorithms like BM25 and TF-IDF often outperform modern embedding-based approaches for smaller document collections by using

Simple Text Search Beats Embeddings for Small Projects

What It Is

Traditional text search algorithms like BM25 and TF-IDF can outperform modern embedding-based approaches for smaller document collections. These statistical methods rank documents by analyzing term frequency and distribution patterns without requiring neural networks or vector representations.

BM25 (Best Matching 25) calculates relevance scores based on how often query terms appear in documents, adjusted for document length and term rarity across the collection. TF-IDF (Term Frequency-Inverse Document Frequency) works similarly, weighing terms by their frequency in a document against their commonness in the entire corpus.

Search platforms like Elasticsearch and OpenSearch ship with these algorithms built-in, requiring minimal configuration. For teams wanting slightly better semantic understanding without full embedding infrastructure, compact BERT models around 100MB can run on standard CPUs, bridging the gap between pure keyword matching and heavyweight vector search.

Why It Matters

The AI industry has pushed vector embeddings as the default solution for search and retrieval, often overlooking simpler alternatives that work perfectly well for common scenarios. This creates unnecessary complexity and cost for teams building applications with modest document collections.

For startups and small teams working with under 10,000 documents, deploying traditional search means avoiding GPU infrastructure, embedding model hosting, and vector database management. The operational overhead drops dramatically - no model versioning, no embedding regeneration pipelines, no specialized hardware requirements.

Development velocity improves when teams can prototype search functionality using tools they already have. Most projects already run Elasticsearch or similar platforms for logging and analytics. Activating BM25 scoring takes minutes rather than weeks of infrastructure planning.

The performance difference matters less than expected for diverse document sets. When documents cover different topics or domains, keyword matching captures enough signal to return relevant results. Semantic similarity becomes critical mainly when documents discuss nearly identical subjects using varied terminology.

Getting Started

Elasticsearch enables BM25 by default. A basic search query looks like this:

{
 "query": {
 "match": {
 "content": "machine learning algorithms"
 }
 }
}

This automatically applies BM25 scoring without additional configuration. For more control over ranking factors:

{
 "query": {
 "match": {
 "content": {
 "query": "machine learning algorithms",
 "operator": "or",
 "fuzziness": "AUTO"
 }
 }
 }
}

Teams wanting semantic capabilities without heavy infrastructure can add Elasticsearch’s inference processors with small models. The eland Python library uploads compact models directly:

eland_import_hub_model --url https://localhost:9200 \
 --hub-model-id sentence-transformers/all-MiniLM-L6-v2 \
 --task-type text_embedding

This 80MB model runs inference during indexing or query time on CPU, providing semantic search without separate vector infrastructure.

Context

The choice between traditional search and embeddings depends on document characteristics and query patterns. Collections with clear topical separation favor keyword approaches - searching technical documentation, blog archives, or product catalogs typically works well with BM25.

Embeddings become valuable when documents use synonyms extensively, when queries need conceptual matching beyond exact terms, or when the collection exceeds tens of thousands of documents with overlapping content. Customer support tickets, research papers in narrow fields, and legal documents often benefit from semantic understanding.

Hybrid approaches combine both methods. Elasticsearch supports rescoring BM25 results with semantic models, applying expensive neural ranking only to top candidates. This balances accuracy with computational cost.

The infrastructure requirements differ substantially. BM25 needs only inverted indexes that most search platforms maintain anyway. Vector search requires dimension-specific indexes, embedding generation pipelines, and often GPU acceleration for acceptable latency at scale.

For teams building MVPs or internal tools, starting with traditional search provides immediate functionality. Migration to embeddings remains straightforward if usage patterns later demand semantic capabilities. The reverse path - simplifying from vectors to keywords - rarely happens despite often being the better choice.