Search: From 30ms to 3 Seconds
A deep dive into the performance characteristics of different search approaches—model sizes, cold starts, query latency, and cost. Use the interactive demo to measure real performance in your browser.
The Intent & Action series
- Search is Dead. Long Live Action. — From retrieval to outcomes
- Understanding Embeddings — How vectors capture meaning
- Intent Approaches — Seven ways to understand queries
- Building Hybrid Intent — Keyword + semantic in practice
- Search Performance (this post) — What 30ms vs 3s actually costs
Interactive Demo: Compare Answer Quality
See how different approaches interpret the same query. Keyword and Fuzzy return matching posts; Vector, LLM, and RAG understand meaning and generate answers.
Performance Comparison
The demo above focuses on answer quality. This section compares performance characteristics — startup time, query latency, and cost.
Keyword JS
- Readiness
- Instant
- Query
- <1ms
- 10 Queries
- ~10ms total
- Cost
- Free
Fuzzy JS
- Readiness
- Instant
- Query
- 1-5ms
- 10 Queries
- ~50ms total
- Cost
- Free
Vector WASM
- Readiness
- 3-10s cold / 1-2s warm
- Query
- 5-50ms (local)
- 10 Queries
- ~2s + 500ms = 2.5s
- Cost
- Free
LLM Local WASM
- Readiness
- 60-180s cold / 10-30s warm
- Query
- 0.5-2s (local, no latency)
- 10 Queries
- ~20s + 10s = 30s
- Cost
- Free forever
LLM Remote API
- Readiness
- Instant (no model)
- Query
- 200-500ms (network RTT)
- 10 Queries
- 0s + 3.5s = 3.5s
- Cost
- ~$0.17/1K queries
RAG Local WASM
- Readiness
- 60-180s cold / 10-30s warm
- Query
- 0.5-2s (local, no latency)
- 10 Queries
- ~20s + 10s = 30s
- Cost
- Free forever
RAG Remote WASM+API
- Readiness
- 3-10s cold / 1-2s warm
- Query
- 200-550ms (network RTT)
- 10 Queries
- ~2s + 3.5s = 5.5s
- Cost
- ~$0.17/1K queries
Readiness = Time until first query possible (cold = first visit, warm = cached)
Key insight: Local WASM has upfront cost but zero per-query latency. Remote has no startup but network RTT on every query.
Detailed Analysis
1. Syntactic Search (Keyword & Fuzzy)
Characteristics:
- Zero cold start — No models to load
- Sub-millisecond queries — Pure JavaScript string operations
- Zero cost — Runs entirely client-side
- Linear scaling — O(n) with document count
Keyword Search:
Cold Start: 0ms (nothing to load)
Query Time: 0.1-0.5ms (string.includes())
Memory: ~0 (just the index)
Accuracy: Low (exact matches only)
Fuzzy Search (Fuse.js):
Cold Start: 5-10ms (build index)
Query Time: 1-5ms (edit distance calculation)
Memory: ~2x index size
Accuracy: Medium (handles typos)
When to use:
- Autocomplete suggestions
- Known-item search (user knows exact title)
- Fallback when models fail to load
2. Vector Search
Model: all-MiniLM-L6-v2
- 22M parameters
- 384-dimension vectors
- Trained on 1B+ sentence pairs
Performance Profile:
Download: ~30MB (quantized ONNX)
Cold Start: 3-10s (download + WASM compile + warm-up)
Warm Start: 1-2s (load from IndexedDB + initialize)
Query Time: 5-50ms (encode + cosine similarity)
Memory: ~100MB (model + vectors)
Accuracy: High (semantic understanding)
Cold vs Warm Breakdown:
- Cold (first visit): Download 30MB → Compile WASM → Initialize model → Warm-up inference
- Warm (return visit): Load from IndexedDB → Initialize model → Ready
Breakdown of Query Time:
- Encode query (3-20ms) — Transform text to 384-dim vector
- Cosine similarity (1-5ms) — Compare against all document vectors
- Sort & rank (<1ms) — Return top results
Scaling Considerations:
- Pre-computed document embeddings (stored in JSON)
- Query-time embedding is the bottleneck
- Linear with document count for similarity search
- Can use approximate nearest neighbors (ANN) for 100K+ docs
3. LLM (Local via WebLLM)
Model: Llama 3.2 1B Instruct
- 1 billion parameters
- 4-bit quantization (q4f16_1)
- Runs entirely in browser via WebGPU
Performance Profile:
Download: ~2GB (quantized weights)
Cold Start: 60-180s (download + compile shaders)
Warm Start: 10-30s (load from cache + compile shaders)
Query Time: 500-2000ms (inference)
Memory: ~3-4GB (model + KV cache)
Accuracy: Very High (full reasoning)
Cold vs Warm Breakdown:
| Phase | Cold (First Visit) | Warm (Return Visit) |
|---|---|---|
| Download | 30-90s (2GB) | 0s (cached) |
| Shader Compile | 20-60s | 10-25s |
| Model Init | 5-15s | 5-10s |
| Total | 60-180s | 10-30s |
Why Still Slow When Warm?
- WebGPU shader compilation happens every page load
- Shaders are GPU-specific, can't be fully cached
- Model weights cached, but must be loaded into GPU memory
Token Economics:
- Input: ~500-1000 tokens (15 posts × ~50 tokens each)
- Output: ~50-100 tokens (answer + sources)
- Speed: ~10-30 tokens/second on good GPU
Caching Strategy:
- Model weights cached in IndexedDB (~2GB)
- Shader compilation still required each session
- KV cache not persisted between queries
4. LLM (Remote via OpenAI)
Model: GPT-4o-mini
- Hosted by OpenAI
- No local resources needed
- Pay-per-token pricing
Performance Profile:
Download: 0 (API call)
Cold Start: 0ms (already running)
Query Time: 200-500ms (network + inference)
Memory: ~0 (server-side)
Cost: ~$0.15/1M input, $0.60/1M output
Cost Breakdown per Query:
Input: ~800 tokens × $0.00015 = $0.00012
Output: ~80 tokens × $0.0006 = $0.00005
Total: ~$0.00017 per query (~$0.17 per 1000 queries)
Latency Breakdown:
- Network RTT (50-150ms) — Request to OpenAI servers
- Queue time (0-100ms) — Variable based on load
- Inference (100-300ms) — Actual model computation
- Streaming — First token faster than full response
5. RAG (Retrieval-Augmented Generation)
RAG combines vector retrieval with LLM generation:
RAG Local (MiniLM + Llama):
Cold Start: 60-180s (download both models)
Warm Start: 10-30s (LLM shader compile dominates)
Query Time:
- Retrieval: 5-50ms (vector search)
- Generation: 500-2000ms (LLM)
- Total: 500-2050ms
Memory: ~3-4GB
Cost: Free
RAG Remote (MiniLM + GPT-4o-mini):
Cold Start: 3-10s (vector model only)
Warm Start: 1-2s (vector model from cache)
Query Time:
- Retrieval: 5-50ms (local vector search)
- Generation: 200-500ms (API)
- Total: 200-550ms
Memory: ~100MB
Cost: ~$0.0001/query
Why RAG Remote is the Sweet Spot:
- Vector model is small (~30MB) and caches well
- Only sends relevant documents to LLM (3-5 instead of 15+)
- Reduces input tokens by 60-80%
- 1-2s warm start vs 10-30s for full local
- Best accuracy with reasonable latency
Performance Trade-offs Matrix
Keyword
- Ready⚡ Instant
- Query⚡ <1ms
- Accuracy🔴 Low
- LatencyNone
- CostFree
Fuzzy
- Ready⚡ Instant
- Query⚡ 1-5ms
- Accuracy🟡 Med
- LatencyNone
- CostFree
Vector
- Ready🟡 3-10s / 1-2s
- Query🟢 5-50ms
- Accuracy🟢 High
- LatencyNone
- CostFree
LLM Local
- Ready🔴 60-180s / 10-30s
- Query🟡 0.5-2s
- Accuracy🟢 V.High
- LatencyNone ✅
- CostFree
LLM Remote
- Ready⚡ Instant
- Query🟡 200-500ms
- Accuracy🟢 V.High
- LatencyEvery query ⚠️
- Cost~$0.17/1K
RAG Local
- Ready🔴 60-180s / 10-30s
- Query🟡 0.5-2s
- Accuracy🟢 V.High
- LatencyNone ✅
- CostFree
RAG Remote
- Ready🟡 3-10s / 1-2s
- Query🟡 200-550ms
- Accuracy🟢 V.High
- LatencyEvery query ⚠️
- Cost~$0.17/1K
Ready = cold / warm (first visit / return visit)
Latency = Network round-trip on every query (remote) vs local processing (WASM)
The Crossover Point: Local vs Remote
The key insight: Local WASM has upfront cost, Remote has per-query cost.
Time to complete N queries:
Local: ReadinessTime + (N × LocalQueryTime)
Remote: 0 + (N × NetworkLatency)
LLM Example (warm start):
- Local: 20s + (N × 1s)
- Remote: 0s + (N × 0.35s)
| Queries | Local Total | Remote Total | Winner |
|---|---|---|---|
| 1 | 21s | 0.35s | Remote |
| 10 | 30s | 3.5s | Remote |
| 30 | 50s | 10.5s | Remote |
| 60 | 80s | 21s | Remote |
| 100 | 120s | 35s | Remote |
Even at 100 queries, remote wins on total time. But consider:
- Privacy: Local = data never leaves browser
- Offline: Local = works without internet
- Cost: Local = free forever, Remote = ~$0.02 for 100 queries
- Consistency: Local = same latency always, Remote = varies with network
Bottom line: Choose remote for speed, local for privacy/offline/cost.
Recommendations by Use Case
High-Traffic Public Site
Recommendation: Keyword + Fuzzy + Vector Search
- No API costs at scale
- Fast for 99% of queries
- Vector search handles semantic queries
- ~30MB one-time download
Internal Tool / Low Traffic
Recommendation: RAG Remote
- Best accuracy
- Fast response (~300ms)
- Cost negligible at low volume
- No heavy client downloads
Privacy-Critical Application
Recommendation: RAG Local
- All processing client-side
- No data leaves browser
- Accept cold start trade-off
- Cache model in IndexedDB
Mobile / Low-Bandwidth
Recommendation: Keyword + Fuzzy only
- Zero download required
- Instant responses
- Works offline
- Consider server-side embeddings API
Measuring Your Own Performance
The demo above shows real metrics from your browser:
- Cold Start — Time to load and initialize model
- Query Time — Time to process search after model ready
- Model/Size — What's running and memory footprint
Tips for accurate measurement:
- Clear cache to measure true cold start
- Run multiple queries to see warm performance
- Try different devices — Mobile vs Desktop vs GPU
- Monitor memory in browser DevTools
Key Takeaways
- Cold start dominates — LLM local takes 30-120s first load
- Vector search is the sweet spot — 30MB for semantic understanding
- RAG Remote is fastest for quality — ~300ms with best accuracy
- Syntactic is free — Always have keyword/fuzzy as fallback
- Cache aggressively — IndexedDB for models, localStorage for settings
The best search isn't the most sophisticated—it's the one that fits your constraints.
Series Navigation
- Search is Dead. Long Live Action. — From retrieval to outcomes
- Understanding Embeddings — How vectors capture meaning
- Intent Approaches — Seven ways to understand queries
- Building Hybrid Intent — Keyword + semantic in practice
- Search Performance (this post) — What 30ms vs 3s actually costs
The best search isn't the most sophisticated—it's the one that fits your constraints.