Search: From 30ms to 3 Seconds

A deep dive into the performance characteristics of different search approaches—model sizes, cold starts, query latency, and cost. Use the interactive demo to measure real performance in your browser.

The Intent & Action series

Search is Dead. Long Live Action. — From retrieval to outcomes
Understanding Embeddings — How vectors capture meaning
Intent Approaches — Seven ways to understand queries
Building Hybrid Intent — Keyword + semantic in practice
Search Performance (this post) — What 30ms vs 3s actually costs

Interactive Demo: Compare Answer Quality

See how different approaches interpret the same query. Keyword and Fuzzy return matching posts; Vector, LLM, and RAG understand meaning and generate answers.

Keyword Exact match is instant and precise 10 posts · 80 char

LLM Context Settings

Posts to send: 10

Excerpt length: 80

Include: Title Tags Excerpt Strict

View: All approaches LLM comparison (local vs remote)

More context = better accuracy but slower & more tokens

Performance Comparison

The demo above focuses on answer quality. This section compares performance characteristics — startup time, query latency, and cost.

Syntactic (No ML)

Keyword JS

Readiness: Instant
Query: <1ms
10 Queries: ~10ms total
Cost: Free

Fuzzy JS

Readiness: Instant
Query: 1-5ms
10 Queries: ~50ms total
Cost: Free

Vector Search (Semantic)

Vector WASM

Readiness: 3-10s cold / 1-2s warm
Query: 5-50ms (local)
10 Queries: ~2s + 500ms = 2.5s
Cost: Free

LLM (Full Reasoning)

LLM Local WASM

Readiness: 60-180s cold / 10-30s warm
Query: 0.5-2s (local, no latency)
10 Queries: ~20s + 10s = 30s
Cost: Free forever

LLM Remote API

Readiness: Instant (no model)
Query: 200-500ms (network RTT)
10 Queries: 0s + 3.5s = 3.5s
Cost: ~$0.17/1K queries

RAG (Retrieve + Generate)

RAG Local WASM

Readiness: 60-180s cold / 10-30s warm
Query: 0.5-2s (local, no latency)
10 Queries: ~20s + 10s = 30s
Cost: Free forever

RAG Remote WASM+API

Readiness: 3-10s cold / 1-2s warm
Query: 200-550ms (network RTT)
10 Queries: ~2s + 3.5s = 5.5s
Cost: ~$0.17/1K queries

Readiness = Time until first query possible (cold = first visit, warm = cached)
Key insight: Local WASM has upfront cost but zero per-query latency. Remote has no startup but network RTT on every query.

Detailed Analysis

1. Syntactic Search (Keyword & Fuzzy)

Characteristics:

Zero cold start — No models to load
Sub-millisecond queries — Pure JavaScript string operations
Zero cost — Runs entirely client-side
Linear scaling — O(n) with document count

Keyword Search:

Cold Start:  0ms (nothing to load)
Query Time:  0.1-0.5ms (string.includes())
Memory:      ~0 (just the index)
Accuracy:    Low (exact matches only)

Fuzzy Search (Fuse.js):

Cold Start:  5-10ms (build index)
Query Time:  1-5ms (edit distance calculation)
Memory:      ~2x index size
Accuracy:    Medium (handles typos)

When to use:

Autocomplete suggestions
Known-item search (user knows exact title)
Fallback when models fail to load

2. Vector Search

Model: all-MiniLM-L6-v2

22M parameters
384-dimension vectors
Trained on 1B+ sentence pairs

Performance Profile:

Download:    ~30MB (quantized ONNX)
Cold Start:  3-10s (download + WASM compile + warm-up)
Warm Start:  1-2s (load from IndexedDB + initialize)
Query Time:  5-50ms (encode + cosine similarity)
Memory:      ~100MB (model + vectors)
Accuracy:    High (semantic understanding)

Cold vs Warm Breakdown:

Cold (first visit): Download 30MB → Compile WASM → Initialize model → Warm-up inference
Warm (return visit): Load from IndexedDB → Initialize model → Ready

Breakdown of Query Time:

Encode query (3-20ms) — Transform text to 384-dim vector
Cosine similarity (1-5ms) — Compare against all document vectors
Sort & rank (<1ms) — Return top results

Scaling Considerations:

Pre-computed document embeddings (stored in JSON)
Query-time embedding is the bottleneck
Linear with document count for similarity search
Can use approximate nearest neighbors (ANN) for 100K+ docs

3. LLM (Local via WebLLM)

Model: Llama 3.2 1B Instruct

1 billion parameters
4-bit quantization (q4f16_1)
Runs entirely in browser via WebGPU

Performance Profile:

Download:    ~2GB (quantized weights)
Cold Start:  60-180s (download + compile shaders)
Warm Start:  10-30s (load from cache + compile shaders)
Query Time:  500-2000ms (inference)
Memory:      ~3-4GB (model + KV cache)
Accuracy:    Very High (full reasoning)

Cold vs Warm Breakdown:

Phase	Cold (First Visit)	Warm (Return Visit)
Download	30-90s (2GB)	0s (cached)
Shader Compile	20-60s	10-25s
Model Init	5-15s	5-10s
Total	60-180s	10-30s

Why Still Slow When Warm?

WebGPU shader compilation happens every page load
Shaders are GPU-specific, can't be fully cached
Model weights cached, but must be loaded into GPU memory

Token Economics:

Input: ~500-1000 tokens (15 posts × ~50 tokens each)
Output: ~50-100 tokens (answer + sources)
Speed: ~10-30 tokens/second on good GPU

Caching Strategy:

Model weights cached in IndexedDB (~2GB)
Shader compilation still required each session
KV cache not persisted between queries

4. LLM (Remote via OpenAI)

Model: GPT-4o-mini

Hosted by OpenAI
No local resources needed
Pay-per-token pricing

Performance Profile:

Download:    0 (API call)
Cold Start:  0ms (already running)
Query Time:  200-500ms (network + inference)
Memory:      ~0 (server-side)
Cost:        ~$0.15/1M input, $0.60/1M output

Cost Breakdown per Query:

Input:  ~800 tokens × $0.00015 = $0.00012
Output: ~80 tokens × $0.0006  = $0.00005
Total:  ~$0.00017 per query (~$0.17 per 1000 queries)

Latency Breakdown:

Network RTT (50-150ms) — Request to OpenAI servers
Queue time (0-100ms) — Variable based on load
Inference (100-300ms) — Actual model computation
Streaming — First token faster than full response

5. RAG (Retrieval-Augmented Generation)

RAG combines vector retrieval with LLM generation:

RAG Local (MiniLM + Llama):

Cold Start:  60-180s (download both models)
Warm Start:  10-30s (LLM shader compile dominates)
Query Time:
  - Retrieval: 5-50ms (vector search)
  - Generation: 500-2000ms (LLM)
  - Total: 500-2050ms
Memory:      ~3-4GB
Cost:        Free

RAG Remote (MiniLM + GPT-4o-mini):

Cold Start:  3-10s (vector model only)
Warm Start:  1-2s (vector model from cache)
Query Time:
  - Retrieval: 5-50ms (local vector search)
  - Generation: 200-500ms (API)
  - Total: 200-550ms
Memory:      ~100MB
Cost:        ~$0.0001/query

Why RAG Remote is the Sweet Spot:

Vector model is small (~30MB) and caches well
Only sends relevant documents to LLM (3-5 instead of 15+)
Reduces input tokens by 60-80%
1-2s warm start vs 10-30s for full local
Best accuracy with reasonable latency

Performance Trade-offs Matrix

Keyword

Ready⚡ Instant
Query⚡ <1ms
Accuracy🔴 Low
LatencyNone
CostFree

Fuzzy

Ready⚡ Instant
Query⚡ 1-5ms
Accuracy🟡 Med
LatencyNone
CostFree

Vector

Ready🟡 3-10s / 1-2s
Query🟢 5-50ms
Accuracy🟢 High
LatencyNone
CostFree

LLM Local

Ready🔴 60-180s / 10-30s
Query🟡 0.5-2s
Accuracy🟢 V.High
LatencyNone ✅
CostFree

LLM Remote

Ready⚡ Instant
Query🟡 200-500ms
Accuracy🟢 V.High
LatencyEvery query ⚠️
Cost~$0.17/1K

RAG Local

Ready🔴 60-180s / 10-30s
Query🟡 0.5-2s
Accuracy🟢 V.High
LatencyNone ✅
CostFree

RAG Remote

Ready🟡 3-10s / 1-2s
Query🟡 200-550ms
Accuracy🟢 V.High
LatencyEvery query ⚠️
Cost~$0.17/1K

Ready = cold / warm (first visit / return visit)
Latency = Network round-trip on every query (remote) vs local processing (WASM)

The Crossover Point: Local vs Remote

The key insight: Local WASM has upfront cost, Remote has per-query cost.

Time to complete N queries:

Local:  ReadinessTime + (N × LocalQueryTime)
Remote: 0 + (N × NetworkLatency)

LLM Example (warm start):

Local: 20s + (N × 1s)
Remote: 0s + (N × 0.35s)

Queries	Local Total	Remote Total	Winner
1	21s	0.35s	Remote
10	30s	3.5s	Remote
30	50s	10.5s	Remote
60	80s	21s	Remote
100	120s	35s	Remote

Even at 100 queries, remote wins on total time. But consider:

Privacy: Local = data never leaves browser
Offline: Local = works without internet
Cost: Local = free forever, Remote = ~$0.02 for 100 queries
Consistency: Local = same latency always, Remote = varies with network

Bottom line: Choose remote for speed, local for privacy/offline/cost.

Recommendations by Use Case

High-Traffic Public Site

Recommendation: Keyword + Fuzzy + Vector Search

No API costs at scale
Fast for 99% of queries
Vector search handles semantic queries
~30MB one-time download

Internal Tool / Low Traffic

Recommendation: RAG Remote

Best accuracy
Fast response (~300ms)
Cost negligible at low volume
No heavy client downloads

Privacy-Critical Application

Recommendation: RAG Local

All processing client-side
No data leaves browser
Accept cold start trade-off
Cache model in IndexedDB

Mobile / Low-Bandwidth

Recommendation: Keyword + Fuzzy only

Zero download required
Instant responses
Works offline
Consider server-side embeddings API

Measuring Your Own Performance

The demo above shows real metrics from your browser:

Cold Start — Time to load and initialize model
Query Time — Time to process search after model ready
Model/Size — What's running and memory footprint

Tips for accurate measurement:

Clear cache to measure true cold start
Run multiple queries to see warm performance
Try different devices — Mobile vs Desktop vs GPU
Monitor memory in browser DevTools

Key Takeaways

Cold start dominates — LLM local takes 30-120s first load
Vector search is the sweet spot — 30MB for semantic understanding
RAG Remote is fastest for quality — ~300ms with best accuracy
Syntactic is free — Always have keyword/fuzzy as fallback
Cache aggressively — IndexedDB for models, localStorage for settings

The best search isn't the most sophisticated—it's the one that fits your constraints.

Series Navigation

Search is Dead. Long Live Action. — From retrieval to outcomes
Understanding Embeddings — How vectors capture meaning
Intent Approaches — Seven ways to understand queries
Building Hybrid Intent — Keyword + semantic in practice
Search Performance (this post) — What 30ms vs 3s actually costs

The best search isn't the most sophisticated—it's the one that fits your constraints.

Interactive Demo: Compare Answer Quality

Performance Comparison

Keyword JS

Fuzzy JS

Vector WASM

LLM Local WASM

LLM Remote API

RAG Local WASM

RAG Remote WASM+API

Detailed Analysis

1. Syntactic Search (Keyword & Fuzzy)

2. Vector Search

3. LLM (Local via WebLLM)

4. LLM (Remote via OpenAI)

5. RAG (Retrieval-Augmented Generation)

Performance Trade-offs Matrix

Keyword

Fuzzy

Vector

LLM Local

LLM Remote

RAG Local

RAG Remote

The Crossover Point: Local vs Remote

Recommendations by Use Case

High-Traffic Public Site

Internal Tool / Low Traffic

Privacy-Critical Application

Mobile / Low-Bandwidth

Measuring Your Own Performance

Key Takeaways

Series Navigation

Content Calendar