A deep dive into the performance characteristics of different search approaches—model sizes, cold starts, query latency, and cost. Use the interactive demo to measure real performance in your browser.

The Intent & Action series


Interactive Demo: Compare Answer Quality

See how different approaches interpret the same query. Keyword and Fuzzy return matching posts; Vector, LLM, and RAG understand meaning and generate answers.

Keyword Exact match is instant and precise 10 posts · 80 char
LLM Context Settings
More context = better accuracy but slower & more tokens
Keyword
Fuzzy
Vector
LLM Local
RAG Local

Performance Comparison

The demo above focuses on answer quality. This section compares performance characteristics — startup time, query latency, and cost.

Syntactic (No ML)

Keyword JS

Readiness
Instant
Query
<1ms
10 Queries
~10ms total
Cost
Free

Fuzzy JS

Readiness
Instant
Query
1-5ms
10 Queries
~50ms total
Cost
Free
Vector Search (Semantic)

Vector WASM

Readiness
3-10s cold / 1-2s warm
Query
5-50ms (local)
10 Queries
~2s + 500ms = 2.5s
Cost
Free
LLM (Full Reasoning)

LLM Local WASM

Readiness
60-180s cold / 10-30s warm
Query
0.5-2s (local, no latency)
10 Queries
~20s + 10s = 30s
Cost
Free forever

LLM Remote API

Readiness
Instant (no model)
Query
200-500ms (network RTT)
10 Queries
0s + 3.5s = 3.5s
Cost
~$0.17/1K queries
RAG (Retrieve + Generate)

RAG Local WASM

Readiness
60-180s cold / 10-30s warm
Query
0.5-2s (local, no latency)
10 Queries
~20s + 10s = 30s
Cost
Free forever

RAG Remote WASM+API

Readiness
3-10s cold / 1-2s warm
Query
200-550ms (network RTT)
10 Queries
~2s + 3.5s = 5.5s
Cost
~$0.17/1K queries

Readiness = Time until first query possible (cold = first visit, warm = cached)
Key insight: Local WASM has upfront cost but zero per-query latency. Remote has no startup but network RTT on every query.


Detailed Analysis

1. Syntactic Search (Keyword & Fuzzy)

Characteristics:

  • Zero cold start — No models to load
  • Sub-millisecond queries — Pure JavaScript string operations
  • Zero cost — Runs entirely client-side
  • Linear scaling — O(n) with document count

Keyword Search:

Cold Start:  0ms (nothing to load)
Query Time:  0.1-0.5ms (string.includes())
Memory:      ~0 (just the index)
Accuracy:    Low (exact matches only)

Fuzzy Search (Fuse.js):

Cold Start:  5-10ms (build index)
Query Time:  1-5ms (edit distance calculation)
Memory:      ~2x index size
Accuracy:    Medium (handles typos)

When to use:

  • Autocomplete suggestions
  • Known-item search (user knows exact title)
  • Fallback when models fail to load

2. Vector Search

Model: all-MiniLM-L6-v2

  • 22M parameters
  • 384-dimension vectors
  • Trained on 1B+ sentence pairs

Performance Profile:

Download:    ~30MB (quantized ONNX)
Cold Start:  3-10s (download + WASM compile + warm-up)
Warm Start:  1-2s (load from IndexedDB + initialize)
Query Time:  5-50ms (encode + cosine similarity)
Memory:      ~100MB (model + vectors)
Accuracy:    High (semantic understanding)

Cold vs Warm Breakdown:

  • Cold (first visit): Download 30MB → Compile WASM → Initialize model → Warm-up inference
  • Warm (return visit): Load from IndexedDB → Initialize model → Ready

Breakdown of Query Time:

  1. Encode query (3-20ms) — Transform text to 384-dim vector
  2. Cosine similarity (1-5ms) — Compare against all document vectors
  3. Sort & rank (<1ms) — Return top results

Scaling Considerations:

  • Pre-computed document embeddings (stored in JSON)
  • Query-time embedding is the bottleneck
  • Linear with document count for similarity search
  • Can use approximate nearest neighbors (ANN) for 100K+ docs

3. LLM (Local via WebLLM)

Model: Llama 3.2 1B Instruct

  • 1 billion parameters
  • 4-bit quantization (q4f16_1)
  • Runs entirely in browser via WebGPU

Performance Profile:

Download:    ~2GB (quantized weights)
Cold Start:  60-180s (download + compile shaders)
Warm Start:  10-30s (load from cache + compile shaders)
Query Time:  500-2000ms (inference)
Memory:      ~3-4GB (model + KV cache)
Accuracy:    Very High (full reasoning)

Cold vs Warm Breakdown:

Phase Cold (First Visit) Warm (Return Visit)
Download 30-90s (2GB) 0s (cached)
Shader Compile 20-60s 10-25s
Model Init 5-15s 5-10s
Total 60-180s 10-30s

Why Still Slow When Warm?

  • WebGPU shader compilation happens every page load
  • Shaders are GPU-specific, can't be fully cached
  • Model weights cached, but must be loaded into GPU memory

Token Economics:

  • Input: ~500-1000 tokens (15 posts × ~50 tokens each)
  • Output: ~50-100 tokens (answer + sources)
  • Speed: ~10-30 tokens/second on good GPU

Caching Strategy:

  • Model weights cached in IndexedDB (~2GB)
  • Shader compilation still required each session
  • KV cache not persisted between queries

4. LLM (Remote via OpenAI)

Model: GPT-4o-mini

  • Hosted by OpenAI
  • No local resources needed
  • Pay-per-token pricing

Performance Profile:

Download:    0 (API call)
Cold Start:  0ms (already running)
Query Time:  200-500ms (network + inference)
Memory:      ~0 (server-side)
Cost:        ~$0.15/1M input, $0.60/1M output

Cost Breakdown per Query:

Input:  ~800 tokens × $0.00015 = $0.00012
Output: ~80 tokens × $0.0006  = $0.00005
Total:  ~$0.00017 per query (~$0.17 per 1000 queries)

Latency Breakdown:

  1. Network RTT (50-150ms) — Request to OpenAI servers
  2. Queue time (0-100ms) — Variable based on load
  3. Inference (100-300ms) — Actual model computation
  4. Streaming — First token faster than full response

5. RAG (Retrieval-Augmented Generation)

RAG combines vector retrieval with LLM generation:

RAG Local (MiniLM + Llama):

Cold Start:  60-180s (download both models)
Warm Start:  10-30s (LLM shader compile dominates)
Query Time:
  - Retrieval: 5-50ms (vector search)
  - Generation: 500-2000ms (LLM)
  - Total: 500-2050ms
Memory:      ~3-4GB
Cost:        Free

RAG Remote (MiniLM + GPT-4o-mini):

Cold Start:  3-10s (vector model only)
Warm Start:  1-2s (vector model from cache)
Query Time:
  - Retrieval: 5-50ms (local vector search)
  - Generation: 200-500ms (API)
  - Total: 200-550ms
Memory:      ~100MB
Cost:        ~$0.0001/query

Why RAG Remote is the Sweet Spot:

  • Vector model is small (~30MB) and caches well
  • Only sends relevant documents to LLM (3-5 instead of 15+)
  • Reduces input tokens by 60-80%
  • 1-2s warm start vs 10-30s for full local
  • Best accuracy with reasonable latency

Performance Trade-offs Matrix

Keyword
  • Ready⚡ Instant
  • Query⚡ <1ms
  • Accuracy🔴 Low
  • LatencyNone
  • CostFree
Fuzzy
  • Ready⚡ Instant
  • Query⚡ 1-5ms
  • Accuracy🟡 Med
  • LatencyNone
  • CostFree
Vector
  • Ready🟡 3-10s / 1-2s
  • Query🟢 5-50ms
  • Accuracy🟢 High
  • LatencyNone
  • CostFree
LLM Local
  • Ready🔴 60-180s / 10-30s
  • Query🟡 0.5-2s
  • Accuracy🟢 V.High
  • LatencyNone ✅
  • CostFree
LLM Remote
  • Ready⚡ Instant
  • Query🟡 200-500ms
  • Accuracy🟢 V.High
  • LatencyEvery query ⚠️
  • Cost~$0.17/1K
RAG Local
  • Ready🔴 60-180s / 10-30s
  • Query🟡 0.5-2s
  • Accuracy🟢 V.High
  • LatencyNone ✅
  • CostFree
RAG Remote
  • Ready🟡 3-10s / 1-2s
  • Query🟡 200-550ms
  • Accuracy🟢 V.High
  • LatencyEvery query ⚠️
  • Cost~$0.17/1K

Ready = cold / warm (first visit / return visit)
Latency = Network round-trip on every query (remote) vs local processing (WASM)


The Crossover Point: Local vs Remote

The key insight: Local WASM has upfront cost, Remote has per-query cost.

Time to complete N queries:

Local:  ReadinessTime + (N × LocalQueryTime)
Remote: 0 + (N × NetworkLatency)

LLM Example (warm start):

  • Local: 20s + (N × 1s)
  • Remote: 0s + (N × 0.35s)
Queries Local Total Remote Total Winner
1 21s 0.35s Remote
10 30s 3.5s Remote
30 50s 10.5s Remote
60 80s 21s Remote
100 120s 35s Remote

Even at 100 queries, remote wins on total time. But consider:

  • Privacy: Local = data never leaves browser
  • Offline: Local = works without internet
  • Cost: Local = free forever, Remote = ~$0.02 for 100 queries
  • Consistency: Local = same latency always, Remote = varies with network

Bottom line: Choose remote for speed, local for privacy/offline/cost.


Recommendations by Use Case

High-Traffic Public Site

Recommendation: Keyword + Fuzzy + Vector Search

  • No API costs at scale
  • Fast for 99% of queries
  • Vector search handles semantic queries
  • ~30MB one-time download

Internal Tool / Low Traffic

Recommendation: RAG Remote

  • Best accuracy
  • Fast response (~300ms)
  • Cost negligible at low volume
  • No heavy client downloads

Privacy-Critical Application

Recommendation: RAG Local

  • All processing client-side
  • No data leaves browser
  • Accept cold start trade-off
  • Cache model in IndexedDB

Mobile / Low-Bandwidth

Recommendation: Keyword + Fuzzy only

  • Zero download required
  • Instant responses
  • Works offline
  • Consider server-side embeddings API

Measuring Your Own Performance

The demo above shows real metrics from your browser:

  • Cold Start — Time to load and initialize model
  • Query Time — Time to process search after model ready
  • Model/Size — What's running and memory footprint

Tips for accurate measurement:

  1. Clear cache to measure true cold start
  2. Run multiple queries to see warm performance
  3. Try different devices — Mobile vs Desktop vs GPU
  4. Monitor memory in browser DevTools

Key Takeaways

  1. Cold start dominates — LLM local takes 30-120s first load
  2. Vector search is the sweet spot — 30MB for semantic understanding
  3. RAG Remote is fastest for quality — ~300ms with best accuracy
  4. Syntactic is free — Always have keyword/fuzzy as fallback
  5. Cache aggressively — IndexedDB for models, localStorage for settings

The best search isn't the most sophisticated—it's the one that fits your constraints.


Series Navigation


The best search isn't the most sophisticated—it's the one that fits your constraints.