Entity extraction is one of the most powerful tools in Ema workflows—and one of the most misunderstood. This deep dive covers schemas, instructions, output handling, and the gotchas that trip up even experienced builders.

What is Entity Extraction?

Entity extraction is a structured data extractor—it takes unstructured text (conversations, documents) and pulls out specific fields you define.

Think of it as a smart form-filler that reads a conversation and populates a JSON object:

Input: "Hi, this is Michael Thompson. I'd like to check my portfolio
        and maybe send an update to my advisor at advisor@firm.com"

Output: {
  "caller_name": "Michael Thompson",
  "request_type": "portfolio_review",
  "needs_email": true,
  "recipient_email": "advisor@firm.com"
}

When to Use Entity Extraction

Use Case Why Entity Extraction
Get caller name, email Extract specific values reliably
Identify client being discussed Pull client_id from conversation
Determine request type Classify into predefined types
Check if email is needed Boolean: needs_email = true/false
Build query parameters Get ticker, date range, etc.

Defining Extraction Columns

Each column defines a field to extract:

Property Description
Name Field name (e.g., caller_name)
Description What to extract
Type String (1), Number (2), Boolean (3), Array (4), Object (5)
Enum values If limited options (e.g., ["advisor", "client"])

Example Schema

entity_extraction:
  columns:
    - name: caller_name
      description: "Full name of person calling"
      dataType: 1 # String

    - name: caller_type
      description: "Is caller an advisor or client?"
      dataType: 1
      isEnum: true
      possibleValues: ["advisor", "client", "unknown"]

    - name: needs_email
      description: "Does this request require sending an email?"
      dataType: 3 # Boolean

    - name: mentioned_tickers
      description: "Stock tickers mentioned in conversation"
      dataType: 4 # Array
      arrayElementType: { dataType: 1 } # Array of strings

Writing Good Extraction Instructions

The instructions field tells the LLM how to extract. This is where most mistakes happen.

❌ Bad Instructions

Extract the data.

✅ Good Instructions

Extract caller information from the ENTIRE conversation history.

RULES:
1. CALLER NAME: Look for "This is [Name]", "My name is [Name]"
   - Scan ALL messages, not just the latest
   - If corrected ("Actually, it's Mike not Michael"), use corrected value

2. CALLER TYPE:
   - "advisor" if they say "my client", calling about someone else
   - "client" if they say "my portfolio", "my account"
   - "unknown" if not yet clear

3. NEEDS EMAIL:
   - true if: "send", "email", "communicate to", "reach out"
   - false if: just asking questions

4. If a value cannot be determined, return null (don't guess)

The Output Problem: extraction_columns is a Blob

Entity extraction outputs to extraction_columns—a JSON blob containing all your fields:

{
  "caller_name": "Michael Thompson",
  "caller_type": "client",
  "needs_email": true,
  "mentioned_tickers": ["NVDA", "AAPL"]
}

What You CAN Do With It

Destination How
Agent named_inputs Pass whole blob, agent reads fields
Categorizer custom_data Categorizer evaluates fields for routing
Another LLM node Pass as named_input, LLM extracts what it needs

What You CAN'T Do

# ❌ CAN'T index into the blob for direct wiring
email_node:
  to: entity_extraction.extraction_columns.recipient_email

# ❌ CAN'T evaluate fields in runIf
some_node:
  runIf: entity_extraction.extraction_columns.needs_email == true

The Solution: JSON Extractor

flowchart LR
    A["entity_extraction"] --> B["JSON Extractor"]
    B --> C["caller_name: 'Michael'"]
    B --> D["needs_email: true"]
    B --> E["caller_type: 'client'"]
    C & D & E --> F["Now you can wire directly!"]

Entity Extraction vs call_llm JSON

Both can extract structured data. When to use which?

Aspect Entity Extraction call_llm JSON
UI Column builder Write in instructions
Type safety Built-in dataType None
Enum enforcement Platform enforces LLM might hallucinate
Flexibility Fixed schema Dynamic
Best for Known, stable fields Complex/nested structures

Choose Entity Extraction When:

  • You have well-defined fields that don't change
  • You want UI-based column management
  • You need enum validation
  • Multiple people manage the workflow

Choose call_llm When:

  • Schema varies by context
  • You need complex nested structures
  • You want full prompt control
  • One-off extraction tasks

Handling Multi-Turn Conversations

Entity extraction must scan the FULL conversation, not just the current message:

Turn 1: Bot asks "What's your email?"
Turn 2: User says "john@example.com"
Turn 3: User says "Send the report"

On Turn 3:
- Current message has no email
- But chat_conversation has the email from Turn 2
- Instructions must say "scan entire conversation"

Instructions for Multi-Turn

Extract from the ENTIRE conversation history.

ACCUMULATION RULES:
- Scan ALL user and bot messages
- If value provided in later turn, it fills earlier gap
- If user corrects a value, use the NEW value
- "Yes"/"Confirm" → confirmation_status = "confirmed"

Connecting Extraction to Other Nodes

To Categorizer (for routing)

# Pass extraction to categorizer for state-aware routing
chat_categorizer:
  custom_data: entity_extraction.extraction_columns
  categories:
    - NEEDS_EMAIL: "needs_email is true AND recipient_email is null"
    - READY: "has all required data"

To Agent (for processing)

custom_agent:
  named_inputs:
    - name: "Entities"
      source: entity_extraction.extraction_columns
  task_instructions: |
    Client: 
    Request type: 

To Email Node (via JSON Extractor)

flowchart LR
    A["entity_extraction"] --> B["JSON Extractor"] --> C["email_value"] --> D["send_email.to"]

Common Patterns

Pattern 1: Caller Identification

columns:
  - caller_name: "string"
  - caller_id: "string (c_XXX or a_XXX)"
  - caller_type: "enum: advisor/client/unknown"
  - ok_to_contact: "boolean"

Pattern 2: Request Analysis

columns:
  - request_type: "enum: portfolio/compliance/market"
  - urgency: "enum: low/medium/high"
  - focus_client: "string"
  - focus_tickers: "array of strings"

Pattern 3: Communication Readiness

columns:
  - needs_email: "boolean"
  - recipient_email: "string"
  - confirmation_status: "enum: pending/confirmed/cancelled"
  - missing_fields: "array of strings"

Troubleshooting Guide

Problem: Extraction Returns Null for Known Values

Symptoms: User clearly stated their name, but caller_name is null.

Diagnosis: Instructions only look at current message.

Solution:

# Before
instructions: "Extract caller name from the message"

# After
instructions: |
  Extract caller name from the ENTIRE conversation history.
  The name may have been provided in an earlier turn.

Problem: Enum Field Has Unexpected Value

Symptoms: caller_type should be "advisor" or "client" but returns "financial professional".

Diagnosis: LLM is paraphrasing instead of using enum values.

Solution:

# Add explicit enum instruction
instructions: |
  CALLER TYPE must be exactly one of: "advisor", "client", "unknown"
  Do not paraphrase or use synonyms.

Problem: Can't Wire extraction_columns to Node Input

Symptoms: Trying to connect email value to send_email node fails.

Diagnosis: Can't index into blob directly.

Solution: Add JSON Extractor between extraction and target node:

entity_extraction → JSON Extractor (path: $.recipient_email) → send_email.to

Problem: Extraction Works Once but Fails on Follow-Up

Symptoms: First extraction is correct, subsequent turns return nulls.

Diagnosis: Each turn is stateless—need to re-extract from history.

Solution: Ensure you're passing trigger.chat_conversation (full history), not trigger.user_query (current message only).

Problem: Corrections Not Being Applied

Symptoms: User says "Actually, use mike@corp.com instead" but old email persists.

Diagnosis: No correction handling in instructions.

Solution:

instructions: |
  CORRECTION HANDLING:
  - If user says "Actually...", "Instead...", "Change to..."
    use the NEW value, not the original
  - Latest correction wins

Best Practices

Practice Why
Be specific in descriptions "Email in format user@domain.com" not just "Email"
Use enums for known values Prevents hallucinated categories
Include extraction guidance Tell LLM WHERE to look for values
Handle nulls explicitly "If not found, return null"
Scan full history Not just current message
Use JSON Extractor for wiring When you need individual values

Summary

Question Answer
What is it? Structured data extraction from text
Output format JSON blob (extraction_columns)
Can I index into it? No—use JSON Extractor
When to use? Known fields, UI management, type safety
Multi-turn? Must scan full chat_conversation
Alternative? call_llm with JSON instructions

Entity extraction is your structured data powerhouse. Use it when you need reliable, typed field extraction. Just remember: the output is a blob, so plan your wiring accordingly.


For more on using extracted data with routing, see Ema Workflows: Routing and Branching.