Ema Workflows: Entity Extraction

Entity extraction is one of the most powerful tools in Ema workflows—and one of the most misunderstood. This deep dive covers schemas, instructions, output handling, and the gotchas that trip up even experienced builders.

What is Entity Extraction?

Entity extraction is a structured data extractor—it takes unstructured text (conversations, documents) and pulls out specific fields you define.

Think of it as a smart form-filler that reads a conversation and populates a JSON object:

Input: "Hi, this is Michael Thompson. I'd like to check my portfolio
        and maybe send an update to my advisor at advisor@firm.com"

Output: {
  "caller_name": "Michael Thompson",
  "request_type": "portfolio_review",
  "needs_email": true,
  "recipient_email": "advisor@firm.com"
}

When to Use Entity Extraction

Use Case	Why Entity Extraction
Get caller name, email	Extract specific values reliably
Identify client being discussed	Pull client_id from conversation
Determine request type	Classify into predefined types
Check if email is needed	Boolean: needs_email = true/false
Build query parameters	Get ticker, date range, etc.

Defining Extraction Columns

Each column defines a field to extract:

Property	Description
Name	Field name (e.g., `caller_name`)
Description	What to extract
Type	String (1), Number (2), Boolean (3), Array (4), Object (5)
Enum values	If limited options (e.g., `["advisor", "client"]`)

Example Schema

entity_extraction:
  columns:
    - name: caller_name
      description: "Full name of person calling"
      dataType: 1 # String

    - name: caller_type
      description: "Is caller an advisor or client?"
      dataType: 1
      isEnum: true
      possibleValues: ["advisor", "client", "unknown"]

    - name: needs_email
      description: "Does this request require sending an email?"
      dataType: 3 # Boolean

    - name: mentioned_tickers
      description: "Stock tickers mentioned in conversation"
      dataType: 4 # Array
      arrayElementType: { dataType: 1 } # Array of strings

Writing Good Extraction Instructions

The instructions field tells the LLM how to extract. This is where most mistakes happen.

❌ Bad Instructions

Extract the data.

✅ Good Instructions

Extract caller information from the ENTIRE conversation history.

RULES:
1. CALLER NAME: Look for "This is [Name]", "My name is [Name]"
   - Scan ALL messages, not just the latest
   - If corrected ("Actually, it's Mike not Michael"), use corrected value

2. CALLER TYPE:
   - "advisor" if they say "my client", calling about someone else
   - "client" if they say "my portfolio", "my account"
   - "unknown" if not yet clear

3. NEEDS EMAIL:
   - true if: "send", "email", "communicate to", "reach out"
   - false if: just asking questions

4. If a value cannot be determined, return null (don't guess)

The Output Problem: extraction_columns is a Blob

Entity extraction outputs to extraction_columns—a JSON blob containing all your fields:

{
  "caller_name": "Michael Thompson",
  "caller_type": "client",
  "needs_email": true,
  "mentioned_tickers": ["NVDA", "AAPL"]
}

What You CAN Do With It

Destination	How
Agent named_inputs	Pass whole blob, agent reads fields
Categorizer custom_data	Categorizer evaluates fields for routing
Another LLM node	Pass as named_input, LLM extracts what it needs

What You CAN'T Do

# ❌ CAN'T index into the blob for direct wiring
email_node:
  to: entity_extraction.extraction_columns.recipient_email

# ❌ CAN'T evaluate fields in runIf
some_node:
  runIf: entity_extraction.extraction_columns.needs_email == true

The Solution: JSON Extractor

flowchart LR
    A["entity_extraction"] --> B["JSON Extractor"]
    B --> C["caller_name: 'Michael'"]
    B --> D["needs_email: true"]
    B --> E["caller_type: 'client'"]
    C & D & E --> F["Now you can wire directly!"]

Entity Extraction vs call_llm JSON

Both can extract structured data. When to use which?

Aspect	Entity Extraction	call_llm JSON
UI	Column builder	Write in instructions
Type safety	Built-in dataType	None
Enum enforcement	Platform enforces	LLM might hallucinate
Flexibility	Fixed schema	Dynamic
Best for	Known, stable fields	Complex/nested structures

Choose Entity Extraction When:

You have well-defined fields that don't change
You want UI-based column management
You need enum validation
Multiple people manage the workflow

Choose call_llm When:

Schema varies by context
You need complex nested structures
You want full prompt control
One-off extraction tasks

Handling Multi-Turn Conversations

Entity extraction must scan the FULL conversation, not just the current message:

Turn 1: Bot asks "What's your email?"
Turn 2: User says "john@example.com"
Turn 3: User says "Send the report"

On Turn 3:
- Current message has no email
- But chat_conversation has the email from Turn 2
- Instructions must say "scan entire conversation"

Instructions for Multi-Turn

Extract from the ENTIRE conversation history.

ACCUMULATION RULES:
- Scan ALL user and bot messages
- If value provided in later turn, it fills earlier gap
- If user corrects a value, use the NEW value
- "Yes"/"Confirm" → confirmation_status = "confirmed"

Connecting Extraction to Other Nodes

To Categorizer (for routing)

# Pass extraction to categorizer for state-aware routing
chat_categorizer:
  custom_data: entity_extraction.extraction_columns
  categories:
    - NEEDS_EMAIL: "needs_email is true AND recipient_email is null"
    - READY: "has all required data"

To Agent (for processing)

custom_agent:
  named_inputs:
    - name: "Entities"
      source: entity_extraction.extraction_columns
  task_instructions: |
    Client: 
    Request type:

To Email Node (via JSON Extractor)

flowchart LR
    A["entity_extraction"] --> B["JSON Extractor"] --> C["email_value"] --> D["send_email.to"]

Common Patterns

Pattern 1: Caller Identification

columns:
  - caller_name: "string"
  - caller_id: "string (c_XXX or a_XXX)"
  - caller_type: "enum: advisor/client/unknown"
  - ok_to_contact: "boolean"

Pattern 2: Request Analysis

columns:
  - request_type: "enum: portfolio/compliance/market"
  - urgency: "enum: low/medium/high"
  - focus_client: "string"
  - focus_tickers: "array of strings"

Pattern 3: Communication Readiness

columns:
  - needs_email: "boolean"
  - recipient_email: "string"
  - confirmation_status: "enum: pending/confirmed/cancelled"
  - missing_fields: "array of strings"

Troubleshooting Guide

Problem: Extraction Returns Null for Known Values

Symptoms: User clearly stated their name, but caller_name is null.

Diagnosis: Instructions only look at current message.

Solution:

# Before
instructions: "Extract caller name from the message"

# After
instructions: |
  Extract caller name from the ENTIRE conversation history.
  The name may have been provided in an earlier turn.

Problem: Enum Field Has Unexpected Value

Symptoms: caller_type should be "advisor" or "client" but returns "financial professional".

Diagnosis: LLM is paraphrasing instead of using enum values.

Solution:

# Add explicit enum instruction
instructions: |
  CALLER TYPE must be exactly one of: "advisor", "client", "unknown"
  Do not paraphrase or use synonyms.

Problem: Can't Wire extraction_columns to Node Input

Symptoms: Trying to connect email value to send_email node fails.

Diagnosis: Can't index into blob directly.

Solution: Add JSON Extractor between extraction and target node:

entity_extraction → JSON Extractor (path: $.recipient_email) → send_email.to

Problem: Extraction Works Once but Fails on Follow-Up

Symptoms: First extraction is correct, subsequent turns return nulls.

Diagnosis: Each turn is stateless—need to re-extract from history.

Solution: Ensure you're passing trigger.chat_conversation (full history), not trigger.user_query (current message only).

Problem: Corrections Not Being Applied

Symptoms: User says "Actually, use mike@corp.com instead" but old email persists.

Diagnosis: No correction handling in instructions.

Solution:

instructions: |
  CORRECTION HANDLING:
  - If user says "Actually...", "Instead...", "Change to..."
    use the NEW value, not the original
  - Latest correction wins

Best Practices

Practice	Why
Be specific in descriptions	"Email in format user@domain.com" not just "Email"
Use enums for known values	Prevents hallucinated categories
Include extraction guidance	Tell LLM WHERE to look for values
Handle nulls explicitly	"If not found, return null"
Scan full history	Not just current message
Use JSON Extractor for wiring	When you need individual values

Summary

Question	Answer
What is it?	Structured data extraction from text
Output format	JSON blob (`extraction_columns`)
Can I index into it?	No—use JSON Extractor
When to use?	Known fields, UI management, type safety
Multi-turn?	Must scan full chat_conversation
Alternative?	call_llm with JSON instructions

Entity extraction is your structured data powerhouse. Use it when you need reliable, typed field extraction. Just remember: the output is a blob, so plan your wiring accordingly.

For more on using extracted data with routing, see Ema Workflows: Routing and Branching.

What is Entity Extraction?

When to Use Entity Extraction

Defining Extraction Columns

Example Schema

Writing Good Extraction Instructions

❌ Bad Instructions

✅ Good Instructions

The Output Problem: extraction_columns is a Blob

What You CAN Do With It

What You CAN'T Do

The Solution: JSON Extractor

Entity Extraction vs call_llm JSON

Choose Entity Extraction When:

Choose call_llm When:

Handling Multi-Turn Conversations

Instructions for Multi-Turn

Connecting Extraction to Other Nodes

To Categorizer (for routing)

To Agent (for processing)

To Email Node (via JSON Extractor)

Common Patterns

Pattern 1: Caller Identification

Pattern 2: Request Analysis

Pattern 3: Communication Readiness

Troubleshooting Guide

Problem: Extraction Returns Null for Known Values

Problem: Enum Field Has Unexpected Value

Problem: Can't Wire extraction_columns to Node Input

Problem: Extraction Works Once but Fails on Follow-Up

Problem: Corrections Not Being Applied

Best Practices

Summary

Content Calendar