Add a Data Source - Airia Explore

Add and Configure a Data Source in an Agent

This page is also available under Context Engineering > Retrieval Methods, where it’s framed alongside hybrid search, reranking, and graph-enhanced retrieval as part of the full context pipeline.

Data sources enable your Agent to access specific knowledge, grounding its responses with relevant content. The Agent uses configured search settings and created indexes to retrieve information, generating more accurate outputs for user queries. We support two methods for adding

Prerequisites

A data source must be created and configured. For details, refer to Data Source Connectors.

Two Ways to Add a Datasource to Your Agent

Airia supports two methods for connecting datasources to your agent pipeline. Both methods enable the agent to retrieve information from your knowledge base, but they differ in how retrieval is performed and when to use them.

Data Search Step — A dedicated pipeline step that performs a single, embedding-based search against a datasource. The full user input is used as the search query, and the retrieved chunks are passed directly to the LLM or the next step. Best for simple, predictable queries and workflows where speed and low cost are priorities.
Datasource in the AI Step — Datasources are attached directly to the LLM, which dynamically decides which sources to query, which retrieval tools to use, and how many times to search. Best for conversational agents and complex queries where answer accuracy is the priority.

Add Datasource in the AI Step

Overview

Add Datasource in the AI Step transforms data retrieval from a complex, power-user-only capability into an accessible, configurable toolset available to anyone building AI agents on Airia. With this feature, you can connect one or more knowledge sources directly to an AI (LLM) step in your agent pipeline. The AI will intelligently query those sources — dynamically deciding which sources to search, which retrieval method to use, and how many times to search — in order to produce the most accurate, contextually rich response possible. This is powered by multi-hop retrieval via the Airia Datasource MCP Server, a tool-based retrieval architecture that allows the AI to query multiple knowledge sources dynamically within a single agent run.

How It Works

When you enable a datasource in the AI step:

The description and ID of each selected datasource are automatically injected into the LLM’s context, helping the AI understand what each source contains and when to query it.
The Airia Datasource MCP Server is automatically deployed and attached to the LLM. This server exposes a set of retrieval tools the AI can call during inference.
The AI autonomously decides which tool(s) to invoke, how many times, and in what sequence — enabling multi-hop retrieval across all connected databases and indexes.

Step-by-Step: Configuring a Datasource in the AI Step

Step 1 — Enable Datasource in the AI Step

Open your agent pipeline in the Airia builder.
Navigate to the AI Step you want to configure.
Toggle on Enable Datasource.

Step 2 — Select Your Datasource(s)

Click the datasource dropdown that appears.
Select one or more datasources. Multi-selection is supported — you can connect as many knowledge sources as your use case requires.
The description and ID of each selected datasource are automatically passed to the LLM context at search time.

💡 Tip: Make sure your datasource descriptions are clear and specific. The AI uses these descriptions to determine which source is most relevant for a given query.

Step 3 — Review Retrieval Tools

Once a datasource is selected, the Airia Datasource MCP Server is automatically deployed and attached to your AI step.

By default, all available retrieval tools are enabled.
You can manually disable individual tools based on your use case or requirements (e.g., if you only want vector search and not keyword search).

⚠️ Important: If neither the Airia Datasource MCP Server nor any Airia native retrieval tools are configured, the LLM will not have access to your knowledge base and may produce incorrect or hallucinated answers.

What If No Datasource Is Selected?

If a datasource is not selected in the AI step, the LLM will still require a datasource ID to search against. In this case, you must provide it in one of these ways:

In the LLM prompt (system or user prompt)
In the user input passed to the AI step at runtime

⚠️ Warning: If no datasource ID is supplied through any of these methods and no retrieval tool is configured, the AI has no knowledge source to query. This will likely result in hallucinated or factually incorrect responses.

Limitations & Known Issues

Agent-in-Agent Tool Calls

⚠️ Known Limitation: Tool calls — including datasource retrieval tools — do not currently work within nested agent (agent-in-agent) configurations. This is a platform-wide limitation affecting all MCPs, not specific to the Datasource MCP Server.

Add a Data Search step

To add a data source to your Agent:

While creating your Agent, drag and drop the desired data source from the Data Sources section in the left side panel into your Agent workflow as a separate step.

Configure Search Settings

After adding a data source, configure its search behavior:

Select Files for the Agent (Optional) By default, the Agent retrieves data from the entire data source. To narrow the search to specific documents, click the Select files for this Agent button and choose the desired files.
Choose Search Type Select the search type best suited for your use case:

Semantic Search

Adjust your workspace’s search settings to get the most relevant and useful results from semantic or hybrid searches. Each setting fine-tunes how your system finds and ranks content based on your query.

1. Max Results

This setting controls the maximum number of text chunks returned based on semantic similarity to your query.

How to Configure

Choose a number for Max Results (e.g., 5, 10, or 20). The system will retrieve up to this many most semantically relevant chunks.

When to Use

Use this setting to limit the volume of results, preventing information overload for your Large Language Model (LLM) or focusing strictly on the most pertinent information. Example:

Query: “How do I integrate Jira with ServiceNow?”
Max Results: 3

The system returns the 3 most semantically related chunks (e.g., “integration setup,” “API configuration,” “permissions”).

2. Relevance Threshold (1–100)

This setting filters out chunks that do not meet a minimum semantic similarity score. The score is internally converted to a 1–100 scale.

How to Configure

Choose a Relevance Threshold (e.g., 70). Only chunks with a relevance score equal to or greater than your chosen threshold will be returned. A setting of 0 means no threshold is applied.

When to Use

Higher threshold (e.g., 80–90): For highly precise results, such as searching a technical knowledge base.
Lower threshold (e.g., 40–60): For broader context, suitable for brainstorming or research.

Example:

Query: “Jira integration errors”
Relevance Threshold: 80

Only chunks very closely related to Jira errors will be retrieved, excluding general setup or unrelated tool information.

3. Neighboring Chunks

When a chunk matches your search, this option allows you to include surrounding chunks (before and after it within the same document) to provide additional context.

How to Configure

Choose how many Neighboring Chunks to include:

0: Return only the matching chunk.
1–5: Include a few nearby chunks for context.
Full document: Include the entire document if one of its chunks matches.

When to Use

Use this when context is crucial, especially when a single sentence or paragraph alone doesn’t convey the full meaning. Example:

Query: “ServiceNow workflow automation”
Neighboring Chunks: Full Document

If a match is found in one paragraph, the entire document detailing the automation setup will be sent to the LLM, ensuring comprehensive context.

4. Hybrid Search (Keyword + Semantic)

Hybrid Search combines semantic search (understanding meaning) with keyword search (exact word matches). You can assign a weight to each method.

Keyword Search: Finds exact words or identifiers (e.g., “JIRA-1234,” “Project Falcon”).
Semantic Search: Finds similar meanings (e.g., “how to connect Jira” will match “Jira integration steps”).

How to Configure

Adjust the balance using a slider or numeric values for Keyword Search and Semantic Search weights:

100% Keyword / 0% Semantic: Relies solely on exact word matches.
50% Keyword / 50% Semantic: Gives equal importance to meaning and exact words.
20% Keyword / 80% Semantic: Prioritizes meaning while still allowing for precise terms.

When to Use

Keyword-heavy: For searches involving product codes, specific names, or identifiers.
Semantic-heavy: For conceptual or general questions.
Balanced: For queries that blend both precise terms and broader concepts.

Example:

Query: “Banana”
Keyword weight 100%: Finds documents containing the exact word “banana.”
Semantic weight 100%: Finds documents about “fruit,” “tropical food,” or “smoothies.”
Hybrid (50/50): Finds both exact matches and semantically related concepts.

Summary of Search Settings

Setting	What It Controls	Best For	Example
Max Results	How many chunks are returned	Controlling the size of results	”Top `5` relevant answers”
Relevance Threshold	How relevant chunks must be	Filtering out weak matches	”Only results > `80%` similarity”
Neighboring Chunks	How much context to include	Providing context-rich answers	”Return `Full document` when hit found”
Hybrid Search	Balance between meaning and exact match	Combining precise + conceptual queries	”Product codes + topic meaning”

Text-to-SQL Search

Text-to-SQL search is suitable for .csv and .xlsx files, especially when the data is primarily numerical and lacks deep semantic meaning. This method allows the Agent to generate SQL queries from natural language input to retrieve structured results.

Model Selection: Choose the LLM responsible for generating SQL queries within the Agent workflow.
Recommendation: For stable and accurate results, select “High Quality Capable” models.
- High Quality (best performance):
  - Claude 4 Sonnet
  - GPT 4.1
  - Claude 3.7 Sonnet
  - GPT 4o
- Sufficient Quality:
  - GPT 4.1 mini
  - Claude 3.5 Sonnet
  - GPT 4o mini
Fuzzy Search: Enable to allow the system to search through records even with misspellings in the user’s query.
Fuzzy search can increase query generation complexity.

Important: For both Semantic and Text-to-SQL search to function, indexes must be created and the data source configured during its creation. Check Ingestion settings for details.

Choosing the Right Search Method

The optimal search method depends on your query type, data structure, and desired outcome:

Use SQL Retrieval for:
- Structured files (.csv, .xlsx).
- Precise, structured queries.
- Efficiently answering qualitative questions.
- Data that is mostly numerical and does not have strong semantic meaning.
Use Semantic Retrieval for:
- Natural language queries.
- Unstructured or text-heavy documents.
- Cases requiring semantic understanding.

Combining both methods can offer the most flexible and effective solution, especially when Agents interact with users through natural language.

Datasource in the AI Step vs. Data Search Step

Choosing the right retrieval approach depends on your use case. Here’s a comparison to help you decide:

	Data Search Step	Datasource in the AI Step
Retrieval type	Single-hop	Multi-hop
How it works	The full user input is embedded as a query; embedding search is performed; matching chunks are passed to the LLM	The LLM selects which retrieval tool(s) to call, how many times, and in what order based on the query and context
Number of search calls	Always exactly one	One or more — the LLM decides
Best for	Simple queries, linear workflows (non-chat), batch processing	Complex queries, conversational chat, agentic workflows requiring reasoning
Accuracy	Good for straightforward lookups	Higher accuracy — the LLM can retrieve additional context if the first results are insufficient
Speed	Faster	Slower (additional tool calls add latency)
Cost	Lower	Higher (each additional tool call incurs cost)
Recommended when	Speed and cost are priorities; queries are simple and predictable	Answer quality and accuracy are priorities; use case involves multi-turn chat or complex reasoning

When to use the Data Search Step

Use the Data Search Step for simple, predictable retrieval scenarios — such as keyword-triggered lookups, document summarization workflows, or batch processing pipelines where the query structure is known and consistent.

When to use Datasource in the AI Step

Use Datasource in the AI Step when you need the AI to reason about what to search for and how much information to retrieve. This is especially valuable for:

Conversational agents / chatbots where follow-up questions require multiple rounds of retrieval
Research or synthesis tasks that span multiple knowledge sources
Complex queries where a single retrieval pass may not return sufficient context

Best Practices

✅ Write clear datasource descriptions. The LLM uses these to decide which sources to search — descriptive names like “HR Policy Documents (2023–2025)” outperform vague ones like “DB1”.
✅ Only disable retrieval tools you’re confident you won’t need. Removing tools limits what the LLM can do at search time.
✅ Always ensure a retrieval tool or MCP server is configured. An AI step without any retrieval configuration will hallucinate when asked knowledge-base questions.
✅ Use the Data Search Step for simple workflows. Reserve Datasource in the AI Step for accuracy-critical or conversational use cases.
✅ Test with representative queries. Multi-hop retrieval is powerful but may produce unexpected behavior with ambiguous or out-of-scope queries — test thoroughly before deploying.

Overview

Agent Fundamentals

AI & Prompts

Agent Behaviours

Workflow Steps

Lifecycle & Quality

Interface Options

Sharing Agents

Documentation Index

​Add and Configure a Data Source in an Agent

​Prerequisites

​Two Ways to Add a Datasource to Your Agent

​Add Datasource in the AI Step

​Overview

​How It Works

​Step-by-Step: Configuring a Datasource in the AI Step

​Step 1 — Enable Datasource in the AI Step

​Step 2 — Select Your Datasource(s)

​Step 3 — Review Retrieval Tools

​What If No Datasource Is Selected?

​Limitations & Known Issues

​Agent-in-Agent Tool Calls

​Add a Data Search step

​Configure Search Settings

​Semantic Search

1. Max Results

How to Configure

When to Use

2. Relevance Threshold (1–100)

How to Configure

When to Use

3. Neighboring Chunks

How to Configure

When to Use

4. Hybrid Search (Keyword + Semantic)

How to Configure

When to Use

Summary of Search Settings

​Text-to-SQL Search

​Choosing the Right Search Method

​Datasource in the AI Step vs. Data Search Step

​When to use the Data Search Step

​When to use Datasource in the AI Step

​Best Practices

Add and Configure a Data Source in an Agent

Prerequisites

Two Ways to Add a Datasource to Your Agent

Add Datasource in the AI Step

Overview

How It Works

Step-by-Step: Configuring a Datasource in the AI Step

Step 1 — Enable Datasource in the AI Step

Step 2 — Select Your Datasource(s)

Step 3 — Review Retrieval Tools

What If No Datasource Is Selected?

Limitations & Known Issues

Agent-in-Agent Tool Calls

Add a Data Search step

Configure Search Settings

Semantic Search

Text-to-SQL Search

Choosing the Right Search Method

Datasource in the AI Step vs. Data Search Step

When to use the Data Search Step

When to use Datasource in the AI Step

Best Practices