Optimized ingestion settings ensure your data is prepared for efficient search and retrieval by the LLM. You can adjust these settings to optimize data ingestion based on your specific use case and document type.

Scan Document for Images

This feature allows the system to generate descriptions for images found within your documents, making image content discoverable through search.
💡 Note: This feature is enabled by default.
An OCR (Optical Character Recognition) solution is used to extract text from images. This extracted text, along with generated image descriptions, enhances search capabilities by indexing visual content.

Configure Text-to-SQL for Structured Data

Text-to-SQL allows you to interact with your structured data (specifically .csv and .xlsx files) using natural language queries, which are then translated into SQL.

When to Use Text-to-SQL

Use Text-to-SQL when you need to ask precise, qualitative questions about your structured data, such as:
  • “What is the revenue generated by product A for the year to date?”
  • “How many leads have we generated for the last year?”

How to Use Text-to-SQL

1. Set Up Your Data Source

Begin by setting up your data source with the relevant .csv or .xlsx files. The data source can also contain other file types.

2. Activate SQL Indexing

In the Ingestion settings for your data source, activate the SQL indexing option. For your .csv/.xlsx files, choose one of the following:
  • Semantic: When selected, only vectors will be generated for the structured files. This enables text search based on meaning and context. Choose this for semi-structured tabular data where natural language understanding is key.
    💡 Example: For a survey documented in an Excel file with open-ended customer answers, use Semantic. Question: “What are the common complaints customers have about Agent Builder?”
  • SQL Only: When selected, the file will be indexed as SQL only, without enabling semantic search. Choose this for highly structured data where precise, quantitative answers are expected.
    💡 Example: Question: “How many complaints are registered as High priority?”
  • Both: When selected, both vectors and SQL indexes will be generated for the structured files. This can enhance retrieval accuracy but will trade off speed and cost due to dual retrieval.
    💡 Note: Both is the default option for Text-to-SQL setting. For all other file types within the same data source, only vector embeddings (semantic search) will be generated.

3. Use in the Agent

In your Agent’s workflow, activate the Text-to-SQL retrieval option in the Data Source step. By default, this option is disabled, and the Data Source relies on Semantic retrieval. Enabling Text-to-SQL search will specifically query through .csv and .xlsx files from the connected Data Source.
💡 Example: To retrieve all sales records from an Excel file where sales exceed $5,000 and the date is within Q1 2025, a SQL query like SELECT * FROM sales WHERE amount > 5000 AND date LIKE '2025-01%' provides an efficient and precise solution by leveraging the file’s structured format.
💡 Hint: If you want to enable both Semantic and SQL search types (e.g., when your Data Source contains both .csv/.xlsx files and other file types, or if you chose the Both option for your structured files), you can drag and drop the Data Source step twice onto the canvas. Configure one copy to use Semantic retrieval and the other to use SQL retrieval, then connect both to the LLM.

Text-to-SQL Agent Settings

Model Selection

You need to select the LLM that will be used in the agentic workflow for Text-to-SQL. The LLM is fully responsible for SQL query generation. We recommend using “High Quality Capable” models to achieve stable and accurate results. Recommended models (tested):
  • High Quality (best performance):
    • Claude 4 Sonnet
    • GPT 4.1
    • Claude 3.7 Sonnet
    • GPT 4o
  • Sufficient Quality:
    • GPT 4.1 mini
    • Claude 3.5 Sonnet
    • GPT 4o mini
You can enable Fuzzy search to allow the system to search through records even if there are misspellings in the user’s query. Note that Fuzzy search can increase query generation complexity. When the Agent runs with the configured Data Source step, it will produce results based on the chosen settings. The Text-to-SQL retrieval agentic flow will output a structured result from the dynamically generated SQL query, based on the user’s natural language input. The choice between semantic retrieval and SQL retrieval for agents depends on the query type, data structure, scalability needs, and maintenance considerations. For structured files like .csv and .xlsx with precise, structured queries, SQL retrieval is preferred for its efficiency, accuracy, and ability to answer qualitative questions. For natural language queries or when dealing with text fields requiring semantic understanding, semantic retrieval is advantageous. In practice, combining both methods often provides the most flexible and effective solution, especially for agents interacting with users through natural language.

Configure Vector Database

The chosen Vector Database significantly impacts search capabilities, especially regarding hybrid search.

Available Options

  • Airia DB: This proprietary database supports only semantic search and generates dense vectors for your content. This is the default vector database option.
  • Pinecone BYOK (Bring Your Own Key): Depending on the index you provide in your Pinecone database, it can enable Hybrid Search. If the index supports hybrid search (i.e., it’s configured for both dense and sparse vectors), Airia will, by default, generate both sparse and dense vectors in your Pinecone database to enable this capability.
  • Weaviate: Hybrid Search is always available with Weaviate. Weaviate applies Fusion algorithms for ranking results from both keyword (lexical) and semantic searches, enhancing relevance. You can learn more about fusion algorithms in the Weaviate blog.