Scan Document for Images
This feature allows the system to generate descriptions for images found within your documents, making image content discoverable through search.💡 Note: This feature is enabled by default.An OCR (Optical Character Recognition) solution is used to extract text from images. This extracted text, along with generated image descriptions, enhances search capabilities by indexing visual content.
Select PDF Parser
💡 Note: This feature is currently available to selected customers who are granted early access. Please contact your sales representative if you wish to also receive early access. Capabilities and pricing for parsers are subject to change.Airia uses different PDF parsers to extract content from your PDF documents. Selecting the correct parser ensures optimal data extraction and searchability, especially for complex layouts.
- Basic: Default option. Optimized for simple documents such as plain-text PDFs and repetitive simple layouts.
- Advanced: Optimized for images, math expressions, tables, scanned documents, and complex layouts.
- Universal: Handles diverse layouts, handwritten notes, and noisy scans.
Edit the Selected Parser
You can change the PDF parser for your data source. Go to the option menu next to your data source and click Edit. From the edit screen, select a new parser. This new parser will be applied to all newly added or updated files within the data source after sync. To apply the new parser to all existing files, you must create a new data source with the desired parser setting.Configure Text-to-SQL for Structured Data
Text-to-SQL allows you to interact with your structured data (specifically.csv
and .xlsx
files) using natural language queries, which are then translated into SQL.
When to Use Text-to-SQL
Use Text-to-SQL when you need to ask precise, qualitative questions about your structured data, such as:- “What is the revenue generated by product A for the year to date?”
- “How many leads have we generated for the last year?”
How to Use Text-to-SQL
1. Set Up Your Data Source
Begin by setting up your data source with the relevant.csv
or .xlsx
files. The data source can also contain other file types.
2. Activate SQL Indexing
In the Ingestion settings for your data source, activate the SQL indexing option. For your.csv
/.xlsx
files, choose one of the following:
-
Semantic: When selected, only vectors will be generated for the structured files. This enables text search based on meaning and context. Choose this for semi-structured tabular data where natural language understanding is key.
💡 Example: For a survey documented in an Excel file with open-ended customer answers, use Semantic. Question: “What are the common complaints customers have about Agent Builder?”
-
SQL Only: When selected, the file will be indexed as SQL only, without enabling semantic search. Choose this for highly structured data where precise, quantitative answers are expected.
💡 Example: Question: “How many complaints are registered as High priority?”
-
Both: When selected, both vectors and SQL indexes will be generated for the structured files. This can enhance retrieval accuracy but will trade off speed and cost due to dual retrieval.
💡 Note: Both is the default option for Text-to-SQL setting. For all other file types within the same data source, only vector embeddings (semantic search) will be generated.
3. Checking Ingestion Status for Structured Files
For.csv
/.xlsx
files, you can monitor their ingestion status directly within the data source view. The status indicates the success of both SQL and vector indexing:
- Ready: Both the SQL index and vector embeddings have been successfully created.
- Failed: Both the SQL index and vector embeddings have failed to be created. You can check the reason for failure in the Failed files logs (indicated by a red button at the top of the page).
- Partial: One of the two indexes (either SQL or vector) has failed, while the other was successful. Hover over the “Partial” status to see which specific index is ready and which has failed. The reason for the failed index can also be found in the Failed files logs.
4. Use in the Agent
In your Agent’s workflow, activate the Text-to-SQL retrieval option in the Data Source step. By default, this option is disabled, and the Data Source relies on Semantic retrieval. Enabling Text-to-SQL search will specifically query through.csv
and .xlsx
files from the connected Data Source.
💡 Example: To retrieve all sales records from an Excel file where sales exceed $5,000 and the date is within Q1 2025, a SQL query like SELECT * FROM sales WHERE amount > 5000 AND date LIKE '2025-01%'
provides an efficient and precise solution by leveraging the file’s structured format.
💡 Hint: If you want to enable both Semantic and SQL search types (e.g., when your Data Source contains both.csv
/.xlsx
files and other file types, or if you chose the Both option for your structured files), you can drag and drop the Data Source step twice onto the canvas. Configure one copy to use Semantic retrieval and the other to use SQL retrieval, then connect both to the LLM.
Text-to-SQL Agent Settings
Model Selection
You need to select the LLM that will be used in the agentic workflow for Text-to-SQL. The LLM is fully responsible for SQL query generation. We recommend using “High Quality Capable” models to achieve stable and accurate results. Recommended models (tested):-
High Quality (best performance):
- Claude 4 Sonnet
- GPT 4.1
- Claude 3.7 Sonnet
- GPT 4o
-
Sufficient Quality:
- GPT 4.1 mini
- Claude 3.5 Sonnet
- GPT 4o mini
Fuzzy Search
You can enable Fuzzy search to allow the system to search through records even if there are misspellings in the user’s query. Note that Fuzzy search can increase query generation complexity. When the Agent runs with the configured Data Source step, it will produce results based on the chosen settings. The Text-to-SQL retrieval agentic flow will output a structured result from the dynamically generated SQL query, based on the user’s natural language input. The choice between semantic retrieval and SQL retrieval for agents depends on the query type, data structure, scalability needs, and maintenance considerations. For structured files like.csv
and .xlsx
with precise, structured queries, SQL retrieval is preferred for its efficiency, accuracy, and ability to answer qualitative questions. For natural language queries or when dealing with text fields requiring semantic understanding, semantic retrieval is advantageous. In practice, combining both methods often provides the most flexible and effective solution, especially for agents interacting with users through natural language.
Configure Vector Database
The chosen Vector Database significantly impacts search capabilities, especially regarding hybrid search.Available Options
- Airia DB: This proprietary database supports only semantic search and generates dense vectors for your content. This is the default vector database option.
- Pinecone BYOK (Bring Your Own Key): Depending on the index you provide in your Pinecone database, it can enable Hybrid Search. If the index supports hybrid search (i.e., it’s configured for both dense and sparse vectors), Airia will, by default, generate both sparse and dense vectors in your Pinecone database to enable this capability.
- Weaviate: Hybrid Search is always available with Weaviate. Weaviate applies Fusion algorithms for ranking results from both keyword (lexical) and semantic searches, enhancing relevance. You can learn more about fusion algorithms in the Weaviate blog.