So, here we go...
The project is in its very early stages and currently does the following:
It allows interaction with your file (currently only .csv
, but support for more formats like XML, JSON, etc., is being added).
- The file is stored as an SQL table.
- When a user enters a query, an LLM is provided with two key inputs:
- The column names.
- The user query.
- If the column names are expressive enough, the LLM can generate an accurate SQL query.
- This SQL query is then executed on the stored data, and the results are made available for download.
- Providing expressive column names helps the LLM map the user’s natural language query to the correct columns.
- If we can also supply the data types of the columns, the query generation improves significantly, leading to better results.
- Based on this observation, we could perform schema analysis to infer data types more effectively.
It helps generate queries for MongoDB (currently, only Genuine MongoDB, not Azure Cosmos).
- We ask the user for the connection string, database name, and collection names (supporting multiple collections).
- We perform schema analysis to understand the data types better.
- We feed the schema analysis and user query to the LLM for query generation.
- Similar to file-based querying, having expressive column names and knowing data types significantly improves query generation.
Why do I ask users for the database and collection names instead of relying on LLMs or RAG techniques?
After reviewing multiple articles on text-to-query systems (notably, Uber’s QueryGPT case study), I found that retrieving the correct database and collection names is challenging for LLMs and RAG-based approaches. This difficulty arises primarily due to the expressiveness (or lack thereof) of names. For now, I have chosen to leave this task to users to ensure accuracy.