Experimental natural language retrievers using duck db #15642

colombod · 2024-08-26T11:16:23Z

Description

This pull request adds support for natural language retrievers on top of duckDb.
Compare to other approaches this is using duckDb to perform KQL queries instead of python code. This is important as it addresses security concerns when running arbitrary code. The duckDb session is an in memory one and the original data cannot be altered by the retriever.

The schema is also used to generate a description of the set and what could be used for. The description and ontology are then used to calculate a ranking score against the query bundle.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

Yes
No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Yes
No

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Added new unit/integration tests
Added new notebook (that tests end-to-end)
I stared at the code and made sure it makes sense

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

nerdai

Thanks @colombod. Do you want to include this in our llama-index-experimental package or create your own retrievers integration?

It looks like you might be trying to do the latter but incorporate it into the llama-index-experimental package, which is probably not the way to go. Specifically if we put it in the experimental package, then we wouldn't need the pyproject.toml nor would we need llama-index-retrievers-natural-language subfolder

colombod · 2024-08-26T23:04:08Z

Thanks @colombod. Do you want to include this in our llama-index-experimental package or create your own retrievers integration?

It looks like you might be trying to do the latter but incorporate it into the llama-index-experimental package, which is probably not the way to go. Specifically if we put it in the experimental package, then we wouldn't need the pyproject.toml nor would we need llama-index-retrievers-natural-language subfolder

Him thank you for the comment, not sure I am totally following, my idea was to add it to the llama-index-experimental package. What do you suggest?

llama-index-experimental/pyproject.toml

logan-markewich · 2024-09-03T17:35:11Z

@colombod Yea as Andrei mentioned, if you want this to be in the experimental package, we can make a folder like llama_index/experimental/retriever/duckdb ? (The current folder name is what I would use for a standalone integration package)

...ma_index/experimental/retrievers/llama-index-retrievers-natural-language/nl_csv_retriever.py

...x/experimental/retrievers/llama-index-retrievers-natural-language/nl_data_frame_retierver.py

colombod · 2024-10-07T12:28:59Z

@colombod Yea as Andrei mentioned, if you want this to be in the experimental package, we can make a folder like llama_index/experimental/retriever/duckdb ? (The current folder name is what I would use for a standalone integration package)

this more about natural language retrieved more than a duck db one

colombod · 2024-10-14T14:44:37Z

@logan-markewich what is the issue with the build? can i get some hint / help

This commit adds the following files: - `llama-index-retrievers-natural-language/__init__.py`: Imports `PandasQueryEngine` and `PandasInstructionParser`. - `llama-index-retrievers-natural-language/BUILD`: Adds Python sources. - `llama-index-retrievers-natural-language/nl_csv_retriever.py`: Defines the `NLCSVRetriever` class, which retrieves data from a CSV file using natural language queries. - `llama-index-retrievers-natural-language/nl_data_frame_retierver.py`: Defines the `NLDataframeRetriever` class, which retrieves data from a pandas DataFrame using natural language queries. - `llama-index-retrievers-natural-language/nl_json_retriever.py`: Defines the `NLJsonRetriever` class, which retrieves data from a JSON file using natural language queries. These retrievers provide capabilities to retrieve data based on natural language queries. They utilize the Llama Index query engine and support querying CSV, JSON, and pandas DataFrames.

This commit adds a new feature to LlamaIndex that enables the use of natural language to retrieve information from Pandas dataframes, CSV files, and JSON objects. Instead of using Python code, this feature utilizes duckDb to perform KQL queries, addressing security concerns when running arbitrary code. The duckDb session is in memory and does not alter the original data. Additionally, the schema is used to generate a description of the dataset and its potential uses. This description and ontology are then used to calculate a ranking score against the query bundle. These changes enhance LlamaIndex's capabilities by providing an alternative approach for retrieving information using natural language queries.

This commit adds a new result ranking prompt to the NLDataframeRetriever class. The prompt allows users to provide a schema and query, and asks them to rate the relevance of the schema in modeling the domain of the query. The relevance must be a number between 0 and 1, where 1 indicates high relevance and 0 indicates low relevance. The significant changes include: - Added DEFAULT_RESULT_RANKING_TMPL constant for the result ranking template - Added DEFAULT_RESULT_RANKING_PROMPTROMPT constant for the result ranking prompt template - Updated NLDataframeRetriever constructor to accept a result_ranking_prompt parameter - Initialized self._result_ranking_prompt with either the provided parameter or the default prompt template - Modified NLDataframeRetriever.complete() method to use self._result_ranking_prompt as part of the LLM completion request These changes allow users of NLDataframeRetriever to easily rank the relevance of schemas in modeling their queries, providing more accurate results.

This commit adds new natural language retrievers to the codebase. The `llama-index-retrievers-natural-language` package has been removed, and a new package called `natrual_language` has been created. Significant changes: - Deleted the `llama-index-retrievers-natural-language` package - Added the `natrual_language` package - Renamed the `BUILD` file from `llama-index-retrievers-natural-language` to `natrual_language` - Renamed the following files from `llama-index-retrievers-natural-language` to `natrual_language`: - nl_csv_retriever.py - nl_data_frame_retierver.py - nl_json_retriever.py - README.md The new natural language retrievers provide capabilities for retrieving data using natural language queries. This change enhances the functionality of the codebase by introducing more flexible and user-friendly retrieval options.

logan-markewich · 2025-02-17T23:35:21Z

@colombod great, it merged! Now just need to cook up an example notebook ;)

masci · 2025-02-18T14:49:45Z

@logan-markewich @colombod Are we still on time to fix the typo in the integration name? That will affect the import paths, it would be nice to fix before this gets spread.

nerdai reviewed Aug 26, 2024

View reviewed changes

logan-markewich reviewed Sep 3, 2024

View reviewed changes

llama-index-experimental/pyproject.toml Outdated Show resolved Hide resolved

logan-markewich reviewed Sep 3, 2024

View reviewed changes

...ma_index/experimental/retrievers/llama-index-retrievers-natural-language/nl_csv_retriever.py Outdated Show resolved Hide resolved

logan-markewich reviewed Sep 3, 2024

View reviewed changes

...x/experimental/retrievers/llama-index-retrievers-natural-language/nl_data_frame_retierver.py Outdated Show resolved Hide resolved

colombod force-pushed the natural_language_retriver branch from 28772db to f319e51 Compare September 9, 2024 01:35

colombod force-pushed the natural_language_retriver branch from 8c6045d to 5553944 Compare October 7, 2024 12:20

colombod marked this pull request as ready for review October 7, 2024 12:28

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 7, 2024

colombod force-pushed the natural_language_retriver branch from e0a1910 to 2377bef Compare October 7, 2024 19:57

colombod added 6 commits October 14, 2024 15:46

spelling mistakes

4055ec3

make prompt configurable

415ba0d

colombod force-pushed the natural_language_retriver branch from 2377bef to d86da4f Compare October 14, 2024 14:46

Merge branch 'main' into natural_language_retriver

06a7667

logan-markewich approved these changes Feb 17, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 17, 2025

logan-markewich enabled auto-merge (squash) February 17, 2025 23:28

vbump

2c15ff8

logan-markewich merged commit 8d8c823 into run-llama:main Feb 17, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental natural language retrievers using duck db #15642

Experimental natural language retrievers using duck db #15642

colombod commented Aug 26, 2024 •

edited

Loading

nerdai left a comment

colombod commented Aug 26, 2024

logan-markewich commented Sep 3, 2024

colombod commented Oct 7, 2024

colombod commented Oct 14, 2024

logan-markewich commented Feb 17, 2025

masci commented Feb 18, 2025

Experimental natural language retrievers using duck db #15642

Experimental natural language retrievers using duck db #15642

Conversation

colombod commented Aug 26, 2024 • edited Loading

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

nerdai left a comment

Choose a reason for hiding this comment

colombod commented Aug 26, 2024

logan-markewich commented Sep 3, 2024

colombod commented Oct 7, 2024

colombod commented Oct 14, 2024

logan-markewich commented Feb 17, 2025

masci commented Feb 18, 2025

colombod commented Aug 26, 2024 •

edited

Loading