Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental natural language retrievers using duck db #15642

Merged
merged 8 commits into from
Feb 17, 2025

Conversation

colombod
Copy link
Contributor

@colombod colombod commented Aug 26, 2024

Description

This pull request adds support for natural language retrievers on top of duckDb.
Compare to other approaches this is using duckDb to perform KQL queries instead of python code. This is important as it addresses security concerns when running arbitrary code. The duckDb session is an in memory one and the original data cannot be altered by the retriever.

The schema is also used to generate a description of the set and what could be used for. The description and ontology are then used to calculate a ranking score against the query bundle.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Added new unit/integration tests
  • Added new notebook (that tests end-to-end)
  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

Copy link
Contributor

@nerdai nerdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @colombod. Do you want to include this in our llama-index-experimental package or create your own retrievers integration?

It looks like you might be trying to do the latter but incorporate it into the llama-index-experimental package, which is probably not the way to go. Specifically if we put it in the experimental package, then we wouldn't need the pyproject.toml nor would we need llama-index-retrievers-natural-language subfolder

@colombod
Copy link
Contributor Author

Thanks @colombod. Do you want to include this in our llama-index-experimental package or create your own retrievers integration?

It looks like you might be trying to do the latter but incorporate it into the llama-index-experimental package, which is probably not the way to go. Specifically if we put it in the experimental package, then we wouldn't need the pyproject.toml nor would we need llama-index-retrievers-natural-language subfolder

Him thank you for the comment, not sure I am totally following, my idea was to add it to the llama-index-experimental package. What do you suggest?

@logan-markewich
Copy link
Collaborator

@colombod Yea as Andrei mentioned, if you want this to be in the experimental package, we can make a folder like llama_index/experimental/retriever/duckdb ? (The current folder name is what I would use for a standalone integration package)

@colombod colombod force-pushed the natural_language_retriver branch from 28772db to f319e51 Compare September 9, 2024 01:35
@colombod colombod force-pushed the natural_language_retriver branch from 8c6045d to 5553944 Compare October 7, 2024 12:20
@colombod colombod marked this pull request as ready for review October 7, 2024 12:28
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 7, 2024
@colombod
Copy link
Contributor Author

colombod commented Oct 7, 2024

@colombod Yea as Andrei mentioned, if you want this to be in the experimental package, we can make a folder like llama_index/experimental/retriever/duckdb ? (The current folder name is what I would use for a standalone integration package)

this more about natural language retrieved more than a duck db one

@colombod colombod force-pushed the natural_language_retriver branch from e0a1910 to 2377bef Compare October 7, 2024 19:57
@colombod
Copy link
Contributor Author

@logan-markewich what is the issue with the build? can i get some hint / help

This commit adds the following files:
- `llama-index-retrievers-natural-language/__init__.py`: Imports `PandasQueryEngine` and `PandasInstructionParser`.
- `llama-index-retrievers-natural-language/BUILD`: Adds Python sources.
- `llama-index-retrievers-natural-language/nl_csv_retriever.py`: Defines the `NLCSVRetriever` class, which retrieves data from a CSV file using natural language queries.
- `llama-index-retrievers-natural-language/nl_data_frame_retierver.py`: Defines the `NLDataframeRetriever` class, which retrieves data from a pandas DataFrame using natural language queries.
- `llama-index-retrievers-natural-language/nl_json_retriever.py`: Defines the `NLJsonRetriever` class, which retrieves data from a JSON file using natural language queries.

These retrievers provide capabilities to retrieve data based on natural language queries. They utilize the Llama Index query engine and support querying CSV, JSON, and pandas DataFrames.
This commit adds a new feature to LlamaIndex that enables the use of natural language to retrieve information from Pandas dataframes, CSV files, and JSON objects. Instead of using Python code, this feature utilizes duckDb to perform KQL queries, addressing security concerns when running arbitrary code. The duckDb session is in memory and does not alter the original data.

Additionally, the schema is used to generate a description of the dataset and its potential uses. This description and ontology are then used to calculate a ranking score against the query bundle.

These changes enhance LlamaIndex's capabilities by providing an alternative approach for retrieving information using natural language queries.
This commit adds a new result ranking prompt to the NLDataframeRetriever class. The prompt allows users to provide a schema and query, and asks them to rate the relevance of the schema in modeling the domain of the query. The relevance must be a number between 0 and 1, where 1 indicates high relevance and 0 indicates low relevance.

The significant changes include:
- Added DEFAULT_RESULT_RANKING_TMPL constant for the result ranking template
- Added DEFAULT_RESULT_RANKING_PROMPTROMPT constant for the result ranking prompt template
- Updated NLDataframeRetriever constructor to accept a result_ranking_prompt parameter
- Initialized self._result_ranking_prompt with either the provided parameter or the default prompt template
- Modified NLDataframeRetriever.complete() method to use self._result_ranking_prompt as part of the LLM completion request

These changes allow users of NLDataframeRetriever to easily rank the relevance of schemas in modeling their queries, providing more accurate results.
This commit adds new natural language retrievers to the codebase. The `llama-index-retrievers-natural-language` package has been removed, and a new package called `natrual_language` has been created.

Significant changes:
- Deleted the `llama-index-retrievers-natural-language` package
- Added the `natrual_language` package
- Renamed the `BUILD` file from `llama-index-retrievers-natural-language` to `natrual_language`
- Renamed the following files from `llama-index-retrievers-natural-language` to `natrual_language`:
  - nl_csv_retriever.py
  - nl_data_frame_retierver.py
  - nl_json_retriever.py
  - README.md

The new natural language retrievers provide capabilities for retrieving data using natural language queries. This change enhances the functionality of the codebase by introducing more flexible and user-friendly retrieval options.
@colombod colombod force-pushed the natural_language_retriver branch from 2377bef to d86da4f Compare October 14, 2024 14:46
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 17, 2025
@logan-markewich logan-markewich enabled auto-merge (squash) February 17, 2025 23:28
@logan-markewich logan-markewich merged commit 8d8c823 into run-llama:main Feb 17, 2025
11 checks passed
@logan-markewich
Copy link
Collaborator

@colombod great, it merged! Now just need to cook up an example notebook ;)

@masci
Copy link
Member

masci commented Feb 18, 2025

@logan-markewich @colombod Are we still on time to fix the typo in the integration name? That will affect the import paths, it would be nice to fix before this gets spread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants