Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limit rows for queries #812

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

azawalich
Copy link

Adding rows limitation to SQL queries executors - it is handy for environments with less RAM memory, to avoid OOMError.

Not sure where to add this argument additionally in main functions, despite of main executors themselves, any help?

Copy link

@llamapreview llamapreview bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto Pull Request Review from LlamaPReview

1. Overview

1.1 Core Changes

  • Primary purpose and scope: The primary purpose of this pull request is to enhance the memory safety of the Vanna application by adding a row limiting feature to SQL query executors. This is particularly useful in environments with limited RAM to prevent Out-Of-Memory (OOM) errors.
  • Key components modified: The core change involves modifying the run_sql methods for all supported database connectors (Snowflake, SQLite, PostgreSQL, MySQL, Oracle, Presto, Hive) in src/vanna/base/base.py.
  • Cross-component impacts: This change impacts all database connector implementations, introducing a new execution flow control for handling query results. It implicitly requires all run_sql implementations to support row limiting.
  • Business value alignment: This change aligns with the business value of improving application stability and reliability, especially in resource-constrained environments. It reduces the risk of application crashes due to OOM errors, thus enhancing user experience and reducing potential support costs.

1.2 Technical Architecture

  • System design modifications: The PR introduces a new rows_limit parameter to the run_sql_* functions across various database connectors.
  • Component interaction changes: The interaction between the query execution logic and the database connectors is modified to use fetchmany() with a limit instead of fetchall(), when a limit is provided.
  • Integration points impact: This change affects the integration points with all supported databases, ensuring a consistent approach to limiting the number of rows fetched.
  • Dependency changes and implications: No new dependencies were added. The existing database connector libraries (e.g., snowflake.connector, psycopg2, pymysql, oracledb, presto, hive) are used with their existing fetchmany() methods, but with corrected parameter names where necessary.

2. Critical Findings

2.1 Must Fix (P0🔴)

Issue: Incorrect Fetch Parameter Names in PostgreSQL connector.

  • Impact: This will lead to TypeError exceptions during runtime when using the PostgreSQL connector with a rows_limit, causing query execution to fail. This impacts the application's ability to handle large datasets and prevent OOM errors with PostgreSQL.
  • Resolution: Change numRows=rows_limit to size=rows_limit in the run_sql_postgres function.

Issue: Unbounded LLM Payload in generate_followup_questions.

  • Impact: Sending the entire DataFrame to the LLM can lead to excessive token usage, potentially causing performance issues, higher costs, and even exceeding LLM context limits. This undermines the PR's goal of memory safety.
  • Resolution: Revert to using df.head(25).to_markdown() or a configurable constant (e.g., MAX_LLM_ROWS) to limit the data sent to the LLM.

Issue: Missing Input Validation for rows_limit.

  • Impact: Without input validation, invalid values for rows_limit (e.g., negative numbers, non-integers) can be passed to the run_sql_* functions, leading to unexpected behavior or errors.
  • Resolution: Add input validation to all run_sql_* functions to ensure rows_limit is either None or a positive integer.

2.2 Should Fix (P1🟡)

Issue: Documentation Gaps in run_sql_* functions.

  • Impact: The new rows_limit parameter is not documented in the function docstrings, making it unclear for users how to use this new feature.
  • Suggested Solution: Add a description of the rows_limit parameter to the docstrings of all run_sql_* functions.

Issue: Inconsistent Error Handling and Parameter Names across connectors.

  • Impact: Inconsistencies in parameter names (e.g., size vs. numRows) and the lack of uniform validation across different database connectors can lead to confusion and potential errors.
  • Suggested Solution: Ensure all connectors use consistent parameter names (preferring the correct one for each library) and implement the same input validation logic for rows_limit.

2.3 Consider (P2🟢)

Area: Centralized Limit Handling.

  • Improvement Opportunity: Reduce code duplication and improve maintainability by creating a helper function to handle the fetching of results with or without a limit.

Area: Configuration Defaults.

  • Improvement Opportunity: Allow users to configure a default row limit globally through the config object, providing a more convenient way to manage this setting.

Area: Performance Instrumentation.

  • Improvement Opportunity: Add metrics to track query size and whether a limit was applied, providing valuable insights into query performance and the effectiveness of the row limiting feature.

2.4 Summary of Action Items

  1. Immediate (P0): Fix incorrect parameter names (PostgreSQL), revert unbounded LLM payload, and add input validation for rows_limit.
  2. Before Merge (P1): Update documentation and ensure consistent error handling and parameter names across all connectors.
  3. Future Iteration (P2): Consider centralized limit handling, configuration defaults, and performance instrumentation.

3. Technical Analysis

3.1 Code Logic Analysis

📁 src/vanna/base/base.py - run_sql_postgres

  • Submitted PR Code:
    def run_sql_postgres(sql: str, rows_limit: int = None) -> Union[pd.DataFrame, None]:
        conn = None
        try:
            conn = connect_to_db()  # Initial connection attempt
            cs = conn.cursor()
            cs.execute(sql)
            if rows_limit != None:
                results = cs.fetchmany(numRows=rows_limit)
            else:
                results = cs.fetchall()

            # Create a pandas dataframe from the results
            df = pd.DataFrame(results, columns=[desc[0] for desc in cs.description])
            return df
  • Analysis:
    • The code attempts to add row limiting to the PostgreSQL connector using cs.fetchmany(numRows=rows_limit).
    • Issue: psycopg2's fetchmany method uses the parameter name size, not numRows. This will cause a TypeError.
    • There's no input validation for rows_limit.
  • LlamaPReview Suggested Improvements:
    def run_sql_postgres(sql: str, rows_limit: int = None) -> Union[pd.DataFrame, None]:
        conn = None
        # Input validation for rows_limit
        if rows_limit is not None:
            if not isinstance(rows_limit, int) or rows_limit < 1:
                raise ValueError("rows_limit must be a positive integer or None")
        try:
            conn = connect_to_db()  # Initial connection attempt
            cs = conn.cursor()
            cs.execute(sql)
            if rows_limit != None:
                results = cs.fetchmany(size=rows_limit)  # Corrected parameter name
            else:
                results = cs.fetchall()

            # Create a pandas dataframe from the results
            df = pd.DataFrame(results, columns=[desc[0] for desc in cs.description])
            return df
  • Improvement rationale:
    • Technical benefits: Corrects the parameter name for fetchmany, preventing a TypeError. Adds input validation to prevent unexpected behavior.
    • Business value: Ensures the function works as intended, preventing application crashes and improving reliability.
    • Risk assessment: Low risk; the changes are localized and address a clear error.

📁 src/vanna/base/base.py - generate_followup_questions

  • Submitted PR Code:
    message_log = [
        self.system_message(
            f"You are a helpful data assistant. The user asked the question: '{question}'

The SQL query for this question was: {sql}

The following is a pandas DataFrame with the results of the query: 
{df.to_markdown()}

"
        ),
        self.user_message(
            f"Generate a list of {n_questions} followup questions that the user might ask about this data. Respond with a list of questions, one per line. Do not answer with any explanations -- just the questions. Remember that there should be an unambiguous SQL query that can be generated from the question. Prefer questions that are answerable outside of the context of this conversation. Prefer questions that are slight modifications of the SQL query that was generated that allow digging deeper into the data. Each question will be turned into a button that the user can click to generate a new SQL query so don't use 'example' type questions. Each question must have a one-to-one correspondence with an instantiated SQL query." +
            self._response_language()
        ),
    ]
  • Analysis:
    • The code sends the entire DataFrame (df.to_markdown()) to the LLM, which can be very large.
    • Issue: This can lead to excessive token usage, high costs, and potential LLM context limit issues.
  • LlamaPReview Suggested Improvements:
    MAX_LLM_ROWS = 25  # Define a constant for maximum rows

    message_log = [
        self.system_message(
            f"You are a helpful data assistant. The user asked the question: '{question}'

    The SQL query for this question was: {sql}

    The following is a pandas DataFrame with the results of the query:
    {df.head(MAX_LLM_ROWS).to_markdown()}

    "
        ),
        self.user_message(
            f"Generate a list of {n_questions} followup questions that the user might ask about this data. Respond with a list of questions, one per line. Do not answer with any explanations -- just the questions. Remember that there should be an unambiguous SQL query that can be generated from the question. Prefer questions that are answerable outside of the context of this conversation. Prefer questions that are slight modifications of the SQL query that was generated that allow digging deeper into the data. Each question will be turned into a button that the user can click to generate a new SQL query so don't use 'example' type questions. Each question must have a one-to-one correspondence with an instantiated SQL query." +
            self._response_language()
        ),
    ]
  • Improvement rationale:
    • Technical benefits: Limits the amount of data sent to the LLM, preventing potential issues with large DataFrames.
    • Business value: Reduces LLM costs and improves performance.
    • Risk assessment: Low risk; this change restores a previous safety measure.

3.2 Key Quality Aspects

  • System scalability considerations: The PR improves scalability by allowing the application to handle larger datasets without running out of memory.
  • Performance bottlenecks and optimizations: The use of fetchmany instead of fetchall is a performance optimization for large result sets.
  • Testing strategy and coverage: The PR description doesn't mention any specific tests. Unit tests should be added to verify the rows_limit functionality for each database connector.
  • Documentation needs: The rows_limit parameter needs to be documented in the docstrings of all affected functions.

4. Overall Evaluation

  • Technical assessment: The PR introduces a valuable feature for memory safety, but it has some critical issues that need to be addressed before merging. The core logic of limiting rows is sound, but the implementation needs refinement.
  • Business impact: The PR positively impacts the business by improving application stability and reducing the risk of OOM errors.
  • Risk evaluation: The risk is moderate. The identified issues can cause runtime errors, but they are localized and can be addressed with relatively simple fixes.
  • Notable positive aspects and good practices: The PR addresses a real-world problem (OOM errors) and provides a general solution that applies to multiple database connectors.
  • Implementation quality: The implementation quality needs improvement. The identified issues with parameter names, missing validation, and unbounded LLM payloads indicate a need for more careful coding and testing.
  • Final recommendation: Request Changes. The PR cannot be merged in its current state due to the P0 issues. The author should address the P0 and P1 issues before merging.

💡 LlamaPReview Community
Have feedback on this AI Code review tool? Join our GitHub Discussions to share your thoughts and help shape the future of LlamaPReview.

fix argument name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant