-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
limit rows for queries #812
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto Pull Request Review from LlamaPReview
1. Overview
1.1 Core Changes
- Primary purpose and scope: The primary purpose of this pull request is to enhance the memory safety of the Vanna application by adding a row limiting feature to SQL query executors. This is particularly useful in environments with limited RAM to prevent Out-Of-Memory (OOM) errors.
- Key components modified: The core change involves modifying the
run_sql
methods for all supported database connectors (Snowflake, SQLite, PostgreSQL, MySQL, Oracle, Presto, Hive) insrc/vanna/base/base.py
. - Cross-component impacts: This change impacts all database connector implementations, introducing a new execution flow control for handling query results. It implicitly requires all
run_sql
implementations to support row limiting. - Business value alignment: This change aligns with the business value of improving application stability and reliability, especially in resource-constrained environments. It reduces the risk of application crashes due to OOM errors, thus enhancing user experience and reducing potential support costs.
1.2 Technical Architecture
- System design modifications: The PR introduces a new
rows_limit
parameter to therun_sql_*
functions across various database connectors. - Component interaction changes: The interaction between the query execution logic and the database connectors is modified to use
fetchmany()
with a limit instead offetchall()
, when a limit is provided. - Integration points impact: This change affects the integration points with all supported databases, ensuring a consistent approach to limiting the number of rows fetched.
- Dependency changes and implications: No new dependencies were added. The existing database connector libraries (e.g.,
snowflake.connector
,psycopg2
,pymysql
,oracledb
,presto
,hive
) are used with their existingfetchmany()
methods, but with corrected parameter names where necessary.
2. Critical Findings
2.1 Must Fix (P0🔴)
Issue: Incorrect Fetch Parameter Names in PostgreSQL connector.
- Impact: This will lead to
TypeError
exceptions during runtime when using the PostgreSQL connector with arows_limit
, causing query execution to fail. This impacts the application's ability to handle large datasets and prevent OOM errors with PostgreSQL. - Resolution: Change
numRows=rows_limit
tosize=rows_limit
in therun_sql_postgres
function.
Issue: Unbounded LLM Payload in generate_followup_questions
.
- Impact: Sending the entire DataFrame to the LLM can lead to excessive token usage, potentially causing performance issues, higher costs, and even exceeding LLM context limits. This undermines the PR's goal of memory safety.
- Resolution: Revert to using
df.head(25).to_markdown()
or a configurable constant (e.g.,MAX_LLM_ROWS
) to limit the data sent to the LLM.
Issue: Missing Input Validation for rows_limit
.
- Impact: Without input validation, invalid values for
rows_limit
(e.g., negative numbers, non-integers) can be passed to therun_sql_*
functions, leading to unexpected behavior or errors. - Resolution: Add input validation to all
run_sql_*
functions to ensurerows_limit
is eitherNone
or a positive integer.
2.2 Should Fix (P1🟡)
Issue: Documentation Gaps in run_sql_*
functions.
- Impact: The new
rows_limit
parameter is not documented in the function docstrings, making it unclear for users how to use this new feature. - Suggested Solution: Add a description of the
rows_limit
parameter to the docstrings of allrun_sql_*
functions.
Issue: Inconsistent Error Handling and Parameter Names across connectors.
- Impact: Inconsistencies in parameter names (e.g.,
size
vs.numRows
) and the lack of uniform validation across different database connectors can lead to confusion and potential errors. - Suggested Solution: Ensure all connectors use consistent parameter names (preferring the correct one for each library) and implement the same input validation logic for
rows_limit
.
2.3 Consider (P2🟢)
Area: Centralized Limit Handling.
- Improvement Opportunity: Reduce code duplication and improve maintainability by creating a helper function to handle the fetching of results with or without a limit.
Area: Configuration Defaults.
- Improvement Opportunity: Allow users to configure a default row limit globally through the
config
object, providing a more convenient way to manage this setting.
Area: Performance Instrumentation.
- Improvement Opportunity: Add metrics to track query size and whether a limit was applied, providing valuable insights into query performance and the effectiveness of the row limiting feature.
2.4 Summary of Action Items
- Immediate (P0): Fix incorrect parameter names (PostgreSQL), revert unbounded LLM payload, and add input validation for
rows_limit
. - Before Merge (P1): Update documentation and ensure consistent error handling and parameter names across all connectors.
- Future Iteration (P2): Consider centralized limit handling, configuration defaults, and performance instrumentation.
3. Technical Analysis
3.1 Code Logic Analysis
📁 src/vanna/base/base.py - run_sql_postgres
- Submitted PR Code:
def run_sql_postgres(sql: str, rows_limit: int = None) -> Union[pd.DataFrame, None]:
conn = None
try:
conn = connect_to_db() # Initial connection attempt
cs = conn.cursor()
cs.execute(sql)
if rows_limit != None:
results = cs.fetchmany(numRows=rows_limit)
else:
results = cs.fetchall()
# Create a pandas dataframe from the results
df = pd.DataFrame(results, columns=[desc[0] for desc in cs.description])
return df
- Analysis:
- The code attempts to add row limiting to the PostgreSQL connector using
cs.fetchmany(numRows=rows_limit)
. - Issue:
psycopg2
'sfetchmany
method uses the parameter namesize
, notnumRows
. This will cause aTypeError
. - There's no input validation for
rows_limit
.
- The code attempts to add row limiting to the PostgreSQL connector using
- LlamaPReview Suggested Improvements:
def run_sql_postgres(sql: str, rows_limit: int = None) -> Union[pd.DataFrame, None]:
conn = None
# Input validation for rows_limit
if rows_limit is not None:
if not isinstance(rows_limit, int) or rows_limit < 1:
raise ValueError("rows_limit must be a positive integer or None")
try:
conn = connect_to_db() # Initial connection attempt
cs = conn.cursor()
cs.execute(sql)
if rows_limit != None:
results = cs.fetchmany(size=rows_limit) # Corrected parameter name
else:
results = cs.fetchall()
# Create a pandas dataframe from the results
df = pd.DataFrame(results, columns=[desc[0] for desc in cs.description])
return df
- Improvement rationale:
- Technical benefits: Corrects the parameter name for
fetchmany
, preventing aTypeError
. Adds input validation to prevent unexpected behavior. - Business value: Ensures the function works as intended, preventing application crashes and improving reliability.
- Risk assessment: Low risk; the changes are localized and address a clear error.
- Technical benefits: Corrects the parameter name for
📁 src/vanna/base/base.py - generate_followup_questions
- Submitted PR Code:
message_log = [
self.system_message(
f"You are a helpful data assistant. The user asked the question: '{question}'
The SQL query for this question was: {sql}
The following is a pandas DataFrame with the results of the query:
{df.to_markdown()}
"
),
self.user_message(
f"Generate a list of {n_questions} followup questions that the user might ask about this data. Respond with a list of questions, one per line. Do not answer with any explanations -- just the questions. Remember that there should be an unambiguous SQL query that can be generated from the question. Prefer questions that are answerable outside of the context of this conversation. Prefer questions that are slight modifications of the SQL query that was generated that allow digging deeper into the data. Each question will be turned into a button that the user can click to generate a new SQL query so don't use 'example' type questions. Each question must have a one-to-one correspondence with an instantiated SQL query." +
self._response_language()
),
]
- Analysis:
- The code sends the entire DataFrame (
df.to_markdown()
) to the LLM, which can be very large. - Issue: This can lead to excessive token usage, high costs, and potential LLM context limit issues.
- The code sends the entire DataFrame (
- LlamaPReview Suggested Improvements:
MAX_LLM_ROWS = 25 # Define a constant for maximum rows
message_log = [
self.system_message(
f"You are a helpful data assistant. The user asked the question: '{question}'
The SQL query for this question was: {sql}
The following is a pandas DataFrame with the results of the query:
{df.head(MAX_LLM_ROWS).to_markdown()}
"
),
self.user_message(
f"Generate a list of {n_questions} followup questions that the user might ask about this data. Respond with a list of questions, one per line. Do not answer with any explanations -- just the questions. Remember that there should be an unambiguous SQL query that can be generated from the question. Prefer questions that are answerable outside of the context of this conversation. Prefer questions that are slight modifications of the SQL query that was generated that allow digging deeper into the data. Each question will be turned into a button that the user can click to generate a new SQL query so don't use 'example' type questions. Each question must have a one-to-one correspondence with an instantiated SQL query." +
self._response_language()
),
]
- Improvement rationale:
- Technical benefits: Limits the amount of data sent to the LLM, preventing potential issues with large DataFrames.
- Business value: Reduces LLM costs and improves performance.
- Risk assessment: Low risk; this change restores a previous safety measure.
3.2 Key Quality Aspects
- System scalability considerations: The PR improves scalability by allowing the application to handle larger datasets without running out of memory.
- Performance bottlenecks and optimizations: The use of
fetchmany
instead offetchall
is a performance optimization for large result sets. - Testing strategy and coverage: The PR description doesn't mention any specific tests. Unit tests should be added to verify the
rows_limit
functionality for each database connector. - Documentation needs: The
rows_limit
parameter needs to be documented in the docstrings of all affected functions.
4. Overall Evaluation
- Technical assessment: The PR introduces a valuable feature for memory safety, but it has some critical issues that need to be addressed before merging. The core logic of limiting rows is sound, but the implementation needs refinement.
- Business impact: The PR positively impacts the business by improving application stability and reducing the risk of OOM errors.
- Risk evaluation: The risk is moderate. The identified issues can cause runtime errors, but they are localized and can be addressed with relatively simple fixes.
- Notable positive aspects and good practices: The PR addresses a real-world problem (OOM errors) and provides a general solution that applies to multiple database connectors.
- Implementation quality: The implementation quality needs improvement. The identified issues with parameter names, missing validation, and unbounded LLM payloads indicate a need for more careful coding and testing.
- Final recommendation: Request Changes. The PR cannot be merged in its current state due to the P0 issues. The author should address the P0 and P1 issues before merging.
💡 LlamaPReview Community
Have feedback on this AI Code review tool? Join our GitHub Discussions to share your thoughts and help shape the future of LlamaPReview.
fix argument name
Adding rows limitation to SQL queries executors - it is handy for environments with less RAM memory, to avoid OOMError.
Not sure where to add this argument additionally in main functions, despite of main executors themselves, any help?