Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_gbq(table_id, column=[list of columns]) should actually filter the amount of columns downloaded from the API #872

Open
tswast opened this issue Jan 23, 2025 · 0 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Collaborator

tswast commented Jan 23, 2025

Is your feature request related to a problem? Please describe.

Currently, one only uses the columns parameter to re-order the list of columns and it has to exactly match the columns provided in the query or table. See this TODO:

# TODO(kiraksi): allow columns to be a subset of all columns in the table, with follow up PR
if columns is not None:
if sorted(columns) == sorted(final_df.columns):
final_df = final_df[columns]
else:
raise InvalidColumnOrder("Column order does not match this DataFrame.")

Describe the solution you'd like

Only download the selected columns if the user passes a list of columns to read_gbq

For queries:

Maybe these still need to have the columns match since one can specify these in SQL? I don't see a selected_fields option in https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.client.Client#google_cloud_bigquery_client_Client_query_and_wait

For table IDs:

Pass the list of columns through as selected_fields to https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.client.Client#google_cloud_bigquery_client_Client_list_rows

Starting here:

final_df = connector.download_table(
query_or_table,
max_results=max_results,
progress_bar_type=progress_bar_type,
dtypes=dtypes,
)
going through to
rows_iter = self.client.list_rows(table_ref, max_results=max_results)

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Aside: https://googleapis.dev/python/pandas-gbq/latest/reading.html has no mention that a table ID is supported. We should add a sample there.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Jan 23, 2025
@Linchin Linchin added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants