`read_gbq(table_id, column=[list of columns])` should actually filter the amount of columns downloaded from the API #872

tswast · 2025-01-23T17:59:17Z

Is your feature request related to a problem? Please describe.

Currently, one only uses the columns parameter to re-order the list of columns and it has to exactly match the columns provided in the query or table. See this TODO:

python-bigquery-pandas/pandas_gbq/gbq.py

Lines 939 to 944 in 912b615

    
           # TODO(kiraksi): allow columns to be a subset of all columns in the table, with follow up PR 
        
           if columns is not None: 
        
               if sorted(columns) == sorted(final_df.columns): 
        
                   final_df = final_df[columns] 
        
               else: 
        
                   raise InvalidColumnOrder("Column order does not match this DataFrame.")

Describe the solution you'd like

Only download the selected columns if the user passes a list of columns to read_gbq

For queries:

Maybe these still need to have the columns match since one can specify these in SQL? I don't see a selected_fields option in https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.client.Client#google_cloud_bigquery_client_Client_query_and_wait

For table IDs:

Pass the list of columns through as selected_fields to https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.client.Client#google_cloud_bigquery_client_Client_list_rows

Starting here:

python-bigquery-pandas/pandas_gbq/gbq.py

Lines 914 to 919 in 912b615

    
           final_df = connector.download_table( 
        
               query_or_table, 
        
               max_results=max_results, 
        
               progress_bar_type=progress_bar_type, 
        
               dtypes=dtypes, 
        
           )

going through to

python-bigquery-pandas/pandas_gbq/gbq.py

Line 396 in 912b615

rows_iter = self.client.list_rows(table_ref, max_results=max_results)

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Aside: https://googleapis.dev/python/pandas-gbq/latest/reading.html has no mention that a table ID is supported. We should add a sample there.

The text was updated successfully, but these errors were encountered:

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Jan 23, 2025

blunderbuss-gcf bot assigned GaoleMeng Jan 23, 2025

Linchin added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_gbq(table_id, column=[list of columns])` should actually filter the amount of columns downloaded from the API #872

`read_gbq(table_id, column=[list of columns])` should actually filter the amount of columns downloaded from the API #872

tswast commented Jan 23, 2025

read_gbq(table_id, column=[list of columns]) should actually filter the amount of columns downloaded from the API #872

read_gbq(table_id, column=[list of columns]) should actually filter the amount of columns downloaded from the API #872

Comments

tswast commented Jan 23, 2025

`read_gbq(table_id, column=[list of columns])` should actually filter the amount of columns downloaded from the API #872

`read_gbq(table_id, column=[list of columns])` should actually filter the amount of columns downloaded from the API #872