Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetRecordReader::read_from_row_group - num_records cannot exceed number of rows #7132

Closed
Paul-Folbrecht opened this issue Feb 13, 2025 · 2 comments
Labels
question Further information is requested

Comments

@Paul-Folbrecht
Copy link

Your sample:

#[derive(ParquetRecordReader)]
struct ACompleteRecord {
    pub a_bool: bool,
    pub a_string: String,
}

pub fn read_some_records() -> Vec<ACompleteRecord> {
  let mut samples: Vec<ACompleteRecord> = Vec::new();
  let file = File::open("some_file.parquet").unwrap();

  let reader = SerializedFileReader::new(file).unwrap();
  let mut row_group = reader.get_row_group(0).unwrap();
  samples.read_from_row_group(&mut *row_group, 1).unwrap();
  samples
}

The docs state

"Read up to num_records records from row_group_reader into self."

But if you pass more than the number of rows in the file, you get an error like

thread 'parquet_reader::test_read_records' panicked at services/src/parquet_reader.rs:10:10:
index out of bounds: the len is 66945 but the index is 66945

Since there's no way to determine the number of rows without iterating once, this is a problem.

@tustvold
Copy link
Contributor

@tustvold tustvold added question Further information is requested and removed bug labels Feb 13, 2025
@Paul-Folbrecht
Copy link
Author

Thanks. It turns out that the parquets we're processing have bad metadata and so that value is wrong. But, you are correct, no bug here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants