-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: DH-18143: Improve handling of sort order for Iceberg tables #6646
base: main
Are you sure you want to change the base?
Changes from all commits
bdb5050
58adc11
54e1f08
b996afd
ec67d7f
1b7fbda
e029f67
b98123e
0e72eab
b9b8d80
b10b685
dcc0b99
af24391
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
// | ||
// Copyright (c) 2016-2025 Deephaven Data Labs and Patent Pending | ||
// | ||
package io.deephaven.iceberg.location; | ||
|
||
import io.deephaven.api.ColumnName; | ||
import io.deephaven.api.SortColumn; | ||
import io.deephaven.engine.table.impl.locations.TableKey; | ||
import io.deephaven.engine.table.impl.locations.TableLocation; | ||
import io.deephaven.iceberg.util.IcebergTableAdapter; | ||
import io.deephaven.parquet.table.ParquetInstructions; | ||
import io.deephaven.parquet.table.location.ParquetTableLocation; | ||
import org.apache.iceberg.DataFile; | ||
import org.apache.iceberg.NullOrder; | ||
import org.apache.iceberg.Schema; | ||
import org.apache.iceberg.SortDirection; | ||
import org.apache.iceberg.SortField; | ||
import org.apache.iceberg.SortOrder; | ||
import org.jetbrains.annotations.NotNull; | ||
import org.jetbrains.annotations.Nullable; | ||
|
||
import java.util.ArrayList; | ||
import java.util.Collections; | ||
import java.util.List; | ||
|
||
public class IcebergTableParquetLocation extends ParquetTableLocation implements TableLocation { | ||
|
||
@Nullable | ||
private final List<SortColumn> sortedColumns; | ||
|
||
public IcebergTableParquetLocation( | ||
@NotNull final IcebergTableAdapter tableAdapter, | ||
@NotNull final TableKey tableKey, | ||
@NotNull final IcebergTableParquetLocationKey tableLocationKey, | ||
@NotNull final ParquetInstructions readInstructions) { | ||
super(tableKey, tableLocationKey, readInstructions); | ||
sortedColumns = computeSortedColumns(tableAdapter, tableLocationKey.dataFile()); | ||
} | ||
|
||
@Override | ||
@NotNull | ||
public List<SortColumn> getSortedColumns() { | ||
return sortedColumns == null ? super.getSortedColumns() : sortedColumns; | ||
} | ||
|
||
@Nullable | ||
private static List<SortColumn> computeSortedColumns( | ||
@NotNull final IcebergTableAdapter tableAdapter, | ||
@NotNull final DataFile dataFile) { | ||
final Integer sortOrderId = dataFile.sortOrderId(); | ||
Comment on lines
+42
to
+50
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need to think about behavior when unsorted (either b/c null or explicitly set to unsorted)... |
||
// If sort order is missing or unknown, we cannot determine the sorted columns from the metadata and will | ||
// check the underlying parquet file for the sorted columns, when the user asks for them. | ||
if (sortOrderId == null) { | ||
return null; | ||
} | ||
final SortOrder sortOrder = tableAdapter.icebergTable().sortOrders().get(sortOrderId); | ||
if (sortOrder == null) { | ||
return null; | ||
} | ||
if (sortOrder.isUnsorted()) { | ||
return Collections.emptyList(); | ||
} | ||
final Schema schema = sortOrder.schema(); | ||
final List<SortColumn> sortColumns = new ArrayList<>(sortOrder.fields().size()); | ||
for (final SortField field : sortOrder.fields()) { | ||
if (!field.transform().isIdentity()) { | ||
// TODO (DH-18160): Improve support for handling non-identity transforms | ||
break; | ||
} | ||
final ColumnName columnName = ColumnName.of(schema.findColumnName(field.sourceId())); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This might throw an InvalidNameException; we might need to wait for some Resolver work I'm doing in https://deephaven.atlassian.net/browse/DH-18365 to land so we can properly map field ids. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would not mark this as resolved yet |
||
final SortColumn sortColumn; | ||
if (field.nullOrder() == NullOrder.NULLS_FIRST && field.direction() == SortDirection.ASC) { | ||
sortColumn = SortColumn.asc(columnName); | ||
} else if (field.nullOrder() == NullOrder.NULLS_LAST && field.direction() == SortDirection.DESC) { | ||
sortColumn = SortColumn.desc(columnName); | ||
} else { | ||
malhotrashivam marked this conversation as resolved.
Show resolved
Hide resolved
Comment on lines
+72
to
+76
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should raise the issue of null-first, nulls-last with the engine team. Arguably, this is something we should want to support. Additionally, we may need to hold of on handling any floating point columns.
The In the meantime, I think the strategy of breaking and returning what we have so far should be OK. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
break; | ||
} | ||
sortColumns.add(sortColumn); | ||
} | ||
return Collections.unmodifiableList(sortColumns); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently learned that the files themselves may have metadata, ie
org.apache.iceberg.ManifestReader#spec
. It makes me want to add caution extending these helper methods too far. While we aren't passing alongManifestReader#spec
today, we may need to in the future and might need to model it as appropriate.