Incremental Append Scan #2031

smaheshwar-pltr · 2025-05-21T22:58:07Z

Revival of #533 given inactivity there.

Co (/main 😄)-author: @hililiwei, apologies to them if they were stilling working on this. Significant credit goes to them - thank you, @hililiwei!

CC some folks I saw reviewing previous PR / dicussions: @Fokko @chinmay-bhat @kevinjqliu.

I think a bit more work is required (or more depending on people's thoughts) but happy for a review to start now.

Rationale for this change

PyIceberg lacks incremental read utilities. I think it's been asked for multiple times - a strength of PyIceberg is that small data (such as just data appended between snapshots) can be processed theoretically much faster than e.g. Spark, so IMHO, incremental scans are one of the most needed features of PyIceberg right now.

Are these changes tested?

Yes, with unit tests and integration tests.

Are there any user-facing changes?

smaheshwar-pltr · 2025-05-21T23:00:14Z

pyiceberg/table/snapshots.py

+    from_snapshot_id_exclusive: Optional[int],
+    to_snapshot_id_inclusive: int,


I realise these semantics are confusing given that ancestors_between above is inclusive-inclusive

ancestors_between is only used in validation_history so far.... do we need both or can we consolidate to just use snapshot_id instead of Snapshot objects?

I probably should have just used table_metadata.snapshot_by_id now that I am seeing it lol

smaheshwar-pltr · 2025-05-21T23:00:48Z

pyiceberg/table/snapshots.py

+        yield from ancestors_of(table_metadata.snapshot_by_id(to_snapshot_id_inclusive), table_metadata)
+
+
+def is_ancestor_of(snapshot_id: int, ancestor_snapshot_id: int, table_metadata: TableMetadata) -> bool:


Copied virtually exactly from https://github.com/apache/iceberg/blob/1911c94ea605a3d3f10a1994b046f00a5e9fdceb/core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java#L43-L52

smaheshwar-pltr · 2025-05-21T23:01:39Z

pyiceberg/table/snapshots.py

+    If from_snapshot_id_exclusive is None or no ancestors of the "to" snapshot match it, all ancestors of the "to"
+    snapshot are returned.
+    """
+    if from_snapshot_id_exclusive is not None:


I'm following the structure of ancestors_between above here and caseworking, but I don't think it's strictly needed though

smaheshwar-pltr · 2025-05-21T23:03:40Z

dev/provision.py

+    # https://github.com/apache/iceberg/issues/1092#issuecomment-638432848 / https://github.com/apache/iceberg/issues/3747#issuecomment-1145419407
+    # REPLACE TABLE requires certain Hive server configuration
+    if catalog_name != "hive":
+        # Replace to break snapshot lineage:


Strictly speaking, I don't need this case because I can test broken lineage throwing just by inverting snapshot orders. But this feels like a realer use case to me. And changing the schema in this way also lets me test that the table's current schema is always used 😄

smaheshwar-pltr · 2025-05-21T23:05:51Z

pyiceberg/manifest.py

@@ -717,6 +717,14 @@ def fetch_manifest_entry(self, io: FileIO, discard_deleted: bool = True) -> List
                if not discard_deleted or entry.status != ManifestEntryStatus.DELETED
            ]

+    def __eq__(self, other: Any) -> bool:


Changes in this file are from #533.

To elaborate on why they're needed:

We maintain a set of manifest files (see Incremental Append Scan #2031 (comment)) when planning an append scan

Equality and hash methods are therefore needed. This is all inspired by the Java implementation, see https://github.com/apache/iceberg/blob/1911c94ea605a3d3f10a1994b046f00a5e9fdceb/core/src/main/java/org/apache/iceberg/GenericManifestFile.java#L407-L421.

smaheshwar-pltr · 2025-05-22T13:44:16Z

dev/provision.py

+    )
+
+    # https://github.com/apache/iceberg/issues/1092#issuecomment-638432848 / https://github.com/apache/iceberg/issues/3747#issuecomment-1145419407
+    # Don't do replace for Hive catalog as REPLACE TABLE requires certain Hive server configuration


This is probably fixable 🤔 (but I also made use of this here)

smaheshwar-pltr · 2025-05-22T13:49:43Z

dev/provision.py

+
+    spark.sql(
+        f"""
+        CREATE OR REPLACE TABLE {catalog_name}.default.test_incremental_read (


(Same as in #533)

smaheshwar-pltr · 2025-05-22T13:51:53Z

pyiceberg/table/__init__.py

            return iter([])

+        append_snapshot_ids: Set[int] = {snapshot.snapshot_id for snapshot in append_snapshots}
+
+        manifests = {


Note that we maintain a set of manifest files, just like https://github.com/apache/iceberg/blob/1911c94ea605a3d3f10a1994b046f00a5e9fdceb/core/src/main/java/org/apache/iceberg/BaseIncrementalAppendScan.java#L70-L74.

smaheshwar-pltr · 2025-05-22T13:54:21Z

pyiceberg/table/__init__.py

@@ -1092,6 +1096,61 @@ def scan(
            limit=limit,
        )

+    # TODO: Consider more concise name
+    def incremental_append_scan(


Thoughts on this method's name? It's a bit verbose, but can't think of a better alternative

smaheshwar-pltr · 2025-05-22T13:55:29Z

pyiceberg/table/__init__.py

+        row_filter: Union[str, BooleanExpression] = ALWAYS_TRUE,
+        selected_fields: Tuple[str, ...] = ("*",),
+        case_sensitive: bool = True,
+        from_snapshot_id_exclusive: Optional[int] = None,
+        to_snapshot_id_inclusive: Optional[int] = None,
+        options: Properties = EMPTY_DICT,
+        limit: Optional[int] = None,


For folks reviewing: with the exception of the optional snapshot IDs, these are the same args as the default scan method just above. I think they all make sense for an append scan too. (I also added some tests)

smaheshwar-pltr · 2025-05-22T14:12:05Z

pyiceberg/table/__init__.py

+            from_snapshot_id_exclusive:
+                Optional ID of the "from" snapshot, to start the incremental scan from, exclusively. This can be set
+                on the IncrementalAppendScan object returned, but ultimately must not be None.


This is a significant, user-facing change compared to #533. This PR throws if the from snapshot ID is not set, but that PR defaults to the oldest ancestor of the end snapshot inclusively. Here is my argument, would love to hear what people think:

Incremental Append Scan #533 is good in that it sticks very closely to the Java implementation of IncrementalScan - see the docstring here.

However, Spark marks the start-snapshot-id as non-optional, and throws if it's not provided. See docs

IMO: the Java APIs are not user-facing, but may be written in a generalised way to allow for flexibility in the engines that consume the APIs which are user-facing. The difference with PyIceberg is that it is user-facing and so I claim should be more engine-inspired.

Here, the Spark behaviour makes sense, IMO. It's required for an append scan, but not for a changelog scan.

If you omit it for the changelog scan, which a user would do to "read from the start of the table", you just need to check for the oldest ancestor because that root snapshot was the last replace, so the changelog would anyway begin from there. Looking at the docs, it indeed does just say "If not provided, it reads from the table’s first snapshot inclusively."

But an append scan is documented as reading all data from append snapshots alone, ignoring all other snapshot types. Therefore, the intuitive meaning of reading from the first snapshot doesn't actually hold because all appends from the first snapshot wouldn't necessarily be relayed (just the ones from the last replace). To avoid user confusion, I think it's best to throw as Spark does here.

smaheshwar-pltr · 2025-05-22T14:16:06Z

pyiceberg/table/__init__.py



-class TableScan(ABC):
+class AbstractTableScan(ABC):


Github's diff here is messed up.

A goal of this PR is minimising user-facing breaks (I believe), which I am guessing was the main concern with #533 that e.g. removed snapshot_id from TableScan. In this PR, I keep that field - this means it doesn't make sense for incremental scans to subclass TableScan because they have two snapshot IDs.

I therefore introduced this abstract class to be a base class for all table scans, non-incremental and incremental.

Do we need to do this rename? Changing the name would break any other library that relies on this class.

See #2031 (comment) - TableScan still exists with the same methods so I'd have thought we'd be fine here, or do I misunderstand?

The purpose of this new class is a base class for table scans, including incremental ones, because logic can be shared. But your point about reconsidering class hierarchy still holds

smaheshwar-pltr · 2025-05-22T14:18:07Z

pyiceberg/table/__init__.py

+        """Create a copy of this table scan with updated fields."""
+        return type(self)(**{**self.__dict__, **overrides})
+
+    def to_pandas(self, **kwargs: Any) -> pd.DataFrame:


A minor user-facing change (IMHO) is that these methods on TableScan that subclasses this now have default implementations based on the to_arrow abstract method. This feels OK to me - its subclasses can override is needed, but maybe should be documented (if we want to go with this approach).

smaheshwar-pltr · 2025-05-22T14:22:08Z

pyiceberg/table/__init__.py

+        return result
+
+
+class FileBasedScan(AbstractTableScan, ABC):


In light of #533 (comment), I think it makes sense to have some abstraction for scans that return FileScanTasks specifically.

I think we maybe should've been doing some handling before - on main, this line gives me a warning

iceberg-python/pyiceberg/table/__init__.py

Line 1933 in f47513b

if task.residual == AlwaysTrue() and len(task.delete_files) == 0:

because

iceberg-python/pyiceberg/table/__init__.py

Lines 1926 to 1927 in f47513b

# every task is a FileScanTask

tasks = self.plan_files()

doesn't follow from the typing. But overriding this return type a Iterable[FileScanTask] fixes that, as here.

The FileScanTasks-based scan abstraction also means we can provide default implementations of methods like to_arrow and the other ones here based on FileScanTasks being returned by plan_files. This reduces duplication - we get them on both DataScan and IncrementalAppendScan.

smaheshwar-pltr · 2025-05-22T14:30:02Z

pyiceberg/table/__init__.py

+    """A base class for table scans that plan FileScanTasks."""
+
+    @cached_property
+    def _manifest_group_planner(self) -> ManifestGroupPlanner:


The motivation for a manifest-based file scan task planner comes from the Java-side https://github.com/apache/iceberg/blob/1911c94ea605a3d3f10a1994b046f00a5e9fdceb/core/src/main/java/org/apache/iceberg/BaseIncrementalAppendScan.java#L76-L97 (class here).

I cache this property on this class because the class has a partition_filters method that is cached on that class. It's therefore cached for the FileBasedScan's lifetime, similar to what we had before:

iceberg-python/pyiceberg/table/__init__.py

Lines 1696 to 1698 in f47513b

@cached_property

def partition_filters(self) -> KeyDefaultDict[int, BooleanExpression]:

return KeyDefaultDict(self._build_partition_projection)

To me, the abstraction of something that handles planning such tasks from manifests makes sense and naturally reduces duplication with the new append scan type. I think other design decisions are possible like introducing a member for it or rethinking the abstraction. This felt fine to me (I like keeping this class thin)

smaheshwar-pltr · 2025-05-22T14:34:55Z

pyiceberg/table/__init__.py

+class TableScan(AbstractTableScan, ABC):
+    """A base class for non-incremental table scans that target a single snapshot."""


See https://github.com/apache/iceberg-python/pull/2031/files#r2102683614.

I don't love this - if compatibility isn't a concern, I'd refactor the hierarchy like #533 did. I can't see much nuance this offers as a base class over DataScan, although it doesn't extend from FileBasedScan but DataScan does.

smaheshwar-pltr · 2025-05-22T14:35:54Z

pyiceberg/table/__init__.py

+        return self._manifest_group_planner.plan_files(manifests=snapshot.manifests(self.io))
+
+    # TODO: Document motivation and un-caching
+    @property


This was previously a cached_property but now that _manifest_group_planner is cached, it can just be a property. (I don't this is a user-facing change)

This method isn't used by PyIceberg because it was moved into ManifestGroupPlanner that concerns itself with those things. The only reasons I'm keeping it is compatibility and maybe it was public on DataScan for a reason before, i.e. maybe library users were interested in using it.

smaheshwar-pltr · 2025-05-22T14:47:47Z

pyiceberg/table/__init__.py

-        partition_type = spec.partition_type(self.table_metadata.schema())
-        partition_schema = Schema(*partition_type.fields)
-        partition_expr = self.partition_filters[spec_id]
+A = TypeVar("A", bound="IncrementalScan", covariant=True)


Linter didn't like I 😢

smaheshwar-pltr · 2025-05-22T14:53:21Z

pyiceberg/table/__init__.py

-    def _build_metrics_evaluator(self) -> Callable[[DataFile], bool]:
-        schema = self.table_metadata.schema()
-        include_empty_files = strtobool(self.options.get("include_empty_files", "false"))
+class IncrementalScan(AbstractTableScan, ABC):


I'm differing from both #533 and the Java-side here in keeping this thin and not performing any snapshot defaults / validation in this abstract super-class.

This is because of my claim in #2031 (comment) (happy to discuss) - IMHO, both append and changelog scans should perform their own, probably different defaults / validation because we're designing APIs for user, not engine use.

cc @chinmay-bhat, would love to hear your thoughts on this and this PR!

smaheshwar-pltr · 2025-05-22T14:54:13Z

pyiceberg/table/__init__.py

        )

    def plan_files(self) -> Iterable[FileScanTask]:
-        """Plans the relevant files by filtering on the PartitionSpecs.
+        from_snapshot_id, to_snapshot_id = self._validate_and_resolve_snapshots()


The implementation here is largely inspired by the Java implementation https://github.com/apache/iceberg/blob/1911c94ea605a3d3f10a1994b046f00a5e9fdceb/core/src/main/java/org/apache/iceberg/BaseIncrementalAppendScan.java#L46.

smaheshwar-pltr · 2025-05-22T14:54:41Z

pyiceberg/table/__init__.py

+            manifest_entry_filter=lambda manifest_entry: manifest_entry.snapshot_id in append_snapshot_ids
+            and manifest_entry.status == ManifestEntryStatus.ADDED,


https://github.com/apache/iceberg/blob/1911c94ea605a3d3f10a1994b046f00a5e9fdceb/core/src/main/java/org/apache/iceberg/BaseIncrementalAppendScan.java#L81-L84

smaheshwar-pltr · 2025-05-22T14:56:24Z

pyiceberg/table/__init__.py

+        self.case_sensitive = scan.case_sensitive
+        self.options = scan.options
+
+    def plan_files(


Sorry, the diff here is messed up. This is the same as the relevant body of previous DataFile method, but we filter on manifest_evaluators within this method itself on the manifests provided. We also introduce this manifest_entry_filter that's Java inspired

smaheshwar-pltr · 2025-05-22T14:56:35Z

pyiceberg/table/__init__.py

+            if not manifest_entry_filter(manifest_entry):
+                continue


See https://github.com/apache/iceberg-python/pull/2031/files#r2102776541

smaheshwar-pltr · 2025-05-22T14:57:08Z

pyiceberg/table/__init__.py


-        return ray.data.from_arrow(self.to_arrow())
+    # TODO: Document that this method was was made static
+    @staticmethod


I made this method static (it wasn't before - the one on DataScan).

smaheshwar-pltr · 2025-05-22T14:57:55Z

pyiceberg/table/__init__.py

+        ]
+
+
+class ManifestGroupPlanner:


This class effectively extracts the code relevant to planning based on manifest files previously in DataScan. The code is largely the same (differences pointed out in comments)

See https://github.com/apache/iceberg-python/pull/2031/files#r2102715753 regarding inspiration

smaheshwar-pltr · 2025-05-22T15:00:08Z

tests/integration/test_reads.py

+
+
+@pytest.mark.integration
+@pytest.mark.parametrize("catalog", [pytest.lazy_fixture("session_catalog")])


(Using just the REST catalog that has the replace here to get a different schema)

smaheshwar-pltr · 2025-05-22T15:01:07Z

tests/integration/test_reads.py

+
+
+@pytest.mark.integration
+@pytest.mark.parametrize("catalog", [pytest.lazy_fixture("session_catalog")])


(Using just the REST catalog that has the replace here)

smaheshwar-pltr · 2025-05-28T19:10:59Z

Put up apache/iceberg#13179 regarding an append-only option for the Spark side. Would like to hear people's thoughts there - but I think we can proceed with this PR as it is now

smaheshwar-pltr · 2025-05-29T12:25:20Z

@Fokko, please may you review this?

smaheshwar-pltr · 2025-06-04T10:02:13Z

Friendly ping, @Fokko @kevinjqliu

hililiwei · 2025-06-05T00:51:27Z

Thanks for picking it up. Due to some reasons, I've been away for quite a while. I'm sorry for not making progress on this work.

Fokko · 2025-06-08T19:12:36Z

@smaheshwar-pltr First of all, sorry for the long wait, and thanks for picking this up. I'll have to look into this in more detail next week. It would be great to break the reviews into smaller pieces to speed up the reviews. Going over the PR, I'm not sure if we want to copy the whole class hierarchy from Java, as this does not feel very Python in my opinion.

smaheshwar-pltr · 2025-06-11T00:15:24Z

Not sure why CI failed when tests pass for me locally - did see that error on other PRs, so merged main just now to see if that fixes it. Don't think that failure was related to this PR

smaheshwar-pltr · 2025-06-11T00:25:28Z

Going over the PR, I'm not sure if we want to copy the whole class hierarchy from Java, as this does not feel very Python in my opinion.

Thanks for taking a look, @Fokko - that makes sense.

I think abstract classes used in this PR's way achieve append scan functionalities nicely and without duplication (see the added tests in tests/integration/test_reads.py), and in line with existing code: the previous TableScan has abstract methods. But the hierarchy indeed may be confusing, maybe even more than usual given I've tried to not introduce breaks, and it's true that rearrangement bloats the diff 😄.

Maybe this PR's logic, of planning files for an append scan, APIs, and tests can still be reviewed?

Sreesh Maheshwar added 4 commits May 21, 2025 16:12

Refactor hierarchy structure to share code

57f51a2

Attempt with from ancestor being optional

b5d3363

Attempt at incremental append scan

ea251e0

Add tests and fix bugs

d47c206

smaheshwar-pltr commented May 21, 2025

View reviewed changes

Sreesh Maheshwar added 4 commits May 22, 2025 11:21

Rename methods, add TODOs and introduce incremental scan superclass

61b4f19

Remove unused properties

2747e19

Rearrange classes

08ad36f

Nit improvements to tests

99021cc

smaheshwar-pltr commented May 22, 2025

View reviewed changes

Test limit on an append scan also

aeca168

smaheshwar-pltr commented May 22, 2025

View reviewed changes

Add a count test

6cb3539

smaheshwar-pltr commented May 22, 2025

View reviewed changes

Remove TODO

f0ee5cb

smaheshwar-pltr commented May 22, 2025

View reviewed changes

smaheshwar-pltr marked this pull request as ready for review May 22, 2025 15:04

smaheshwar-pltr mentioned this pull request May 22, 2025

Incremental Append Scan #533

Open

Remove TODOs

14cbe27

Merge branch 'main' into sm/incremental-append-scan

a2e063e

		from_snapshot_id_exclusive: Optional[int],
		to_snapshot_id_inclusive: int,

		yield from ancestors_of(table_metadata.snapshot_by_id(to_snapshot_id_inclusive), table_metadata)


		def is_ancestor_of(snapshot_id: int, ancestor_snapshot_id: int, table_metadata: TableMetadata) -> bool:

	@cached_property
	def partition_filters(self) -> KeyDefaultDict[int, BooleanExpression]:
	return KeyDefaultDict(self._build_partition_projection)

		class TableScan(AbstractTableScan, ABC):
		"""A base class for non-incremental table scans that target a single snapshot."""

		manifest_entry_filter=lambda manifest_entry: manifest_entry.snapshot_id in append_snapshot_ids
		and manifest_entry.status == ManifestEntryStatus.ADDED,



		@pytest.mark.integration
		@pytest.mark.parametrize("catalog", [pytest.lazy_fixture("session_catalog")])

Incremental Append Scan #2031

Are you sure you want to change the base?

Incremental Append Scan #2031

Conversation

smaheshwar-pltr commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

smaheshwar-pltr May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

smaheshwar-pltr commented May 21, 2025 •

edited

Loading

smaheshwar-pltr May 21, 2025 •

edited

Loading

smaheshwar-pltr May 21, 2025 •

edited

Loading

smaheshwar-pltr May 22, 2025 •

edited

Loading

smaheshwar-pltr May 22, 2025 •

edited

Loading

Fokko Jun 8, 2025 •

edited

Loading

smaheshwar-pltr May 22, 2025 •

edited

Loading

smaheshwar-pltr May 22, 2025 •

edited

Loading

smaheshwar-pltr May 22, 2025 •

edited

Loading

smaheshwar-pltr May 22, 2025 •

edited

Loading

smaheshwar-pltr May 22, 2025 •

edited

Loading

smaheshwar-pltr May 22, 2025 •

edited

Loading

smaheshwar-pltr commented May 28, 2025 •

edited

Loading

smaheshwar-pltr commented Jun 11, 2025 •

edited

Loading