-
Notifications
You must be signed in to change notification settings - Fork 275
feat: Declarative TableMetadata Builder #1362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Declarative TableMetadata Builder #1362
Conversation
@liurenjie1024 @Xuanwo Looking forward to your thoughts! |
#[builder(default, setter(transform = |stats: Vec<StatisticsFile>| { | ||
stats.into_iter().map(|s| (s.snapshot_id, s)).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why also default? How is this different from snapshots
?
Accepting Vec<StatisticsFile>
should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not - I missed the default for snapshots
and current_snapshot_id
and a few others.
My basic reasoning is that all fields that must be set in order to produce valid metadata should not have a default, even if the Rust type would allow it. An example for a non-default field would be schemas
, as we require at least one entry in the HashMap to be able to produce valid Metadata (current_schema_id
must be set and be present).
Valid TableMetadata
must not contain snapshots
though, and current_snapshot_id
can be None, so those should have defaults.
.metadata_log(vec![]) | ||
.refs(HashMap::new()) | ||
.build() | ||
.try_normalize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the test for the declarative builder would be better to avoid unnecessary calls. Do we need metadata.try_normalize
? (the test passes without metadata.try_normalize
call)
OTOH it's somewhat weird that try_normalize
is the only API to finish building.
Could build()
do the try_normalize
implicitly, so that user doesn't have to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this design as well, however I can't find an easy way to implement a custom build method without implementing it on the very complex generic type that is derived by the macro (see end of this message) and is hard to keep in sync.
typed_builder
s only usable customization of the build method is using Into
to convert the type. There is no native way to inject try_normalize
into the build method. Especially, there is no way to make the generated build method return a Result
.
I see the following options:
- Just build
TableMetadata
without normalization - this would allow to create a potentially invalid TableMetadata, as all other constructors (existingTableMetadataBuilder
&Deserialize
) use normalization. Hence, I discarded it. - Implement a fallible
try_build
(with normalize) method and make the derivedbuild
method private. This would require to implement thetry_build
on the above type, which isn't very obvious, and would need to be kept in sync with fields inTableMetadata
and even worse, with which have a default builder implementation. Due to this complexity, and a high probability of state drift, I discarded this as well. - Don't use typed builder - which requires us to write significantly more code, but it would be a viable option.
- Extend typed builder to support
TryInto
in addition toInto
in the build method. - Use a very small custom type that wraps the notion of unvalidated metadata and offers an easy way to validate it - which is what I did.
If there are any other ideas here or if I missed something, let me know!
Here is the type: Note that default and non-default fields are treated differently:
#[allow(dead_code, non_camel_case_types, missing_docs)]
#[automatically_derived]
impl<
__properties: ::typed_builder::Optional<HashMap<String, String>>,
__current_snapshot_id: ::typed_builder::Optional<Option<i64>>,
__snapshots: ::typed_builder::Optional<HashMap<i64, SnapshotRef>>,
__snapshot_log: ::typed_builder::Optional<Vec<SnapshotLog>>,
__metadata_log: ::typed_builder::Optional<Vec<MetadataLog>>,
__statistics: ::typed_builder::Optional<HashMap<i64, StatisticsFile>>,
__partition_statistics: ::typed_builder::Optional<HashMap<i64, PartitionStatisticsFile>>,
__encryption_keys: ::typed_builder::Optional<HashMap<String, String>>,
>
TableMetadataDeclarativeBuilder<(
(FormatVersion,),
(Uuid,),
(String,),
(i64,),
(i64,),
(i32,),
(HashMap<i32, SchemaRef>,),
(i32,),
(HashMap<i32, PartitionSpecRef>,),
(PartitionSpecRef,),
(StructType,),
(i32,),
__properties,
__current_snapshot_id,
__snapshots,
__snapshot_log,
__metadata_log,
(HashMap<i64, SortOrderRef>,),
(i64,),
(HashMap<String, SnapshotReference>,),
__statistics,
__partition_statistics,
__encryption_keys,
)>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typed_builders only usable customization of the build method is using Into to convert the type. There is no native way to inject try_normalize into the build method. Especially, there is no way to make the generated build method return a Result.
Looks there is draft design for this problem: idanarye/rust-typed-builder#95 . Another option maybe we can try to make the progress of feature. 😄
#[derive(Debug, PartialEq, Deserialize, Eq, Clone)] | ||
#[derive(Debug, PartialEq, Deserialize, Eq, Clone, typed_builder::TypedBuilder)] | ||
#[builder(builder_method(name=declarative_builder))] | ||
#[builder(builder_type(name=TableMetadataDeclarativeBuilder, doc="Build a new [`TableMetadata`] in a declarative way. For imperative operations (e.g. `add_snapshot`) and creating new TableMetadata, use [`TableMetadataBuilder`] instead."))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced "declarative" vs "imperative" is the defining distinction.
The TableMetadataBuilder
is not "a builder pattern for TableMetadata". It's a facade to perform changes on the TableMetadata to be invoked by higher layer. For that reason, it includes implicit changes that are supposed to be carried on when a metadata change happens, such as updating last_updated_ms when a new snapshot is added
self.last_updated_ms = Some(snapshot.timestamp_ms()); |
The use-case described in the PR description is "the" standard builder pattern. To me, this is the defining difference.
I suggest changing the doc string here and align the type names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I absolutely agree with this from a rust perspective. From an iceberg perspective other languages use the world Builder
for this functionality unfortunately. I am uncertain if renaming the TableMetadataBuilder
to a clearer name is worth the breaking change today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @findepi about the naming. To me this is more like a constructor, but since I'm not a native English speaker, I don't have a good suggestion to the name. But it would be lovely if we have a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I feel the TypedBuilder
is a little hard to maintain in such a large struct, do you mind to create a standalone struct for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I feel the
TypedBuilder
is a little hard to maintain in such a large struct, do you mind to create a standalone struct for this?
I don't quite understand this suggestion - isn't the purpose of this PR to create TableMetadata
? How would implementing it on another struct help?
Regarding names, @liurenjie1024 would you be OK with the breaking change that Piotrs proposal implies? Renaming the existing TableMetadataBuilder
to something else, then naming the new builder TableMetadataBuilder
(which does something completely different and is used by far less people?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @liurenjie1024 and @findepi you mean different things.
@liurenjie1024 you just would like to rename TableMetadataDeclarativeBuilder
to TableMetadataConstructor
- correct? I am onboard with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liurenjie1024 you just would like to rename TableMetadataDeclarativeBuilder to TableMetadataConstructor - correct? I am onboard with that.
Yes. Feel free to suggest anything more intuitive.
I don't quite understand this suggestion - isn't the purpose of this PR to create TableMetadata? How would implementing it on another struct help?
I mean not using TypedBuilder
to generate code automatically, rather than manually maintaining the builder/construct data structure. This pr contains many advanced usage of TypeBuilder
, and as a reviewer I find it's sth how difficult to catch up all the usages. I also worry about the maintainceness of using too much auto generated code.
|
||
impl UnnormalizedTableMetadata { | ||
/// Build a new [`UnnormalizedTableMetadata`] using the [`TableMetadataDeclarativeBuilder`]. | ||
pub fn builder() -> TableMetadataDeclarativeBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused? remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a library and the constructor can be used outside of the crate. I like to attach some constructor to a type if available.
As it's an auxiliary type, I am OK with removing it as well.
/// Unnormalized table metadata, used as an intermediate type | ||
/// to build table metadata in a declarative way. | ||
#[derive(Debug, PartialEq, Deserialize, Eq, Clone)] | ||
pub struct UnnormalizedTableMetadata(TableMetadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we need this. Why not build TableMeadata from a declarative builder directly?🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly due to my point 1) in this comment: #1362 (comment)
I don't think there should be any constructor that allows to create unvalidated / potentially corrupt TableMetadata
.
@findepi @ZENOTME thanks for the Reviews! The most unclear point is probably the return type of the Build method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @c-thiel for this pr! In general I agree that the requirement is reasonable. But I have concerns with the naming and TypedBuilder
usage, left some comments for discussion.
#[derive(Debug, PartialEq, Deserialize, Eq, Clone)] | ||
#[derive(Debug, PartialEq, Deserialize, Eq, Clone, typed_builder::TypedBuilder)] | ||
#[builder(builder_method(name=declarative_builder))] | ||
#[builder(builder_type(name=TableMetadataDeclarativeBuilder, doc="Build a new [`TableMetadata`] in a declarative way. For imperative operations (e.g. `add_snapshot`) and creating new TableMetadata, use [`TableMetadataBuilder`] instead."))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @findepi about the naming. To me this is more like a constructor, but since I'm not a native English speaker, I don't have a good suggestion to the name. But it would be lovely if we have a better name.
#[derive(Debug, PartialEq, Deserialize, Eq, Clone)] | ||
#[derive(Debug, PartialEq, Deserialize, Eq, Clone, typed_builder::TypedBuilder)] | ||
#[builder(builder_method(name=declarative_builder))] | ||
#[builder(builder_type(name=TableMetadataDeclarativeBuilder, doc="Build a new [`TableMetadata`] in a declarative way. For imperative operations (e.g. `add_snapshot`) and creating new TableMetadata, use [`TableMetadataBuilder`] instead."))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I feel the TypedBuilder
is a little hard to maintain in such a large struct, do you mind to create a standalone struct for this?
Which issue does this PR close?
Currently we don't have a way to create
TableMetadata
in a declarative way other than deserializing it from JSON.The
TableMetadataBuilder
mostly works imperatively with methods likeadd_schema
andadd_partition_spec
that mutate the state incrementally. While incremental modification is what most users need, it is also helpful to offer a type safe way to create a newTableMetadata
object given all of its attributes, other than deserialization from JSON.My concrete use-case is: Lakekeeper stores TableMetadata in Postgres. We extract data typesafe, so we end up with various objects such as a list of schemas, partition_specs, the last_updated_ms and all other fields required to build a new
TableMetadata
. As we already have rust-types, we don't want to serialize them to JSON first. We also can't use theTableMetadata
builder, as it's incremental nature would result in significant overhead.Design considerations
Instead of using the builder approach, we could also add another method
try_from_parts
toimpl TableMetadata
. We have this alternative approach implemented here: https://github.com/lakekeeper/iceberg-rust/blob/a8b6509775b139c92772a999b0cbca637e274a03/crates/iceberg/src/spec/table_metadata.rs#L676-L740I feel that a builder is less intrusive though. I don't like adding the
UnnormalizedTableMetadata
, but didn't find a way to skip this type without writing significantly more code that needs to be maintained. Because of the derived builder in this PR, the approach is almost maintenance free.What changes are included in this PR?
TableMetadataDeclarativeBuilder
UnnormalizedTableMetadata
as an intermediate type that is.try_normalize()
toTableMetadata
Are these changes tested?
yes