You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sdf/concepts/window-processing.mdx
+27-27
Original file line number
Diff line number
Diff line change
@@ -68,29 +68,8 @@ window:
68
68
69
69
Now, let's explore these components in more detail.
70
70
71
-
### `assign-timestamp` Operator
72
-
The first crucial step in event-time window processing is to determine the event time of each incoming record. This is handled by the assign-timestamp operator. The event timestamp is vital because it dictates which window(s) a particular record belongs to.
73
-
74
-
You define a custom function (e.g., in Rust) that takes as input a type (that should match with previous configurations of the service) and its system event time (the time it was received by SDF) and returns the logical event timestamp (as an i64 Unix timestamp in milliseconds) to be used for windowing.
// Logic to extract or assign a timestamp from the 'value'
84
-
// For example, if 'value' has a 'timestamp' field:
85
-
Ok(value.timestamp)
86
-
}
87
-
```
88
-
89
-
In this example, InputMessage is a defined type that includes a timestamp field. The function extracts this field to be used as the event time. If the actual event time from the source system is not present in the message, you might use _event_time (SDF's ingestion time) or implement logic to derive it.
90
-
91
71
### Tumbling Window
92
-
93
-
Tumbling windows are defined by a single parameter: `duration`. This specifies the fixed length of each window. Windows are contiguous and non-overlapping.
72
+
Tumbling windows are defined by a single parameter: duration. This specifies the fixed length of each window. Windows are contiguous and non-overlapping.
94
73
95
74
#### Example:
96
75
```YAML
@@ -120,14 +99,34 @@ Sliding windows are defined by two parameters:
120
99
121
100
Here, window.sliding.duration: 60s and window.sliding.slide: 10s mean that every 10 seconds, a new window is initiated, covering the last 60 seconds of data.
122
101
102
+
### `assign-timestamp` Operator
103
+
The first crucial step in event-time window processing is to determine the event time of each incoming record. This is handled by the assign-timestamp operator. The event timestamp is vital because it dictates which window(s) a particular record belongs to.
104
+
105
+
You must define a custom function (e.g., in Rust) that takes as input a type (that should match with previous configurations of the service) and its system event time (the time it was received by SDF) and returns the logical event timestamp (as an i64 Unix timestamp in milliseconds) to be used for windowing.
// Logic to extract or assign a timestamp from the 'value'
115
+
// For example, if 'value' has a 'timestamp' field:
116
+
Ok(value.timestamp)
117
+
}
118
+
```
119
+
120
+
In this example, InputMessage is a defined type that includes a timestamp field. The function extracts this field to be used as the event time. If the actual event time from the source system is not present in the message, you might use _event_time (SDF's ingestion time) or implement logic to derive it.
121
+
123
122
### Watermark
124
123
125
124
Watermarks are a fundamental concept in stream processing, representing the point in event time up to which SDF assumes all data has been received for a given partition. They are crucial for triggering window computations (i.e., deciding when a window is "closed") and handling late-arriving data. SDF advances watermarks based on the timestamps assigned by assign-timestamp.
126
125
127
126
The watermark configuration is nested within the window block.
128
127
129
128
- `grace-period`: The grace-period (also known as allowed lateness) defines an additional amount of time after a window's end time (as determined by the watermark) during which late-arriving records are still accepted and processed for that window. Records whose timestamps fall within the window's original range but arrive after the watermark has passed the window's end (but before the grace period expires) can still be included.
130
-
- `idleness`: The idleness configuration helps in scenarios where there are not new events for an extended period of time. If the service is idle for the specified duration (i.e., no new messages arrive), SDF can advance the watermark and therefore closing a window, preventing the overall system watermark from being stalled.
129
+
- `idleness`: The idleness configuration helps in scenarios where there are not new events for an extended period of time. If the service is idle for the specified duration (i.e., no new messages arrive), SDF can advance the watermark and therefore closing a window, preventing the overall system watermark from being stalled. it's generally recommended that the idleness timeout is larger than the duration of the window to avoid premature window closure.
131
130
132
131
#### Example:
133
132
@@ -137,24 +136,25 @@ The watermark configuration is nested within the window block.
137
136
tumbling:
138
137
duration: 10s
139
138
watermark:
140
-
idleness: 12s # Consider a partition idle after 10s of no data
139
+
idleness: 12s # Consider a partition idle after 12s of no data
141
140
grace-period: 3s # Allow data to be 3s late for this window
142
141
```
143
142
144
143
In this setup, for a 10-second tumbling window (e.g., for event times 00:00:00 to 00:00:10), even if the watermark has advanced past 00:00:10, records with timestamps within this [0, 10s) interval can still be included if they arrive before the watermark passes 00:00:13 (window_end + grace_period).
145
144
Also, If SDF hasn't received any data for 12 seconds, then the current watermark is advanced and that triggers closing of current window and creation of a new one.
146
145
147
-
### Core Window Operations
146
+
### Inner Window operators
148
147
Within the window block, several operators define how data is processed once its timestamp is assigned and it's notionally part of one or more window instances:
149
148
150
149
#### transforms (within window context):
151
-
As shown in the overview , transforms can be defined as a direct child of the window block. These transformations (e.g., map, flat-map, filter) are applied to each record after its timestamp has been assigned by assign-timestamp but before it is passed to the partition operator. This allows for data shaping, enrichment, or filtering specific to the windowing logic before key-based operations commence.
150
+
As shown in the overview , transforms can be defined as a direct child of the window block. These transformations (e.g., map, flat-map, filter, filter-map) are applied to each record after its timestamp has been assigned by assign-timestamp but before it is passed to the partition operator. This allows for data shaping, enrichment, or filtering specific to the windowing logic before key-based operations commence.
152
151
#### partition
153
152
Windowed operations are typically stateful and performed on a per-key basis. The partition operator defines how incoming records (which have already been timestamped and possibly transformed) are grouped by a key, and how state associated with that key is updated for each record.
154
153
- assign-key: This sub-operator contains a function that takes a record and returns a key. All records sharing the same key will be processed together within their respective window instances, allowing for stateful computations per key.
155
154
- transforms: Group of operations to run before update-state step, can be any of the SDF transforms operators (filter, map, filter-map, flat-map)
156
155
- update-state: This sub-operator contains a function that is executed for each record within its assigned partition (i.e., for its key). It typically interacts with a keyed state (defined in the states section of the service) to accumulate information or perform calculations.
157
-
#### flush Operator
156
+
157
+
### flush Operator
158
158
The flush operator defines the action to be taken when a window instance closes for a particular key (triggered by the watermark advancing past the window's end plus any grace period). This is typically where final aggregations are computed from the accumulated state, and the results for that window instance are emitted. The flush function has access to the state that was built up by update-state operations for the specific key and window that is closing.
Stateful Dataflows (SDF) is an intuitive platform designed to help developers build, troubleshoot, and run full-featured event-driven pipelines, also known as dataflows.
10
+
11
+
### How does SDF relate to Fluvio?
12
+
13
+
SDF leverages Fluvio topics as the mechanism for receiving input and publishing the output of its data processing pipelines.
14
+
15
+
### Is prior experience with Fluvio required to use SDF effectively?
16
+
17
+
No, while SDF utilizes Fluvio for message transport, direct interaction with Fluvio by SDF developers is minimal. You don't need in-depth Fluvio expertise to begin using SDF.
18
+
19
+
### What programming languages are supported for writing custom logic in SDF?
20
+
21
+
Currently, our tooling exclusively supports Rust. We may explore the implementation of support for other languages in the future.
22
+
23
+
### How does SDF ensure low latency of dataflows?
24
+
25
+
SDF achieves low latency by leveraging the performance characteristics of Rust and WebAssembly (WASM), the compilation target for custom logic within the dataflows.
26
+
27
+
## Can I perform HTTP(S) requests from operators?
28
+
29
+
Yes, operators can perform HTTP(S) requests by utilizing the `sdf-http-guest` crate available at https://github.com/infinyon/sdf-http-guest.
30
+
31
+
### How does SDF handle data serialization and deserialization?
32
+
33
+
SDF automatically generates serialization and deserialization logic based on the topic configuration defined for the dataflows. Each topic must specify the associated data type and whether the format is JSON or raw (which is only permitted for native data types).
34
+
35
+
### What are SDF packages?
36
+
37
+
An SDF package is a modular unit used to define reusable data types and/or functions that can then be incorporated into dataflows. Utilizing packages is the recommended and most effective way to add custom logic to your dataflows.
38
+
39
+
### Can I change a namespace in a package after I have implemented the code?
40
+
41
+
Yes, you can change the namespace of a package after implementation. However, this is a potentially breaking change. If the namespace is modified, any code that imports types or functions from this package will likely need to be updated to reflect the new namespace in their import statements.
Copy file name to clipboardExpand all lines: sdf/index.mdx
-1
Original file line number
Diff line number
Diff line change
@@ -35,7 +35,6 @@ Stateful Dataflows are an extension of [Fluvio], leveraging the Fluvio infrastru
35
35
Fluvio [connectors] serve as the interface to external ingress and egress services. Fluvio publishes a growing library of connectors, such as HTTP, Kafka, NATs, SQL, S3, etc. Connectors are easy to build, test, deploy, and share with the community.
36
36
37
37
38
-
39
38
## Next Step
40
39
41
40
In then [Let's Get Started] section, we'll walk through the steps to get started with Stateful Dataflows.
For upgrading Infinyon Cloud (remote) workers, please contact [InfinyOn support](#infinyon-support).
26
26
27
-
### CLI changes
28
27
29
-
- SDF interactive shell not longer exits on ctrl-c
30
-
- Introduced the `--clear-all` and `--clear-state` options to the dataflow restart command, allowing users to clear the dataflow state before restarting.
28
+
### Improvements
31
29
32
-
### New features
30
+
- Improve error output for some of the yaml parsing errors.
31
+
- Do not require `type-name` in list type
32
+
- Avoid generating invalid WIT names
33
+
- Use dev runtime by default on `sdf test`
34
+
- Add validation to check if pkg namespace/name conflicts with df
35
+
- Improve State recovery on window services on restart
33
36
34
-
- Added capability to schedule events using CRON format.
35
-
- Added grace period configuration to window watermark, in order to support allowing out of order events.
37
+
### Changes
36
38
37
-
### Logging and Debugging Enhancements
38
-
39
-
- Enhanced the sdf log -v command to include record offset and partition information.
39
+
- Make types respect casing defined on yaml for deserialization and serialization when using `apiVersion: 0.6.0`
0 commit comments