Skip to content

tansu-io/example-delta-lake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tansu is an Apache Kafka API compatible broker written in async 🚀 Rust 🦀 with PostgreSQL, S3 or memory storage engines.

Topics validated by JSON Schema, Apache Avro or Protocol buffers can be written as Apache Iceberg or Delta Lake tables

This repository showcases examples of structured data published to schema-backed topics, instantly accessible as Delta Lake tables.

Prerequisites:

The justfile contains recipes to create topics, produce and query data.

Once you have the prerequisites installed, clone this repository and start everything up with:

git clone git@github.com:tansu-io/example-delta-duckdb.git
cd example-pyiceberg

Start everything up with:

just up

Should result in:

âś” Network example-delta-duckdb_default    Created
âś” Volume "example-delta-duckdb_minio"     Created
âś” Container example-delta-duckdb-minio-1  Healthy
docker compose exec minio mc ready local
mc: Configuration written to `/tmp/.mc/config.json`. Please update your access credentials.
mc: Successfully created `/tmp/.mc/share`.
mc: Initialized share uploads `/tmp/.mc/share/uploads.json` file.
mc: Initialized share downloads `/tmp/.mc/share/downloads.json` file.
The cluster 'local' is ready
docker compose exec minio mc alias set local http://localhost:9000 minioadmin minioadmin
Added `local` successfully.
docker compose exec minio mc mb local/tansu
Bucket created successfully `local/tansu`.
docker compose exec minio mc mb local/lake
Bucket created successfully `local/lake`.
docker compose up --detach --wait tansu
âś” tansu Pulled
âś” Container example-delta-duckdb-tansu-1  Healthy

The above does the following:

  • starts the MinIO S3 compatible service
  • creates a s3://lake bucket in MinIO, used to store the Delta Lake tables
  • creates a s3://tansu bucket in MinIO, used to store Kafka related data used by tansu
  • runs the tansu broker configured to use MinIO as the storage engine with Delta Lake

Done! You can now run the examples.

Employee

Employee is a protocol buffer backed topic, with the following schema employee.proto:

syntax = 'proto3';

message Key {
  int32 id = 1;
}

message Value {
  string name = 1;
  string email = 2;
}

Sample employee data is in employees.json:

[
  {
    "key": { "id": 12321 },
    "value": { "name": "Bob", "email": "bob@example.com" }
  },
  {
    "key": { "id": 32123 },
    "value": { "name": "Alice", "email": "alice@example.com" }
  }
]

Create the employee topic:

just employee-topic-create

The above command will create an tansu.employee Delta Lake table, that is normalized and partitioned on the meta.year, meta.month, meta.day from the Kafka message:

config value
tansu.lake.partition meta.year,meta.month,meta.day
tansu.lake.normalize true

Publish the sample data onto the employee topic:

just employee-produce

We can view the Delta Lake table created in s3://lake/tansu.employee with:

just minio-mc ls -r local/lake/tansu.employee

Note that the tansu.employee table, is partitioned on the meta.year, meta.month, meta.day:

docker compose exec minio mc ls -r local/lake/tansu.employee
[2025-05-30 06:56:41 UTC] 2.5KiB STANDARD _delta_log/00000000000000000000.json
[2025-05-30 06:56:41 UTC]   998B STANDARD _delta_log/00000000000000000001.json
[2025-05-30 06:56:41 UTC] 1.8KiB STANDARD meta.year=2025/meta.month=5/meta.day=30/part-00000-47182dcc-6071-4836-8233-1ae50678194e-c000.parquet

To view the Delta Lake table in DuckDB:

just employee-duckdb-delta

Giving the following output:

meta.partition meta.timestamp meta.year meta.month meta.day key.id value.name value.email
0 2025-05-30 06:56:41.136 2025 5 30 12321 Bob bob@example.com
0 2025-05-30 06:56:41.136 2025 5 30 32123 Alice alice@example.com

Grade

Grade is a JSON schema backed topic, with the following schema grade.json:

{
  "type": "record",
  "name": "Grade",

  "fields": [
    { "name": "key", "type": "string", "pattern": "^\\d{3}-\\d{2}-\\d{4}$" },
    {
      "name": "value",
      "type": {
        "type": "record",
        "fields": [
          { "name": "first", "type": "string" },
          { "name": "last", "type": "string" },
          { "name": "test1", "type": "double" },
          { "name": "test2", "type": "double" },
          { "name": "test3", "type": "double" },
          { "name": "test4", "type": "double" },
          { "name": "final", "type": "double" },
          { "name": "grade", "type": "string" }
        ]
      }
    }
  ]
}

Sample grade data is in: grades.json:

[
  {
    "key": "123-45-6789",
    "value": {
      "lastName": "Alfalfa",
      "firstName": "Aloysius",
      "test1": 40.0,
      "test2": 90.0,
      "test3": 100.0,
      "test4": 83.0,
      "final": 49.0,
      "grade": "D-"
    }
  },
  ...
]

Create the grade topic:

just grade-topic-create

The above command will create an tansu.grade Delta Lake table, that is normalized, partitioned on meta.year from the Kafka message, the Z Order of the data is value.grade :

config value
tansu.lake.partition meta_year
tansu.lake.normalize true
tansu.lake.normalize.separator _
tansu.lake.z_order value_grade

Tansu will automatically maintain this table compacting small files and applying Z Ordering every 10 minutes or so.

Publish the sample data onto the grade topic:

just grade-produce

We can view the files created by Tansu in s3://lake/tansu.grade with:

just minio-mc ls -r local/lake/tansu.grade

Note that the tansu.grade table, is partitioned on the meta.year:

[2025-05-31 06:57:19 UTC] 3.4KiB STANDARD _delta_log/00000000000000000000.json
[2025-05-31 06:57:19 UTC] 1.4KiB STANDARD _delta_log/00000000000000000001.json
[2025-05-31 06:57:19 UTC] 5.4KiB STANDARD meta.year=2025/part-00000-bd6bce1a-0288-4ab2-9a40-01dc0bff2199-c000.parquet

View the data in DuckDB:

just grade-duckdb-delta

Giving the following output, note that the grade is unordered:

meta_day meta_month meta_partition meta_timestamp meta_year key value_final value_first value_grade value_last value_test1 value_test2 value_test3 value_test4
31 5 0 2025-05-31T08:05:37.956+00:00 2025 345-67-8901 43.0 Cecil F Noshow 45.0 11.0 -1.0 4.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 123-12-1234 48.0 University D+ Alfred 41.0 97.0 96.0 97.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 123-45-6789 49.0 Aloysius D- Alfalfa 40.0 90.0 100.0 83.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 234-56-7890 46.0 Betty C- Rubble 44.0 90.0 80.0 90.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 567-89-0123 44.0 Gramma C Gerty 41.0 80.0 60.0 40.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 632-79-9939 50.0 Bif B+ Buff 46.0 20.0 30.0 40.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 456-78-9012 45.0 Fred A- Bumpkin 43.0 78.0 88.0 77.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 087-65-4321 47.0 Electric B- Android 42.0 23.0 36.0 45.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 345-67-3901 4.0 Boy B George 40.0 1.0 11.0 -1.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 143-12-1234 97.0 Jim A+ Backus 48.0 1.0 97.0 96.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 565-89-0123 40.0 Art D+ Carnivore 44.0 1.0 80.0 60.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 632-79-9439 40.0 Harvey C Heffalump 30.0 1.0 20.0 30.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 234-56-2890 90.0 Benny B- Franklin 50.0 1.0 90.0 80.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 223-45-6789 83.0 Andrew A Airpump 49.0 1.0 90.0 100.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 087-75-4321 45.0 Jim C+ Dandy 47.0 1.0 23.0 36.0
31 5 0 2025-05-31T08:05:37.956+00:00 2025 456-71-9012 77.0 Ima B- Elephant 45.0 1.0 78.0 88.0

After maintenance has run the table is ordered by grade:

meta_day meta_month meta_partition meta_timestamp meta_year key value_final value_first value_grade value_last value_test1 value_test2 value_test3 value_test4
31 5 0 2025-05-31T08:31:39.065+00:00 2025 223-45-6789 83.0 Andrew A Airpump 49.0 1.0 90.0 100.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 143-12-1234 97.0 Jim A+ Backus 48.0 1.0 97.0 96.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 456-78-9012 45.0 Fred A- Bumpkin 43.0 78.0 88.0 77.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 345-67-3901 4.0 Boy B George 40.0 1.0 11.0 -1.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 632-79-9939 50.0 Bif B+ Buff 46.0 20.0 30.0 40.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 087-65-4321 47.0 Electric B- Android 42.0 23.0 36.0 45.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 456-71-9012 77.0 Ima B- Elephant 45.0 1.0 78.0 88.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 234-56-2890 90.0 Benny B- Franklin 50.0 1.0 90.0 80.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 567-89-0123 44.0 Gramma C Gerty 41.0 80.0 60.0 40.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 632-79-9439 40.0 Harvey C Heffalump 30.0 1.0 20.0 30.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 087-75-4321 45.0 Jim C+ Dandy 47.0 1.0 23.0 36.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 234-56-7890 46.0 Betty C- Rubble 44.0 90.0 80.0 90.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 123-12-1234 48.0 University D+ Alfred 41.0 97.0 96.0 97.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 565-89-0123 40.0 Art D+ Carnivore 44.0 1.0 80.0 60.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 123-45-6789 49.0 Aloysius D- Alfalfa 40.0 90.0 100.0 83.0
31 5 0 2025-05-31T08:31:39.065+00:00 2025 345-67-8901 43.0 Cecil F Noshow 45.0 11.0 -1.0 4.0

Observation

Observation is an Avro backed topic, with the following schema observation.avsc:

{
  "type": "record",
  "name": "observation",
  "fields": [
    { "name": "key", "type": "string", "logicalType": "uuid" },
    {
      "name": "value",
      "type": "record",
      "fields": [
        { "name": "amount", "type": "double" },
        { "name": "unit", "type": "enum", "symbols": ["CELSIUS", "MILLIBAR"] }
      ]
    }
  ]
}

Sample observation data, is in: observations.json:

[
  {
    "key": "1E44D9C2-5E7A-443B-BF10-2B1E5FD72F15",
    "value": { "amount": 23.2, "unit": "CELSIUS" }
  },
  ...
]

Create the observation topic:

just observation-topic-create

Publish the sample data onto the observation topic:

just observation-produce

View the data in DuckDB:

just observation-duckdb-delta

Giving the following output:

key value meta
1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15 {'amount': 23.2, 'unit': CELSIUS} {'partition': 0, 'timestamp': 2025-05-25 14:08:24.539}
1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15 {'amount': 1027.0, 'unit': MILLIBAR} {'partition': 0, 'timestamp': 2025-05-25 14:08:24.539}
1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15 {'amount': 22.8, 'unit': CELSIUS} {'partition': 0, 'timestamp': 2025-05-25 14:08:24.539}
1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15 {'amount': 1023.0, 'unit': MILLIBAR} {'partition': 0, 'timestamp': 2025-05-25 14:08:24.539}
1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15 {'amount': 22.5, 'unit': CELSIUS} {'partition': 0, 'timestamp': 2025-05-25 14:08:24.539}
1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15 {'amount': 1018.0, 'unit': MILLIBAR} {'partition': 0, 'timestamp': 2025-05-25 14:08:24.539}
1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15 {'amount': 23.1, 'unit': CELSIUS} {'partition': 0, 'timestamp': 2025-05-25 14:08:24.539}
1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15 {'amount': 1020.0, 'unit': MILLIBAR} {'partition': 0, 'timestamp': 2025-05-25 14:08:24.539}
1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15 {'amount': 23.4, 'unit': CELSIUS} {'partition': 0, 'timestamp': 2025-05-25 14:08:24.539}
1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15 {'amount': 1025.0, 'unit': MILLIBAR} {'partition': 0, 'timestamp': 2025-05-25 14:08:24.539}

Person

Person is a JSON schema backed topic, with the following schema person.json:

{
  "title": "Person",
  "type": "object",
  "properties": {
    "key": {
      "type": "string",
      "pattern": "^\\d{3}-\\d{2}-\\d{4}$"
    },
    "value": {
      "type": "object",
      "properties": {
        "firstName": {
          "type": "string",
          "description": "The person's first name."
        },
        "lastName": {
          "type": "string",
          "description": "The person's last name."
        },
        "age": {
          "description": "Age in years which must be equal to or greater than zero.",
          "type": "integer",
          "minimum": 0
        }
      }
    }
  }
}

Sample person data, is in persons.json:

[
  {
    "key": "123-45-6789",
    "value": { "lastName": "Alfalfa", "firstName": "Aloysius", "age": 21 }
  },
  ...
]

Create the person topic:

just person-topic-create

Publish the sample data onto the person topic:

just person-produce

View the data in DuckDB:

just person-duckdb-delta

Giving the following output:

key value
123-45-6789 {'age': 21, 'firstName': Aloysius, 'lastName': Alfalfa}
123-12-1234 {'age': 52, 'firstName': University, 'lastName': Alfred}
567-89-0123 {'age': 35, 'firstName': Gamma, 'lastName': Gerty}
087-65-4321 {'age': 23, 'firstName': Electric, 'lastName': Android}
456-78-9012 {'age': 72, 'firstName': Fred, 'lastName': Bumpkin}
234-56-7890 {'age': 44, 'firstName': Betty, 'lastName': Rubble}
345-67-8901 {'age': 67, 'firstName': Cecil, 'lastName': Noshow}
632-79-9939 {'age': 38, 'firstName': Buff, 'lastName': Bif}
223-45-6789 {'age': 42, 'firstName': Andrew, 'lastName': Airpump}
143-12-1234 {'age': 63, 'firstName': Jim, 'lastName': Backus}
565-89-0123 {'age': 29, 'firstName': Art, 'lastName': Carnivore}
087-75-4321 {'age': 56, 'firstName': Jim, 'lastName': Dandy}
456-71-9012 {'age': 45, 'firstName': Ima, 'lastName': Elephant}
234-56-2890 {'age': 54, 'firstName': Benny, 'lastName': Franklin}
345-67-3901 {'age': 91, 'firstName': Boy, 'lastName': George}
632-79-9439 {'age': 17, 'firstName': Harvey, 'lastName': Heffalump}

Search

Search is a protocol buffer backedd topic, with the following schema search.proto:

syntax = 'proto3';

enum Corpus {
  CORPUS_UNSPECIFIED = 0;
  CORPUS_UNIVERSAL = 1;
  CORPUS_WEB = 2;
  CORPUS_IMAGES = 3;
  CORPUS_LOCAL = 4;
  CORPUS_NEWS = 5;
  CORPUS_PRODUCTS = 6;
  CORPUS_VIDEO = 7;
}

message Value {
  string query = 1;
  int32 page_number = 2;
  int32 results_per_page = 3;
  Corpus corpus = 4;
}

Sample search data, is in searches.json:

[
  {
    "value": {
      "query": "abc/def",
      "page_number": 6,
      "results_per_page": 13,
      "corpus": "CORPUS_WEB"
    }
  }
]

Create the search topic:

just search-topic-create

Publish the sample data onto the search topic:

just search-produce

View the data in DuckDB:

just search-duckdb-delta

Giving the following output:

meta value
{'partition': 0, 'timestamp': 2025-05-25 15:06:18.507} {'query': abc/def, 'page_number': 6, 'results_per_page': 13, 'corpus': 2}

Taxi

Taxi is a protocol buffer backed topic, with the following schema taxi.proto:

syntax = 'proto3';

enum Flag {
    N = 0;
    Y = 1;
}

message Value {
  int64 vendor_id = 1;
  int64 trip_id = 2;
  float trip_distance = 3;
  double fare_amount = 4;
  Flag store_and_fwd = 5;
}

Sample trip data, is in trips.json:

[
  {
    "value": {
      "vendor_id": 1,
      "trip_id": 1000371,
      "trip_distance": 1.8,
      "fare_amount": 15.32,
      "store_and_fwd": "N"
    }
  },
  ...
]

Create the taxi topic:

just taxi-topic-create

Publish the sample data onto the taxi topic:

just taxi-produce

View the data in DuckDB:

just taxi-duckdb-delta

Giving the following output:

meta value
{'partition': 0, 'timestamp': 2025-05-25 14:58:11.719} {'vendor_id': 1, 'trip_id': 1000371, 'trip_distance': 1.8, 'fare_amount': 15.32, 'store_and_fwd': 0}
{'partition': 0, 'timestamp': 2025-05-25 14:58:11.719} {'vendor_id': 2, 'trip_id': 1000372, 'trip_distance': 2.5, 'fare_amount': 22.15, 'store_and_fwd': 0}
{'partition': 0, 'timestamp': 2025-05-25 14:58:11.719} {'vendor_id': 2, 'trip_id': 1000373, 'trip_distance': 0.9, 'fare_amount': 9.01, 'store_and_fwd': 0}
{'partition': 0, 'timestamp': 2025-05-25 14:58:11.719} {'vendor_id': 1, 'trip_id': 1000374, 'trip_distance': 8.4, 'fare_amount': 42.13, 'store_and_fwd': 1}

About

Tansu schema-backed topics, instantly accessible as Delta Lake tables

Topics

Resources

Stars

Watchers

Forks

Languages