Skip to content

Commit

Permalink
Implement encode/decode for ID columns + tests, docs
Browse files Browse the repository at this point in the history
This PR allows users to specify the special value of 'func' for p_epoch
to use custom functions to encode/decode time-ordered integers other than
the classic seconds, ms, us or ns since the UNIX epoch.

Resolves pgpartman#729.
  • Loading branch information
calebj committed Jan 7, 2025
1 parent 949bf56 commit 64940d8
Show file tree
Hide file tree
Showing 15 changed files with 2,819 additions and 9 deletions.
8 changes: 4 additions & 4 deletions doc/pg_partman.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,12 +207,12 @@ RETURNS boolean
* An ACCESS EXCLUSIVE lock is taken on the parent table during the running of this function. No data is moved when running this function, so lock should be brief
* A default partition and template table are created by default unless otherwise configured
* `p_parent_table` - the existing parent table. MUST be schema qualified, even if in public schema
* `p_control` - the column that the partitioning will be based on. Must be a time, integer, text or uuid based column. When control is of type text/uuid, p_time_encoder and p_time_decoder must be set.
* `p_control` - the column that the partitioning will be based on. Must be a time, integer, text or UUID based column. When control is of type text/UUID, p_time_encoder and p_time_decoder must be set.
* `p_interval` - the time or integer range interval for each partition. No matter the partitioning type, value must be given as text.
+ *\<interval\>* - Any valid value for the interval data type. Do not type cast the parameter value, just leave as text.
+ *\<integer\>* - For ID based partitions, the integer value range of the ID that should be set per partition. Enter this as an integer in text format ('100' not 100). If the interval is >=2, then the `p_type` must be `range`. If the interval equals 1, then the `p_type` must be `list`. Also note that while numeric values are supported for id-based partitioning, the interval must still be a whole number integer.
* `p_type` - the type of partitioning to be done. Currently only **range** and **list** are supported. See `p_interval` parameter for special conditions concerning type.
* `p_epoch` - tells `pg_partman` that the control column is an integer type, but actually represents and epoch time value. Valid values for this option are: 'seconds', 'milliseconds', 'microseconds', 'nanoseconds', and 'none'. The default is 'none'. All table names will be time-based. In addition to a normal index on the control column, be sure you create a functional, time-based index on the control column (to_timestamp(controlcolumn)) as well so this works efficiently.
* `p_epoch` - tells `pg_partman` that the control column is an integer type, but actually represents an epoch time value or integer containing an encoded timestamp. Valid values for this option are: 'seconds', 'milliseconds', 'microseconds', 'nanoseconds', 'func', and 'none'. The default is 'none'. All table names will be time-based. For 'func', encode/decode functions between the integer type used and `timestamptz` are required. In addition to a normal index on the control column, be sure you create a functional, time-based index on the control column (to_timestamp(controlcolumn)) as well so this works efficiently.
* `p_premake` - is how many additional partitions to always stay ahead of the current partition. Default value is 4. This will keep at minimum 5 partitions made, including the current one. For example, if today was Sept 6th, and `premake` was set to 4 for a daily partition, then partitions would be made for the 6th as well as the 7th, 8th, 9th and 10th. Note some intervals may occasionally cause an extra partition to be premade or one to be missed due to leap years, differing month lengths, etc. This usually won't hurt anything and should self-correct (see **About** section concerning timezones and non-UTC). If partitioning ever falls behind the `premake` value, normal running of `run_maintenance()` and data insertion should automatically catch things up.
* `p_start_partition` - allows the first partition of a set to be specified instead of it being automatically determined. Must be a valid timestamp (for time-based) or positive integer (for id-based) value. Be aware, though, the actual parameter data type is text. For time-based partitioning, all partitions starting with the given timestamp up to CURRENT_TIMESTAMP (plus `premake`) will be created. For id-based partitioning, only the partition starting at the given value (plus `premake`) will be made. Note that for subpartitioning, this only applies during initial setup and not during ongoing maintenance.
* `p_default_table` - boolean flag to determine whether a default table is created. Defaults to true.
Expand All @@ -222,8 +222,8 @@ RETURNS boolean
* `p_jobmon` - allow `pg_partman` to use the `pg_jobmon` extension to monitor that partitioning is working correctly. Defaults to TRUE.
* `p_date_trunc_interval` - By default, pg_partman's time-based partitioning will truncate the child table starting values to line up at the beginning of typical boundaries (midnight for daily, day 1 for monthly, Jan 1 for yearly, etc). If a partitioning interval that does not fall on those boundaries is desired, this option may be required to ensure the child table has the expected boundaries (especially if you also set `p_start_partition`). The valid values allowed for this parameter are the interval values accepted by PostgreSQL's built-in `date_trunc()` function (day, week, month, etc). For example, if you set a 9-week interval, by default pg_partman would truncate the tables by month (since the interval is greater than one month but less than 1 year) and unexpectedly start on the first of the month in some cases. Set this parameter value to `week`, so that the child table start values are properly truncated on a weekly basis to line up with the 9-week interval. If you are using a custom time interval, please experiment with this option to get the expected set of child tables you desire or use a more typical partitioning interval to simplify partition management.
* `p_control_not_null` - By default, this value is true and the control column must be set to NOT NULL. Setting this to false allows the control column to be NULL. Allowing this is not advised without very careful review and an explicit use-case defined as it can cause excessive data in the DEFAULT child partition.
* `p_time_encoder` - name of function that encodes a timestamp into a string representing your partition bounds. Setting this implicitly enables time based partitioning and is mandatory for text/uuid control column types. This enables partitioning tables using time based identifiers like uuidv7, ulid, snowflake ids and others. The function must handle NULL input safely. See test-time-daily.sql and test-uuid-daily for usage examples.
* `p_time_decoder` - name of function that decodes a text/uuid control value into a timestamp. Setting this implicitly enables time based partitioning and is mandatory for text/uuid control column types. This enables partitioning tables using time based identifiers like uuidv7, ulid, snowflake ids and others. The function must handle NULL input safely. See test-time-daily.sql and test-uuid-daily for usage examples.
* `p_time_encoder` - name of function that encodes a `timestamp` into a string or integer representing your partition bounds. Setting this implicitly enables time based partitioning and is mandatory for text/UUID control column types, or integer control column with `p_epoch` = 'func'. This enables partitioning tables using time based identifiers like UUIDv7, ULID, snowflake ids and others. The function must handle NULL input safely. See test-time-daily.sql and test-UUID-daily for usage examples.
* `p_time_decoder` - name of function that decodes a text/UUID control value into a `timestamptz`. Setting this implicitly enables time based partitioning and is mandatory for text/UUID control column types, or integer control column with `p_epoch` = 'func'. This enables partitioning tables using time based identifiers like UUIDv7, ULID, snowflake ids and others. The function must handle NULL input safely. See test-time-daily.sql and test-uuid-daily for usage examples.


<a id="create_sub_parent"></a>
Expand Down
102 changes: 102 additions & 0 deletions doc/pg_partman_howto.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Example Guide On Setting Up Native Partitioning
- [Simple Time Based: 1 Partition Per Day](#simple-time-based-1-partition-per-day)
- [Simple Time Based with UUIDv7 type: 1 Partition Per Day](#simple-time-based-with-uuidv7-type-1-partition-per-day)
- [Simple Time Based with Text Type: 1 Partition Per Day](#simple-time-based-with-text-type-1-partition-per-day)
- [Simple Time Based with Snowflake IDs: 1 Partition Per Hour](#simple-time-based-with-snowflake-ids-1-partition-per-hour)
- [Simple Serial ID: 1 Partition Per 10 ID Values](#simple-serial-id-1-partition-Per-10-id-values)
- [Partitioning an Existing Table](#partitioning-an-existing-table)
* [Offline Partitioning](#offline-partitioning)
Expand Down Expand Up @@ -285,6 +286,107 @@ Indexes:
"time_taptest_table_p20240815_pkey" PRIMARY KEY, btree (col3)
Access method: heap
```

### Simple Time Based with Snowflake IDs: 1 Partition Per Hour
This example demonstrates how to use an integer control column that contains integers that encode a timestamp together with other data.

```sql
CREATE SCHEMA IF NOT EXISTS partman_test;

CREATE TABLE partman_test.time_taptest_table(
col1 BIGINT NOT NULL PRIMARY KEY,
col2 text default 'stuff')
PARTITION BY RANGE (col1);
```

```sql
\d+ partman_test.time_taptest_table
Partitioned table "partman_test.time_taptest_table"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+--------+-----------+----------+---------------+----------+-------------+--------------+-------------
col1 | bigint | | not null | | plain | | |
col2 | text | | | 'stuff'::text | extended | | |
Partition key: RANGE (col1)
Indexes:
"time_taptest_table_pkey" PRIMARY KEY, btree (col1)
Number of partitions: 0
```

Snowflake IDs are used in some distributed systems to generate unique, time-ordered IDs without centralization or coordination between nodes. X, Discord, Mastodon and Instagram are known to use these identifiers, and this example will use [Discord's scheme](https://discord.com/developers/docs/reference#snowflakes). The timestamp is encoded in the top 42 bits of a 64-bit integer, and the rest is for worker data and a counter. Discord also measures time from 2015 UTC instead of the UNIX epoch of 1970 UTC, a gap of 1420070400 seconds. The BIGINT type is limited to 63 bits since it is a signed integer, but 63 bits is sufficient to hold Discord IDs until September 2084.

The following functions respectively encode and decode snowflake IDs from/to timestamps. Note that when encoding the timestamp, the worker/counter bits are zero. so the returned value is useful as a partition boundary, not as a real ID.

```sql
CREATE FUNCTION public.timestamp_to_snowflake(p_timestamp timestamptz, OUT encoded bigint)
RETURNS bigint
LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL SAFE
AS $$
BEGIN
SELECT 1000*(EXTRACT(epoch FROM p_timestamp) - 1420070400)::BIGINT << 22 INTO encoded;
END
$$;

CREATE FUNCTION public.snowflake_to_timestamp(p_snowflake bigint, OUT ts timestamptz)
RETURNS TIMESTAMPTZ
LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL SAFE
AS $$
BEGIN
SELECT TO_TIMESTAMP((p_snowflake >> 22)/1000 + 1420070400) INTO ts;
END
$$;
```

Now we will instruct partman to use the snowflake encoder and decoder functions with the special value 'func' for `p_epoch`.

```sql
SELECT partman.create_parent('partman_test.time_taptest_table'
, p_control := 'col1'
, p_interval := '1 hour'
, p_epoch := 'func'
, p_time_encoder := 'public.timestamp_to_snowflake'
, p_time_decoder := 'public.snowflake_to_timestamp'
);
create_parent
---------------
t
(1 row)
```

```sql
\d+ partman_test.time_taptest_table
Partitioned table "partman_test.time_taptest_table"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+--------+-----------+----------+---------------+----------+-------------+--------------+-------------
col1 | bigint | | not null | | plain | | |
col2 | text | | | 'stuff'::text | extended | | |
Partition key: RANGE (col1)
Indexes:
"time_taptest_table_pkey" PRIMARY KEY, btree (col1)
Partitions: partman_test.time_taptest_table_p20250107_030000 FOR VALUES FROM ('1326022498713600000') TO ('1326037598208000000'),
partman_test.time_taptest_table_p20250107_040000 FOR VALUES FROM ('1326037598208000000') TO ('1326052697702400000'),
partman_test.time_taptest_table_p20250107_050000 FOR VALUES FROM ('1326052697702400000') TO ('1326067797196800000'),
partman_test.time_taptest_table_p20250107_060000 FOR VALUES FROM ('1326067797196800000') TO ('1326082896691200000'),
partman_test.time_taptest_table_p20250107_070000 FOR VALUES FROM ('1326082896691200000') TO ('1326097996185600000'),
partman_test.time_taptest_table_p20250107_080000 FOR VALUES FROM ('1326097996185600000') TO ('1326113095680000000'),
partman_test.time_taptest_table_p20250107_090000 FOR VALUES FROM ('1326113095680000000') TO ('1326128195174400000'),
partman_test.time_taptest_table_p20250107_100000 FOR VALUES FROM ('1326128195174400000') TO ('1326143294668800000'),
partman_test.time_taptest_table_p20250107_110000 FOR VALUES FROM ('1326143294668800000') TO ('1326158394163200000'),
partman_test.time_taptest_table_default DEFAULT
```
```sql
\d+ partman_test.time_taptest_table_p20250107_030000
Table "partman_test.time_taptest_table_p20250107_030000"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+--------+-----------+----------+---------------+----------+-------------+--------------+-------------
col1 | bigint | | not null | | plain | | |
col2 | text | | | 'stuff'::text | extended | | |
Partition of: partman_test.time_taptest_table FOR VALUES FROM ('1326022498713600000') TO ('1326037598208000000')
Partition constraint: ((col1 IS NOT NULL) AND (col1 >= '1326022498713600000'::bigint) AND (col1 < '1326037598208000000'::bigint))
Indexes:
"time_taptest_table_p20250107_030000_pkey" PRIMARY KEY, btree (col1)
Access method: heap
```

### Simple Serial ID: 1 Partition Per 10 ID Values
For this use-case, the template table is not created manually before calling `create_parent()`. So it shows that if a primary/unique key is added later, it does not apply to the currently existing child tables. That will have to be done manually.

Expand Down
4 changes: 3 additions & 1 deletion sql/functions/create_parent.sql
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,9 @@ IF v_control_type NOT IN ('time', 'id', 'text', 'uuid') THEN
RAISE EXCEPTION 'Only date/time, text/uuid or integer types are allowed for the control column.';
ELSIF v_control_type IN ('text', 'uuid') AND (p_time_encoder IS NULL OR p_time_decoder IS NULL) THEN
RAISE EXCEPTION 'p_time_encoder and p_time_decoder needs to be set for text/uuid type control column.';
ELSIF v_control_type NOT IN ('text', 'uuid') AND (p_time_encoder IS NOT NULL OR p_time_decoder IS NOT NULL) THEN
ELSIF v_control_type = 'id' AND p_epoch = 'func' AND (p_time_encoder IS NULL OR p_time_decoder IS NULL) THEN
RAISE EXCEPTION 'p_time_encoder and p_time_decoder functions need to be set for p_epoch=func to work.';
ELSIF v_control_type NOT IN ('text', 'uuid', 'id') AND (p_time_encoder IS NOT NULL OR p_time_decoder IS NOT NULL) THEN
RAISE EXCEPTION 'p_time_encoder and p_time_decoder can only be used with text/uuid type control column.';
END IF;

Expand Down
16 changes: 16 additions & 0 deletions sql/functions/create_partition_time.sql
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ ex_hint text;
ex_message text;
v_control text;
v_control_type text;
v_time_decoder text;
v_time_encoder text;
v_datetime_string text;
v_epoch text;
Expand Down Expand Up @@ -45,6 +46,8 @@ v_sub_timestamp_max timestamptz;
v_sub_timestamp_min timestamptz;
v_template_table text;
v_time timestamptz;
v_partition_id_start bigint;
v_partition_id_end bigint;
v_partition_text_start text;
v_partition_text_end text;

Expand All @@ -54,6 +57,7 @@ BEGIN
*/

SELECT control
, time_decoder
, time_encoder
, partition_interval::interval -- this shared field also used in partition_id as bigint
, epoch
Expand All @@ -62,6 +66,7 @@ SELECT control
, template_table
, inherit_privileges
INTO v_control
, v_time_decoder
, v_time_encoder
, v_partition_interval
, v_epoch
Expand Down Expand Up @@ -123,6 +128,7 @@ v_partition_expression := CASE
WHEN v_epoch = 'milliseconds' THEN format('to_timestamp((%I/1000)::float)', v_control)
WHEN v_epoch = 'microseconds' THEN format('to_timestamp((%I/1000000)::float)', v_control)
WHEN v_epoch = 'nanoseconds' THEN format('to_timestamp((%I/1000000000)::float)', v_control)
WHEN v_epoch = 'func' THEN format('%s(%I)', v_time_decoder, v_control)
ELSE format('%I', v_control)
END;
RAISE DEBUG 'create_partition_time: v_partition_expression: %', v_partition_expression;
Expand Down Expand Up @@ -237,7 +243,17 @@ FOREACH v_time IN ARRAY p_partition_times LOOP
, v_partition_text_start
, v_partition_text_end);
END IF;
ELSIF v_epoch = 'func' THEN
EXECUTE format('SELECT %s(%L)', v_time_encoder, v_partition_timestamp_start) INTO v_partition_id_start;
EXECUTE format('SELECT %s(%L)', v_time_encoder, v_partition_timestamp_end) INTO v_partition_id_end;

EXECUTE format('ALTER TABLE %I.%I ATTACH PARTITION %I.%I FOR VALUES FROM (%L) TO (%L)'
, v_parent_schema
, v_parent_tablename
, v_parent_schema
, v_partition_name
, v_partition_id_start
, v_partition_id_end);
ELSE
-- Must attach with integer based values for built-in constraint and epoch
IF v_epoch = 'seconds' THEN
Expand Down
4 changes: 4 additions & 0 deletions sql/functions/partition_data_time.sql
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ v_source_schemaname text;
v_source_tablename text;
v_rowcount bigint;
v_start_control timestamptz;
v_time_decoder text;
v_total_rows bigint := 0;

BEGIN
Expand All @@ -49,10 +50,12 @@ SELECT partition_interval::interval
, control
, datetime_string
, epoch
, time_decoder
INTO v_partition_interval
, v_control
, v_datetime_string
, v_epoch
, v_time_decoder
FROM @extschema@.part_config
WHERE parent_table = p_parent_table;
IF NOT FOUND THEN
Expand Down Expand Up @@ -133,6 +136,7 @@ v_partition_expression := CASE
WHEN v_epoch = 'milliseconds' THEN format('to_timestamp((%I/1000)::float)', v_control)
WHEN v_epoch = 'microseconds' THEN format('to_timestamp((%I/1000000)::float)', v_control)
WHEN v_epoch = 'nanoseconds' THEN format('to_timestamp((%I/1000000000)::float)', v_control)
WHEN v_epoch = 'func' THEN format('%s(%I)', v_time_decoder, v_control)
ELSE format('%I', v_control)
END;

Expand Down
Loading

0 comments on commit 64940d8

Please sign in to comment.