Skip to content

Canonical Data Representation for Digital Signature

Kim, DoHyung edited this page Dec 8, 2020 · 1 revision

Overview

This document defines a canonical representation of data conforming to a data model that is largely a simplified form of JSON model.

The canonical represenation guarantees that any data with the same content (as defined in this document) reduces to an identical representation.

The purpose of the canonical representation is either streaming comparison of two data items or computation of cryptographic hash of data. So it is very unlikely to store the final canonical representation in its entirety or to decode it back to a data structure, though possible.

Requirements

  1. Should be easy to produce from a series of events or in-memory data structure that is an output of a parser or a decoder
    • Should not be necessary to implement a parser or a decoder just to produce a canonical representation
  2. The canonical representation is not to be decoded once created
    • So it doesn't need to be easy or efficient to decode
  3. The size of the representation don't need to be as compact as possible, though must not be unnecessarily huge
    • It is not for storage but streaming
    • That means no need to employ complex bit fiddling
  4. The representation must be self-contained, meaning no out-of-band data such as schemas must be required for the interpretation of the representation

Data Model

In this section, we define what constitutes content of data.

Primitive Types

  • Boolean
    • true or false
  • Unit
    • A type with a single () value
  • Integer
    • Arbitrarily large
  • Unicode string
  • Bytes
    • A sequence of unsigned bytes

Complex Types

  • Array
    • A sequence of values of possibly different types

Mapping of Typical Data Types

The data model above is quite restricted compared to many data models implied by data formats like JSON and YAML. So here is how typical data types are mapped to the canonical data model. This is a normative part of this specification.

Null

Many data models have null or similar. It is mapped to a unit in the canonical representation.

Map/Struct (a.k.a. Record)

A map or struct is represented with an array of arrays of length 2 representing key and value. The order of the key-value pairs should be consistent across generation of the canonical representation, resulting in the same order given the same set of key-value pairs.

Only maps with integer or string keys are considered valid, and when generating a canonical representation:

  • Integers are sorted in ascending order
  • Strings are sorted in lexicographical order

Enum

Symbols are written out as strings.

Union

Unions are written out as an array of length 2. the first item is a discriminator that is in most cases either an integer or a string, and the second is the value.

Encoding

Once data is prepared against the data model specified above, the canonical encoding of the data is produced as defined in this section.

Base Rules

  • The encoding is byte oriented, meaning the order of byte values is significant while the order of bits in a byte isn't
  • A byte is non-negative integer, unless stated otherwise
  • Fixed size, multibyte items are written out their most signficant byte first (i.e. in big endian)

Arbitrary Length Byte Sequence

  • A sequence is chunked into multiple segments, each of which is at most 255 byte long excluding one byte length preceeding the content of a segment
    • A segment is written out as follows: <length><length bytes>
  • The whole sequence is considered terminated when the last segment is less than 255 byte long
    • If the length of a sequence is multiple of 255, the last segment is of zero length

Primitive Types

  • Boolean
    • false is written as a byte of 0
    • true is written as a byte of 1
  • Unit
    • A byte of 2 is written out
  • Integer
    • Zero
      1. A byte of 3 is written out
    • Positive
      1. A byte of 4 is written out as an indicator of a positive integer value follows
      2. The number of bytes to be written is written as a single byte
      3. The integer is written out from the most significant non-zero byte, according to the encoding of a byte sequence above
    • Negative
      1. A byte of 5 is written out as an indicator that a negative integer value follows
      2. The number of bytes to be written is written as a single byte
      3. The magnitude of the integer is written out from the most significant non-zero byte, according to the encoding of a byte sequence above
  • Unicode string
    1. A byte of 6 is written out as an indicator that a string value follows
    2. A UTF-8 encoding of the string is written out, according to the encoding of a byte sequence above
  • Bytes
    1. A byte of 7 is written out as an indicator that a byte sequence follows
    2. The actual bytes are written out, according to the encoding of a byte sequence above

Complex Types

  • Array
    1. A byte of 8 is written out as an indicator that an array value follows
    2. Each value in the array is written out one by one
    3. After writing out the last item, a byte of 255 is written out to indicate the end of the array (this should be where the type indicator of the next item should have been written out)

Notes

  • Multi-segments based encoding of strings and bytes are employed to permit easy implementation with a small, bounded buffer, rather than to achieve a compact encoding