Skip to content

RFC-83 : Data Product Lineage #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
andrea-gioia opened this issue Nov 13, 2024 · 0 comments
Open

RFC-83 : Data Product Lineage #83

andrea-gioia opened this issue Nov 13, 2024 · 0 comments
Assignees
Labels
💡 Proposal (RFC 1) RFC Stage 1 (See CONTRIBUTING.md) 📙 RFC
Milestone

Comments

@andrea-gioia
Copy link
Contributor

andrea-gioia commented Nov 13, 2024

Data Product Lineage

Champion: @andrea-gioia

Summary

We propose adding a dependsOn attribute to the Input Port Object definition to capture relationships between different data products, enabling data product lineage tracking.

Motivation

To map data lineage between source systems and data products, it is essential to define the external dependencies of each input port. Specifically, an input port can consume data from an output port of another data product or from an external system.

Note: an input port cannot consume from multiple external sources (systems or other output ports). If a data product needs to consume data from multiple external sources, it must declare multiple input ports.

Design and examples

We define a field for input ports to specify where they consume data from. The reference to the component on which the port depends is made using the fullyQualifiedName. The fqn used should allow differentiation between an outputPort of other data products and generic external systems.

We call the field dependsOn because it indicates a dependency between interfaces (i.e., components of the same type) that, if unmet, prevents the creation of the port and, consequently, the entire product. The dependsOn field, therefore, always has the same meaning regardless of the component on which it is defined.

See the row that describes the new attribute dependsOn...

Field Name Type Description
id string:uuid (READONLY) It's an UUID version 5 (see RFC-4122) generated server side during data product creation as SHA-1 hash of the port's fullyQualifiedName. It MAY be used when calling the API exposed by the data product experience plane to referentiate the port. Because the fullyQualifiedName is globally unique also the id is globally unique, any way to referentiate the data product when calling API different from the ones exposed by the data product experience plane the port's fullyQualifiedName MUST be always used. Example: "id": "3235744b-8d2e-57b5-afba-f66862cc6a21"
fullyQualifiedName string:fqn (READONLY). The unique universal idetifier of the port. It MUST be a URN of the form urn:dpds:{mesh-namespace}:dataproducts:{product-name}:{product-major-version}:inputports:{port-name}. Example: "fullyQualifiedName: "urn:dpds:it.quantyca:dataproducts:tripExecution:1:inputports:tmsTripCDC".
entityType string:alphanumeric (READONLY) The type of the entity. It MUST be a constant value equal to inputport.
name string:name (REQUIRED) The name of the port. It MUST be unique within the other input ports of the same data product. It's RECOMMENDED to use a cammel case formatted string. Example "name: "tmsTripCDC".
version string:version (REQUIRED) The semantic version number of the data product's port. Everytime the major version of port changes also the major version of the product MUST be incremented.
displayName string The human readable name of the port. It SHOULD be used by the frontend tool to visualize the port's name in place of the name property. It's RECOMMENDED to not use the same displayName for different input ports belonging to the same data product.
description string The port descripion. CommonMark syntax MAY be used for rich text representation.
dependsOn [string:fqn] The list of output ports or external systems from which this input port receives data. Each input port SHOULD read from only one output port or external system, so this array SHOULD always have a length of 1. The reference to the output port or external system from which this port receives data must be specified using its fullyQualifiedName.
componentGroup string:name The name of the group this component belongs to. Grouping different components together is useful to define sub modules withing a data product. A sub-module can be used as a base for creating reusable templates.
promises Promises Object | Reference Object The data product's promises declared over the port.
expectations Expectation Object | Reference Object The data product's expectations declared over the port.
obligations Obligations Object | Reference Object The data product's obligations declared over the port.
tags [string] A list of tags associated to the component. Tags can be used for logical grouping of data product's components.
externalDocs External Resource Object Additional external documentation.

The following is an example of an input port receiving data from an upstream data product

 {
   "fullyQualifiedName": "urn:dpds:com.company-xyz:dataproducts:downstreamProduct:1:inputports:inputRawData",
   "name": "inputRawData",
   "displayName": "Input Raw Data",
   "description": "The input port that reads raw data exposed by the upstreamProduct ",
   "version": "1.2.0",
   "dependsOn": ["urn:dpds:com.company-xyz:dataproducts:upstreamProduct:1:outputports:outputRawData"]
}

The following is an example of an input port receiving data from an upstream external system

 {
   "fullyQualifiedName": "urn:dpds:com.company-xyz:dataproducts:downstreamProduct:1:inputports:inputRawData",
   "name": "inputRawData",
   "displayName": "Input Raw Data",
   "description": "The input port that ingests data from Salesforce",
   "version": "1.2.0",
   "dependsOn": ["urn:dpds:com.company-xyz:systems:salesforce"]
}

Alternatives

Using consumeTo in place of dependsOn would make little sense, as it would not apply in cases where the port reads data in a push rather than a pop manner from the source.

Decision

We have decided to make the modification described in this RFC in version 1.1.0 of the specification.

Consequences

To date, all ports have the same set of attributes. dependsOn would be the first attribute defined specifically for a particular type of port.

References

NA

@andrea-gioia andrea-gioia self-assigned this Nov 13, 2024
@andrea-gioia andrea-gioia changed the title RFC-00000X : Data Product Lineage RFC-000083 : Data Product Lineage Nov 13, 2024
@andrea-gioia andrea-gioia added the 💡 Proposal (RFC 1) RFC Stage 1 (See CONTRIBUTING.md) label Nov 13, 2024
@andrea-gioia andrea-gioia moved this to Todo in DPDS 1.1.0 Nov 13, 2024
@andrea-gioia andrea-gioia added this to the DPDS v1.1.0 milestone Nov 14, 2024
@andrea-gioia andrea-gioia changed the title RFC-000083 : Data Product Lineage RFC-83 : Data Product Lineage Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💡 Proposal (RFC 1) RFC Stage 1 (See CONTRIBUTING.md) 📙 RFC
Projects
Status: Todo
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant