Skip to content

04 ELT Tables

Benny Austin edited this page Sep 11, 2024 · 8 revisions

1. IngestDefintition

IngestDefintition is a configuration table that specifies how the data is extracted from source systems to raw zone of data lake. Most of the columns in this table is populated by user supplied values usually through initialization scripts.

Column System/User Description Default Example
IngestID S System generated sequence number that identifies an ingest configuration
SourceSystemName U An easily recognizable code to identify a data source ERP
StreamName U A code to identify a subject area of data within a data source. This could be a table, view, API, file etc that's supplied by data source GL
SourceSystemDescription U Brief description of the data source and stream General Ledger
Backend U Repository Type of data source On-Premise SQL/Oracle/HTTP API/Drop File
DataFormat U Format in which data is available from data source Table/View/CSV/JSON/OpenHub/Stored Procedure
EntityName U Base Table/View/API End Point Finance.GL or API End Point
WatermarkColName U Column of Entity that will be used as delta/watermark column. Set to NULL if table needs to be fully ingested every time. This column can be timestamp or running number column LAST_UPDATED
DeltaFormat U Format string to format the DeltaDate/DeltaNumber in the generated query WHERE clause. See https://docs.microsoft.com/en-us/sql/t-sql/functions/format-transact-sql for format patterns. Set to NULL to use the default format dd MMM yyyy HH:mm
LastDeltaDate S/U The last delta timestamp as updated by the pipeline after successful run. For the first time run, this value must be set by user to an appropriate value such as today or earliest available timestamp. Only applicable if DeltaName is set. Today
LastDeltaNumber S/U Running Number of Entity that will be used as delta/watermark column. Set to NULL if table needs to be fully ingested every time or if the delta column is not a running number OHREQUID
MaxIntervalMinutes U Maximum time range in minutes for which data is extracted. The data is extracted from source for the range from LastDeltaDate to (LastDeltaDate +MaxIntervalMinutes). If not specified the date range for extract is between LastDeltaDate and Now *1 Day =1440 minutes,Daily Extract *1440/24= Hourly extract
MaxIntervalNumber U Maximum number range for which data is extracted. The data is extracted from source for the range from LastDeltaNumber to (LastDeltaNumber +MaxIntervalNumber). If not specified the number range for extract is between LastDeltaNumber and current
DataMapping U DataMapping column can be used in 2 ways : (a) To give meaningful name to source column. For E.g to rename SAP German column names to English names. (b) To supply headers to external files that don't have a header. Follow instructions mentioned here - Azure Data Factory TabularTranslator column mapping. Specified as a JSON array as per the Microsoft documentation https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#explicit-mapping E.g 1 - To rename columns:[{ "source": { "name": "UCINSTALLA" }, "sink": { "name": "InstallationNumber" } }, { "source": { "name": "DIVISION" }, "sink": { "name": "Division" } } ] E.g 2 - To supply headers:[{"source":{"ordinal":1},"sink":{"name":"NMI"}},{"source":{"ordinal":2},"sink":{"name":"SUPPLY_SUBSTATION_FULL_NUMBER"}}]
SourceFileDropFileSystem U File System/Container where the external data files are uploaded in data lake drop
SourceFileDropFolder U Folder within the File System/Container where the external data files are uploaded in data lake budget/YYYY/MM/DD
SourceFileDropFile U external data file name budget_YYYYMMDD.csv
SourceFileDelimiter U Column separator if the external file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files
SourceFileHeaderFlag U Flag (1/0) to indicate if the external file has a header. Header=1, No Header=0 1
SourceStructure U Azure Data Factory parameter to define exactly what columns to pull from an ADF dataset if you don't want to extract all columns from an entity. Specified as a JSON array as per ADF documentation for alternate schema mapping https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping tbc but of the format [{"name":"column1_name","type":"column1_datatype"},{"name":"column2_name","type":"column2_type"}]
DestinationRawFileSystem U File System/Container where the ingested data is uploaded as files in raw zone of data lake raw
DestinationRawFolder U Folder where the ingested data is uploaded as files in raw zone of data lake erp/gl/YYYY/MM
DestinationRawFile U Name of ingested file in raw zone erp_gl_YYYYMMDD_HH_GUID.parquet
RunSequence U User supplied value that determines the sequence in which the pipelines will be run within a SourceSystemName. If none specified the default value will push this workload to back of queue 100 1
MaxRetries U Maximum number of times a failed pipeline will try to re-run before it gives up. 3
ActiveFlag U Flag to activate/deactivate the pipeline. Active=1, De-Active=0 1
L1TransformationReqdFlag U Flag to indicate whether a Level 1 Transformation is required. Required=1, Not Required=0
L2TransformationReqdFlag U Flag to indicate whether a Level 1 Transformation is required. Required=1, Not Required=0
DelayL1TransformationFlag U Flag to indicate whether the Level 1 Transformation of data is done as part of ingestion or delayed and done in another schedule
DelayL2TransformationFlag U Flag to indicate whether the Level 1 Transformation of data is done as part of ingestion or delayed and done in another schedule
CreatedBy S User who created this record LoginID
CreatedTimestamp S Timestamp of record creation Now
ModifiedBy S User who updated this record LoginID
ModifiedTimestamp S Timestamp of record update Now

2. IngestInstance

IngestInstance table records the execution of each run of an ingest pipeline. This table is populated by ingestion pipeline and has capability to re-run in addition to providing basic audit logging capability

Column System/User Description Default Example
IngestInstanceID S System generated running number to identify each execution of Ingestion
IngestID S System generated running number to identify ingest definition
SourceFileDropFileSystem S File System/Container where the external data files are uploaded in data lake drop
SourceFileDropFolder S Folder within the File System/Container where the external data files are uploaded in data lake budget/2020/11
SourceFileDropFile S External data file name budget_20190913.csv
DestinationRawFileSystem S File System/Container where the ingested data is uploaded as files in raw zone of data lake raw
DestinationRawFolder S Folder where the ingested data is uploaded as files in raw zone of data lake erp/budget/2020/11
DestinationRawFile S Name of ingested file in raw zone erp_budget_20201105_10_617776a4-e8cf-4b59-a5e5-da772f7283fa.parquet
DataFromTimestamp S First Timestamp of data that was ingested 2019-03-05 11:59:56.0000000
DataToTimestamp S Last Timestamp of the data that was ingested 2019-03-05 12:22:36.0000000
DataFromNumber S First running number of data that was ingested 93830
DataToNumber S Last running number of data that was ingested 93796
SourceCount S Number of records in source 2048079
IngestCount S Number of records Ingested. Depending on the source the SourceCount and TargetCount may not be same for e.g JSON files have a SourceCount=1 and TargetCount=N after flattening 2048079
IngestStartTimestamp S Timestamp when the ADF Pipeline started execution
IngestEndTimestamp S Timestamp when the ADF Pipeline completed execution either successful or not
IngestStatus S Status of pipeline execution. Allowed values are Running,Success,ReRunSuccess,Failure,ReRunFailure
RetryCount S Counter that keeps track of how many times the pipeline was re-run. Will be reset after successful run. This value is checked against MaxRetries to determine if the reload should proceed 2
ReloadFlag S/U Flag (1/0) to indicate whether the ingest instance needs to be reloaded. Manually set by user. The flag will be automatically reset after successful run. 1= Reload 0
CreatedBy S User who created this record LoginID
CreatedTimestamp S Timestamp of record creation Now
ModifiedBy S User who modified this record LoginID
ModifiedTimestamp S Timestamp of record creation Now
ADFIngestPipelineRunID S Data Factory GUID of Ingestion Pipeline a1373a9d-36eb-4bf1-b8db-c382136f3b36

3. L1TransformDefinition

Level 1 Transformation is application of business rules to raw data. L1TransformDefinition defines how transformation is configured to transform data from raw zone and write the output to Curated1 zone and DWH.

Typical business rules for L1 Transformations:

  • De-duplication.
  • Data Cleansing.
  • Upserts.
  • Adding new attributes like Ring Fence data points.
  • Adding Audit data points
Column System/User Description Default Example
L1TransformID S System generated running number that identifies a Level 1 Transform configuration
IngestID U Ingestion workload for which this transformation applies to. An Ingestion workload can be transformed in more than one way.
ComputePath U Folder path where the transform compute is located. For instance, the folder where the Spark notebook is located in workspace Shared/Common/C1
ComputeName U Name of the transform compute. For instance, the name of the Spark notebook or Stored Procedure that does the transform C1Transform-Generic
InputRawFileSystem U File system/container where the raw files are located raw
InputRawFileFolder U Folder where the raw files are located erp/gl/YYYY/MM
InputRawFile U Raw file name erp_gl_YYYYMMDD_HH_GUID.parquet
InputRawFileDelimiter U Column Separator of raw file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files |
InputFileHeaderFlag U Flag (1/0) to indicate if the raw file has a header. Header=1, No Header=0 1
OutputL1CurateFileSystem U File system/container where the transformed files will be written trusted
OutputL1CuratedFolder U Folder where the transformed files will be written erp/gl/YYYY/MM
OutputL1CuratedFile U Transformed file name erp_gl_YYYYMMDD_HH_GUID.parquet
OutputL1CuratedFileDelimiter U Column Separator of transformed file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files |
OutputL1CuratedFileFormat U Type of transformed file Parquet/ORC/CSV
OutputL1CuratedFileWriteMode U Control field that determines how the transformed data is written to curated file. Valid values are append, overwrite,ignore,error or errorifexists overwrite append
OutputDWStagingTable U Temporary DWH table that stages the transformed data for MERGE updates. This field must be populated if you want to use Upsert to guarantee unique keys (as defined by LookupColumns). Set this field to NULL if you want to Insert all new records without checking for uniqueness (which may cause duplicate records) STG.FIN_GL
LookupColumns U Unique Key Columns that is required for MERGE updates. If there are multiple columns separate them with a pipe (|) company_code|gl_code|start_date
OutputDWTable U DWH table that stores the transformed data FIN.GL
OutputDWTableWriteMode U Control field that determines how the transformed data is written to DWH table. Valid values are append, overwrite, ignore, error or errorifexists. "overwrite" – truncates all existing records from table before inserting new records. Use this if every load is a drop and full reload. "append" – keeps existing records, and inserts/updates new records into the table. Use this for delta loads. "ignore" – skips the insert/update if the data already exists in table. "error" or "errorifexists" – throws an exception if data already exists append
MaxRetries U Maximum number of times a failed transform will try to re-run before it gives up. 3
WatermarkColName U Delta column name, same value as the ingest record WatermarkColName last_updated
ActiveFlag U Flag(1/0) to activate/deactivate the transform. Active=1
CreatedBy S User who created this record LoginID
CreatedTimestamp S Timestamp of record creation Now
ModifiedBy S User who modified this record LoginID
ModifiedTimestamp S Timestamp of record creation Now

4. L1TransformInstance

L1TransformInstance table records the execution of each run of an L1 transform pipeline. This table is populated by transform pipeline and has capability to re-run in addition to providing basic audit logging capability and data lineage with ingested data.

Column System/User Description Default Example
L1TransformInstanceID S System generated running number to identify an execution of Level 1 Transform
L1TransformID S Level 1 Transformation definition ID
IngestInstanceID S Ingest Instance that created this transformation record
IngestID S Ingest Definition ID
NotebookName S Name of the Databricks transform notebook C1Transform-Generic
NotebookPath S Folder path where the transform notebook is located in Databricks workspace Shared/Common/C1
InputRawFileSystem S File system/container where the raw files are located raw
InputRawFileFolder S Folder where the raw files are located erp/gl/2020/11
InputRawFile S Raw file name erp_gl_20201104_07_b20469a3-d31e-4276-8677-9e6f6279660a.parquet
InputRawFileDelimiter S Column Separator of raw file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files |
InputFileHeaderFlag S Flag (1/0) to indicate if the raw file has a header. Header=1, No Header=0 1
OutputL1CurateFileSystem S File system/container where the transformed files will be written trusted
OutputL1CuratedFolder S Folder where the transformed files will be written erp/gl/2020/11
OutputL1CuratedFile S Transformed file name mdw_npd_interval_data_20190912_16_b20469a3-d31e-4276-8677-9e6f6279660a.parquet
OutputL1CuratedFileDelimiter S Column Separator of transformed file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files |
OutputL1CuratedFileFormat S Type of transformed file Parquet/ORC/CSV
OutputL1CuratedFileWriteMode S Control field that determines how the transformed data is written to curated file. Valid values are append, overwrite,ignore,error or errorifexists overwrite overwrite
OutputDWStagingTable S Temporary DWH table that stages the transformed data for MERGE updates STG.FIN_GL
LookupColumns S Unique Key Columns that is required for MERGE updates. If there are multiple columns separate them with a pipe (|) company_code|gl_code|start_dt
OutputDWTable S DWH table that stores the transformed data FIN.GL
OutputDWTableWriteMode S Control field that determines how the transformed data is written to DWH table. Valid values are append, overwrite,ignore,error and errorifexists append append
IngestCount S Total number of records extracted from data source
L1TransformInsertCount S Total Number of records inserted by the transformation
L1TransformUpdateCount S Total Number of records updated by the transformation
L1TransformDeleteCount S Total Number of records deleted by the transformation
L1TransformStartTimestamp S Timestamp when the transform started
L1TransformEndTimestamp S Timestamp when the transformed finished - successful or otherwise
L1TransformStatus S Status of transformation. Valid values are Running,Success,Failure,ReRunFailure or ReRunSuccess
RetryCount S Count of retries attempted on a failed transform. If the the count reaches MaxRetries, the transform will not attempt to retry. The count is reset after a successful run.
ActiveFlag S Flag (1/0) to indicate whether the transform instance is active or not. Flag set to active by default and will be reset after a successful run. A failed run will set the ActiveFlag until MaxRetries is reached 1
ReRunL1TransformFlag S/U Flag (1/0) to rerun a transform. By default the flag is set to 0 after successful run. It has to be manually set for a re-run 0
IngestADFPipelineRunID S GUID of the ADF Ingestion pipeline that created this record 31f5db5f-0b7d-469e-9e7b-6398231eadea
L1TransformADFPipelineRunID S GUID of the ADF Transformation pipeline that executed this transform instance bd9abf6f-8309-412d-b583-fc83db0430ea
CreatedBy S User who created this record LoginID
CreatedTimestamp S Timestamp of record creation Now
ModifiedBy S User who modified this record LoginID
ModifiedTimestamp S Timestamp of record creation Now

5. L2TransformDefinition

Level 2 Transformation is second level of transformation of data. L2TransformDefinition defines how transformation is configured to transform data from raw zone, curated1 zone and DWH and write the output to Curated2 zone and DWH.

Typical L2 Transformations:

  • Aggregation
  • Redaction
  • Consolidation
  • Snapshots
  • Post processing
Column System/User Description Default Example
L2TransformID U System generated sequence number to identify a Level 2 Transform configuration
IngestID U Ingestion workload for which this transformation applies to. An Ingestion workload can be transformed in more than one way.
L1TransformID U Level 1 transformation id for which this transformation applies to. A Level 1 transformation can be transformed in more than one way.
ComputePath U Folder path where the transform compute is located. For instance, the folder where the Spark notebook is located in workspace Shared/Common/C2
ComputeName U Name of the transform compute. For instance, the name of the Spark notebook or Stored Procedure that does the transform C2Transform-FromDWH-ToC2-Generic
InputType U Type of data source for this transformation. Allowed values - Raw, Curated, Datawarehouse
InputFileSystem U File system/container of input file. The input file could be in raw zone, trusted zone or curated zone trusted
InputFileFolder U Folder where the input files are located erp/gl/YYYY/MM
InputFile U Input file name erp_gl_YYYYMMDD_HH_GUID.parquet
InputFileDelimiter U Column Separator of input file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files |
InputFileHeaderFlag U Flag (1/0) to indicate if the input file has a header. Header=1, No Header=0 1
InputDWTable U Input DWHTable FIN.GL
WatermarkColName U High watermark timestamp column name for input DWH table LAST_UPDATED
LastDeltaDate U High watermark value for delta column Today
LastDeltaNumber U
MaxIntervalMinutes U Maximum time range in minutes for which data is extracted. The data is extracted from source for the range from LastDeltaDate to (LastDeltaDate +MaxIntervalMinutes). If not specified the date range for extract is between LastDeltaDate and Now *1 Day =1440 minutes,Daily Extract *1440/24= Hourly extract
MaxIntervalNumber U Maximum number range for which data is extracted. The data is extracted from source for the range from LastDeltaNumber to (LastDeltaNumber +MaxIntervalNumber). If not specified the number range for extract is between LastDeltaNumber and current
MaxRetries U Maximum number of times a failed transformation will try to re-run before it gives up. 3
OutputL2CurateFileSystem U File system/container where the transformed files will be written curated
OutputL2CuratedFolder U Folder where the transformed files will be written erp/gl-snapshot/2020/11
OutputL2CuratedFile U Transformed file name erp_glsnapshot_202011.parquet
OutputL2CuratedFileDelimiter U Column Separator of transformed file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files |
OutputL2CuratedFileFormat U Type of transformed file Parquet/ORC/CSV
OutputL2CuratedFileWriteMode U Control field that determines how the transformed data is written to curated file. Valid values are append, overwrite,ignore,error or errorifexists overwrite append
OutputDWStagingTable U Temporary DWH table that stages the transformed data for MERGE updates STG.FIN_GLSnapshot
LookupColumns U Unique Key Columns that is required for MERGE updates. If there are multiple columns separate them with a pipe (|) company_code|gl_code|start_date
OutputDWTable U DWH table that stores the transformed data FIN.GL_Snapshot
OutputDWTableWriteMode U Control field that determines how the transformed data is written to DWH table. Valid values are append, overwrite,ignore,error or errorifexists append append
ActiveFlag U Flag(1/0) to activate/deactivate the transform. Active=1
CreatedBy S User who created this record LoginID
CreatedTimestamp S Timestamp of record creation Now
ModifiedBy S User who modified this record LoginID
ModifiedTimestamp S Timestamp of record creation Now

6. L2TransformInstance

L2TransformInstance table records the execution of each run of an L2 transform pipeline. This table is populated by transform pipeline and has capability to re-run in addition to providing basic audit logging capability and data lineage with ingested and curated1 data.

Column System/User Description Default Example
L2TransformInstanceID S System generated running number to identify an execution of Level 2 Transform
L2TransformID S Level 2 Transformation definition ID
IngestID S Ingest Definition ID
L1TransformID S Level 1 Transformation definition ID
NotebookPath S Folder path where the transform notebook is located in Databricks workspace C2Transform-FromDWH-ToC2-Generic
NotebookName S Name of the Databricks transform notebook Shared/Common/C2
InputFileSystem S File system/container where the input files are located. The input files could be located anywhere in the data lake - raw zone, curated 1 zone, curated 2 zone trusted
InputFileFolder S Folder where the input files are located erp/gl/2020/11
InputFile S Raw file name erp_gl_20201105_00_b20469a3-d31e-4276-8677-9e6f6279660a.parquet
InputFileDelimiter S Column Separator of raw file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files |
InputFileHeaderFlag S Flag (1/0) to indicate if the raw file has a header. Header=1, No Header=0 1
InputDWTable S Name of DWH table is the input data is coming from DWH FIN.GL
DeltaName S High watermark column of DWH table. This could be a timestamp or running number field. If blank all records from InputDWTable is considered for transformation
DataFromTimestamp S From Timestamp of input data from DWH if DeltaName is a timestamp column
DataToTimestamp S To Timestamp of input data from DWH if DeltaName is a timestamp column
DataFromNumber S From Running Number of input data from DWH if DeltaName is a running number column
DataToNumber S To Running Number of input data from DWH if DeltaName is a running number column
OutputL2CurateFileSystem S File system/container where the transformed files are written curated
OutputL2CuratedFolder S Folder where the transformed files are written erp/gl-snapshot/2020/11
OutputL2CuratedFile S Transformed file name gl_snapshot_202011.parquet
OutputL2CuratedFileDelimiter S Column Separator of transformed file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files |
OutputL2CuratedFileFormat S Format of the transformed file parquet|csv|orc
OutputL2CuratedFileWriteMode S Control field that determines how the transformed data is written to curated file. Valid values are append, overwrite,ignore,error or errorifexists overwrite overwrite
OutputDWStagingTable S Temporary DWH table that stages the transformed data for MERGE updates STG.FIN_GLSnapshot
LookupColumns S Unique Key Columns that is required for MERGE updates. If there are multiple columns separate them with a pipe (|) company_code|gl_code|start_date
OutputDWTable S DWH table that stores the transformed data FIN.GLSnapshot
OutputDWTableWriteMode S Control field that determines how the transformed data is written to DWH table. Valid values are append, overwrite,ignore,error and errorifexists append append
InputCount S Number of input records
L2TransformInsertCount S Total Number of records inserted by the transformation
L2TransformUpdateCount S Total Number of records updated by the transformation
L2TransformDeleteCount S Total Number of records deleted by the transformation
L2TransformStartTimestamp S Timestamp when the transformation started
L2TransformEndTimestamp S Timestamp when the transformation completed successful or otherwise
L2TransformStatus S Status of transformation. Valid values are Running,Success,Failure,ReRunFailure or ReRunSuccess
RetryCount S Count of retries attempted on a failed transform. If the the count reaches MaxRetries, the transform will not attempt to retry. The count is reset after a successful run.
ActiveFlag S Flag (1/0) to indicate whether the transform instance is active or not. Flag set to active by default and will be reset after a successful run. A failed run will set the ActiveFlag until MaxRetries is reached 1
ReRunL2TransformFlag S/U Flag (1/0) to rerun a transform. By default the flag is set to 0 after successful run. It has to be manually set for a re-run 0
IngestADFPipelineRunID S GUID of the ADF Ingestion pipeline that created this record 31f5db5f-0b7d-469e-9e7b-6398231eadea
L1TransformADFPipelineRunID S GUID of Level 1 Transformation Pipeline if available
L2TransformADFPipelineRunID S GUID of this transformation pipeline 357FCF99-8155-4A6F-BF86-DBB6F3528710
CreatedBy S User who created this record LoginID
CreatedTimestamp S Timestamp of record creation Now
ModifiedBy S User who modified this record LoginID
ModifiedTimestamp S Timestamp of record creation Now