04 ELT Tables

1. IngestDefintition

IngestDefintition is a configuration table that specifies how the data is extracted from source systems to raw zone of data lake. Most of the columns in this table is populated by user supplied values usually through initialization scripts.

Column	System/User	Description	Default	Example
IngestID	S	System generated sequence number that identifies an ingest configuration
SourceSystemName	U	An easily recognizable code to identify a data source		ERP
StreamName	U	A code to identify a subject area of data within a data source. This could be a table, view, API, file etc that's supplied by data source		GL
SourceSystemDescription	U	Brief description of the data source and stream		General Ledger
Backend	U	Repository Type of data source		On-Premise SQL/Oracle/HTTP API/Drop File
DataFormat	U	Format in which data is available from data source		Table/View/CSV/JSON/OpenHub/Stored Procedure
EntityName	U	Base Table/View/API End Point		Finance.GL or API End Point
WatermarkColName	U	Column of Entity that will be used as delta/watermark column. Set to NULL if table needs to be fully ingested every time. This column can be timestamp or running number column		LAST_UPDATED
DeltaFormat	U	Format string to format the DeltaDate/DeltaNumber in the generated query WHERE clause. See https://docs.microsoft.com/en-us/sql/t-sql/functions/format-transact-sql for format patterns. Set to NULL to use the default format		dd MMM yyyy HH:mm
LastDeltaDate	S/U	The last delta timestamp as updated by the pipeline after successful run. For the first time run, this value must be set by user to an appropriate value such as today or earliest available timestamp. Only applicable if DeltaName is set.		Today
LastDeltaNumber	S/U	Running Number of Entity that will be used as delta/watermark column. Set to NULL if table needs to be fully ingested every time or if the delta column is not a running number		OHREQUID
MaxIntervalMinutes	U	Maximum time range in minutes for which data is extracted. The data is extracted from source for the range from LastDeltaDate to (LastDeltaDate +MaxIntervalMinutes). If not specified the date range for extract is between LastDeltaDate and Now		1 Day =1440 minutes,Daily Extract 1440/24= Hourly extract
MaxIntervalNumber	U	Maximum number range for which data is extracted. The data is extracted from source for the range from LastDeltaNumber to (LastDeltaNumber +MaxIntervalNumber). If not specified the number range for extract is between LastDeltaNumber and current
DataMapping	U	DataMapping column can be used in 2 ways : (a) To give meaningful name to source column. For E.g to rename SAP German column names to English names. (b) To supply headers to external files that don't have a header. Follow instructions mentioned here - Azure Data Factory TabularTranslator column mapping. Specified as a JSON array as per the Microsoft documentation https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#explicit-mapping		E.g 1 - To rename columns:[{ "source": { "name": "UCINSTALLA" }, "sink": { "name": "InstallationNumber" } }, { "source": { "name": "DIVISION" }, "sink": { "name": "Division" } } ] E.g 2 - To supply headers:[{"source":{"ordinal":1},"sink":{"name":"NMI"}},{"source":{"ordinal":2},"sink":{"name":"SUPPLY_SUBSTATION_FULL_NUMBER"}}]
SourceFileDropFileSystem	U	File System/Container where the external data files are uploaded in data lake		drop
SourceFileDropFolder	U	Folder within the File System/Container where the external data files are uploaded in data lake		budget/YYYY/MM/DD
SourceFileDropFile	U	external data file name		budget_YYYYMMDD.csv
SourceFileDelimiter	U	Column separator if the external file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files
SourceFileHeaderFlag	U	Flag (1/0) to indicate if the external file has a header. Header=1, No Header=0		1
SourceStructure	U	Azure Data Factory parameter to define exactly what columns to pull from an ADF dataset if you don't want to extract all columns from an entity. Specified as a JSON array as per ADF documentation for alternate schema mapping https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping		tbc but of the format [{"name":"column1_name","type":"column1_datatype"},{"name":"column2_name","type":"column2_type"}]
DestinationRawFileSystem	U	File System/Container where the ingested data is uploaded as files in raw zone of data lake		raw
DestinationRawFolder	U	Folder where the ingested data is uploaded as files in raw zone of data lake		erp/gl/YYYY/MM
DestinationRawFile	U	Name of ingested file in raw zone		erp_gl_YYYYMMDD_HH_GUID.parquet
RunSequence	U	User supplied value that determines the sequence in which the pipelines will be run within a SourceSystemName. If none specified the default value will push this workload to back of queue	100	1
MaxRetries	U	Maximum number of times a failed pipeline will try to re-run before it gives up.		3
ActiveFlag	U	Flag to activate/deactivate the pipeline. Active=1, De-Active=0		1
L1TransformationReqdFlag	U	Flag to indicate whether a Level 1 Transformation is required. Required=1, Not Required=0
L2TransformationReqdFlag	U	Flag to indicate whether a Level 1 Transformation is required. Required=1, Not Required=0
DelayL1TransformationFlag	U	Flag to indicate whether the Level 1 Transformation of data is done as part of ingestion or delayed and done in another schedule
DelayL2TransformationFlag	U	Flag to indicate whether the Level 1 Transformation of data is done as part of ingestion or delayed and done in another schedule
CreatedBy	S	User who created this record	LoginID
CreatedTimestamp	S	Timestamp of record creation	Now
ModifiedBy	S	User who updated this record	LoginID
ModifiedTimestamp	S	Timestamp of record update	Now

2. IngestInstance

IngestInstance table records the execution of each run of an ingest pipeline. This table is populated by ingestion pipeline and has capability to re-run in addition to providing basic audit logging capability

Column	System/User	Description	Default	Example
IngestInstanceID	S	System generated running number to identify each execution of Ingestion
IngestID	S	System generated running number to identify ingest definition
SourceFileDropFileSystem	S	File System/Container where the external data files are uploaded in data lake		drop
SourceFileDropFolder	S	Folder within the File System/Container where the external data files are uploaded in data lake		budget/2020/11
SourceFileDropFile	S	External data file name		budget_20190913.csv
DestinationRawFileSystem	S	File System/Container where the ingested data is uploaded as files in raw zone of data lake		raw
DestinationRawFolder	S	Folder where the ingested data is uploaded as files in raw zone of data lake		erp/budget/2020/11
DestinationRawFile	S	Name of ingested file in raw zone		erp_budget_20201105_10_617776a4-e8cf-4b59-a5e5-da772f7283fa.parquet
DataFromTimestamp	S	First Timestamp of data that was ingested		2019-03-05 11:59:56.0000000
DataToTimestamp	S	Last Timestamp of the data that was ingested		2019-03-05 12:22:36.0000000
DataFromNumber	S	First running number of data that was ingested		93830
DataToNumber	S	Last running number of data that was ingested		93796
SourceCount	S	Number of records in source		2048079
IngestCount	S	Number of records Ingested. Depending on the source the SourceCount and TargetCount may not be same for e.g JSON files have a SourceCount=1 and TargetCount=N after flattening		2048079
IngestStartTimestamp	S	Timestamp when the ADF Pipeline started execution
IngestEndTimestamp	S	Timestamp when the ADF Pipeline completed execution either successful or not
IngestStatus	S	Status of pipeline execution. Allowed values are Running,Success,ReRunSuccess,Failure,ReRunFailure
RetryCount	S	Counter that keeps track of how many times the pipeline was re-run. Will be reset after successful run. This value is checked against MaxRetries to determine if the reload should proceed		2
ReloadFlag	S/U	Flag (1/0) to indicate whether the ingest instance needs to be reloaded. Manually set by user. The flag will be automatically reset after successful run. 1= Reload		0
CreatedBy	S	User who created this record	LoginID
CreatedTimestamp	S	Timestamp of record creation	Now
ModifiedBy	S	User who modified this record	LoginID
ModifiedTimestamp	S	Timestamp of record creation	Now
ADFIngestPipelineRunID	S	Data Factory GUID of Ingestion Pipeline		a1373a9d-36eb-4bf1-b8db-c382136f3b36

3. L1TransformDefinition

Level 1 Transformation is application of business rules to raw data. L1TransformDefinition defines how transformation is configured to transform data from raw zone and write the output to Curated1 zone and DWH.

Typical business rules for L1 Transformations:

De-duplication.
Data Cleansing.
Upserts.
Adding new attributes like Ring Fence data points.
Adding Audit data points

Column	System/User	Description	Default	Example
L1TransformID	S	System generated running number that identifies a Level 1 Transform configuration
IngestID	U	Ingestion workload for which this transformation applies to. An Ingestion workload can be transformed in more than one way.
ComputePath	U	Folder path where the transform compute is located. For instance, the folder where the Spark notebook is located in workspace		Shared/Common/C1
ComputeName	U	Name of the transform compute. For instance, the name of the Spark notebook or Stored Procedure that does the transform		C1Transform-Generic
InputRawFileSystem	U	File system/container where the raw files are located		raw
InputRawFileFolder	U	Folder where the raw files are located		erp/gl/YYYY/MM
InputRawFile	U	Raw file name		erp_gl_YYYYMMDD_HH_GUID.parquet
InputRawFileDelimiter	U	Column Separator of raw file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files		\|
InputFileHeaderFlag	U	Flag (1/0) to indicate if the raw file has a header. Header=1, No Header=0		1
OutputL1CurateFileSystem	U	File system/container where the transformed files will be written		trusted
OutputL1CuratedFolder	U	Folder where the transformed files will be written		erp/gl/YYYY/MM
OutputL1CuratedFile	U	Transformed file name		erp_gl_YYYYMMDD_HH_GUID.parquet
OutputL1CuratedFileDelimiter	U	Column Separator of transformed file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files		\|
OutputL1CuratedFileFormat	U	Type of transformed file		Parquet/ORC/CSV
OutputL1CuratedFileWriteMode	U	Control field that determines how the transformed data is written to curated file. Valid values are append, overwrite,ignore,error or errorifexists	overwrite	append
OutputDWStagingTable	U	Temporary DWH table that stages the transformed data for MERGE updates. This field must be populated if you want to use Upsert to guarantee unique keys (as defined by LookupColumns). Set this field to NULL if you want to Insert all new records without checking for uniqueness (which may cause duplicate records)		STG.FIN_GL
LookupColumns	U	Unique Key Columns that is required for MERGE updates. If there are multiple columns separate them with a pipe (\|)		company_code\|gl_code\|start_date
OutputDWTable	U	DWH table that stores the transformed data		FIN.GL
OutputDWTableWriteMode	U	Control field that determines how the transformed data is written to DWH table. Valid values are append, overwrite, ignore, error or errorifexists. "overwrite" – truncates all existing records from table before inserting new records. Use this if every load is a drop and full reload. "append" – keeps existing records, and inserts/updates new records into the table. Use this for delta loads. "ignore" – skips the insert/update if the data already exists in table. "error" or "errorifexists" – throws an exception if data already exists	append
MaxRetries	U	Maximum number of times a failed transform will try to re-run before it gives up.	3
WatermarkColName	U	Delta column name, same value as the ingest record WatermarkColName		last_updated
ActiveFlag	U	Flag(1/0) to activate/deactivate the transform. Active=1
CreatedBy	S	User who created this record	LoginID
CreatedTimestamp	S	Timestamp of record creation	Now
ModifiedBy	S	User who modified this record	LoginID
ModifiedTimestamp	S	Timestamp of record creation	Now

4. L1TransformInstance

L1TransformInstance table records the execution of each run of an L1 transform pipeline. This table is populated by transform pipeline and has capability to re-run in addition to providing basic audit logging capability and data lineage with ingested data.

Column	System/User	Description	Default	Example
L1TransformInstanceID	S	System generated running number to identify an execution of Level 1 Transform
L1TransformID	S	Level 1 Transformation definition ID
IngestInstanceID	S	Ingest Instance that created this transformation record
IngestID	S	Ingest Definition ID
NotebookName	S	Name of the Databricks transform notebook		C1Transform-Generic
NotebookPath	S	Folder path where the transform notebook is located in Databricks workspace		Shared/Common/C1
InputRawFileSystem	S	File system/container where the raw files are located		raw
InputRawFileFolder	S	Folder where the raw files are located		erp/gl/2020/11
InputRawFile	S	Raw file name		erp_gl_20201104_07_b20469a3-d31e-4276-8677-9e6f6279660a.parquet
InputRawFileDelimiter	S	Column Separator of raw file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files		\|
InputFileHeaderFlag	S	Flag (1/0) to indicate if the raw file has a header. Header=1, No Header=0		1
OutputL1CurateFileSystem	S	File system/container where the transformed files will be written		trusted
OutputL1CuratedFolder	S	Folder where the transformed files will be written		erp/gl/2020/11
OutputL1CuratedFile	S	Transformed file name		mdw_npd_interval_data_20190912_16_b20469a3-d31e-4276-8677-9e6f6279660a.parquet
OutputL1CuratedFileDelimiter	S	Column Separator of transformed file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files		\|
OutputL1CuratedFileFormat	S	Type of transformed file		Parquet/ORC/CSV
OutputL1CuratedFileWriteMode	S	Control field that determines how the transformed data is written to curated file. Valid values are append, overwrite,ignore,error or errorifexists	overwrite	overwrite
OutputDWStagingTable	S	Temporary DWH table that stages the transformed data for MERGE updates		STG.FIN_GL
LookupColumns	S	Unique Key Columns that is required for MERGE updates. If there are multiple columns separate them with a pipe (\|)		company_code\|gl_code\|start_dt
OutputDWTable	S	DWH table that stores the transformed data		FIN.GL
OutputDWTableWriteMode	S	Control field that determines how the transformed data is written to DWH table. Valid values are append, overwrite,ignore,error and errorifexists	append	append
IngestCount	S	Total number of records extracted from data source
L1TransformInsertCount	S	Total Number of records inserted by the transformation
L1TransformUpdateCount	S	Total Number of records updated by the transformation
L1TransformDeleteCount	S	Total Number of records deleted by the transformation
L1TransformStartTimestamp	S	Timestamp when the transform started
L1TransformEndTimestamp	S	Timestamp when the transformed finished - successful or otherwise
L1TransformStatus	S	Status of transformation. Valid values are Running,Success,Failure,ReRunFailure or ReRunSuccess
RetryCount	S	Count of retries attempted on a failed transform. If the the count reaches MaxRetries, the transform will not attempt to retry. The count is reset after a successful run.
ActiveFlag	S	Flag (1/0) to indicate whether the transform instance is active or not. Flag set to active by default and will be reset after a successful run. A failed run will set the ActiveFlag until MaxRetries is reached	1
ReRunL1TransformFlag	S/U	Flag (1/0) to rerun a transform. By default the flag is set to 0 after successful run. It has to be manually set for a re-run	0
IngestADFPipelineRunID	S	GUID of the ADF Ingestion pipeline that created this record		31f5db5f-0b7d-469e-9e7b-6398231eadea
L1TransformADFPipelineRunID	S	GUID of the ADF Transformation pipeline that executed this transform instance		bd9abf6f-8309-412d-b583-fc83db0430ea
CreatedBy	S	User who created this record	LoginID
CreatedTimestamp	S	Timestamp of record creation	Now
ModifiedBy	S	User who modified this record	LoginID
ModifiedTimestamp	S	Timestamp of record creation	Now

5. L2TransformDefinition

Level 2 Transformation is second level of transformation of data. L2TransformDefinition defines how transformation is configured to transform data from raw zone, curated1 zone and DWH and write the output to Curated2 zone and DWH.

Typical L2 Transformations:

Aggregation
Redaction
Consolidation
Snapshots
Post processing

Column	System/User	Description	Default	Example
L2TransformID	U	System generated sequence number to identify a Level 2 Transform configuration
IngestID	U	Ingestion workload for which this transformation applies to. An Ingestion workload can be transformed in more than one way.
L1TransformID	U	Level 1 transformation id for which this transformation applies to. A Level 1 transformation can be transformed in more than one way.
ComputePath	U	Folder path where the transform compute is located. For instance, the folder where the Spark notebook is located in workspace		Shared/Common/C2
ComputeName	U	Name of the transform compute. For instance, the name of the Spark notebook or Stored Procedure that does the transform		C2Transform-FromDWH-ToC2-Generic
InputType	U	Type of data source for this transformation. Allowed values - Raw, Curated, Datawarehouse
InputFileSystem	U	File system/container of input file. The input file could be in raw zone, trusted zone or curated zone		trusted
InputFileFolder	U	Folder where the input files are located		erp/gl/YYYY/MM
InputFile	U	Input file name		erp_gl_YYYYMMDD_HH_GUID.parquet
InputFileDelimiter	U	Column Separator of input file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files		\|
InputFileHeaderFlag	U	Flag (1/0) to indicate if the input file has a header. Header=1, No Header=0		1
InputDWTable	U	Input DWHTable		FIN.GL
WatermarkColName	U	High watermark timestamp column name for input DWH table		LAST_UPDATED
LastDeltaDate	U	High watermark value for delta column		Today
LastDeltaNumber	U
MaxIntervalMinutes	U	Maximum time range in minutes for which data is extracted. The data is extracted from source for the range from LastDeltaDate to (LastDeltaDate +MaxIntervalMinutes). If not specified the date range for extract is between LastDeltaDate and Now		1 Day =1440 minutes,Daily Extract 1440/24= Hourly extract
MaxIntervalNumber	U	Maximum number range for which data is extracted. The data is extracted from source for the range from LastDeltaNumber to (LastDeltaNumber +MaxIntervalNumber). If not specified the number range for extract is between LastDeltaNumber and current
MaxRetries	U	Maximum number of times a failed transformation will try to re-run before it gives up.		3
OutputL2CurateFileSystem	U	File system/container where the transformed files will be written		curated
OutputL2CuratedFolder	U	Folder where the transformed files will be written		erp/gl-snapshot/2020/11
OutputL2CuratedFile	U	Transformed file name		erp_glsnapshot_202011.parquet
OutputL2CuratedFileDelimiter	U	Column Separator of transformed file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files		\|
OutputL2CuratedFileFormat	U	Type of transformed file		Parquet/ORC/CSV
OutputL2CuratedFileWriteMode	U	Control field that determines how the transformed data is written to curated file. Valid values are append, overwrite,ignore,error or errorifexists	overwrite	append
OutputDWStagingTable	U	Temporary DWH table that stages the transformed data for MERGE updates		STG.FIN_GLSnapshot
LookupColumns	U	Unique Key Columns that is required for MERGE updates. If there are multiple columns separate them with a pipe (\|)		company_code\|gl_code\|start_date
OutputDWTable	U	DWH table that stores the transformed data		FIN.GL_Snapshot
OutputDWTableWriteMode	U	Control field that determines how the transformed data is written to DWH table. Valid values are append, overwrite,ignore,error or errorifexists	append	append
ActiveFlag	U	Flag(1/0) to activate/deactivate the transform. Active=1
CreatedBy	S	User who created this record	LoginID
CreatedTimestamp	S	Timestamp of record creation	Now
ModifiedBy	S	User who modified this record	LoginID
ModifiedTimestamp	S	Timestamp of record creation	Now

6. L2TransformInstance

L2TransformInstance table records the execution of each run of an L2 transform pipeline. This table is populated by transform pipeline and has capability to re-run in addition to providing basic audit logging capability and data lineage with ingested and curated1 data.

Column	System/User	Description	Default	Example
L2TransformInstanceID	S	System generated running number to identify an execution of Level 2 Transform
L2TransformID	S	Level 2 Transformation definition ID
IngestID	S	Ingest Definition ID
L1TransformID	S	Level 1 Transformation definition ID
NotebookPath	S	Folder path where the transform notebook is located in Databricks workspace		C2Transform-FromDWH-ToC2-Generic
NotebookName	S	Name of the Databricks transform notebook		Shared/Common/C2
InputFileSystem	S	File system/container where the input files are located. The input files could be located anywhere in the data lake - raw zone, curated 1 zone, curated 2 zone		trusted
InputFileFolder	S	Folder where the input files are located		erp/gl/2020/11
InputFile	S	Raw file name		erp_gl_20201105_00_b20469a3-d31e-4276-8677-9e6f6279660a.parquet
InputFileDelimiter	S	Column Separator of raw file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files		\|
InputFileHeaderFlag	S	Flag (1/0) to indicate if the raw file has a header. Header=1, No Header=0		1
InputDWTable	S	Name of DWH table is the input data is coming from DWH		FIN.GL
DeltaName	S	High watermark column of DWH table. This could be a timestamp or running number field. If blank all records from InputDWTable is considered for transformation
DataFromTimestamp	S	From Timestamp of input data from DWH if DeltaName is a timestamp column
DataToTimestamp	S	To Timestamp of input data from DWH if DeltaName is a timestamp column
DataFromNumber	S	From Running Number of input data from DWH if DeltaName is a running number column
DataToNumber	S	To Running Number of input data from DWH if DeltaName is a running number column
OutputL2CurateFileSystem	S	File system/container where the transformed files are written		curated
OutputL2CuratedFolder	S	Folder where the transformed files are written		erp/gl-snapshot/2020/11
OutputL2CuratedFile	S	Transformed file name		gl_snapshot_202011.parquet
OutputL2CuratedFileDelimiter	S	Column Separator of transformed file when the file is CSV, TXT or tab separated. Not applicable for JSON, Parquet, ORC, XML files		\|
OutputL2CuratedFileFormat	S	Format of the transformed file		parquet\|csv\|orc
OutputL2CuratedFileWriteMode	S	Control field that determines how the transformed data is written to curated file. Valid values are append, overwrite,ignore,error or errorifexists	overwrite	overwrite
OutputDWStagingTable	S	Temporary DWH table that stages the transformed data for MERGE updates		STG.FIN_GLSnapshot
LookupColumns	S	Unique Key Columns that is required for MERGE updates. If there are multiple columns separate them with a pipe (\|)		company_code\|gl_code\|start_date
OutputDWTable	S	DWH table that stores the transformed data		FIN.GLSnapshot
OutputDWTableWriteMode	S	Control field that determines how the transformed data is written to DWH table. Valid values are append, overwrite,ignore,error and errorifexists	append	append
InputCount	S	Number of input records
L2TransformInsertCount	S	Total Number of records inserted by the transformation
L2TransformUpdateCount	S	Total Number of records updated by the transformation
L2TransformDeleteCount	S	Total Number of records deleted by the transformation
L2TransformStartTimestamp	S	Timestamp when the transformation started
L2TransformEndTimestamp	S	Timestamp when the transformation completed successful or otherwise
L2TransformStatus	S	Status of transformation. Valid values are Running,Success,Failure,ReRunFailure or ReRunSuccess
RetryCount	S	Count of retries attempted on a failed transform. If the the count reaches MaxRetries, the transform will not attempt to retry. The count is reset after a successful run.
ActiveFlag	S	Flag (1/0) to indicate whether the transform instance is active or not. Flag set to active by default and will be reset after a successful run. A failed run will set the ActiveFlag until MaxRetries is reached		1
ReRunL2TransformFlag	S/U	Flag (1/0) to rerun a transform. By default the flag is set to 0 after successful run. It has to be manually set for a re-run		0
IngestADFPipelineRunID	S	GUID of the ADF Ingestion pipeline that created this record		31f5db5f-0b7d-469e-9e7b-6398231eadea
L1TransformADFPipelineRunID	S	GUID of Level 1 Transformation Pipeline if available
L2TransformADFPipelineRunID	S	GUID of this transformation pipeline		357FCF99-8155-4A6F-BF86-DBB6F3528710
CreatedBy	S	User who created this record	LoginID
CreatedTimestamp	S	Timestamp of record creation	Now
ModifiedBy	S	User who modified this record	LoginID
ModifiedTimestamp	S	Timestamp of record creation	Now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly