sfguide-document-ai-pipeline-automation-in-snowflak

Document AI Integration with Snowflake

Overview

Organizations handle a vast number of documents from various vendors, departments, and external sources, integrating them into internal systems. These documents often contain critical data required by downstream systems. Many businesses rely on manual or semi-automated processes that are inefficient and require significant human intervention.

This solution fully automates the end-to-end data extraction process using Document AI models tailored to different PDF formats, integrating with Snowflake for structured storage and leveraging metadata for model and score management.

Solution Components

The architecture consists of multiple components working together to efficiently process, extract, and validate data from PDFs:

📄 Document AI Models – Custom-trained models to extract structured information from PDFs.
🗂 Metadata Management – Centralized tables to manage model configurations, extracted data scoring, and processing rules.
✅ Pre-Processing (Gatekeeper Process) – A Python-based validation layer to filter and approve documents before processing.
🔄 ETL Pipeline – Handles data extraction, transformation, and validation to ensure high-quality structured output.
🖥 Streamlit UI – A web-based interface for managing models, tracking processes, handling exceptions, and enabling reprocessing.

This automated pipeline enhances efficiency, reduces manual workload, and ensures accurate and scalable document processing within Snowflake. 🚀

Steps to Replicate the Setup

Create a Document AI Model

Navigate to Snowsight → Document AI.
Create a new model and name it Invoice_model.
Upload training PDFs (use a couple of documents from the sample directory in the GitHub repository).
- In real-world scenarios, at least 20 documents are recommended for better results.

Extract the Following Values from Document AI:

AMOUNT
DISCOUNT
INVOICE_DATE
INV_NUMBER
ITEM_DESCRIPTIONS
PO_NUMBER
QUANTITY
RATE

Download the Required Scripts

All scripts are available in the GitHub Repository.
Download the repository to access all necessary scripts.

Create Snowflake Internal Stages

@invoice_doc - The source stage where incoming PDF documents land.
@manual_review - Documents that fail pre-filter validation are moved here for manual review.
@ignored_docs - Documents that are deemed unfit for Document AI processing are stored here.

Create a Snowflake Warehouse

Create a warehouse named DS_DEV_WH for processing tasks.

Execute the QuickStart SQL Script

Load the DOC_AI_QuickStart.SQL file from the GitHub repository into Snowflake Snowsight.
Execute the script to create all necessary database objects, including:
- Tables
- Views
- Streams
- Stored Procedures
- Tasks
This script will also create a database named DS_DEV_DB along with the required schemas.

Insert Metadata

Insert the extracted values from the Document AI model training into the score threshold table.

Enable Scheduled Processing

Navigate to DS_DEV_DB → Tasks in Snowsight.
Resume the Extract_data and Preprocessing tasks.
- The Extract_data task includes a chain of dependent tasks.
By default, the prefilter and extract_data tasks run every minute.
- Modify the schedule as needed.

Load Sample Documents

Download sample documents from the sample_docs directory.
Upload them to the source stage @invoice_docs for processing.

Deploy Streamlit UI

In Snowsight, create a new Streamlit application.
Edit the Streamlit app, copy the code from Streamlit-in-Snowflake.py.
Paste the copied code into the Streamlit editor and execute it.

Monitor the Document Processing Flow

Start the processing tasks to initiate the document flow.
Use the Streamlit app’s live view to monitor document movement through the pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
sample_docs		sample_docs
Cleanup.sql		Cleanup.sql
DOC_AI_QuickStart.SQL		DOC_AI_QuickStart.SQL
LICENSE		LICENSE
README.md		README.md
Streamlit-in-Snowflake.py		Streamlit-in-Snowflake.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sfguide-document-ai-pipeline-automation-in-snowflak

Document AI Integration with Snowflake

Overview

Solution Components

Steps to Replicate the Setup

Create a Document AI Model

Extract the Following Values from Document AI:

Download the Required Scripts

Create Snowflake Internal Stages

Create a Snowflake Warehouse

Execute the QuickStart SQL Script

Insert Metadata

Enable Scheduled Processing

Load Sample Documents

Deploy Streamlit UI

Monitor the Document Processing Flow

About

Releases

Packages

Contributors 2

Languages

License

Snowflake-Labs/sfguide-document-ai-pipeline-automation-in-snowflake

Folders and files

Latest commit

History

Repository files navigation

sfguide-document-ai-pipeline-automation-in-snowflak

Document AI Integration with Snowflake

Overview

Solution Components

Steps to Replicate the Setup

Create a Document AI Model

Extract the Following Values from Document AI:

Download the Required Scripts

Create Snowflake Internal Stages

Create a Snowflake Warehouse

Execute the QuickStart SQL Script

Insert Metadata

Enable Scheduled Processing

Load Sample Documents

Deploy Streamlit UI

Monitor the Document Processing Flow

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages