Organizations handle a vast number of documents from various vendors, departments, and external sources, integrating them into internal systems. These documents often contain critical data required by downstream systems. Many businesses rely on manual or semi-automated processes that are inefficient and require significant human intervention.
This solution fully automates the end-to-end data extraction process using Document AI models tailored to different PDF formats, integrating with Snowflake for structured storage and leveraging metadata for model and score management.
The architecture consists of multiple components working together to efficiently process, extract, and validate data from PDFs:
- 📄 Document AI Models – Custom-trained models to extract structured information from PDFs.
- 🗂 Metadata Management – Centralized tables to manage model configurations, extracted data scoring, and processing rules.
- ✅ Pre-Processing (Gatekeeper Process) – A Python-based validation layer to filter and approve documents before processing.
- 🔄 ETL Pipeline – Handles data extraction, transformation, and validation to ensure high-quality structured output.
- 🖥 Streamlit UI – A web-based interface for managing models, tracking processes, handling exceptions, and enabling reprocessing.
This automated pipeline enhances efficiency, reduces manual workload, and ensures accurate and scalable document processing within Snowflake. 🚀
- Navigate to Snowsight → Document AI.
- Create a new model and name it Invoice_model.
- Upload training PDFs (use a couple of documents from the sample directory in the GitHub repository).
- In real-world scenarios, at least 20 documents are recommended for better results.
- AMOUNT
- DISCOUNT
- INVOICE_DATE
- INV_NUMBER
- ITEM_DESCRIPTIONS
- PO_NUMBER
- QUANTITY
- RATE
- All scripts are available in the GitHub Repository.
- Download the repository to access all necessary scripts.
@invoice_doc
- The source stage where incoming PDF documents land.@manual_review
- Documents that fail pre-filter validation are moved here for manual review.@ignored_docs
- Documents that are deemed unfit for Document AI processing are stored here.
- Create a warehouse named DS_DEV_WH for processing tasks.
- Load the DOC_AI_QuickStart.SQL file from the GitHub repository into Snowflake Snowsight.
- Execute the script to create all necessary database objects, including:
- Tables
- Views
- Streams
- Stored Procedures
- Tasks
- This script will also create a database named DS_DEV_DB along with the required schemas.
- Insert the extracted values from the Document AI model training into the score threshold table.
- Navigate to DS_DEV_DB → Tasks in Snowsight.
- Resume the Extract_data and Preprocessing tasks.
- The Extract_data task includes a chain of dependent tasks.
- By default, the prefilter and extract_data tasks run every minute.
- Modify the schedule as needed.
- Download sample documents from the
sample_docs
directory. - Upload them to the source stage @invoice_docs for processing.
- In Snowsight, create a new Streamlit application.
- Edit the Streamlit app, copy the code from
Streamlit-in-Snowflake.py
. - Paste the copied code into the Streamlit editor and execute it.
- Start the processing tasks to initiate the document flow.
- Use the Streamlit app’s live view to monitor document movement through the pipeline.