Skip to content

An AI Vision Language Model System for extracting structured knowledge graph information(JSON) from images of process diagrams

License

Notifications You must be signed in to change notification settings

Zackriya-Solutions/diagram2graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Diagram2Graph header

Digitizing diagrams using Vision Language Models

Model Adapter Dataset
Open in Colab License

Introduction

Building a system for converting images of process diagrams into structured Neo4J-compatible JSON format using AI-based Vision models. The project involves an initial exploration of the same using general purpose models(from Anthropic), followed by fine-tuning a smaller vision language model for improved performance and latency reduction.

Demo

Objectives

  • Extract diagrams information from images
  • Exploring general purpose model(Claude3.5-sonnet)
  • Comparison of image-to-cypher and image-to-json
  • Develop a workflow for image-to-json(convertible to graph) conversion
  • Finetune Vision language model Qwen2.5-vl-3b for image-to-json
  • Implement a prototype for integration

🎯 Key Features

βœ… Diagram-to-Graph Conversion
Extract nodes, edges, and attributes from images into JSON for Neo4J ingestion

βœ… Improved Performance

  • +14% node detection & +23% edge detection vs base model
  • Runs on 3B-parameter Qwen2.5-VL with LoRA fine-tuning

βœ… Privacy-First Design
No API dependenciesβ€”process diagrams locally

πŸ› οΈ How It Works

Workflow Overview

High Level Architecture

Until now the model was finetuned, and observed the performance improvement. The Whole system is yet need to be developed(Frontend, Neo4j Integration,etc.)

  1. Input: Image of a process/flow diagram
  2. Processing: Vision-Language Model (VLM) extracts nodes/edges
  3. Output: Structured JSON for representng knowledge graph

πŸ”¬ Technical Deep Dive

Why Fine-Tuning a Smaller Model?

Proprietary models are good with generalized application, But when it comes to specific requirements, Finetuning with relevent data may result in good optimized result even with less param models(here 3B)--which means less resource usage

πŸ’‘ Key Advantages

Factor Large Proprietary Models Fine-Tuned Qwen2.5-VL
Compute Cost High (API fees + cloud GPU usage) Low (local inference)
Privacy Data sent to third-party APIs On-premise processing
Customization Limited to API constraints Full control via LoRA

Solution : Fine-tune Qwen2.5-VL-3B with domain-specific data

Dataset & Training

Specification Details
Training Data 200 hand-labeled diagrams
Method LoRA (PEFT) + f32 precision
Epochs 10
Hardware 1x GPU (24GB+ VRAM recommended)

Quantized version can be expected sooner

Evaluation Results

Significant improvements in structured extraction score:

Metric Base Model Fine-Tuned Improvement
Node Detection 74.9% F1 89.1% +14%
Edge Detection 46.05% F1 69.45% +23%

πŸš€ How to use

Try it on Google Colab :

Open In Colab

Installation

pip install -q transformers accelerate qwen-vl-utils[decord]==0.0.8

Inference Example

from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
import torch

# Load fine-tuned model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "zackriya/diagram2graph-adapters",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
processor = Qwen2_5_VLProcessor.from_pretrained("zackriya/diagram2graph-adapters")

SYSTEM_MESSAGE = """You are a Vision Language Model specialized in extracting structured data from visual representations of process and flow diagrams.
Your task is to analyze the provided image of a diagram and extract the relevant information into a well-structured JSON format.
The diagram includes details such as nodes and edges. each of them have their own attributes.
Focus on identifying key data fields and ensuring the output adheres to the requested JSON structure.
Provide only the JSON output based on the extracted information. Avoid additional explanations or comments."""

# Process image
def run_inference(image):
    messages= [
        {"role": "system","content": [{"type": "text", "text": SYSTEM_MESSAGE}],},
        {"role": "user",
            "content": [
                {"type": "image","image": image,},
                {"type": "text","text": "Extract data in JSON format, Only give the JSON",},
            ],
        },
    ]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, _ = process_vision_info(messages)

    inputs = processor(
        text=[text],
        images=image_inputs,
        return_tensors="pt",
    )
    inputs = inputs.to('cuda')

    generated_ids = model.generate(**inputs, max_new_tokens=1024)
    generated_ids_trimmed = [
        out_ids[len(in_ids):]
        for in_ids, out_ids
        in zip(inputs.input_ids, generated_ids)
    ]

    output_text = processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )
    return output_text
# Usage
output = run_inference(image)

# JSON loading
import json
json.loads(output[0])

πŸ“ˆ What's Next?

  • Dataset Expansion: More Data
  • Quantized Models: For better resource management
  • Ollama Integration: Simplify local deployment
  • Python Library: Plug-and-Play use
  • Neo4J Integration: Knowledge graph DB integration

πŸ™ Acknowledgments


🀝 Collaborate

Are you interested in fine tuning your own model for your use case or want to explore how we can help you? Let's collaborate.

Zackriya Solutions


πŸ”— Associated Content


πŸ“œ License

Apache 2.0 License | Developed by Zackariya Solutions


Give us a ⭐️ if you like this work!

About

An AI Vision Language Model System for extracting structured knowledge graph information(JSON) from images of process diagrams

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •