ide-cap-chan is a utility for batch captioning images with natural language using various Vision-Language (VL) models.
- High-speed processing: Optimized for rapid batch caption generation with ExLlama2, Qwen2-VL-7B-Instruct, Qwen2-VL-2B-Instruct (Vikhr-family included), Idefics3-8B-Llama3, LLaVa-NeXT (LLaVa-1.6), Llama JoyCaption Alpha Two, Molmo-7B-O, Molmo-72B, MiniCPM-o-2_6 and Pixtral models
- Multi-GPU support: Distribute workloads across multiple GPUs
- Efficient quantization: Supports ExLlama2 (exl2), int8, and nf4 quantization for reduced VRAM usage
- Autoload strategies: VRAM-optimized loading
- Model flexibility: Use default or custom models via CLI arguments.
- Input flexibility: Supports Hugging Face, local, and external models
- Tag integration: Enhance captions with existing tags/captions
- Process control: Interrupt and resume captioning tasks
- Batch processing: Recursively process subfolders in input directories
- NVIDIA GPU with CUDA support (8GB VRAM minimum for llava, 12GB recommended for Qwen2-VL-7B in exl2, 48GB VRAM total for Molmo-72B)
- Clone the repository:
git clone https://github.com/2dameneko/ide-cap-chan
- Install dependencies:
- Windows: Run
install.bat
- Linux: Create a virtual environment and install requirements:
python -m venv venv source venv/bin/activate pip install -r requirements.txt
- Windows: Run
- Place images and corresponding tag files in the input folder (default:
2tag
) - Start processing:
- Windows: Run
batch_processing.bat
- Linux: Execute
python ide-cap-chan.py
- Windows: Run
- Specify alternative models using CLI arguments
- Customize prompts in
model_handler.py
(modifysystem_prompt
anduser_prompt
)
- Windows: Run
update.cmd
Run without arguments for default behavior. Available CLI options (python ide-cap-chan.py -h
):
Argument | Description |
---|---|
--model_path |
Path to model (Hugging Face, local, or external) |
--model_type |
Model architecture/loader: idefics3, llava, joy-caption, molmo, qwen2vl, molmo72b, pixtral, exllama2, minicpmo, generic (default: exllama2 ) |
--input_dir |
Input directory path (default: 2tag ) |
--CUDA_VISIBLE_DEVICES |
Comma-separated GPU IDs (default: 0 ). Note:- Multi-GPU may strain your PSU - molmo72b ignores this argument and auto-splits across GPUs |
--caption_suffix |
Caption file extension (default: .txt ) |
--caption_format |
Output format: json , markdown , short , long , bbox (requires ToriiGate ≥0.4) |
--add_tags |
Enhance captions with existing tag files (ToriiGate-family models), (default: .ttxt ) |
--add_chars |
Enhance captions with character information (requires ToriiGate ≥0.4), (default: .ttxt ) |
--add_char_traits |
Enhance captions with character traits (requires ToriiGate ≥0.4), (default: .ttxt ) |
--add_info |
Enhance captions with miscellaneous image info (requires ToriiGate ≥0.4), (default: .ttxt ) |
--no_chars |
Do not add character names (requires ToriiGate ≥0.4), (default: .ttxt ) |
.jpg
, .png
, .webp
, .jpeg
- 0.9: Added MiniCPM-o-2_6 loader support, rewritten to modular design, pinned versions,
- 0.8: Added ExLlama2 loader support (default), ToriiGate-v0.4 features, Molmo-72B auto-split
- 0.7: Added Molmo/Qwen2VL/Pixtral support, improved multi-GPU quant processing, code refactor
- 0.6: Internal code improvements
- 0.5: Added JoyCaption support, code refactor
- 0.4: Added LLaVA support, updated to PyTorch 2.5.1
- 0.3: Improved argument handling, fixed extension case sensitivity
- 0.2:
- Multi-GPU support with load balancing
- nf4 quantization
- Fixed duplicate file filtering
- Updated environment scripts
- 0.1: Initial release
This project is a proof of concept and not production-ready.
- Idefics3 Architecture: HuggingFaceM4/Idefics3-8B-Llama3
- LLaVA Architecture: Transformers Documentation
- JoyCaption Code: fpgaminer/joycaption
- Qwen2-VL Architecture: Qwen/Qwen2-VL-7B-Instruct
- Qwen2-VL Implementation: MNeMoNiCuZ/qwen2-vl-7b-captioner-relaxed-batch
- Molmo Architecture: AllenAI Collection
- Pixtral Architecture: Pixtral Documentation
- MiniCPM-o-2_6 Architecture: MiniCPM-o-2_6 Documentation
- Vikhr-2-VL: Vikhr-2-VL Documentation
- ExLlamaV2: ExLlamaV2 Documentation
Model Credits
ToriiGate · LLaVA · JoyCaption · Qwen2, Pixtral · Molmo · Molmo72b · MiniCPM-o-2_6 · Vikhr-2-VL-2b-Instruct
Thank you for your interest in ide-cap-chan!