TextCrafter enables precise multi-region visual text rendering, addressing the challenges of long, small-size,various numbers, symbols and styles in visual text generation. We illustrate the comparisons among TextCrafter with three state-of-the-art models: FLUX, TextDiffuser-2 and 3DIS.
-
We propose TextCrafter, a novel training-free framework for rendering multiple texts in complex visual scenes, consisting of three key stages: instance fusion, region insulation, and token focus.
-
We construct a new CVTG-2K dataset, which encompasses a diverse range of visual text prompts varying in position, quantity, length, and attributes, serving as a robust benchmark for evaluating the performance of generation models in Complex Visual Text Generation tasks.
-
We conduct extensive quantitative and qualitative experiments on the CVTG-2K benchmark. The results demonstrate that TextCrafter delivers exceptional performance, proving its superior effectiveness and robustness in tackling the CVTG task.
If you find this work useful for your research and applications, please cite us using this BibTeX:
@article{du2025textcrafter,
title={TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes},
author={Du, Nikai and Chen, Zhennan and Chen, Zhizhou and Gao, Shan and Chen, Xi and Jiang, Zhengkai and Yang, Jian and Tai, Ying},
journal={arXiv preprint arXiv:2503.23461},
year={2025}
}
@article{chen2024region,
title={Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement},
author={Chen, Zhennan and Li, Yajie and Wang, Haofan and Chen, Zhibo and Jiang, Zhengkai and Li, Jun and Wang, Qian and Yang, Jian and Tai, Ying},
journal={arXiv preprint arXiv:2411.06558},
year={2024}
}
TextCrafter consists of three steps. (a) Instance Fusion: Strengthen the connection between visual text and its corresponding carrier. (b) Region Insulation: Leverage the positional priors of the pre-trained DiT model to initialize the layout information for each text instance while separating and denoising text prompts across different regions. (c) Text Focus: Enhance the attention maps of visual text, refing the fidelity of text rendering.
-
[2025.04.01]: The dataset is released at link.
-
[2025.04.01]: TextCrafter code has been released and supports Flux and SD3.5!
Clone this repo:
git clone https://github.com/NJU-PCALab/TextCrafter.git
cd TextCrafter
Build up a new environment and install packages as follows:
conda create -n textcrafter python==3.9 -y
conda activate textcrafter
pip install xformers==0.0.28.post1 diffusers==0.32.2 peft==0.14.0 torchvision==0.19.1 opencv-python==4.10.0.84 sentencepiece==0.2.0 protobuf==5.28.1 scipy==1.13.1 six==1.17.0 gurobipy==12.0.0 matplotlib==3.9.4
Running the simple code below to perform inference verify whether the environment is correctly installed.
cd TextCrafer_FLUX # or cd TextCrafer_SD3
python demo.py
If you downloaded the FLUX.1 dev or SD3.5 model in advance, please specify the model path in demo.py
and pre_generation.py
Parameter Description:
prompt
: Text description of the entire image.
carrier_list
: A list of the carriers/locations of each visual text, which can be concrete objects or abstract concepts, but must come from the prompt.
sentence_list
: A list of individual descriptions for each visual text, which you can extract from the prompt.
pre_generation_steps
: The number of inference steps required in the pre-generation process. Only a few steps are needed to obtain a rough layout.
insulation_steps
: The number of reasoning steps performed by Region Insulation. The number of insulation is higher, which means the object position is more accurately controlled, but the boundaries may be more obvious.
num_inference_steps
: Total number of sampling/inference steps.
cross_replace_steps
: In the Text Focus stage, the proportion of steps that perform attention reweight operations
min_area
: The minimum area of the rectangle when optimizing the layout of each visual text. This parameter should be set to different values for different numbers of visual texts. It is recommended that (1, 2, 3, 4, 5) visual texts correspond to (0.65, 0.3, 0.2, 0.15, 0.12) respectively.
addition
: In the Instance Fusion stage, the control coefficient of embedding addition.
CVTG-2K is a challenging benchmark dataset comprising 2,000 prompts for complex visual text generation tasks. Generated via OpenAI's O1-mini API using Chain-of-Thought techniques, it features diverse scenes including street views, advertisements, and book covers. The dataset contains longer visual texts (averaging 8.10 words and 39.47 characters) and multiple text regions (2-5) per prompt. Half the dataset incorporates stylistic attributes (size, color, font), enhancing evaluation capabilities. CVTG-2K provides fine-grained information through decoupled prompts and carrier words that express text-position relationships, making it ideal for advancing research in visual text generation and stylization.
After downloading CVTG-2K.zip and extracting it, you will see two folders:
- CVTG: Contains data without attribute annotations
- CVTG-style: Contains data with attribute annotations
Inside each folder, you will find JSON files named with numbers, such as 1.json
(with fine-grained annotations) and 1_combined.json
(without fine-grained annotations). The numbers in the filenames represent the quantity of visual text regions, ranging from 2 to 5.
CVTG-2K/
├── CVTG/ # Data without attribute annotations
│ ├── 2.json
│ ├── 2_combined.json
│ ├── 3.json
│ ├── 3_combined.json
│ ├── 4.json
│ ├── 4_combined.json
│ ├── 5.json
│ └── 5_combined.json
└── CVTG-style/ # Data with attribute annotations
├── 2.json
├── 2_combined.json
├── 3.json
├── 3_combined.json
├── 4.json
├── 4_combined.json
├── 5.json
└── 5_combined.json
We feel thankful for the available code/model of diffusers, gurobi, FLUX, Stable Diffusion, prompt-to-prompt, RAG-Diffusion.