Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image_caption_mapper等类似算子使用前怎么处理自己的数据格式 #600

Open
3 tasks done
Crazy-JY opened this issue Feb 28, 2025 · 7 comments
Open
3 tasks done
Assignees
Labels
question Further information is requested

Comments

@Crazy-JY
Copy link

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

我手上只有几张图片,我该怎么把他们处理成合法的输入格式呢,还是直接在process.yaml中把dataset_path写成包含图片的文件夹路径 或者单张图片路径也可以呢。我看到了fmt_conversion/multimodal/ 中dj数据格式的介绍,但还是不太清楚该如何组织这些输入图片

Additional 额外信息

No response

@Crazy-JY Crazy-JY added the question Further information is requested label Feb 28, 2025
@HYLcool
Copy link
Collaborator

HYLcool commented Feb 28, 2025

@Crazy-JY ,感谢你对Data-Juicer的关注与使用!

简单说需要将数据集中的单条样本组织为这里的格式。

对于你的情况的话,如果你仅需要使用image_caption_mapper对已有的几张图片进行处理,那除了几张图片外,你还需要一个数据集文件,以jsonl格式为例,你可能需要为这几张图片创建一个dataset.jsonl文件,其中对于每张图片,每个样本可简单准备为:

{
  "text": "<__dj__image>",
  "images": ["/path/to/img1"]
}

由于初始图片没有对应的caption,因此text字段处仅有一个image的特殊token作为占位符,表示这个样本中包含一张图片;images字段中则把该样本对应的图片路径放到列表里即可。

这个数据集可简单由这段代码片段生成:

import os
import jsonlines
from data_juicer.utils.mm_utils import SpecialTokens

image_dir = 'data'  # 放置图片的目录路径
dataset_file = 'dataset.jsonl'  # 数据集路径

with jsonlines.open(dataset_file, 'w') as writer:
    for fn in os.listdir(image_dir):
        writer.write({
            'text': SpecialTokens.image,  # 仅放置特殊token
            'images': [os.path.join(image_dir, fn)],  # 将图片路径放入列表
        })

生成好的dataset.jsonl文件可以填入data-juicer配置文件中的dataset_path,然后使用你需要的算子开始处理。

你可以自己尝试一下,如还有其他问题可随时交流~

@HYLcool HYLcool self-assigned this Feb 28, 2025
@Crazy-JY
Copy link
Author

非常感谢!我试一下

@Crazy-JY
Copy link
Author

Crazy-JY commented Feb 28, 2025

您好!非常感谢解决了数据格式的问题,但我在使用本地的InternVL2_5-2B 并运行image-caption-mapper算子时出现了新的问题。大致是说没有指明text或text_target,运行信息与报错内容如下:

2025-02-28 08:38:07 | INFO | data_juicer.core.executor:52 - Using cache compression method: [None]
2025-02-28 08:38:07 | INFO | data_juicer.core.executor:57 - Setting up data formatter...
2025-02-28 08:38:07 | INFO | data_juicer.core.executor:80 - Preparing exporter...
2025-02-28 08:38:07 | INFO | data_juicer.core.executor:160 - Loading dataset from data formatter...
2025-02-28 08:38:08 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats...
2025-02-28 08:38:08 | INFO | data_juicer.format.formatter:200 - There are 1 sample(s) in the original dataset.
num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
WARNING:datasets.arrow_dataset:num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
2025-02-28 08:38:08 | INFO | data_juicer.format.formatter:214 - 1 samples left after filtering empty text.
2025-02-28 08:38:08 | INFO | data_juicer.format.formatter:237 - Converting relative paths in the dataset to their absolute version. (Based on the directory of input dataset file)
num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
WARNING:datasets.arrow_dataset:num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
2025-02-28 08:38:08 | INFO | data_juicer.format.mixture_formatter:137 - sampled 1 from 1
2025-02-28 08:38:08 | INFO | data_juicer.format.mixture_formatter:143 - There are 1 in final dataset
2025-02-28 08:38:08 | INFO | data_juicer.core.executor:166 - Preparing process operators...
2025-02-28 08:38:08 | INFO | data_juicer.core.executor:194 - Processing data...
2025-02-28 08:38:08 | WARNING | data_juicer.utils.process_utils:75 - The required cuda memory:20.0GB might be more than the available cuda memory:18.77734375GB.This Op[image_captioning_mapper] might require more resource to run.
num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
WARNING:datasets.arrow_dataset:num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
image_captioning_mapper_process: 0%| | 0/1 [00:00<?, ? examples/s]INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:vision_select_layer: -1
INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:ps_version: v2
INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:min_dynamic_patch: 1
INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:max_dynamic_patch: 12
2025-02-28 08:38:14 | INFO | logging:968 - Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:vision_select_layer: -1
INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:ps_version: v2
INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:min_dynamic_patch: 1
INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:max_dynamic_patch: 12
INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:vision_select_layer: -1
INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:ps_version: v2
INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:min_dynamic_patch: 1
INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:max_dynamic_patch: 12
FlashAttention2 is not installed.
INFO:transformers_modules.InternVL2_5-2B.modeling_internvl_chat:num_image_token: 256
INFO:transformers_modules.InternVL2_5-2B.modeling_internvl_chat:ps_version: v2
Warning: Flash attention is not available, using eager attention instead.
2025-02-28 08:39:18 | ERROR | data_juicer.ops.base_op:67 - An error occurred in image_captioning_mapper when processing samples "{'text': ['<__dj__image>'], 'images': [['/home/dj_test_images/7060.png']]}" -- <class 'ValueError'>: You need to specify either text or text_target.
image_captioning_mapper_process: 100%|##########| 1/1 [01:09<00:00, 69.49s/ examples]
2025-02-28 08:39:18 | INFO | data_juicer.core.data:226 - [1/1] OP [image_captioning_mapper] Done in 69.696s. Left 0 samples.
2025-02-28 08:39:20 | INFO | data_juicer.utils.logger_utils:227 - Processing finished with:
Warnings: 1
Errors: 1
╒═════════════════════════╤══════════════════════╤═════════════════════════════════════════════════════╤═══════════════╕
│ OP/Method │ Error Type │ Error Message │ Error Count │
╞═════════════════════════╪══════════════════════╪═════════════════════════════════════════════════════╪═══════════════╡
│ image_captioning_mapper │ <class 'ValueError'> │ You need to specify either text or text_target. │ 1 │
╘═════════════════════════╧══════════════════════╧═════════════════════════════════════════════════════╧═══════════════╛
Error/Warning details can be found in the log file [/data-juicer/outputs/demo-process/log/export_demo-processed.jsonl_time_20250228083755.txt] and its related log files.
2025-02-28 08:39:20 | INFO | data_juicer.core.executor:206 - All OPs are done in 71.412s.
2025-02-28 08:39:20 | INFO | data_juicer.core.executor:209 - Exporting dataset to disk...
2025-02-28 08:39:20 | INFO | data_juicer.core.exporter:111 - Exporting computed stats into a single file...
2025-02-28 08:39:20 | INFO | data_juicer.core.exporter:146 - Export dataset into a single file...
Creating json from Arrow format: 0ba [00:00, ?ba/s]

@Crazy-JY
Copy link
Author

Crazy-JY commented Feb 28, 2025

这里附上我的输入数据内容
{"text":"<__dj__image>", "images":["/home/dj_test_images/7060.png"]}

这里附上上述问题出现时的配置文件内容

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: './demos/data/demo-dataset-image.jsonl'  # path to your dataset directory or file
np: 1  # number of subprocess to process your dataset

export_path: './outputs/demo-process/demo-processed.jsonl'
text_keys: 'text'
image_key: 'images'
image_special_token: '<__dj__image>'

# process schedule
# a list of several process operators with their arguments
process:
  - image_captioning_mapper:                             
      hf_img2seq: '/home/InternVL2/InternVL2_5-2B'             
      caption_num: 1                               
      keep_candidate_mode: 'random_any'         
      keep_original_sample: true                            
      prompt: "describe the image"                                        
      prompt_key: null                                       
      mem_required: '16GB'
      trust_remote_code: true

@Crazy-JY
Copy link
Author

另外我保持输入不变时,经常出现如下情况。

没有报错和告警,但在./outputs/demo-process/demo-processed.jsonl也没有输出,不知道是不是有输出格式或者输出路径 没设置或者设置有问题。信息如下:
2025-02-28 09:31:03 | INFO | data_juicer.core.executor:52 - Using cache compression method: [None]
2025-02-28 09:31:03 | INFO | data_juicer.core.executor:57 - Setting up data formatter...
2025-02-28 09:31:03 | INFO | data_juicer.core.executor:80 - Preparing exporter...
2025-02-28 09:31:03 | INFO | data_juicer.core.executor:160 - Loading dataset from data formatter...
2025-02-28 09:31:04 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats...
2025-02-28 09:31:04 | INFO | data_juicer.format.formatter:200 - There are 1 sample(s) in the original dataset.
2025-02-28 09:31:04 | INFO | data_juicer.format.formatter:214 - 1 samples left after filtering empty text.
2025-02-28 09:31:04 | INFO | data_juicer.format.formatter:237 - Converting relative paths in the dataset to their absolute version. (Based on the directory of input dataset file)
2025-02-28 09:31:04 | INFO | data_juicer.format.mixture_formatter:137 - sampled 1 from 1
2025-02-28 09:31:04 | INFO | data_juicer.format.mixture_formatter:143 - There are 1 in final dataset
2025-02-28 09:31:04 | INFO | data_juicer.core.executor:166 - Preparing process operators...
2025-02-28 09:31:04 | INFO | data_juicer.core.executor:194 - Processing data...
2025-02-28 09:31:05 | INFO | data_juicer.core.data:226 - [1/1] OP [image_captioning_mapper] Done in 0.817s. Left 0 samples.
2025-02-28 09:31:07 | INFO | data_juicer.utils.logger_utils:227 - Processing finished with:
Warnings: 0
Errors: 0

Error/Warning details can be found in the log file [/data-juicer/outputs/demo-process/log/export_demo-processed.jsonl_time_20250228093051.txt] and its related log files.
2025-02-28 09:31:07 | INFO | data_juicer.core.executor:206 - All OPs are done in 2.469s.
2025-02-28 09:31:07 | INFO | data_juicer.core.executor:209 - Exporting dataset to disk...
2025-02-28 09:31:07 | INFO | data_juicer.core.exporter:111 - Exporting computed stats into a single file...
2025-02-28 09:31:07 | INFO | data_juicer.core.exporter:146 - Export dataset into a single file...
Creating json from Arrow format: 0ba [00:00, ?ba/s]

@HYLcool
Copy link
Collaborator

HYLcool commented Feb 28, 2025

image_captioning_mapper算子里默认支持的是类似于BLIP-2这样的模型,你使用的InternVL2_5-2B这类VLM模型有自己的一套tokenization和generate或者chat的接口,所以它和这个算子的实现没有很匹配,建议你可以根据这个算子的实现和InternVL2的使用示例实现一个新算子。

后续没有输出是因为复用了第一次处理失败时的cache,在测试时可以在配置文件中设置use_cache: false来关闭cache,在大规模数据处理时再打开cache。

@Crazy-JY
Copy link
Author

好的,了解了,非常感谢~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants