Skip to content

Refa: Optimize pptx shape extraction to reduce content loss #6703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 22, 2025

Conversation

zhudongwork
Copy link
Contributor

What problem does this PR solve?

When parsing pptx files, some shapes do not contain the shape_type attribute, which causes the original code to throw an exception during extraction, leading to failure in content extraction. This optimization introduces handling logic for such anomalous shapes, providing a safer and more robust processing mechanism.

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. ☯️ refactor Pull request that refactor/refine code labels Apr 1, 2025
@BadwomanCraZY
Copy link
Contributor

@zhudongwork Thanks for your issue. Could you please upload a pptx file with the problems. And we can quickly locate the bug.

@zhudongwork
Copy link
Contributor Author

@zhudongwork Thanks for your issue. Could you please upload a pptx file with the problems. And we can quickly locate the bug.

demo.pptx

@BadwomanCraZY
Copy link
Contributor

BadwomanCraZY commented Apr 1, 2025

Thank you for submitting the code. You have resolved the error issues we had before, but the effect of the version you generated in the specific text parsing is not as good as our original code. We hope you can further modify your code.
The log error message has changed from

`Traceback (most recent call last):
  File "/ragflow/deepdoc/parser/ppt_parser.py", line 73, in __call__
    txt = self.__extract(shape)
  File "/ragflow/deepdoc/parser/ppt_parser.py", line 34, in __extract
    if shape.shape_type == 19:
  File "/ragflow/.venv/lib/python3.10/site-packages/pptx/shapes/autoshape.py", line 325, in shape_type
    raise NotImplementedError("Shape instance of unrecognized shape type")
NotImplementedError: Shape instance of unrecognized shape type`

to

`2025-04-01 17:58:59,015 INFO     27 HTTP Request: POST https://open.bigmodel.cn/api/paas/v4/embeddings "HTTP/1.1 200 OK"
2025-04-01 17:58:59,076 INFO     27 HEAD http://es01:9200/ragflow_748ba2da0edc11f0b42b726aca92dd24 [status:200 duration:0.006s]
2025-04-01 17:58:59,341 INFO     27 From minio(0.26492000406142324) demo.pptx/demo.pptx
2025-04-01 17:58:59,652 ERROR    27 Error processing shape: 'Ppt' object has no attribute 'get_bulleted_text'
2025-04-01 17:58:59,653 ERROR    27 Error processing shape: 'Ppt' object has no attribute 'get_bulleted_text'
2025-04-01 17:58:59,653 ERROR    27 Error processing shape: 'Ppt' object has no attribute 'get_bulleted_text'
2025-04-01 17:58:59,665 INFO     27 set_progress(ee4414a60edf11f096d5726aca92dd24), progress: 0.5, progress_msg: 17:58:59 Page(1~100000001): Text extraction finished.
2025-04-01 17:58:59,784 INFO     27 set_progress(ee4414a60edf11f096d5726aca92dd24), progress: 0.9, progress_msg: 17:58:59 Page(1~100000001): Image extraction finished`

nightly
yours

@zhudongwork
Copy link
Contributor Author

Thank you for submitting the code. You have resolved the error issues we had before, but the effect of the version you generated in the specific text parsing is not as good as our original code. We hope you can further modify your code. The log error message has changed from

`Traceback (most recent call last):
  File "/ragflow/deepdoc/parser/ppt_parser.py", line 73, in __call__
    txt = self.__extract(shape)
  File "/ragflow/deepdoc/parser/ppt_parser.py", line 34, in __extract
    if shape.shape_type == 19:
  File "/ragflow/.venv/lib/python3.10/site-packages/pptx/shapes/autoshape.py", line 325, in shape_type
    raise NotImplementedError("Shape instance of unrecognized shape type")
NotImplementedError: Shape instance of unrecognized shape type`

to

`2025-04-01 17:58:59,015 INFO     27 HTTP Request: POST https://open.bigmodel.cn/api/paas/v4/embeddings "HTTP/1.1 200 OK"
2025-04-01 17:58:59,076 INFO     27 HEAD http://es01:9200/ragflow_748ba2da0edc11f0b42b726aca92dd24 [status:200 duration:0.006s]
2025-04-01 17:58:59,341 INFO     27 From minio(0.26492000406142324) demo.pptx/demo.pptx
2025-04-01 17:58:59,652 ERROR    27 Error processing shape: 'Ppt' object has no attribute 'get_bulleted_text'
2025-04-01 17:58:59,653 ERROR    27 Error processing shape: 'Ppt' object has no attribute 'get_bulleted_text'
2025-04-01 17:58:59,653 ERROR    27 Error processing shape: 'Ppt' object has no attribute 'get_bulleted_text'
2025-04-01 17:58:59,665 INFO     27 set_progress(ee4414a60edf11f096d5726aca92dd24), progress: 0.5, progress_msg: 17:58:59 Page(1~100000001): Text extraction finished.
2025-04-01 17:58:59,784 INFO     27 set_progress(ee4414a60edf11f096d5726aca92dd24), progress: 0.9, progress_msg: 17:58:59 Page(1~100000001): Image extraction finished`

nightly yours

The function naming has been corrected (the underscore was missing), and it can now run properly.
image

@yingfeng yingfeng added the ci Continue Integration label Apr 2, 2025
@KevinHuSh KevinHuSh requested a review from asiroliu April 17, 2025 10:41
@asiroliu
Copy link
Contributor

@zhudongwork @KevinHuSh
Successfully tested the latest version with the following results:

  • File chunking functionality working as expected
  • No errors detected on backend services

@KevinHuSh KevinHuSh merged commit 10432a1 into infiniflow:main Apr 22, 2025
3 checks passed
yongtenglei pushed a commit to yongtenglei/ragflow that referenced this pull request Apr 22, 2025
…ow#6703)

### What problem does this PR solve?

When parsing pptx files, some shapes do not contain the `shape_type`
attribute, which causes the original code to throw an exception during
extraction, leading to failure in content extraction. This optimization
introduces handling logic for such anomalous shapes, providing a safer
and more robust processing mechanism.

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [x] Performance Improvement
- [ ] Other (please describe):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continue Integration ☯️ refactor Pull request that refactor/refine code size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants