Skip to content

Commit

Permalink
Updated README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
mykolamelnykml committed Jan 29, 2025
1 parent cac4f72 commit b078cb3
Showing 1 changed file with 23 additions and 13 deletions.
36 changes: 23 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
</p>

<p align="center">
<i>An Open-Source Library for Processing Documents in Apache Spark.</i>
<i>An Open-Source Library for Processing Documents using AI/ML in Apache Spark.</i>
</p>

<p align="center">
Expand All @@ -29,37 +29,47 @@

# Welcome to the ScaleDP library

ScaleDP is library allows you to process documents using Apache Spark. Discover pre-trained models for your projects or play with the thousands of machine learning apps hosted on the [Hugging Face Hub](https://huggingface.co/).
ScaleDP is library allows you to process documents using AI/ML capabilities and scale it using Apache Spark.

**LLM** (Large Language Models) and **VLM** (Vision Language Models) models are used to extract data from text and images in combination with **OCR** engines.

Discover pre-trained models for your projects or play with the thousands of models hosted on the [Hugging Face Hub](https://huggingface.co/).

## Key features

### Document processing:
- Load PDF documents/Images to the Spark DataFrame
- Extract text from PDF documents/Images
- Extract images from PDF documents
- Create document processing pipelines
- Extract **structured data** from text/images using LLM and ML models

### OCR:
- OCR Images/PDF documents using various OCR engines
- OCR Images/PDF documents using Vision LLM models

Support various open-source OCR engines:

- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)
- [Easy OCR](https://github.com/JaidedAI/EasyOCR)
- [Surya OCR](https://github.com/VikParuchuri/surya)
- [DocTR](https://github.com/mindee/doctr)

### CV:
- Object detection on images
- Object detection on images using YOLO models
- Text detection on images

### NLP and LLM:

### LLM:

Support OpenAI compatible API for call LLM/VLM models (GPT, Gemini, GROQ, etc.)

- OCR Images/PDF documents using Vision LLM models
- Extract data from the image using Vision LLM models
- Extract data from the text/images using LLM models
- Extract data from using DSPy framework
- Extract data using DSPy framework
- Extract data from the text/images using NLP models from the Hugging Face Hub
- Visualize results

Support various open-source OCR engines:

- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)
- [Easy OCR](https://github.com/JaidedAI/EasyOCR)
- [Surya OCR](https://github.com/VikParuchuri/surya)
- [DocTR](https://github.com/mindee/doctr)


## Installation

Expand Down

0 comments on commit b078cb3

Please sign in to comment.