This project is an API developed in Python using FastAPI. Its main goal is to create an API capable of returning text extracted from various sources to feed the database of LLMs that use the RAG technique.
Currently, the API can:
- Extract and translate subtitles from YouTube videos.
- Transcribe audio files.
- Extract text from web pages.
The API is designed to be efficient and modular, facilitating integrations with other systems and allowing scalability as needed.
- Extraction of subtitles from YouTube videos.
- Transcription of audio files.
- Extraction of text from web pages.
Follow the steps below to set up and run the API on your local machine.
Before starting, make sure you have the following installed:
- Python 3.x
- Pip (Python package manager)
- Clone the repository
git clone https://github.com/daniel-trindade/corpusAPI.git
cd corpusAPI
- Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
- Install dependencies
pip install -r requirements.txt
- Run the API
fastapi dev app/main.py
The API will be running at http://127.0.0.1:8000
(or another specified port).
You can view the documentation when the API is running:
Swagger UI (automatically available with FastAPI)
If you wish to contribute to the project, feel free to open issues or pull requests!
This project is licensed under the MIT License.