Real-time feature pipeline that:
- ingests financial news from Alpaca
- transforms the news documents into embeddings in real-time using Bytewax
- stores the embeddings into the Qdrant Vector DB
The best way to ingest real-time knowledge into an LLM without retraining the LLM too often is by using RAG.
To implement RAG at inference time, you need a vector DB always synced with the latest available data.
The role of this streaming pipeline is to listen 24/7 to available financial news from Alpaca, process the news in real-time using Bytewax, and store the news in the Qdrant Vector DB to make the information available for RAG.
Main dependencies you have to install yourself:
- Python 3.10
- Poetry 1.5.1
- GNU Make 4.3
- AWS CLI 2.11.22
Installing all the other dependencies is as easy as running:
make install
When developing run:
make install_dev
Prepare credentials:
cp .env.example .env
--> and complete the .env
file with your credentials. We will show you below how to generate the credentials for Alpaca and Qdrant ↓ .
All you have to do for Alpaca is create a FREE account and generate the ALPACA_API_KEY
and ALPACA_API_SECRET
API Keys. After, be sure to add them to your .env
file.
-> Check out this document for step-by-step instructions.
Same as for Alpaca, you must create a FREE account in Qdrant and generate the QDRANT_API_KEY
and QDRANT_URL
environment variables. After, be sure to add them to your .env
file.
-> Check out this document to see how.
optional step in case you want to deploy the streaming pipeline to AWS
First, install AWS CLI 2.11.22.
Secondly, configure the credentials of your AWS CLI.
Run production streaming pipeline in real-time
mode:
make run_real_time
To populate the vector DB you can ingest historical data by running the streaming pipeline in batch
mode:
make run_batch
Run the streaming pipeline in real-time
and development
modes:
make run_real_time_dev
Run the streaming pipeline in batch
and development
modes:
make run_batch_dev
Run a query in your vector DB:
make search PARAMS='--query_string "Should I invest in Tesla?"'
You can replace the --query_string
with any question you want.
Build the Docker image:
make build
Run the streaming pipeline in real-time
mode inside the Docker image:
source .env && make run_docker
First, be sure that the credentials
of your AWS CLI are configured.
After, run the following to deploy the streaming pipeline to an AWS EC2 machine:
make deploy_aws
NOTE: You can log in to the AWS console, go to the EC2s section, and you can see your machine running.
To check the state of the deployment, run:
make info_aws
To remove the EC2 machine, run:
make undeploy_aws
Check the code for linting issues:
make lint_check
Fix the code for linting issues (note that some issues can't automatically be fixed, so you might need to solve them manually):
make lint_fix
Check the code for formatting issues:
make format_check
Fix the code for formatting issues:
make format_fix