Datura scraper scrapes tweets and web results by sending requests to an external API and saving the results. It processes data in chunks and supports asynchronous operations for efficient data handling.
- Asynchronous HTTP requests using
httpx
andasyncio
. - JSON data loading and processing.
- Error handling and logging.
- Results are saved in JSONL format.
- Python 3.10+
-
Clone the repository:
git clone https://github.com/Datura-ai/meta-benchmark.git cd meta-benchmark
-
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install the required packages:
pip install -r requirements.txt
export VALIDATOR_ACCESS_KEY="<your_validator_access_key>"
-
Prepare your dataset in the
dataset/data.jsonl
file. Each line should be a valid JSON object containing aquestion
field. -
Run the application:
cd scrapper/datura python3 datura.py
-
The results will be saved in the
results
directory, with separate files for each execution time.
- The API URL and data path can be configured in the
TweetAnalyzerScrapper
class constructor. - Batch size and number of concurrent requests can be configured with
BATCH_SIZE
constant.
- The application uses Python's built-in logging module to log information and errors. Logs are printed to the console.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or issues, please open an issue on the GitHub repository.