Arize Experiment

A powerful CLI tool for evaluating AI model performance through systematic experiments. Run standardized evaluations of language models and AI agents, collect metrics, and analyze results using Arize's analytics platform.

Project Status

Current Status: Beta

Active development with regular updates
Core features stable and production-ready
Actively seeking community feedback and contributions

Roadmap:

Support for additional LLM providers beyond OpenAI and Anthropic
Enhanced metric collection and visualization
Custom evaluation pipeline builder
Support for pluggable evaluators and tasks

Getting Started

Prerequisites

Python 3.10.13 (Other versions including Python 3.11+ not currently supported)
Git
OpenAI API key (for OpenAI-based tasks)
Anthropic API key (for Claude-based tasks)
Arize account and API credentials

Clone the Repository

git clone https://github.com/Arize-ai/arize-experiment.git
cd arize-experiment

Python Setup

# Install pyenv
brew install pyenv

# Configure shell for pyenv (add to ~/.zshrc or ~/.bashrc)
export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"

# Install required Python version
pyenv install 3.10.13
pyenv local 3.10.13

# Create virtual environment
python -m venv venv
source venv/bin/activate

Package Installation

# Install dependencies
pip install -r requirements.txt

# Install the arize-experiment package in editable mode
pip install -e .

Environment Configuration

Create your environment file:

cp .env.example .env

Required environment variables:

# Arize API Credentials
# Required for all tasks/evaluators
# Obtain these from https://app.arize.com/settings/api
ARIZE_API_KEY=your_arize_api_key
ARIZE_SPACE_KEY=your_arize_space_key

# OpenAI API Configuration
# Required for OpenAI-based tasks/evaluators
OPENAI_API_KEY=your_openai_api_key

# Chatbot Server Configuration
# Required for call_chatbot_server task
CHATBOT_SERVER_URL=http://localhost:8080

To obtain Arize credentials:

Sign up at Arize AI Platform
Navigate to Settings → API Keys
Create a new API key and copy both the API key and space key

Features

Standardized Evaluation Framework: Run systematic evaluations of AI models with consistent metrics
Comprehensive Analytics: Track and compare performance across experiments
Flexible Task System: Support for multiple evaluation tasks and metrics
Arize Integration: Automatic upload of results to Arize's analytics platform

Currently supported capabilities:

Tasks
- Classify Sentiment: Classify input text as positive, negative, or neutral
- Call Chatbot Server: Make an API request to an instance of a chatbot server
- Delegate: Make an API request to another service that will handle the task
- Echo: Return the input as it was received
Evaluators
- Chatbot Response is Acceptable: Measure the quality of the response from the chatbot server
- Sentiment Classification is Accurate: Measure the accuracy of the sentiment classifier task output

Dataset Requirements

Dataset Format

Each task type requires a specific dataset format. Here are the requirements for each:

Classify Sentiment Dataset

Columns:

input: The text content to analyze

Chatbot Server Dataset

Columns:

input: A JSON object containing the conversation history between the user and the chatbot

input example:

[
  {
    "role": "user",
    "content": "Hello"
  },
  {
    "role": "assistant",
    "content": "Hi there!"
  },
  {
    "role": "user",
    "content": "What's the weather?"
  }
]

Creating Custom Datasets

Create a CSV file with your dataset
Use the arize-experiment create-dataset command to upload your dataset to Arize

arize-experiment create-dataset \
  --name <dataset-name> \
  --path-to-csv <path-to-csv-file>

Usage Guide

Basic Command Structure

arize-experiment run \
  --name <experiment-name> \
  --dataset <dataset-name> \
  --task <task-name> \
  --evaluator <evaluator-name>

Tasks

1. Classify Sentiment

Analyzes text sentiment using LLMs. Classifies text as positive, negative, or neutral.

# Example: Evaluate sentiment classification
arize-experiment run \
  --name my-experiment \
  --dataset my-dataset \
  --task classify_sentiment \
  --evaluator sentiment_classification_is_accurate

2. Call Chatbot Server

Calls a chatbot server by making HTTP requests to a specified endpoint.

# Example: Evaluate chatbot responses
arize-experiment run \
  --name my-experiment \
  --dataset my-dataset \
  --task call_chatbot_server \
  --evaluator chatbot_response_is_acceptable

Development Guide

Project Structure

arize_experiment/
├── cli/                          # Command-line interface
│   ├── main.py                  # CLI entry point
│   └── handler.py               # Command handlers
├── core/                        # Core functionality
│   ├── task.py                 # Base task class
│   ├── evaluator.py            # Base evaluator class
│   └── metrics.py              # Metric collection
├── evaluators/                  # Evaluation implementations
├── tasks/                       # Task implementations
└── __main__.py                 # Package entry point

Dependencies

The project relies on several key packages:

Core Dependencies
- arize[Datasets]>=7.0.0: Arize AI platform integration
- openai>=1.0.0: OpenAI API client
- anthropic>=0.19.1: Anthropic API client
- pandas>=2.0.0: Data manipulation and analysis
- python-dotenv>=1.0.0: Environment variable management
- click>=8.0.0: CLI framework
- urllib3>=2.0.0: HTTP client library
Development Dependencies
- pytest>=7.0.0: Testing framework
- flake8>=7.0.0: Code linting
- black>=24.1.0: Code formatting
- mypy>=1.8.0: Static type checking
- pandas-stubs>=2.2.0: Type stubs for pandas

Creating New Tasks

Create a new file in tasks/ directory:

# tasks/my_custom_task.py
from arize_experiment.core.task import Task, TaskResult
from arize_experiment.core.schema import DatasetSchema, ColumnSchema, DataType
from arize_experiment.core.task_registry import TaskRegistry
from typing import Dict, Any

@TaskRegistry.register("my_custom_task")  # Register the task with the framework
class MyCustomTask(Task):
    """A custom task implementation that processes input text.

    This task demonstrates the basic structure required for creating new tasks
    in the arize-experiment framework. It inherits from the base Task class
    and implements all required abstract methods.
    """

    def __init__(self, config: Dict[str, Any]) -> None:
        """Initialize the task with configuration parameters.

        Args:
            config: Dictionary containing task-specific configuration parameters
                   that will be used during execution.
        """
        super().__init__()  # Required call to parent class initializer
        self.config = config

    @property
    def name(self) -> str:
        """Define a unique identifier for this task.

        Returns:
            str: A lowercase string with underscores that uniquely identifies
                 this task type in the framework.
        """
        return "my_custom_task"

    @property
    def required_schema(self) -> DatasetSchema:
        """Define the expected structure of input data for this task.

        This schema is used to validate input data before execution.
        In this example, we require a single 'input' column of type string.

        Returns:
            DatasetSchema: Schema object describing required input format
        """
        return DatasetSchema(
            columns={
                "input": ColumnSchema(
                    name="input",
                    types=[DataType.STRING],  # Accepts string data only
                    required=True  # This field must be present
                )
            }
        )

    def execute(self, dataset_row: Dict[str, Any]) -> TaskResult:
        """Execute the task's core logic on a single input row.

        This method contains the main processing logic for the task.
        It handles errors gracefully and returns results in a standardized format.

        Args:
            dataset_row: A dictionary containing the input data matching the
                        required_schema structure.

        Returns:
            TaskResult: Object containing:
                - dataset_row: The original input data
                - output: The processed result (or None if error)
                - metadata: Additional execution information
                - error: Error message if processing failed
        """
        try:
            # Process the input using a helper method (not shown)
            result = self._process_input(dataset_row["input"])

            # Return successful result with metadata
            return TaskResult(
                dataset_row=dataset_row,
                output=result,
                metadata={"config": self.config}  # Include config for tracking
            )
        except Exception as e:
            # Return error result while preserving the input
            return TaskResult(
                dataset_row=dataset_row,
                output=None,  # No output on error
                error=str(e)  # Convert exception to string message
            )

# Alternative registration method:
# TaskRegistry.register("my_custom_task", MyCustomTask)

Creating New Evaluators

Create a new file in evaluators/ directory:

# evaluators/my_custom_evaluator.py
from arize_experiment.core.evaluator import BaseEvaluator
from arize_experiment.core.task import TaskResult
from arize_experiment.core.evaluator_registry import EvaluatorRegistry
from arize.experimental.datasets.experiments.types import EvaluationResult
from typing import Dict, Any

@EvaluatorRegistry.register("my_custom_evaluator")  # Register the evaluator with the framework
class MyCustomEvaluator(BaseEvaluator):
    """An evaluator that assesses task output quality using a threshold.

    This evaluator demonstrates the basic structure required for creating
    new evaluators in the arize-experiment framework. It inherits from
    BaseEvaluator and implements all required abstract methods.
    """

    def __init__(self, threshold: float = 0.8) -> None:
        """Initialize the evaluator with a quality threshold.

        Args:
            threshold: A float between 0 and 1 representing the minimum
                      acceptable score for the evaluation to pass.
                      Defaults to 0.8 (80%).
        """
        super().__init__()  # Required call to parent class initializer
        self.threshold = threshold

    @property
    def name(self) -> str:
        """Define a unique identifier for this evaluator.

        Returns:
            str: A lowercase string with underscores that uniquely identifies
                 this evaluator type in the framework.
        """
        return "my_custom_evaluator"

    def evaluate(self, task_result: TaskResult) -> EvaluationResult:
        """Evaluate the quality of a task's output.

        This method implements the core evaluation logic. It takes a task's
        output and returns a standardized evaluation result with a score
        and pass/fail determination.

        Args:
            task_result: The complete result from a task execution, including
                        input data, output, and any metadata or errors.

        Returns:
            EvaluationResult: Object containing:
                - score: A float between 0 and 1 indicating quality
                - passed: Boolean indicating if score meets threshold
                - metadata: Additional evaluation context
                - explanation: Human-readable description of the result
        """
        # Calculate quality score using helper method (not shown)
        score = self._calculate_score(task_result)

        # Return standardized evaluation result
        return EvaluationResult(
            score=score,  # Quality score between 0 and 1
            passed=score >= self.threshold,  # Pass if score meets threshold
            metadata={"threshold": self.threshold},  # Include config for tracking
            explanation=f"Score {score} {'meets' if score >= self.threshold else 'does not meet'} threshold {self.threshold}"
        )

# Alternative registration method:
# EvaluatorRegistry.register("my_custom_evaluator", MyCustomEvaluator)

Note: The framework uses a registry system to manage tasks and evaluators. You can register your implementations either using the decorator syntax shown above or by calling the register method directly. The registration makes your task/evaluator available to the CLI and other framework components.

Testing Guidelines

We use pytest for testing. All tests should be placed in the tests/ directory.

Test Structure

# tests/test_my_custom_task.py
import pytest
from tasks.my_custom_task import MyCustomTask

def test_my_custom_task_basic():
    task = MyCustomTask({})
    input_data = {"test": "data"}
    result = task.execute(input_data)
    assert result is not None

def test_my_custom_task_validation():
    task = MyCustomTask({})
    invalid_input = {}
    with pytest.raises(ValueError):
        task.validate_input(invalid_input)

Running Tests

# Run all tests
pytest

# Run specific test file
pytest tests/test_my_custom_task.py

# Run with coverage
pytest --cov=arize_experiment tests/

# Run with verbose output
pytest -v

Best Practices & Performance Considerations

API Rate Limits

Please be aware of the rate limits for external services:

OpenAI API: Varies by tier and model
Anthropic API: Varies by tier and model
Arize API: Please refer to your service agreement

Troubleshooting Guide

Common Issues

Python Not Found After Installation
```
pyenv rehash
eval "$(pyenv init -)"
```
Environment Variables Not Loading
- Check file permissions: chmod 600 .env
- Verify file location: Must be in project root
- Use printenv to verify variables are set
Dataset Loading Errors
- Verify JSON format matches requirements
- Check file encoding (use UTF-8)
- Ensure all required fields are present
Task Execution Failures
- Check API key validity
- Verify network connectivity
- Ensure input data matches schema
- Check API service status

Development Tools

The project uses several development tools:

Black: Code formatting
```
black arize_experiment/
```
Flake8: Code linting
```
flake8 arize_experiment/
```
MyPy: Type checking
```
mypy arize_experiment/
```

Pre-commit: Git hooks

pre-commit install
pre-commit run --all-files

Contributing

Getting Started

Fork the repository
Create a feature branch
Implement your changes
Add tests
Submit a pull request

Code Style

Follow PEP 8 guidelines
Use type hints
Document all public methods
Keep functions focused and small

Community

Report issues on GitHub
Contribute to discussions
Share your use-cases

License

This project is licensed under the GNU General Public License v3.0 (GPLv3). This means:

You can use this software for any purpose
You can modify and distribute this software
If you distribute modified versions, they must also be under GPLv3
All changes must be documented and source code must be available

See the LICENSE file for the complete license text.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.github/workflows		.github/workflows
arize_experiment		arize_experiment
tests		tests
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

License

shcallaway/arize-experiment

Folders and files

Latest commit

History

Repository files navigation

Arize Experiment

Project Status

Getting Started

Prerequisites

Clone the Repository

Python Setup

Package Installation

Environment Configuration

Features

Dataset Requirements

Dataset Format

Classify Sentiment Dataset

Chatbot Server Dataset

Creating Custom Datasets

Usage Guide

Basic Command Structure

Tasks

1. Classify Sentiment

2. Call Chatbot Server

Development Guide

Project Structure

Dependencies

Creating New Tasks

Creating New Evaluators

Testing Guidelines

Test Structure

Running Tests

Best Practices & Performance Considerations

API Rate Limits

Troubleshooting Guide

Common Issues

Development Tools

Contributing

Getting Started

Code Style

Community

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages