Skip to content

samrelins/vulnerability_classifier_pipeline

Repository files navigation

LLM Deductive Coding Pipeline

A flexible framework for qualitative text analysis and coding using large language models (LLMs).

Project Overview

This repository contains a reusable framework for applying deductive coding to text data through LLM-based classifiers. The project implements a modular pipeline architecture that supports both single-stage and multi-stage classification approaches, with integration for multiple LLM backends.

The framework is designed for systematic qualitative analysis of text data where:

  • You have predefined coding schemes or categories
  • You need to process large volumes of text data
  • You want to leverage LLMs for consistent application of coding rules
  • You need robust checkpointing for long-running processes

Core Framework Architecture

The system is designed as a modular pipeline with the following key components:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│   Data Loading  │────▶│  Classification │────▶│  Analysis and   │
│   & Processing  │     │     Engine      │     │  Reporting      │
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Classification Strategies

The pipeline is designed to be flexible and support most potential classification strategies. Two example classification approaches have been implemented:

  1. Simple Classifier: Single-stage approach that directly classifies text based on your coding scheme
  2. Two-Stage Classifier:
    • First stage identifies evidence related to your coding categories
    • Second stage scrutinizes that evidence for validity

Key Features

  • Checkpoint System: Resume interrupted processing from the last saved state
  • Batch Processing: Efficiently process large datasets in manageable chunks
  • Flexible Pipeline: Configure multi-stage processing with different models
  • Robust Error Handling: Graceful recovery from API failures and timeouts
  • Comprehensive Logging: Track model inputs, outputs, and parameters

Repository Structure (Core Framework)

The core library (src/) provides a flexible and reusable system for LLM-based text classification:

  • src/classifiers/: Classification implementations

    • simple/: Single-stage classifier that directly outputs labels
    • two_step/: Two-stage classifier with evidence gathering and scrutiny phases
  • src/core/: Core infrastructure

    • pipeline/: Pipeline architecture for managing classification workflows
    • checkpointing/: Checkpoint system for resuming interrupted processes
  • src/data/: Data handling

    • Data types for narratives and text segments
    • Loaders for processing input data
  • src/llm_endpoints/: LLM integrations

    • Support for local llama.cpp
    • OpenAI API integration
    • Together AI integration

Research Implementation: Vulnerability Classification in Police Reports

This repository includes the complete codebase for our recent research paper on vulnerability classification in police reports. Our research examines how LLMs can be used to identify vulnerable populations in police records, with careful consideration of potential demographic biases. We applied the framework to classify four key vulnerability factors in police incident narratives:

  • Mental health issues
  • Drug abuse
  • Alcoholism
  • Homelessness

The associated paper compares the performance of LLM classifications with human labellers, and explores counterfactual narratives where only demographic characteristics are varied to test for bias in classification outcomes. This real-world research application demonstrates how the framework can be effectively applied to sensitive text analysis tasks requiring careful scrutiny of evidence and consideration of potential biases.

Research Methods

Key research components include:

  • Comparison of different LLM architectures (Meta-Llama-3.1-8B, Meta-Llama-3.1-70B, GPT-4o)
  • Evaluation of different prompt engineering strategies (custom vs. codebook)
  • Analysis of classification error patterns and demographic biases

Research Findings

Our preliminary analysis has shown:

  • Significant variations in classification performance across different vulnerability types
  • Evidence of demographic biases in some classification contexts
  • Improvement in classification accuracy when using a two-stage approach
  • Variations in performance between different LLM sizes and architectures

For detailed findings, please refer to the paper replication code and our manuscript.

Paper Replication Code

The repository includes code to replicate our research findings:

  • boston_fio_paper/: Scripts related to our analysis of Boston Field Interrogation and Observation (FIO) data

    • analyse_counterfactuals.Rmd: R analysis of counterfactual narratives
    • classify_narratives.ipynb: Classification pipeline execution
    • download_and_preprocess_fio_data.py: Data preparation
    • generate_counterfactuals.ipynb: Generation of counterfactual narratives
  • experiments/: Experimental notebooks

    • early_testing.ipynb: Initial classifier testing
    • simple_classifier.ipynb: Simple classifier implementation
    • two_stage_classifier.ipynb: Two-stage classifier implementation

For detailed information about replicating our paper results, please see the README in the boston_fio_paper/ directory.

Citation

If you use this code in your research, please cite our paper:

@article{author2023llm,
  title={Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives},
  author={Relins, S. and Birks, D and Lloyd, C},
  journal={Arxiv Preprint},
  year={2023},
  note={Currently under review for the Journal of Quantitative Criminology}
}

Installation and Usage

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/llm-deductive-coding.git
    cd llm-deductive-coding
  2. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

Basic Classification

from src.core.pipeline.pipeline import ClassificationPipeline
from src.core.pipeline.types import PipelineConfig, PipelineStep
from src.classifiers.simple.classifier import get_classifications
from src.llm_endpoints.llama_cpp import llama_cpp_endpoint
import pandas as pd

# Load your data
df = pd.read_csv("your_data.csv")

# Define your classification step
classification_step = PipelineStep(
    name="classification",
    processor_fn=get_classifications,
    fn_args={
        'system_prompt': "Your prompt here...",
        'endpoint': llama_cpp_endpoint,
        'n': 3  # Number of classifications per segment
    },
)

# Create and run pipeline
pipeline = ClassificationPipeline(
    steps=[classification_step],
    config=PipelineConfig(
        batch_size=10,
        checkpoint_dir="classification_checkpoints"
    )
)

# Run classification
results = pipeline.run(df, "narrative_column", "id_column")

Two-Stage Classification

from src.classifiers.two_step.classifier import get_evidence, scrutinise_evidence

# Define evidence gathering step
evidence_step = PipelineStep(
    name="evidence",
    processor_fn=get_evidence,
    fn_args={
        'system_prompt': "Your evidence gathering prompt...",
        'endpoint': llama_cpp_endpoint,
        'n': 10
    },
)

# Define scrutiny step
scrutiny_step = PipelineStep(
    name="scrutiny", 
    processor_fn=scrutinise_evidence,
    fn_args={
        'system_prompt': "Your evidence scrutiny prompt...",
        'endpoint': llama_cpp_endpoint,
        'n': 1
    },
)

# Create and run two-stage pipeline
pipeline = ClassificationPipeline(
    steps=[evidence_step, scrutiny_step],
    config=PipelineConfig(
        batch_size=10,
        checkpoint_dir="two_stage_checkpoints"
    )
)

results = pipeline.run(df, "narrative_column", "id_column")

LLM Integrations

The framework supports multiple LLM backends:

  1. Local Llama.cpp Server: For self-hosted models

    from src.llm_endpoints.llama_cpp import llama_cpp_endpoint
  2. OpenAI API: For accessing GPT models

    from src.llm_endpoints.openai import openai_endpoint
  3. Together AI: For a range of open models

    from src.llm_endpoints.together import together_endpoint

License

This project is licensed under the MIT License.

Use and Modification

Feel free to fork this repository and adapt it for your own needs. While we may not actively maintain this as a community project, we welcome researchers to build upon our work for further investigations into LLM-based classification of sensitive text. If you use or modify this codebase for your research, please cite our paper as referenced in the Citation section above.

Contact

For questions about the code or research, please open an issue or contact the authors.

About

Pipeline for vulnerability classifier for VPFRC LLM classification project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published