LLM Deductive Coding Pipeline

A flexible framework for qualitative text analysis and coding using large language models (LLMs).

Project Overview

This repository contains a reusable framework for applying deductive coding to text data through LLM-based classifiers. The project implements a modular pipeline architecture that supports both single-stage and multi-stage classification approaches, with integration for multiple LLM backends.

The framework is designed for systematic qualitative analysis of text data where:

You have predefined coding schemes or categories
You need to process large volumes of text data
You want to leverage LLMs for consistent application of coding rules
You need robust checkpointing for long-running processes

Core Framework Architecture

The system is designed as a modular pipeline with the following key components:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│   Data Loading  │────▶│  Classification │────▶│  Analysis and   │
│   & Processing  │     │     Engine      │     │  Reporting      │
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Classification Strategies

The pipeline is designed to be flexible and support most potential classification strategies. Two example classification approaches have been implemented:

Simple Classifier: Single-stage approach that directly classifies text based on your coding scheme
Two-Stage Classifier:
- First stage identifies evidence related to your coding categories
- Second stage scrutinizes that evidence for validity

Key Features

Checkpoint System: Resume interrupted processing from the last saved state
Batch Processing: Efficiently process large datasets in manageable chunks
Flexible Pipeline: Configure multi-stage processing with different models
Robust Error Handling: Graceful recovery from API failures and timeouts
Comprehensive Logging: Track model inputs, outputs, and parameters

Repository Structure (Core Framework)

The core library (src/) provides a flexible and reusable system for LLM-based text classification:

src/classifiers/: Classification implementations
- simple/: Single-stage classifier that directly outputs labels
- two_step/: Two-stage classifier with evidence gathering and scrutiny phases
src/core/: Core infrastructure
- pipeline/: Pipeline architecture for managing classification workflows
- checkpointing/: Checkpoint system for resuming interrupted processes
src/data/: Data handling
- Data types for narratives and text segments
- Loaders for processing input data
src/llm_endpoints/: LLM integrations
- Support for local llama.cpp
- OpenAI API integration
- Together AI integration

Research Implementation: Vulnerability Classification in Police Reports

This repository includes the complete codebase for our recent research paper on vulnerability classification in police reports. Our research examines how LLMs can be used to identify vulnerable populations in police records, with careful consideration of potential demographic biases. We applied the framework to classify four key vulnerability factors in police incident narratives:

Mental health issues
Drug abuse
Alcoholism
Homelessness

The associated paper compares the performance of LLM classifications with human labellers, and explores counterfactual narratives where only demographic characteristics are varied to test for bias in classification outcomes. This real-world research application demonstrates how the framework can be effectively applied to sensitive text analysis tasks requiring careful scrutiny of evidence and consideration of potential biases.

Research Methods

Key research components include:

Comparison of different LLM architectures (Meta-Llama-3.1-8B, Meta-Llama-3.1-70B, GPT-4o)
Evaluation of different prompt engineering strategies (custom vs. codebook)
Analysis of classification error patterns and demographic biases

Research Findings

Our preliminary analysis has shown:

Significant variations in classification performance across different vulnerability types
Evidence of demographic biases in some classification contexts
Improvement in classification accuracy when using a two-stage approach
Variations in performance between different LLM sizes and architectures

For detailed findings, please refer to the paper replication code and our manuscript.

Paper Replication Code

The repository includes code to replicate our research findings:

boston_fio_paper/: Scripts related to our analysis of Boston Field Interrogation and Observation (FIO) data
- analyse_counterfactuals.Rmd: R analysis of counterfactual narratives
- classify_narratives.ipynb: Classification pipeline execution
- download_and_preprocess_fio_data.py: Data preparation
- generate_counterfactuals.ipynb: Generation of counterfactual narratives
experiments/: Experimental notebooks
- early_testing.ipynb: Initial classifier testing
- simple_classifier.ipynb: Simple classifier implementation
- two_stage_classifier.ipynb: Two-stage classifier implementation

For detailed information about replicating our paper results, please see the README in the boston_fio_paper/ directory.

Citation

If you use this code in your research, please cite our paper:

@article{author2023llm,
  title={Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives},
  author={Relins, S. and Birks, D and Lloyd, C},
  journal={Arxiv Preprint},
  year={2023},
  note={Currently under review for the Journal of Quantitative Criminology}
}

Installation and Usage

Installation

Clone the repository:

git clone https://github.com/yourusername/llm-deductive-coding.git
cd llm-deductive-coding

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Basic Classification

from src.core.pipeline.pipeline import ClassificationPipeline
from src.core.pipeline.types import PipelineConfig, PipelineStep
from src.classifiers.simple.classifier import get_classifications
from src.llm_endpoints.llama_cpp import llama_cpp_endpoint
import pandas as pd

# Load your data
df = pd.read_csv("your_data.csv")

# Define your classification step
classification_step = PipelineStep(
    name="classification",
    processor_fn=get_classifications,
    fn_args={
        'system_prompt': "Your prompt here...",
        'endpoint': llama_cpp_endpoint,
        'n': 3  # Number of classifications per segment
    },
)

# Create and run pipeline
pipeline = ClassificationPipeline(
    steps=[classification_step],
    config=PipelineConfig(
        batch_size=10,
        checkpoint_dir="classification_checkpoints"
    )
)

# Run classification
results = pipeline.run(df, "narrative_column", "id_column")

Two-Stage Classification

from src.classifiers.two_step.classifier import get_evidence, scrutinise_evidence

# Define evidence gathering step
evidence_step = PipelineStep(
    name="evidence",
    processor_fn=get_evidence,
    fn_args={
        'system_prompt': "Your evidence gathering prompt...",
        'endpoint': llama_cpp_endpoint,
        'n': 10
    },
)

# Define scrutiny step
scrutiny_step = PipelineStep(
    name="scrutiny", 
    processor_fn=scrutinise_evidence,
    fn_args={
        'system_prompt': "Your evidence scrutiny prompt...",
        'endpoint': llama_cpp_endpoint,
        'n': 1
    },
)

# Create and run two-stage pipeline
pipeline = ClassificationPipeline(
    steps=[evidence_step, scrutiny_step],
    config=PipelineConfig(
        batch_size=10,
        checkpoint_dir="two_stage_checkpoints"
    )
)

results = pipeline.run(df, "narrative_column", "id_column")

LLM Integrations

The framework supports multiple LLM backends:

Local Llama.cpp Server: For self-hosted models

from src.llm_endpoints.llama_cpp import llama_cpp_endpoint

OpenAI API: For accessing GPT models

from src.llm_endpoints.openai import openai_endpoint

Together AI: For a range of open models

from src.llm_endpoints.together import together_endpoint

License

This project is licensed under the MIT License.

Use and Modification

Feel free to fork this repository and adapt it for your own needs. While we may not actively maintain this as a community project, we welcome researchers to build upon our work for further investigations into LLM-based classification of sensitive text. If you use or modify this codebase for your research, please cite our paper as referenced in the Citation section above.

Contact

For questions about the code or research, please open an issue or contact the authors.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
boston_fio_paper		boston_fio_paper
experiments		experiments
src/vulnerability_classifier_pipeline		src/vulnerability_classifier_pipeline
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Deductive Coding Pipeline

Project Overview

Core Framework Architecture

Classification Strategies

Key Features

Repository Structure (Core Framework)

Research Implementation: Vulnerability Classification in Police Reports

Research Methods

Research Findings

Paper Replication Code

Citation

Installation and Usage

Installation

Basic Classification

Two-Stage Classification

LLM Integrations

License

Use and Modification

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

samrelins/vulnerability_classifier_pipeline

Folders and files

Latest commit

History

Repository files navigation

LLM Deductive Coding Pipeline

Project Overview

Core Framework Architecture

Classification Strategies

Key Features

Repository Structure (Core Framework)

Research Implementation: Vulnerability Classification in Police Reports

Research Methods

Research Findings

Paper Replication Code

Citation

Installation and Usage

Installation

Basic Classification

Two-Stage Classification

LLM Integrations

License

Use and Modification

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages