README for ANCOVA Analysis Script

Overview

This script, provides tools for performing ANCOVA (Analysis of Covariance) and related statistical analyses. It includes a primary function, do_ancova, which integrates multiple steps of ANCOVA analysis and allows for flexible customization of inputs and outputs, including graphical representations of results.

Installation

The package can be installed via:

GitHub

Clone the repository and install it manually:

git clone https://github.com/GERMAN00VP/ANCOVA
cd ./ANCOVA
pip install .

pip

Install it directly from PyPI:

pip install ANCOVA

Requirements

python>=3.10

Dependencies

The script relies on the following Python packages:

numpy
pandas
statsmodels
scipy
seaborn
matplotlib
scikit_posthocs

Install these dependencies using:

pip install numpy pandas statsmodels scipy seaborn matplotlib scikit-posthocs

Key Functionality: `do_ancova`

The main purpose of the do_ancova function is to perform parametric or non-parametric ANCOVA on a dataset. It accepts a DataFrame containing the dependent variable, categorical variables, and covariates to evaluate the relationship between them while adjusting for covariates.

Features:

Parametric and Non-Parametric ANCOVA:
Automatically switches between parametric or ranked (non-parametric) ANCOVA depending on the assumptions of normality and homoscedasticity.
Interaction Effects:
Allows inclusion of interactions between variables.
Post-Hoc Analysis:
Automatically performs Tukey or Dunn post-hoc tests when significant differences are found between groups.
Data Visualization:
Generates boxplots and scatterplots with regression lines, including statistical significance indicators.
Customizable Options:
Users can customize interactions, colors, and plot details.

Usage: `do_ancova`

Parameters:

data:
A pandas DataFrame containing:
- Column 1: Dependent (response) variable.
- Column 2 (to n categories): Categorical independent variable(s).
- Remaining columns: Continuous covariates.
interactions (Optional):
Specifies interactions between variables:
- "ALL": Includes all interactions.
- list: List of tuples specifying interacting variables.
plot (Default: False):
If True, generates a regression plot and a boxplot.
save_plot (Default: False):
If provided with a file path, saves the generated plots to the specified location.
covariate_to_plot (Optional):
Specifies the covariate to display in plots.
palette (Optional):
A dictionary mapping categorical levels to colors.
categories (Default: 1):
Number of categorical variables.
ax (Optional):
A Matplotlib axis for custom plotting.
y_lab (Optional): Label for the y-axis in the generated plot. Default is False (no label).
x_lab (Optional): Label for the x-axis in the generated plot. Default is False (no label).
sum_of_squares_type (Optional): Specifies the type of sums of squares for ANCOVA. Default is Type 2 (value = 2).

Output:

Results:
- A summary data frame with the ANCOVA parameters and outcomes.
- An ANCOVA table with p-values for each effect.
- Post-hoc results (if applicable).
Plots:
- Scatterplot with regression lines for covariates + Boxplot for main categorical copmpaisons.
- A Matplotlib axis with a Boxplot for categorical comparisons (allows customizing).
Files (Optional):
Saves plots to the specified file path if save_plot is provided.

Notes

Ensure that your dataset has the shape: Cases*Variables.
The script assumes the columns are sorted like this: [Response variable, Main category to compare, Other categorical co-variables (optional), Other continous co-variables].
For multiple categorical variables, specify the number using the categories parameter.

AN EXAMPLE OF USE:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Charge the main function from our package
from Ancova_analysis import do_ancova

This invented dataset contains 150 entries with the following columns:

Number of T Cells: The number of T cells, which is affected by the individual's age and HIV status. Individuals with HIV+ (Untreated) have a significant reduction in T cells, while HIV+ (TAR Treatment) individuals have a minimal reduction compared to HIV- individuals.

HIV Status: A categorical variable representing the individual's HIV status. It can take three values:

  -> HIV- (no HIV)

  -> HIV+ (TAR Treatment) (HIV positive, receiving treatment)

  -> HIV+ (Untreated) (HIV positive, not receiving treatment)

Sex: The individual's sex, either Male or Female.
Age: The individual's age, ranging from 20 to 70 years.

The Number of T Cells decreases with age, and the reduction is more significant for individuals with HIV+ (Untreated).

# Set the seed for reproducibility
np.random.seed(4)

# Number of samples
n = 150

# Categorical variables
sex = np.random.choice(['Male', 'Female'], size=n)
hiv_status = np.random.choice(['HIV-', 'HIV+ (TAR Treatment)', 'HIV+ (Untreated)'], size=n, p=[0.4, 0.3, 0.3])

# Covariate: Age
age = np.random.randint(20, 70, size=n)

# Generate T cell count
t_cells = []
for i in range(n):
    base_t_cells = 1000  # General base for T cells
    age_effect = -3 * (age[i] - 30)  # Mild effect of age
    if hiv_status[i] == 'HIV+ (Untreated)':
        hiv_effect = -200  # Significant reduction for untreated
    elif hiv_status[i] == 'HIV+ (TAR Treatment)':
        hiv_effect = -30  # Minimal reduction for treated
    else:
        hiv_effect = 0  # No effect for HIV-
    noise = np.random.normal(0, 50)  # Random noise
    t_cells.append(base_t_cells + age_effect + hiv_effect + noise)

# Define a palette to select the plotting colors for each category, else it would be randomly assigned
palette = {"HIV-":"skyblue",
           "HIV+ (Untreated)":"salmon",
           "HIV+ (TAR Treatment)":"orange"}


# Create the DataFrame
data_hiv = pd.DataFrame({
    'Number of T Cells': np.round(t_cells).astype(int),
    'HIV Status': hiv_status,
    'Sex': sex,
    'Age': age
})

data_hiv.head()

Lets see if the ANCOVA analysis is able to capture this differences:

# Run the main function and display the results

df_results, ancova_summary,post_hoc = do_ancova(data=data_hiv,
                                                palette=palette,
                                                categories=2, # HIV Status and Sex
                                                interactions=[('HIV Status',"Age")], # Test the significance of the interaction of these variables
                                                y_lab="CD4 T Cells (count)",# Set the y_label 
                                                plot=True, # Create the plot
                                                save_plot= "./Images/ANCOVA_Regression_boxplot.png" # Sves the plot in that path
                                                ) 

display(df_results)
display(ancova_summary)
display(post_hoc)

# Create two subplots in a row
fig, axs = plt.subplots(ncols=2,figsize=(12,6))


df_results, ancova_summary,post_hoc,ax= do_ancova(data=data_hiv,palette=palette,categories=2, y_lab="CD4 T Cells (count)",plot=True,
          ax=axs[0] # When the axis is provided it returns the boxplot and can be integrated with other subplots as you wish
          )

# Modify the df order to plot the sex differences
data_hiv_sex = data_hiv[['Number of T Cells','Sex','HIV Status','Age']]

df_results, ancova_summary,post_hoc,ax= do_ancova(data=data_hiv_sex,categories=2, y_lab="CD4 T Cells (count)",plot=True,
          ax=axs[1], # The other subplot

          )
# Save and show
plt.savefig("./Images/ANCOVA_two_boxplots.png",bbox_inches="tight")
plt.show()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

README for ANCOVA Analysis Script

Overview

Installation

GitHub

pip

Requirements

Dependencies

Key Functionality: `do_ancova`

Features:

Usage: `do_ancova`

Parameters:

Output:

Notes

AN EXAMPLE OF USE:

This invented dataset contains 150 entries with the following columns:

Files

README.md

Latest commit

History

README.md

File metadata and controls

README for ANCOVA Analysis Script

Overview

Installation

GitHub

pip

Requirements

Dependencies

Key Functionality: do_ancova

Features:

Usage: do_ancova

Parameters:

Output:

Notes

AN EXAMPLE OF USE:

This invented dataset contains 150 entries with the following columns:

Key Functionality: `do_ancova`

Usage: `do_ancova`