Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup CI/CD Pipeline for MSStats #143

Merged
merged 59 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
6874be2
Setup CI/CD Pipeline for MSStats
Nov 3, 2024
7eb8b32
Added SSH private key
Nov 3, 2024
5411a18
Change #3
Nov 3, 2024
d01ece8
Added changes #4
Nov 3, 2024
e1d5963
Added changes #5
Nov 3, 2024
caa633d
Changes #6
Nov 3, 2024
c146ae8
Added changes #7
Nov 3, 2024
508f32f
Added changes #8
Nov 3, 2024
b9d4568
Added changes #8
Nov 3, 2024
bc1bffc
Added changes for cmake issues
Nov 3, 2024
b9cef8e
Added changes #9
Nov 3, 2024
ae465d1
Added changes
Nov 3, 2024
811089f
Added changes
Nov 3, 2024
6725f98
Added changes #10
Nov 3, 2024
0accfc7
Added changes #11
Nov 3, 2024
e19933b
Added changes #12
Nov 3, 2024
5a86ee6
Changes for script run added - monitorinh
Nov 3, 2024
a61a9ad
Added changes with diff slurm config
Nov 4, 2024
751efa5
Changes for slurm spec
Nov 4, 2024
e84ac20
Changes for slurm spec #2
Nov 4, 2024
b25cb2e
Changes for slurm spec #3
Nov 4, 2024
c5ae18c
Changes for slurm spec #4
Nov 4, 2024
2a0539d
Added changes for triggering pipeline
Nov 5, 2024
58386d4
Added changes for slurm job #5
Nov 5, 2024
62235e2
Added changes for R version change
Nov 5, 2024
4f0737e
Added changes
Nov 5, 2024
4f00a86
Added changes for slurm job
Nov 5, 2024
8e57590
Added changes for R script
Nov 5, 2024
d1ddb0f
Added changes for slurm job #6
Nov 5, 2024
f7b835d
Added changes for job name
Nov 5, 2024
0f42743
Added changes for getting id of job to be monitored
Nov 5, 2024
b2f427e
Typo in file name
Nov 5, 2024
0e35bc9
Added changes
Nov 5, 2024
8237878
Added changes for R version explicit definition in slurm config
Nov 5, 2024
054ccbb
Feedback changes added
Nov 12, 2024
10a796b
MR Feedbacks
Nov 12, 2024
63469d4
MR Feedbacks #2 - Added FDR
Nov 12, 2024
54c52cd
Added new directory for logs
Nov 12, 2024
0fb0f9c
Changed working directory slurm
Nov 12, 2024
01b8ee8
Revert changes to directory'
Nov 12, 2024
120803e
Added changes for working benchmark file
Nov 12, 2024
163dc36
Updated paths
Nov 12, 2024
21b9cb7
Changes added
Nov 12, 2024
ef4d51c
Minor modification for testing in local directory
Nov 13, 2024
78b8358
Added changes to config
Nov 13, 2024
4df5dde
Added FDR changes
Nov 13, 2024
40f37e3
Added changes for threshold
Nov 13, 2024
6a0f9a0
Changes to awk command
Nov 13, 2024
88aa407
FDR failure
Nov 13, 2024
41fd28a
Changed Working Directory
Nov 13, 2024
e671ea7
Final changes to job
Nov 13, 2024
69f7cf3
Changes for slurm
Nov 13, 2024
a319779
Changes to BEnchmark folder
Nov 13, 2024
b45643f
changes to fdr
Nov 13, 2024
36061f3
Extract only 4 rows
Nov 13, 2024
f24f7b1
Added changes for installing packages from devel branches
Nov 19, 2024
a5a8dc8
Correct branch of MSstats Convert added
Nov 19, 2024
0ae11fb
Added changes for devel branch change in name
Nov 19, 2024
ad68f1e
Changes for correct boxplots
Nov 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
name: Run Simple R Script on HPC via Slurm

on:
push:
branches:
- feature/ci-cd-pipeline

jobs:
test-hpc:
runs-on: ubuntu-latest

steps:
- name: Checkout Repository
uses: actions/checkout@v3

- name: Set Up SSH Access
run: |
mkdir -p ~/.ssh
echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would follow ChatGPT's suggestion here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

ssh-keyscan -H login-00.discovery.neu.edu >> ~/.ssh/known_hosts
Copy link
Contributor

@tonywu1999 tonywu1999 Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow chatGPT's suggestion here (the error handling one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


- name: Transfer Files to HPC
run: |
scp benchmark/benchmark.R benchmark/config.slurm raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R

- name: Submit Slurm Job and Capture Job ID
id: submit_job
run: |
ssh raina.ans@login-00.discovery.neu.edu "cd R && sbatch config.slurm" | tee slurm_job_id.txt
slurm_job_id=$(grep -oP '\d+' slurm_job_id.txt)
echo "Slurm Job ID is $slurm_job_id"
echo "slurm_job_id=$slurm_job_id" >> $GITHUB_ENV

- name: Monitor Slurm Job
run: |
ssh raina.ans@login-00.discovery.neu.edu "
while squeue -j ${{ env.slurm_job_id }} | grep -q ${{ env.slurm_job_id }}; do
echo 'Job Id : ${{ env.slurm_job_id }} is still running...'
sleep 10
done
echo 'Job has completed.'
"

- name: Fetch Output
run: |
scp raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R/job_output.txt job_output.txt
scp raina.ans@login-00.discovery.neu.edu:/home/raina.ans/R/job_error.txt job_error.txt

- name: Upload Output as Artifact
uses: actions/upload-artifact@v4
with:
name: benchmark-output
path: |
job_output.txt
job_error.txt
179 changes: 179 additions & 0 deletions benchmark/benchmark.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
library(MSstatsConvert)
library(MSstats)
library(ggplot2)
library(dplyr)
library(stringr)
library(parallel)


calculateResult <- function(summarized, label){

model = groupComparison("pairwise", summarized)
comparisonResult <- model$ComparisonResult

human_comparisonResult <- comparisonResult %>% filter(grepl("_HUMAN$", Protein))

ecoli_comparisonResult <- comparisonResult %>% filter(grepl("_ECOLI$", Protein))

yeast_comparisonResult <- comparisonResult %>% filter(grepl("_YEAST$", Protein))


human_median <- median(human_comparisonResult$log2FC, na.rm = TRUE)
ecoli_median <- median(ecoli_comparisonResult$log2FC, na.rm = TRUE)
yeast_median <- median(yeast_comparisonResult$log2FC, na.rm = TRUE)

cat("Expected Log Change Human:", human_median, "\n")
cat("Expected Log Change Ecoli:", ecoli_median, "\n")
cat("Expected Log Change Yeast:", yeast_median, "\n")

#calculate SD and mean


# Kept the code for Individual Boxplots

# boxplot(human_comparisonResult$log2FC,
# main = "Boxplot of log2FC for Human",
# ylab = "log2FC",
# col = "lightblue")
#
#
boxplot(ecoli_comparisonResult$log2FC,
main = "Boxplot of log2FC for E. coli",
ylab = "log2FC",
col = "lightgreen")
#
# boxplot(yeast_comparisonResult$log2FC,
# main = "Boxplot of log2FC for Yeast",
# ylab = "log2FC",
# col = "lightpink")

combined_data <- list(
Human = human_comparisonResult$log2FC,
Ecoli = ecoli_comparisonResult$log2FC,
Yeast = yeast_comparisonResult$log2FC
)


unique_ecoli_proteins <- unique(ecoli_comparisonResult$Protein)
unique_yeast_proteins <- unique(yeast_comparisonResult$Protein)

all_proteins <- c(union(unique_ecoli_proteins, unique_yeast_proteins)) # find out the significant proteins in FragData

extracted_proteins <- sapply(all_proteins, function(x) {
split_string <- strsplit(x, "\\|")[[1]] # Split the string by '|'
if (length(split_string) >= 2) {
return(split_string[2]) # Return the second element
} else {
return(NA) # Return NA if there's no second element
}
})

extracted_proteins <- unname(unlist(extracted_proteins))

proteins <- c(extracted_proteins)


TP <- comparisonResult %>% filter(grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue < 0.05) %>% nrow()


FP <- comparisonResult %>% filter(!grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue < 0.05) %>% nrow()


TN <- comparisonResult %>% filter(!grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue >= 0.05) %>% nrow()


FN <- comparisonResult %>% filter(grepl(paste(proteins, collapse = "|"), Protein) & adj.pvalue >= 0.05) %>% nrow()

cat("True Positives (Yeast and EColi): ", TP, "\n")
cat("False Positives (Human Samples)", FP, "\n")
cat("True Negatives", TN, "\n")
cat("False Negatives", FN, "\n")

FPR <- FP / (FP + TN)

# Accuracy
accuracy <- (TP + TN) / (TP + TN + FP + FN)

# Recall
recall <- TP / (TP + FN)

results <- data.frame(
Label = label,
TP = TP,
FP = FP,
TN = TN,
FN = FN,
FPR = FPR,
Accuracy = accuracy,
Recall = recall
)

return(results)

}

start_time <- Sys.time()

# Use fread directly to read the CSV
fragpipe_raw = data.table::fread("..//data//FragPipeMsStatsBenchmarking.csv")

head(fragpipe_raw)

fragpipe_raw$Condition = unlist(lapply(fragpipe_raw$Run, function(x){
paste(str_split(x, "\\_")[[1]][4:5], collapse="_")
}))

fragpipe_raw$BioReplicate = unlist(lapply(fragpipe_raw$Run, function(x){
paste(str_split(x, "\\_")[[1]][4:7], collapse="_")
}))

# Convert to MSstats format
msstats_format = MSstatsConvert::FragPipetoMSstatsFormat(fragpipe_raw, use_log_file = FALSE)


# Define the tasks with descriptive labels
data_process_tasks <- list(
list(
label = "Data process with Normalized Data",
result = function() dataProcess(msstats_format, featureSubset = "topN", n_top_feature = 20)
),
list(
label = "Data process with Normalization and MBImpute False",
result = function() dataProcess(msstats_format, featureSubset = "topN", n_top_feature = 20, MBimpute = FALSE)
),
list(
label = "Data process without Normalization",
result = function() dataProcess(msstats_format, normalization = "FALSE", n_top_feature = 20)
),
list(
label = "Data process without Normalization with MBImpute False",
result = function() dataProcess(msstats_format, normalization = "FALSE", n_top_feature = 20, MBimpute = FALSE)
)
)

# Start the timer
start_time <- Sys.time()

# Use mclapply to run the dataProcess tasks in parallel
num_cores <- detectCores() - 1 # Use one less than the total cores available

# Run data processing tasks in parallel and collect results with labels
summarized_results <- mclapply(data_process_tasks, function(task) {
list(label = task$label, summarized = task$result())
}, mc.cores = num_cores)

# Run calculateResult on each summarized result in parallel
results_list <- mclapply(summarized_results, function(res) {
calculateResult(res$summarized, res$label)
}, mc.cores = num_cores)

# Combine all results into a single data frame
final_results <- do.call(rbind, results_list)

# End the timer
end_time <- Sys.time()
total_time <- end_time - start_time

# Display the final results and execution time
print(final_results)
print(paste("Total Execution Time:", total_time))
29 changes: 29 additions & 0 deletions benchmark/config.slurm
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash
#SBATCH --job-name=msstats_benchmark_job
#SBATCH --output=job_output.txt
#SBATCH --error=job_error.txt
#SBATCH --time=01:00:00 # Set the maximum run time
#SBATCH --ntasks=1 # Number of tasks (one process)
#SBATCH --cpus-per-task=8 # Use 8 CPU cores for the task
#SBATCH --mem=128G # Request 256GB of memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says 256, but says 128 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

#SBATCH --partition=short # Use the 'short' partition (or change as needed)

module load R-geospatial


module load gcc/11.1.0
module load cmake/3.23.2

export LC_ALL=C
export R_LIBS_USER=/home/raina.ans/R/x86_64-pc-linux-gnu-library/4.2-geospatial


mkdir -p $R_LIBS_USER

module load R
Rscript -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager', lib = Sys.getenv('R_LIBS_USER'), repos = 'https://cloud.r-project.org'); \
BiocManager::install('MSstats', lib = Sys.getenv('R_LIBS_USER'), update = FALSE); \
BiocManager::install('MSstatsConvert', lib = Sys.getenv('R_LIBS_USER'), update = FALSE); \
install.packages(c('dplyr', 'stringr', 'ggplot2'), lib = Sys.getenv('R_LIBS_USER'), repos = 'https://cloud.r-project.org')"

Rscript benchmark.R
Loading