vignette(dataProcess): Add tables on dataProcess options and assumptions #150

tonywu1999 · 2025-01-17T15:43:13Z

PR Type

Documentation, Enhancement

Description

Added detailed tables on normalization, feature selection, and imputation options in the vignette.
Updated dataProcess function documentation to reference vignette for recommendations.
Introduced kableExtra dependency for enhanced table rendering in vignettes.
Removed redundant configuration file MSstats.Rproj.

Changes walkthrough 📝

Relevant files

Dependencies

DESCRIPTION Added `kableExtra` dependency for vignette table rendering DESCRIPTION Added `kableExtra` to the list of suggested packages.	+2/-1

Configuration changes

MSstats.Rproj Removed unnecessary `MSstats.Rproj` file MSstats.Rproj Removed the redundant project configuration file.	+0/-17

Documentation

dataProcess.R Updated `dataProcess` function documentation for clarity R/dataProcess.R Updated parameter descriptions to reference vignette recommendations. Clarified documentation for normalization, feature selection, and imputation options.	+6/-3
dataProcess.Rd Improved manual documentation for `dataProcess` function man/dataProcess.Rd Enhanced manual documentation to include vignette references. Improved descriptions for normalization and imputation parameters.	+7/-4
MSstatsWorkflow.Rmd `Added detailed tables and recommendations to vignette` vignettes/MSstatsWorkflow.Rmd Added detailed tables for normalization, feature selection, and imputation options. Included references and assumptions for each processing option. Enhanced visualization with `kableExtra` for better table formatting.	+90/-1

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

…ignette

github-actions · 2025-01-17T15:44:19Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Table Rendering and Dependencies The introduction of `kableExtra` for table rendering in the vignette should be validated for compatibility and proper rendering across different environments. Ensure that the added tables are displayed correctly in the final documentation output. ```{r echo=FALSE, message=FALSE} library(knitr) library(kableExtra) # Data for the table table_data <- data.frame( Name = c("Median", "", "Quantile", "", "Global standards", "", "", "None", ""), Description = c( "Equalize medians of all log feature intensities in each run", "", "Equalize the distributions of all log feature intensities in each run", "", "Equalize median log-intensities of endogenous or spiked-in reference peptides or proteins. Apply adjustment to the remainder of log feature intensities", "", "", "Do not apply any normalization", "" ), Assumption = c( "All steps of data collection and acquisition were randomized", "Most of the proteins in the experiment are the same and have the same concentration for all of the runs. The experimental artifacts affect every peptide in a run by the same constant amount", "All steps of data collection and acquisition were randomized", "Most of the proteins in the experiment are the same and have the same concentration for all of the runs. The experimental artifacts affect every peptide non-linearly, as a function of its log intensity", "All steps of data collection and acquisition were randomized", "The reference peptides or proteins are present in each run and have the same concentration for all of the runs. All experimental artifacts occur only after standards were added.", "The experimental artifacts affect every protein in a run by the same constant amount", "All steps of data collection and acquisition were randomized", "The experiment has no systematic artifacts or has been normalized in another custom manner" ), Effect = c( "The normalization estimates the artifact deviations in each run with a single quantity, reducing overfitting", "The normalization reduces bias and variance of the estimated log fold change", "The normalization estimates the artifact deviations in each run with a complex non-linear function, potentially leading to overfitting", "The normalization reduces bias and variance of the estimated log fold change but may over-correct", "The normalization estimates the artifact deviations in each run with a single quantity, which reduces overfitting", "The normalization estimates the artifact deviations from a small number of peptides, which may increase overfitting. The normalization does not eliminate artifacts that occurred before adding spiked references", "The normalization reduces bias and variance of the estimated log fold change", "All patterns of variation of interest and of nuisance variation are preserved", "" ) ) # Create the table kable(table_data, "html", escape = FALSE, col.names = c("Name", "Description", "Assumption", "Effect")) %>% kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover")) If the assumptions of the normalization are not verified, the normalization may, in fact, increase bias or variance of the estimated log fold change. For example, if the experiment is not randomized and the experimental artifacts are confounded with the conditions, the median and quantile normalizations will introduce bias. Feature Selection Feature selection is used to determine which protein features should be used to infer the overall protein abundance in a sample. The options here are: Using all features Using the top ‘N’ features Removing uninformative features and outliers Using all features will simply leverage all available information to infer the underlying protein abundance. Top ‘N’ features selects a pre-specified number of features with the highest average intensity across all runs for protein-level inference. This option is useful if you believe that the features with lower average intensity are less reliable, or in cases in which some of the proteins have a very large number of features (such as in DIA experiments). For any individual protein, it is usually possible to determine changes in abundance by looking at the peaks with highest intensity; in these cases, using all features results in redundancy while greatly increasing the computational processing time. Finally, removing uninformative features and outliers attempts to select the ‘best’ features by removing features that have too many missing values, that are too noisy or have outliers. Missing Value Imputation Missing value imputation attempts to infer feature intensities in runs in which they were not measured. MSstats imputes these values by using an accelerated failure time model # Table 2 data imputation_table <- data.frame( Name = c("Imputation", "No imputation"), Description = c( "Infer missing feature intensities by using an accelerated failure time model. It will not impute for runs in which all features are missing", "Do not apply imputation" ), Assumption = c( "Features are missing for reasons of low abundance (e.g., features are missing not at random)", "Assume no information about reasons for missingness or that features are missing at random" ), Effect = c( "If the assumption is true, imputation will remove bias toward high intensities in the summarization step. Otherwise, bias will be introduced via inaccurate imputation", "If the assumption is true, no new bias will be introduced. Otherwise, if features are missing for reasons of low abundance, summarized values will be biased toward high intensities" ) ) # Render Table 2 kable(imputation_table, "html", escape = FALSE, col.names = c("Name", "Description", "Assumption", "Effect")) %>% kable_styling(full_width = TRUE, bootstrap_options = c("striped", "hover", "condensed")) %>% column_spec(2:4, width = "30em") </details> <details><summary><a href='https://github.com/Vitek-Lab/MSstats/pull/150/files#diff-321480d21bdc578dd9994bb29f9dcff991db1d3a4ae029079fc567e5a2eaaabcR11-R50'><strong>Documentation Clarity</strong></a> The added references to MSstats vignettes for recommendations on normalization, feature selection, and imputation options should be reviewed for clarity and accuracy. Ensure that these references are helpful and align with the intended audience's needs.</summary> ```txt #' If FALSE, no normalization is performed. See MSstats vignettes for #' recommendations on which normalization option to use. #' @param nameStandards optional vector of global standard peptide names. #' Required only for normalization with global standard peptides. #' @param featureSubset "all" (default) uses all features that the data set has. #' "top3" uses top 3 features which have highest average of log-intensity across runs. #' "topN" uses top N features which has highest average of log-intensity across runs. #' It needs the input for n_top_feature option. #' "highQuality" flags uninformative feature and outliers. See MSstats vignettes for #' recommendations on which feature selection option to use. #' @param remove_uninformative_feature_outlier optional. Only required if #' featureSubset = "highQuality". TRUE allows to remove #' 1) noisy features (flagged in the column feature_quality with "Uninformative"), #' 2) outliers (flagged in the column, is_outlier with TRUE, #' before run-level summarization. FALSE (default) uses all features and intensities #' for run-level summarization. #' @param min_feature_count optional. Only required if featureSubset = "highQuality". #' Defines a minimum number of informative features a protein needs to be considered #' in the feature selection algorithm. #' @param n_top_feature optional. Only required if featureSubset = 'topN'. #' It that case, it specifies number of top features that will be used. #' Default is 3, which means to use top 3 features. #' @param summaryMethod "TMP" (default) means Tukey's median polish, #' which is robust estimation method. "linear" uses linear mixed model. #' @param equalFeatureVar only for summaryMethod = "linear". default is TRUE. #' Logical variable for whether the model should account for heterogeneous variation #' among intensities from different features. Default is TRUE, which assume equal #' variance among intensities from features. FALSE means that we cannot assume equal #' variance among intensities from features, then we will account for heterogeneous #' variation from different features. #' @param censoredInt Missing values are censored or at random. #' 'NA' (default) assumes that all 'NA's in 'Intensity' column are censored. #' '0' uses zero intensities as censored intensity. #' In this case, NA intensities are missing at random. #' The output from Skyline should use '0'. #' Null assumes that all NA intensites are randomly missing. #' @param MBimpute only for summaryMethod = "TMP" and censoredInt = 'NA' or '0'. #' TRUE (default) imputes missing values with 'NA' or '0' (depending on censoredInt option) #' by Accelated failure model. FALSE uses the values assigned by cutoffCensored. #' See MSstats vignettes for recommendations on which imputation option to use.

github-actions · 2025-01-17T15:44:37Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Score
General	Verify that the required package for table rendering is loaded to prevent runtime errors Ensure that the `kableExtra` package is properly imported and loaded in the vignette, as it is used for table rendering. Without this, the table generation code may fail during execution. vignettes/MSstatsWorkflow.Rmd [221-222] +if (!requireNamespace("kableExtra", quietly = TRUE)) { + stop("Package 'kableExtra' is required but not installed.") +} library(kableExtra) Suggestion importance[1-10]: 8 Why: The suggestion ensures that the `kableExtra` package is loaded properly, which is critical for rendering tables in the vignette. This prevents runtime errors if the package is not installed or loaded, making the code more robust.	8
	Ensure that the data frame column names align with the expected structure for table rendering Validate that the column names in the `table_data` and `imputation_table` data frames match the expected structure for the `kable` function to avoid rendering issues. vignettes/MSstatsWorkflow.Rmd [258-259] +if (!all(c("Name", "Description", "Assumption", "Effect") %in% colnames(table_data))) { + stop("Column names in 'table_data' do not match the expected structure.") +} kable(table_data, "html", escape = FALSE, col.names = c("Name", "Description", "Assumption", "Effect")) %>% kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover")) Suggestion importance[1-10]: 7 Why: The suggestion adds a validation step to check column names in the `table_data` data frame before rendering the table. This helps prevent runtime errors due to mismatched column names, improving code reliability.	7
	Add error handling to ensure table rendering does not fail silently Add error handling for the `kable` and `kable_styling` functions to gracefully handle cases where table rendering fails due to unexpected input or missing dependencies. vignettes/MSstatsWorkflow.Rmd [296-298] -kable(imputation_table, "html", escape = FALSE, col.names = c("Name", "Description", "Assumption", "Effect")) %>% -kable_styling(full_width = TRUE, bootstrap_options = c("striped", "hover", "condensed")) %>% -column_spec(2:4, width = "30em") +tryCatch({ + kable(imputation_table, "html", escape = FALSE, col.names = c("Name", "Description", "Assumption", "Effect")) %>% + kable_styling(full_width = TRUE, bootstrap_options = c("striped", "hover", "condensed")) %>% + column_spec(2:4, width = "30em") +}, error = function(e) { + stop("Error in rendering the table: ", e$message) +}) Suggestion importance[1-10]: 6 Why: Adding error handling for the `kable` and `kable_styling` functions ensures that any issues during table rendering are caught and reported. This improves the robustness of the code, although the likelihood of such errors is relatively low.	6

tonywu1999 added 2 commits January 16, 2025 18:02

docs(dataProcess): Update dataProcess parameter manual to reference v…

cb1f7ba

…ignette

vignette(dataProcess): Add tables on dataProcess options and assumptions

2c65113

tonywu1999 requested a review from devonjkohler January 17, 2025 15:43

github-actions bot added the Review effort [1-5]: 4 label Jan 17, 2025

tonywu1999 requested a review from mstaniak January 17, 2025 15:44

tonywu1999 added 2 commits January 17, 2025 11:13

remove kinit library call

209c8cc

add try-catch statements

0083321

tonywu1999 merged commit 8e5fb46 into devel Jan 17, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vignette(dataProcess): Add tables on dataProcess options and assumptions #150

vignette(dataProcess): Add tables on dataProcess options and assumptions #150

tonywu1999 commented Jan 17, 2025 •

edited

Loading

github-actions bot commented Jan 17, 2025

Feature Selection

Missing Value Imputation

github-actions bot commented Jan 17, 2025

vignette(dataProcess): Add tables on dataProcess options and assumptions #150

vignette(dataProcess): Add tables on dataProcess options and assumptions #150

Conversation

tonywu1999 commented Jan 17, 2025 • edited Loading

PR Type

Description

Changes walkthrough 📝

github-actions bot commented Jan 17, 2025

PR Reviewer Guide 🔍

Feature Selection

Missing Value Imputation

github-actions bot commented Jan 17, 2025

PR Code Suggestions ✨

tonywu1999 commented Jan 17, 2025 •

edited

Loading