Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vignette(dataProcess): Add tables on dataProcess options and assumptions #150

Merged
merged 4 commits into from
Jan 17, 2025

Conversation

tonywu1999
Copy link
Contributor

@tonywu1999 tonywu1999 commented Jan 17, 2025

PR Type

Documentation, Enhancement


Description

  • Added detailed tables on normalization, feature selection, and imputation options in the vignette.

  • Updated dataProcess function documentation to reference vignette for recommendations.

  • Introduced kableExtra dependency for enhanced table rendering in vignettes.

  • Removed redundant configuration file MSstats.Rproj.


Changes walkthrough 📝

Relevant files
Dependencies
DESCRIPTION
Added `kableExtra` dependency for vignette table rendering

DESCRIPTION

  • Added kableExtra to the list of suggested packages.
+2/-1     
Configuration changes
MSstats.Rproj
Removed unnecessary `MSstats.Rproj` file                                 

MSstats.Rproj

  • Removed the redundant project configuration file.
+0/-17   
Documentation
dataProcess.R
Updated `dataProcess` function documentation for clarity 

R/dataProcess.R

  • Updated parameter descriptions to reference vignette recommendations.
  • Clarified documentation for normalization, feature selection, and
    imputation options.
  • +6/-3     
    dataProcess.Rd
    Improved manual documentation for `dataProcess` function 

    man/dataProcess.Rd

  • Enhanced manual documentation to include vignette references.
  • Improved descriptions for normalization and imputation parameters.
  • +7/-4     
    MSstatsWorkflow.Rmd
    Added detailed tables and recommendations to vignette       

    vignettes/MSstatsWorkflow.Rmd

  • Added detailed tables for normalization, feature selection, and
    imputation options.
  • Included references and assumptions for each processing option.
  • Enhanced visualization with kableExtra for better table formatting.
  • +90/-1   

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Table Rendering and Dependencies

    The introduction of kableExtra for table rendering in the vignette should be validated for compatibility and proper rendering across different environments. Ensure that the added tables are displayed correctly in the final documentation output.

    ```{r echo=FALSE, message=FALSE}
    library(knitr)
    library(kableExtra)
    
    # Data for the table
    table_data <- data.frame(
      Name = c("Median", "", "Quantile", "", "Global standards", "", "", "None", ""),
      Description = c(
        "Equalize medians of all log feature intensities in each run", "",
        "Equalize the distributions of all log feature intensities in each run", "",
        "Equalize median log-intensities of endogenous or spiked-in reference peptides or proteins. Apply adjustment to the remainder of log feature intensities", "", "",
        "Do not apply any normalization", ""
      ),
      Assumption = c(
        "All steps of data collection and acquisition were randomized",
        "Most of the proteins in the experiment are the same and have the same concentration for all of the runs. The experimental artifacts affect every peptide in a run by the same constant amount",
        "All steps of data collection and acquisition were randomized",
        "Most of the proteins in the experiment are the same and have the same concentration for all of the runs. The experimental artifacts affect every peptide non-linearly, as a function of its log intensity",
        "All steps of data collection and acquisition were randomized",
        "The reference peptides or proteins are present in each run and have the same concentration for all of the runs. All experimental artifacts occur only after standards were added.",
        "The experimental artifacts affect every protein in a run by the same constant amount",
        "All steps of data collection and acquisition were randomized",
        "The experiment has no systematic artifacts or has been normalized in another custom manner"
      ),
      Effect = c(
        "The normalization estimates the artifact deviations in each run with a single quantity, reducing overfitting",
        "The normalization reduces bias and variance of the estimated log fold change",
        "The normalization estimates the artifact deviations in each run with a complex non-linear function, potentially leading to overfitting",
        "The normalization reduces bias and variance of the estimated log fold change but may over-correct",
        "The normalization estimates the artifact deviations in each run with a single quantity, which reduces overfitting",
        "The normalization estimates the artifact deviations from a small number of peptides, which may increase overfitting. The normalization does not eliminate artifacts that occurred before adding spiked references",
        "The normalization reduces bias and variance of the estimated log fold change",
        "All patterns of variation of interest and of nuisance variation are preserved",
        ""
      )
    )
    
    # Create the table
    kable(table_data, "html", escape = FALSE, col.names = c("Name", "Description", "Assumption", "Effect")) %>%
      kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

    If the assumptions of the normalization are not verified, the normalization may, in fact, increase bias or variance of the estimated log fold change. For example, if the experiment is not randomized and the experimental artifacts are confounded with the conditions, the median and quantile normalizations will introduce bias.

    Feature Selection

    Feature selection is used to determine which protein features should be used to infer the overall protein abundance in a sample. The options here are:

    • Using all features
    • Using the top ‘N’ features
    • Removing uninformative features and outliers

    Using all features will simply leverage all available information to infer the underlying protein abundance. Top ‘N’ features selects a pre-specified number of features with the highest average intensity across all runs for protein-level inference. This option is useful if you believe that the features with lower average intensity are less reliable, or in cases in which some of the proteins have a very large number of features (such as in DIA experiments). For any individual protein, it is usually possible to determine changes in abundance by looking at the peaks with highest intensity; in these cases, using all features results in redundancy while greatly increasing the computational processing time. Finally, removing uninformative features and outliers attempts to select the ‘best’ features by removing features that have too many missing values, that are too noisy or have outliers.

    Missing Value Imputation

    Missing value imputation attempts to infer feature intensities in runs in which they were not measured. MSstats imputes these values by using an accelerated failure time model

    # Table 2 data
    imputation_table <- data.frame(
      Name = c("Imputation", "No imputation"),
      Description = c(
        "Infer missing feature intensities by using an accelerated failure time model. It will not impute for runs in which all features are missing",
        "Do not apply imputation"
      ),
      Assumption = c(
        "Features are missing for reasons of low abundance (e.g., features are missing not at random)",
        "Assume no information about reasons for missingness or that features are missing at random"
      ),
      Effect = c(
        "If the assumption is true, imputation will remove bias toward high intensities in the summarization step. Otherwise, bias will be introduced via inaccurate imputation",
        "If the assumption is true, no new bias will be introduced. Otherwise, if features are missing for reasons of low abundance, summarized values will be biased toward high intensities"
      )
    )
    
    # Render Table 2
    kable(imputation_table, "html", escape = FALSE, col.names = c("Name", "Description", "Assumption", "Effect")) %>%
      kable_styling(full_width = TRUE, bootstrap_options = c("striped", "hover", "condensed")) %>%
      column_spec(2:4, width = "30em")
    
    
    </details>
    
    <details><summary><a href='https://github.com/Vitek-Lab/MSstats/pull/150/files#diff-321480d21bdc578dd9994bb29f9dcff991db1d3a4ae029079fc567e5a2eaaabcR11-R50'><strong>Documentation Clarity</strong></a>
    
    The added references to MSstats vignettes for recommendations on normalization, feature selection, and imputation options should be reviewed for clarity and accuracy. Ensure that these references are helpful and align with the intended audience's needs.</summary>
    
    ```txt
    #' If FALSE, no normalization is performed.  See MSstats vignettes for 
    #' recommendations on which normalization option to use.
    #' @param nameStandards optional vector of global standard peptide names. 
    #' Required only for normalization with global standard peptides.
    #' @param featureSubset "all" (default) uses all features that the data set has. 
    #' "top3" uses top 3 features which have highest average of log-intensity across runs. 
    #' "topN" uses top N features which has highest average of log-intensity across runs. 
    #' It needs the input for n_top_feature option. 
    #' "highQuality" flags uninformative feature and outliers. See MSstats vignettes for 
    #' recommendations on which feature selection option to use.
    #' @param remove_uninformative_feature_outlier optional. Only required if 
    #' featureSubset = "highQuality". TRUE allows to remove 
    #' 1) noisy features (flagged in the column feature_quality with "Uninformative"),
    #' 2) outliers (flagged in the column, is_outlier with TRUE, 
    #' before run-level summarization. FALSE (default) uses all features and intensities 
    #' for run-level summarization.
    #' @param min_feature_count optional. Only required if featureSubset = "highQuality".
    #' Defines a minimum number of informative features a protein needs to be considered
    #' in the feature selection algorithm.
    #' @param n_top_feature optional. Only required if featureSubset = 'topN'.  
    #' It that case, it specifies number of top features that will be used.
    #' Default is 3, which means to use top 3 features.
    #' @param summaryMethod "TMP" (default) means Tukey's median polish, 
    #' which is robust estimation method. "linear" uses linear mixed model.
    #' @param equalFeatureVar only for summaryMethod = "linear". default is TRUE. 
    #' Logical variable for whether the model should account for heterogeneous variation 
    #' among intensities from different features. Default is TRUE, which assume equal 
    #' variance among intensities from features. FALSE means that we cannot assume equal 
    #' variance among intensities from features, then we will account for heterogeneous 
    #' variation from different features.
    #' @param censoredInt Missing values are censored or at random. 
    #' 'NA' (default) assumes that all 'NA's in 'Intensity' column are censored. 
    #' '0' uses zero intensities as censored intensity. 
    #' In this case, NA intensities are missing at random. 
    #' The output from Skyline should use '0'. 
    #' Null assumes that all NA intensites are randomly missing.
    #' @param MBimpute only for summaryMethod = "TMP" and censoredInt = 'NA' or '0'. 
    #' TRUE (default) imputes missing values with 'NA' or '0' (depending on censoredInt option) 
    #' by Accelated failure model. FALSE uses the values assigned by cutoffCensored.
    #' See MSstats vignettes for recommendations on which imputation option to use.
    

    Copy link

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    General
    Verify that the required package for table rendering is loaded to prevent runtime errors

    Ensure that the kableExtra package is properly imported and loaded in the vignette,
    as it is used for table rendering. Without this, the table generation code may fail
    during execution.

    vignettes/MSstatsWorkflow.Rmd [221-222]

    +if (!requireNamespace("kableExtra", quietly = TRUE)) {
    +  stop("Package 'kableExtra' is required but not installed.")
    +}
     library(kableExtra)
    Suggestion importance[1-10]: 8

    Why: The suggestion ensures that the kableExtra package is loaded properly, which is critical for rendering tables in the vignette. This prevents runtime errors if the package is not installed or loaded, making the code more robust.

    8
    Ensure that the data frame column names align with the expected structure for table rendering

    Validate that the column names in the table_data and imputation_table data frames
    match the expected structure for the kable function to avoid rendering issues.

    vignettes/MSstatsWorkflow.Rmd [258-259]

    +if (!all(c("Name", "Description", "Assumption", "Effect") %in% colnames(table_data))) {
    +  stop("Column names in 'table_data' do not match the expected structure.")
    +}
     kable(table_data, "html", escape = FALSE, col.names = c("Name", "Description", "Assumption", "Effect")) %>%
     kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
    Suggestion importance[1-10]: 7

    Why: The suggestion adds a validation step to check column names in the table_data data frame before rendering the table. This helps prevent runtime errors due to mismatched column names, improving code reliability.

    7
    Add error handling to ensure table rendering does not fail silently

    Add error handling for the kable and kable_styling functions to gracefully handle
    cases where table rendering fails due to unexpected input or missing dependencies.

    vignettes/MSstatsWorkflow.Rmd [296-298]

    -kable(imputation_table, "html", escape = FALSE, col.names = c("Name", "Description", "Assumption", "Effect")) %>%
    -kable_styling(full_width = TRUE, bootstrap_options = c("striped", "hover", "condensed")) %>%
    -column_spec(2:4, width = "30em")
    +tryCatch({
    +  kable(imputation_table, "html", escape = FALSE, col.names = c("Name", "Description", "Assumption", "Effect")) %>%
    +  kable_styling(full_width = TRUE, bootstrap_options = c("striped", "hover", "condensed")) %>%
    +  column_spec(2:4, width = "30em")
    +}, error = function(e) {
    +  stop("Error in rendering the table: ", e$message)
    +})
    Suggestion importance[1-10]: 6

    Why: Adding error handling for the kable and kable_styling functions ensures that any issues during table rendering are caught and reported. This improves the robustness of the code, although the likelihood of such errors is relatively low.

    6

    @tonywu1999 tonywu1999 requested a review from mstaniak January 17, 2025 15:44
    @tonywu1999 tonywu1999 merged commit 8e5fb46 into devel Jan 17, 2025
    1 check passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant