Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(getSubnetworkFromIndra): Fix bugs related to multiple matches for an HGNC ID #28

Merged
merged 2 commits into from
Jan 13, 2025

Conversation

tonywu1999
Copy link
Contributor

@tonywu1999 tonywu1999 commented Jan 13, 2025

User description

Motivation and Context

Got this error [Error in vapply ... values must be length 1, but FUN(X[[1]]) result is length 0](https://stackoverflow.com/questions/42426914/error-in-vapply-values-must-be-length-1-but-funx11-result-is-length-0)

Changes

Please provide a detailed bullet point list of your changes.

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • Ran styler::style_pkg(transformers = styler::tidyverse_style(indent_by = 4))
  • Ran devtools::document()

PR Type

Bug fix, Tests


Description

  • Fixed handling of multiple matches for HGNC IDs in getSubnetworkFromIndra.

  • Improved error handling for invalid UniProt mnemonic ID inputs.

  • Enhanced unit tests to validate new error messages and edge cases.

  • Updated error messages for clarity and user guidance.


Changes walkthrough 📝

Relevant files
Bug fix
utils_annotateProteinInfoFromIndra.R
Enhanced error handling for UniProt mnemonic ID inputs     

R/utils_annotateProteinInfoFromIndra.R

  • Added error handling for converting non-character inputs to character.
  • Improved validation of UniProt mnemonic ID inputs.
  • Introduced tryCatch to handle conversion errors gracefully.
  • +18/-4   
    utils_getSubnetworkFromIndra.R
    Fixed HGNC ID matching and improved error messages             

    R/utils_getSubnetworkFromIndra.R

  • Fixed logic for matching HGNC IDs to ensure unique matches.
  • Added error message for multiple or zero matches in input data.
  • Updated error message for input size exceeding 400 proteins.
  • +19/-3   
    Tests
    test-getSubnetworkFromIndra.R
    Updated and enhanced unit tests for error handling             

    tests/testthat/test-getSubnetworkFromIndra.R

  • Updated test to reflect new error message for input size limit.
  • Enhanced tests to validate error handling for missing columns.
  • +1/-1     

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Error Handling

    The new error handling for converting uniprotMnemonicIds to character strings should be validated to ensure it properly handles edge cases and does not introduce unexpected behavior.

    tryCatch({
        # Attempt to convert all elements to character if not already character
        uniprotMnemonicIds <- lapply(uniprotMnemonicIds, function(x) {
            if (!is.character(x)) {
                as.character(x)
            } else {
                x
            }
        })
    
        # Check if conversion was successful
        if (any(!sapply(uniprotMnemonicIds, is.character))) {
            stop("All elements in the list must be character strings representing UniProt mnemonic IDs.")
        }
    }, error = function(e) {
        stop("An error occurred converting uniprot mnemonic IDs to character strings: ", e$message)
    Input Validation

    The new validation for matching HGNC IDs in .addAdditionalMetadataToIndraEdge should be reviewed to confirm it correctly identifies and handles cases with 0 or multiple matches.

    if (nrow(matched_rows_source) != 1 || nrow(matched_rows_target) != 1) {
        stop(paste0(
            "INDRA Exception: Unexpected number of matches for the following HGNC IDs in the input data: ", 
            edge$source_id, 
            " or ", 
            edge$target_id, 
            ". Each ID must match exactly one entry in the input data, but 0 or multiple matches were found. Please check the input data for duplicates or missing entries."
        ))

    Copy link

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Validate that the Protein column in the input dataframe contains valid data before processing

    Add a check to ensure that the Protein column in the input dataframe is not empty or
    entirely NA before attempting to match rows, as this could lead to runtime errors or
    incorrect results.

    R/utils_getSubnetworkFromIndra.R [69-70]

    +if (all(is.na(input$Protein) | input$Protein == "")) {
    +    stop("Input data must contain valid entries in the 'Protein' column.")
    +}
     matched_rows_source <- input[input$HgncId == edge$source_id & !is.na(input$Protein), ]
     matched_rows_target <- input[input$HgncId == edge$target_id & !is.na(input$Protein), ]
    Suggestion importance[1-10]: 9

    Why: This suggestion adds a crucial validation step to ensure the Protein column contains valid data before proceeding with matching rows. This prevents runtime errors and ensures the function operates on valid input, significantly improving reliability.

    9
    Improve error handling for non-convertible elements in the input list to avoid unexpected failures

    Ensure that the tryCatch block properly handles cases where uniprotMnemonicIds
    contains non-convertible elements, as the current implementation may not account for
    all edge cases, such as complex objects or NULL values.

    R/utils_annotateProteinInfoFromIndra.R [18-34]

     tryCatch({
         # Attempt to convert all elements to character if not already character
         uniprotMnemonicIds <- lapply(uniprotMnemonicIds, function(x) {
             if (!is.character(x)) {
    -            as.character(x)
    +            tryCatch(as.character(x), error = function(e) NULL)
             } else {
                 x
             }
         })
         
         # Check if conversion was successful
    -    if (any(!sapply(uniprotMnemonicIds, is.character))) {
    +    if (any(is.null(uniprotMnemonicIds) | !sapply(uniprotMnemonicIds, is.character))) {
             stop("All elements in the list must be character strings representing UniProt mnemonic IDs.")
         }
     }, error = function(e) {
         stop("An error occurred converting uniprot mnemonic IDs to character strings: ", e$message)
     })
    Suggestion importance[1-10]: 8

    Why: The suggestion enhances the robustness of the tryCatch block by handling cases where elements in uniprotMnemonicIds cannot be converted to character strings. This improvement addresses potential edge cases and ensures the function behaves predictably, which is critical for input validation.

    8
    General
    Add a test to ensure the function validates the Protein column in the input dataframe

    Add a test case to verify that the function throws an error when the Protein column
    in the input dataframe is entirely NA or empty, ensuring robust validation of input
    data.

    tests/testthat/test-getSubnetworkFromIndra.R [27-37]

    -test_that("Exception is thrown for 400+ proteins in dataframe", {
    -    input_400 <- data.frame(
    -        Protein = paste0("Protein", 1:400),
    -        HgncId = paste0("HGNCID", 1:400),
    +test_that("Exception is thrown for empty or NA Protein column", {
    +    input_invalid <- data.frame(
    +        Protein = rep(NA, 10),
    +        HgncId = paste0("HGNCID", 1:10),
             issue = NA,
             adj.pvalue = 0.05
         )
         expect_error(
    -        getSubnetworkFromIndra(input_400),
    -        "Invalid Input Error: INDRA query must contain less than 400 proteins.  Consider lowering your p-value cutoff"
    +        getSubnetworkFromIndra(input_invalid),
    +        "Input data must contain valid entries in the 'Protein' column."
         )
     })
    Suggestion importance[1-10]: 8

    Why: Adding a test case to validate the Protein column ensures that the function's input validation logic is robust and correctly handles cases where the column is empty or entirely NA. This strengthens the overall reliability of the code.

    8
    Enhance error messages to provide more specific details about problematic rows in the input data

    Refactor the error message in .addAdditionalMetadataToIndraEdge to include more
    specific details about the problematic rows, such as their indices or content, to
    aid debugging.

    R/utils_getSubnetworkFromIndra.R [73-79]

     stop(paste0(
         "INDRA Exception: Unexpected number of matches for the following HGNC IDs in the input data: ", 
         edge$source_id, 
         " or ", 
         edge$target_id, 
    -    ". Each ID must match exactly one entry in the input data, but 0 or multiple matches were found. Please check the input data for duplicates or missing entries."
    +    ". Each ID must match exactly one entry in the input data, but 0 or multiple matches were found. Problematic rows: ", 
    +    paste(which(input$HgncId %in% c(edge$source_id, edge$target_id)), collapse = ", "), 
    +    ". Please check the input data for duplicates or missing entries."
     ))
    Suggestion importance[1-10]: 7

    Why: The suggestion improves the error message by including specific details about problematic rows, aiding debugging and making it easier for users to identify and resolve issues in their input data.

    7

    @codecov-commenter
    Copy link

    Codecov Report

    Attention: Patch coverage is 62.96296% with 10 lines in your changes missing coverage. Please review.

    Project coverage is 86.69%. Comparing base (4c5c730) to head (6504dc4).

    Files with missing lines Patch % Lines
    R/utils_getSubnetworkFromIndra.R 46.15% 7 Missing ⚠️
    R/utils_annotateProteinInfoFromIndra.R 78.57% 3 Missing ⚠️
    Additional details and impacted files
    @@            Coverage Diff             @@
    ##            devel      #28      +/-   ##
    ==========================================
    - Coverage   88.28%   86.69%   -1.59%     
    ==========================================
      Files           7        7              
      Lines         384      406      +22     
    ==========================================
    + Hits          339      352      +13     
    - Misses         45       54       +9     

    ☔ View full report in Codecov by Sentry.
    📢 Have feedback on the report? Share it here.

    @tonywu1999 tonywu1999 merged commit 3a08390 into devel Jan 13, 2025
    4 checks passed
    @tonywu1999 tonywu1999 deleted the bugs branch January 13, 2025 16:41
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants