Skip to content

Latest commit

 

History

History
84 lines (66 loc) · 6.19 KB

README_MatchPredictionAlgorithm.md

File metadata and controls

84 lines (66 loc) · 6.19 KB

Service for the match prediction capabilities of the Atlas Search Algorithm

Projects

The solution is split across multiple projects:

  • Atlas.MatchPrediction
    • Contains the business logic for the match prediction algorithm
  • Atlas.MatchPrediction.Data
    • Data access layer - manages the data schema (via EF code first), and access of said data via Dapper
  • Atlas.MatchPrediction.Functions
    • WARNING This functions app does not actually run match prediction in a search request - instead that is performed by Atlas.Functions in the durable functions layer
    • This functions app provides import functionality for HF sets
    • It also exposes a HTTP endpoint intended for manual debugging/support of the match prediction algorithm and its component stages - not intended for production use while running searches.
  • Atlas.MatchPrediction.Test
    • Unit tests for the project
  • Atlas.MatchPrediction.Test.Integration
    • Integration tests, including covering a real database layer

Overview

The "Match Prediction Algorithm" is an additional processing stage for every patient/donor pair that is returned by the Matching Algorithm.

The output of the match prediction algorithm is a percentage likelihood of a patient/donor pair having a specific match count.

This percentage is provided for 0, 1, and 2 mismatches across all loci, as well as per-locus values being provided for each of 0/1/2 mismatches.

The algorithm makes use of reference data known as "Haplotype Frequency Sets" (or HF Sets) to come to this conclusion

Algorithmic Summary

A high level overview of the match prediction algorithm's logic is as follows:

Haplotype Frequency Set Selection

  • For each patient and donor, a suitable HF set is selected
  • Sets are identified by a combination of the ethnicity and registry data for the donor/patient. If a specific set cannot be found for their ethnicity/registry data, a less specific set will be used, ultimately defaulting to the "global" or default set.

Genotype Expansion

  • Both patient and donor genotypes must be expanded from potentially ambiguous allele representations of unknown phase to a collection of possible diplotypes (unambiguous typing of known phase).
  • This could be achieved naively by expanding to all possible diplotypes, but this would generate far too many possibilities to run calculations on
  • Instead, the diplotypes are calculated from the chosen HF set - all haplotypes that are permitted by the input HLA are selected, and then all combinations of permitted haplotypes are considered to give us a set of possible diplotypes

Frequency Identification

  • For each expanded set of diplotypes, a likelihood for the diplotype must be sourced
  • As the diplotypes were built from haplotypes from the chosen HF set, the likelihood of a diplotype can be easily calculated by multiplying the likelihoods of the two haplotypes it consists of.

Match Calculation

  • For each patient/donor pair of diplotypes, we calculate the match count at each locus.
  • Match counts are determined by comparing P Group values - identical P groups are considered a match
    • In the case of null expressing alleles (which belong to no P group), the P group of its paired allele is used for this calculation, in keeping with the logic used in the matching algorithm
    • In the case of HF sets typed at a non-P group resolution, the data must first be converted to P groups. Therefore the only typing resolutions permitted for HF set data (P Group, G group, g group) must all be convertable to exactly 1 (or 0) P groups.

Final Calculation

  • For each of the percentage results, the final result can be calculated by dividing the sum of all patient donor pairs' likelihoods that meet the result's criteria (e.g. 0 mismatches overall) by the sum of all patient donor pairs' likelihoods

HLA versioning

  • There are two places in the algorithm where HLA typings have to be converted to a specific HLA category:
  • Atlas allows for HF sets to be encoded to a HLA nomenclature version that is older than the one used by the matching algorithm.
  • The match prediction algorithm first tries to convert HLA typings using the HF set HLA version.
  • If this first attempt fails (e.g., when an allele belongs to a subsequent nomenclature version), it will attempt to convert the typing using the matching algorithm HLA version (as long as it is different to the HF set version).
  • If both attempts fail, there is a significant risk of the subject being deemed "unrepresented", depending on the point at which conversion fails and the overall typing resolution.

Match Prediction Requests

  • Match prediction requests (outside of search) can be submitted to the http-triggered function within the Match prediction project.
    • The endpoint accepts a single patient along with a set of donors (at least one donor must be submitted).
    • The endpoint will return a unique request ID for each valid donor input in the batch, and will return validation errors for any invalid donor inputs, i.e, those missing required info.
    • The function forwards the batch request onto a dedicated service bus topic; in this way, potentially millions of requests can be made and queued on the topic for gradual processing.
  • A second, servicebus-triggered function reads messages in batches off the topic, and runs the requests.
    • Results are uploaded to a subfolder of the match prediction results blob storage container (subfolder name: match-prediction-requests).
      • Each json result file is named after its corresponding match prediction request ID.
      • Note, the file does not contain a patient or donor ID; the consumer should map patient-donor IDs to request ID when initially submitting the request.
    • At this point, if any requests contain invalid properties, such invalid HLA, these will be indiviually caught and logged to Application Insights to allow users to correct them and re-submit.
      • Note: No alerts are sent out in such case; the user should manually monitor the logs, or use Application Insights monitoring.