Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test using the sample data from the Leicester OMOP extract #42

Open
4 of 5 tasks
razekmh opened this issue Nov 11, 2024 · 19 comments
Open
4 of 5 tasks

Test using the sample data from the Leicester OMOP extract #42

razekmh opened this issue Nov 11, 2024 · 19 comments
Assignees
Labels
ARC spike Self contained exploratory work

Comments

@razekmh
Copy link

razekmh commented Nov 11, 2024

I want to:

  • Download the files to the DSD
  • Download the concept files from the UCLH SAFEHR repo to the DSD
  • Open in them a version of R
  • Match the concepts in the drug_exposure table to the concepts table
  • Match the concepts to the ATC codes used by ramses in computing the DDD
@razekmh razekmh added spike Self contained exploratory work ARC labels Nov 11, 2024
@razekmh razekmh self-assigned this Nov 11, 2024
@razekmh
Copy link
Author

razekmh commented Nov 11, 2024

Download the files to the DSD

This works fine. I needed to login and find the data on OneDrive.

@razekmh
Copy link
Author

razekmh commented Nov 11, 2024

Download the concept files from the UCLH SAFEHR repo to the DSD

This is a little bit more complicated. I had to download the individual files as mentioned in #41

@razekmh
Copy link
Author

razekmh commented Nov 11, 2024

Open in them a version of R

Parquet files can be parsed using arrow apache package. You will need to install it and load it and then run read_parquet

@razekmh
Copy link
Author

razekmh commented Nov 12, 2024

I read the concept and drug_exposure tables

drug_exposure_df <- read_parquet("data/OMOP/DRUG_EXPOSURE.parquet")
concept_df <- read_parquet("data/OMOP/concept.parquet")

and then merge the two tables on the drug id to find which exposures from the drug_exposure table have a match in the concept table.

drug_names_exposure_df <- merge(x=drug_exposure_df, y=concept_df, by.x = "drug_concept_id", by.y = "concept_id", all.x = T)

@razekmh
Copy link
Author

razekmh commented Nov 12, 2024

I wanted to count the number of matches per drug, so first I count the hits per concept name in the new merged table

drug_names_count <- drug_names_exposure_df |> count(concept_name)

and then sort and display the top three

 head(drug_names_count[order(drug_names_count$n, decreasing = TRUE), ],3 )

I got

                                           concept_name      n
712                                 No matching concept 323452
264 1000 ML Sodium Chloride 9 MG/ML Injectable Solution  82653
232   100 ML Acetaminophen 10 MG/ML Injectable Solution  27254

so we have 323,452 out of 1,190,373 with drugs that are not listed in the concept list. This is clearly a problem but we will try to address it later. Let's press forward.

Now I want to filter our the No matching concept so, we create a new table.

drug_expousre_matching <- drug_names_exposure_df[drug_names_exposure_df$concept_name != "No matching concept",]

@razekmh
Copy link
Author

razekmh commented Nov 12, 2024

Now let's take a look at the match between the drug_strength table and the new drug_expousre_matching table.

First we read in the drug_strength table

drug_strength_df <- read_parquet("data/OMOP/drug_strength.parquet")

Then we merge the two tables together

drug_strength_names_exposure_df <- merge(x=drug_expousre_matching, y=drug_strength_df , by.x = "drug_concept_id", by.y = "drug_concept_id", all.x = T)

@razekmh
Copy link
Author

razekmh commented Nov 12, 2024

Checking the resulting table I can see that the drug strengths are represented in two ways amount and numerator/ denominator . The following table describes the two sets of columns.

Field Required Type Description
amount_value No float The numeric value associated with the amount of active ingredient contained within the product.
amount_unit_concept_id No integer A foreign key to the Concept in the CONCEPT table representing the identifier for the Unit for the absolute amount of active ingredient.
numerator_value No float The numeric value associated with the concentration of the active ingredient contained in the product
numerator_unit_concept_id No integer A foreign key to the Concept in the CONCEPT table representing the identifier for the numerator Unit for the concentration of active ingredient.
denominator_value (V5.0.1) No float The amount of total liquid (or other divisible product, such as ointment, gel, spray, etc.).
denominator_unit_concept_id No integer A foreign key to the Concept in the CONCEPT table representing the identifier for the denominator Unit for the concentration of active ingredient.

Some of the rows have none the fields and will be dropped for now

@razekmh
Copy link
Author

razekmh commented Nov 12, 2024

Remember that our main target; compute_DDDs requires:

  • ATC_code: a character vector of ATC codes (can be easily obtained from the ab_atc() function)
  • ATC_administration: a character vector indicating the ATC route of administration (see Details)
  • dose: a numeric vector indicating the total dose for the drug serving as the ATC DDD basis of strength. For prescriptions, provide the total dose to be given in one day, rather than per administration.
  • unit: a character vector coercible to a units object with as_units().

We will need to convert the strength on each exposure to the dose in units accepted by as_units(). According to the validate article daily_dose = ( strength * daily_frequency )

@razekmh
Copy link
Author

razekmh commented Nov 12, 2024

Dose and Unit seem to be possible to extract from either amount and numerator/ denominator combination.
For a test purpose, I am installing the units library and then setting the value of the active ingredient and then converting it to the unit

x = 22.5/2.5
units(x) <- as_units("mg/mL")

Now testing the value of x

> x
9 [mg/mL]

Next, I would like to try an example of compute_DDDs with a random row from the dataset. However extracting the ATC code is a bit complicated

@razekmh
Copy link
Author

razekmh commented Nov 12, 2024

First step of finding the ATC codes is to examine the drug standard used in the dataset.

unique(drug_expousre_matching$vocabulary_id)
[1] "RxNorm Extension" "RxNorm"  

Matching the RxNorm to ATC is the next task, then

@razekmh
Copy link
Author

razekmh commented Dec 5, 2024

I am back on getting the ATC code out of the OMOP data that we have. The standards that we have are RxNorm Extension and RxNorm. I did some research into this conversion RxNorm -> ATC and found that this seems to be a problem that has not been fully resolved. I found a short tutorial which was useful in understanding the concept_ancestor table, and how it might be useful in solving the mapping.

@razekmh
Copy link
Author

razekmh commented Dec 5, 2024

Let's take a random drug concept

  concept_id concept_name                                 domain_id vocabulary_id    concept_class_id    standard_concept concept_code valid_start_date valid_end_date invalid_reason
       <int> <chr>                                        <chr>     <chr>            <chr>               <chr>            <chr>        <date>           <date>         <chr>         
1   36809522 2.5 ML Albuterol 1 MG/ML Inhalation Solution Drug      RxNorm Extension Quant Clinical Drug S                OMOP4823558  2019-01-28       2099-12-31     NA     

If we filter the concept_ancestor table by descendant_concept_id we will find that there is a list of concepts which are "ancestors" to the drug we selected.

> filter(concept_ancestor_df, concept_ancestor_df$descendant_concept_id == 36809522)
# A tibble: 12 × 4
   ancestor_concept_id descendant_concept_id min_levels_of_separation max_levels_of_separation
                 <int>                 <int>                    <int>                    <int>
 1            36812024              36809522                        0                        0
 2            21603248              36809522                        5                        6
 3            21603255              36809522                        3                        4
 4            21605007              36809522                        6                        7
 5             1154343              36809522                        2                        3
 6             1356108              36809522                        1                        1
 7            36217207              36809522                        3                        3
 8            19093833              36809522                        1                        1
 9            21603249              36809522                        4                        5
10            36215522              36809522                        2                        2
11            36809522              36809522                        0                        0
12            21603256              36809522                        2                        3

@razekmh
Copy link
Author

razekmh commented Dec 5, 2024

Now let's get the info of each of these concepts ancestors from the concept table. We can see that ATC is an ancestor to RxNorm

> filter(concept_df, concept_df$concept_id %in% c(filter(concept_ancestor_df, concept_ancestor_df$descendant_concept_id == 36809522)$ancestor_concept_id) )
# A tibble: 12 × 10
   concept_id concept_name                                 domain_id vocabulary_id    concept_class_id    standard_concept concept_code valid_start_date valid_end_date invalid_reason
        <int> <chr>                                        <chr>     <chr>            <chr>               <chr>            <chr>        <date>           <date>         <chr>         
 1   21605007 RESPIRATORY SYSTEM                           Drug      ATC              ATC 1st             C                R            1970-01-01       2099-12-31     NA            
 2   21603248 DRUGS FOR OBSTRUCTIVE AIRWAY DISEASES        Drug      ATC              ATC 2nd             C                R03          1970-01-01       2099-12-31     NA            
 3   21603249 ADRENERGICS, INHALANTS                       Drug      ATC              ATC 3rd             C                R03A         1970-01-01       2099-12-31     NA            
 4   21603255 Selective beta-2-adrenoreceptor agonists     Drug      ATC              ATC 4th             C                R03AC        1970-01-01       2099-12-31     NA            
 5   21603256 salbutamol; inhalant                         Drug      ATC              ATC 5th             C                R03AC02      1970-01-01       2099-12-31     NA            
 6   36217207 Inhalant Product                             Drug      RxNorm           Dose Form Group     C                1151123      2016-08-01       2099-12-31     NA            
 7   36215522 albuterol Inhalant Product                   Drug      RxNorm           Clinical Dose Group C                1154602      2016-08-01       2099-12-31     NA            
 8    1356108 albuterol Inhalation Solution                Drug      RxNorm           Clinical Drug Form  S                2108226      2019-02-04       2099-12-31     NA            
 9   19093833 albuterol 1 MG/ML                            Drug      RxNorm           Clinical Drug Comp  S                340169       1970-01-01       2099-12-31     NA            
10    1154343 albuterol                                    Drug      RxNorm           Ingredient          S                435          1970-01-01       2099-12-31     NA            
11   36809522 2.5 ML Albuterol 1 MG/ML Inhalation Solution Drug      RxNorm Extension Quant Clinical Drug S                OMOP4823558  2019-01-28       2099-12-31     NA            
12   36812024 Albuterol 1 MG/ML Inhalation Solution        Drug      RxNorm Extension Clinical Drug       S                OMOP4826058  2019-01-28       2099-12-31     NA    

We could get the ATC 5th ancestor of our RxNorm code using

> filter(concept_df, concept_df$concept_id %in% c(filter(concept_ancestor_df, concept_ancestor_df$descendant_concept_id == 36809522)$ancestor_concept_id), concept_df$concept_class_id == 'ATC 5th')
# A tibble: 1 × 10
  concept_id concept_name         domain_id vocabulary_id concept_class_id standard_concept concept_code valid_start_date valid_end_date invalid_reason
       <int> <chr>                <chr>     <chr>         <chr>            <chr>            <chr>        <date>           <date>         <chr>         
1   21603256 salbutamol; inhalant Drug      ATC           ATC 5th          C                R03AC02      1970-01-01       2099-12-31     NA 

@razekmh
Copy link
Author

razekmh commented Dec 5, 2024

Now this means we could do:

  • a function that finds the ATC code for each instance of RxNorm and go through the drug_exposure table per row.
  • merge the tables and filter to get the same result as the first solution.
  • build a mapping table of RxNorm -> ATC 5th

each solution has advantages and disadvantages. For now I will try to build the mapping.

@razekmh
Copy link
Author

razekmh commented Dec 5, 2024

First I will collect all the unique drug_concept_id from the exposure table. The idea is not to spend time on mapping concept ids which are not used within our dataset. The clear downside is that we will need to check any new data for possible new concept ids which are not mapped. Quite similar to building an index in fact.

rxnorm_unique_drug_ids <- map(unique(drug_exposure_df$drug_concept_id), ~ tibble(rxnorm_id = .x)) %>%   bind_rows()

This result in a dataframe of one column with a length of 1,742. Next I will focus on the ATC 5th concepts and filter these out of the concept table.

atc_5th_concepts <- filter(concept_df, concept_df$concept_class_id == 'ATC 5th')

This gives us 5,452 concepts. Now, I will use these ATC 5th concepts to filter the concept_ancestors table to get the relationships which ATC 5th is an ancestor in.

atc_5th_as_ancestors <- filter(concept_ancestor_df, concept_ancestor_df$ancestor_concept_id %in% c(atc_5th_concepts$concept_id), concept_ancestor_df$max_levels_of_separation > 0)

This result in 2,101,373 relationships. Clearly there are a lot of branching. Now let's match the rxnorm concepts with the atc 5th concepts

rxnorm_to_atc_concept_code <- merge(x=rxnorm_unique_drug_ids, y=atc_5th_as_ancestors, by.x='rxnorm_id', by.y='descendant_concept_id', all.x = TRUE) %>% select(rxnorm_id, ancestor_concept_id)

This gives us 2,033 rows. This is coming from 1,742. Strange! Further if we check the most common merges, we find that NA is the most common with 90 hits

> count(rxnorm_to_atc_concept_code, ancestor_concept_id) %>% arrange(desc(n)) %>% head(5)
  ancestor_concept_id  n
1                  NA 90

let's keep going forward to the last step and then check these results after. I will now replace the concept id of the ATC 5th with the concept code from the ATC standard which could be used in the DDD function in Ramses

rxnorm_concept_id_to_atc_code <- merge(x=rxnorm_to_atc_concept_code , y=concept_df, by.x ='ancestor_concept_id', by.y='concept_id', all.x = TRUE) %>% select(rxnorm_id, concept_code)

We finally arrive to table we wanted

> head(rxnorm_concept_id_to_atc_code)
  rxnorm_id concept_code
1  21111050      S01AA28
2  42683461      J05AR24
3  36065445      B01AX07
4  36065920      J05AR25
5  36121030      L04AC18
6   1361296      L02BB06

@razekmh
Copy link
Author

razekmh commented Dec 16, 2024

one note for the future is that the DDD of a drug could change based on the admission route. For example, morphine

ATC code Name DDD U Adm.R
N02AA01 morphine 0.1 g O
30 mg P
30 mg R

@razekmh
Copy link
Author

razekmh commented Dec 16, 2024

The relationship between RxNorm and ATC goes through the following steps. These steps are reflected in the concept_ancestors table

graph BT;
    A("RxNorm: Clinical Drug")-->B("RxNorm: Ingredient");
    B-->C(ATC 5th);
    C-->D(ATC 4th);
    D-->E(ATC 3th);
    E-->F(ATC 2th);
    F-->G(ATC 1th);
Loading

However the connection between RxNorm -> ATC is not perfect. There are three main problems with this link:

1- Missing links:

image

We ca see this clearly in the pervious step where 90 out of 1,742 of the used RxNorm have n equivalent ATC.

> count(rxnorm_to_atc_concept_code, ancestor_concept_id) %>% arrange(desc(n)) %>% head(5)
  ancestor_concept_id  n
1                  NA 90

2- Multiple ancestors Multiple descendants:

The relationship between RxNorm -> ATC is not one -> one but rather many -> many. This means some ATC 5th codes has many RxNorm descendants and some RxNorm codes has many ATC ancestors. For example:

graph BT;
    A("OMOP809928")-->B("L01BA01")
    A-->C("L04AX03")
    D("OMOP2603855")-->B
    D-->C
Loading

This is partially a side effect of the third point.

> filter(concept_df, concept_df$concept_id %in% c(43809444, 41405893))
# A tibble: 2 × 10
  concept_id concept_name                                     domain_id vocabulary_id    concept_class_id    standard_concept concept_code valid_start_date valid_end_date invalid_reason
       <int> <chr>                                            <chr>     <chr>            <chr>               <chr>            <chr>        <date>           <date>         <chr>         
1   41405893 0.3 ML Methotrexate 50 MG/ML Injectable Solution Drug      RxNorm Extension Quant Clinical Drug S                OMOP2603855  2017-08-24       2099-12-31     NA            
2   43809444 50 ML Methotrexate 100 MG/ML Injectable Solution Drug      RxNorm Extension Quant Clinical Drug S                OMOP809928   2017-08-24       2099-12-31     NA     

3- Repeated Concepts:

ATC 5th concepts are not unique. Some medications are mentioned multiple times under different classification with different codes. For example: methotrexate is mentioned under
L01BA01 and L04AX03. However only L04AX03 has recorded DDD

@razekmh
Copy link
Author

razekmh commented Jan 27, 2025

Here is a more serious example of the problem

> filter(filtered_drug_lookup, drug_concept_id == "46276153")
# A tibble: 3 × 7
  drug_concept_name       drug_concept_id drug_concept_class_id ATC_level ATC_concept_name                                      ATC_code ATC_concept_id
  <chr>                             <int> <chr>                 <chr>     <chr>                                                 <chr>             <int>
1 sodium chloride 9 MG/ML        46276153 Clinical Drug Comp    5         cefoperazone and beta-lactamase inhibitor; parenteral J01DD62        21602913
2 sodium chloride 9 MG/ML        46276153 Clinical Drug Comp    5         ceftriaxone, combinations; systemic                   J01DD54        21602912
3 sodium chloride 9 MG/ML        46276153 Clinical Drug Comp    5         imipenem and cilastatin; parenteral                   J01DH51        21602925

the code 46276153 of OMOP RxNorm/RxNorm Extension matches three ATC codes. Further two of the ATC codes have the same route code. Meaning they cannot be filtered based on the route code. The third ATC is missing data

ATC code Name DDD U Adm.R Note
J01DH51 imipenem and cilastatin 2 g P Refers to imipenem
J01DD62 cefoperazone and beta-lactamase inhibitor 4 g P Refers to cefoperazone
J01DD54 ceftriaxone, combinations

@razekmh
Copy link
Author

razekmh commented Jan 27, 2025

one more interesting problem is mapping the route of administration form the OMOP used standard to the ATC one.
OMOP data does not use RxNorm for this. I am not sure if there is RxNorm for the drug administration.

I found a route called body cavity which is not really clear. Let's try to understand what type of drugs uses this route. Please excuse the bad r code

> filter(collected_concept, collected_concept$concept_id %in% filter(omde, omde$route_concept_id == 4222254)$drug_concept_id)
# A tibble: 4 × 10
  concept_id concept_name                  domain_id vocabulary_id concept_class_id standard_concept concept_code valid_start_date valid_end_date invalid_reason
       <int> <chr>                         <chr>     <chr>         <chr>            <chr>            <chr>        <date>           <date>         <chr>         
1          0 No matching concept           Metadata  None          Undefined        NA               No matching1970-01-01       2099-12-31     NA            
2   21078067 5 ML Tranexamic Acid 100 MG/Drug      RxNorm ExtenQuant ClinicalS                OMOP307613   2017-08-24       2099-12-31     NA            
3   42479436 1000 ML Sodium Chloride 9 MGDrug      RxNorm ExtenQuant ClinicalS                OMOP416795   2017-08-24       2099-12-31     NA            
4   42939099 Sodium Chloride 9 MG/ML IrriDrug      RxNorm ExtenClinical Drug    S                OMOP4665764  2018-08-21       2099-12-31     NA            
> filter(collected_concept, collected_concept$concept_id %in% filter(omde, omde$route_concept_id == 4222254)$drug_concept_id) |> select(concept_class_id, concept_name)
# A tibble: 4 × 2
  concept_class_id    concept_name                                       
  <chr>               <chr>                                              
1 Undefined           No matching concept                                
2 Quant Clinical Drug 5 ML Tranexamic Acid 100 MG/ML Prefilled Syringe   
3 Quant Clinical Drug 1000 ML Sodium Chloride 9 MG/ML Injectable Solution
4 Clinical Drug       Sodium Chloride 9 MG/ML Irrigation Solution  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARC spike Self contained exploratory work
Projects
None yet
Development

No branches or pull requests

1 participant