strategy_for_multimodal.txt


Summary
=======

- Reserve a set of matched examples (say 20%) that will ONLY be used for MULTIMODE ('image+rna')
    - the 'v' ('divide_cases') option divides matched cases (only) into two classes: DESIGNATED_UNIMODE_CASES and DESIGNATED_MULTIMODE_CASES. It ignores cases (i.e. directories) which do not have both image and rna data
        - the assigment is random. There is always a 1/4 chance that a given case will be assigned to the DESIGNATED_MULTIMODE_CASES class. This means later cases are less likely to become MULTIMODE cases which would be a problem if the cases were ordered by class (as far as we know, they're not)
          The reason it's done this way is because we're only making one pass through the directory so we need to ensure that the DESIGNATED_MULTIMODE_CASES class is filled with the user requested number of MULTIMODE cases before we've completed the one and only pass.
          If CASES_RESERVED_FOR_IMAGE_RNA is more than 20% of all matched cases, and especially if it is >25% of all matched cases, the the DESIGNATED_MULTIMODE_CASES class is unlikely to be filled. So make sure CASES_RESERVED_FOR_IMAGE_RNA is <=20% of all matched cases,
- Perform UNIMODE training on the remainder (say 80%) of the examples for each of the two modes, generating a trained model for each. This corresponds to steps 1 and 3 below
    - the number of examples reserved for testing in unimode can be relatively small, however there must be at least some because the criterion for saving a model is 'lower test score than all earlier training epochs'
    - it's important to use a model that's 'UNIMODE optimised' (for example, making sure not to overfit), so take the same care with hyperparameters as you would were you not performing MULTIMODE
- Push all TRAINING examples through each (now trained) model to generate the two sets of UNIMODE embeddings.  This corresponds to steps 2 and 4 below
    - this seems odd, but recall that we have separately reserved a set of examples for MULTIMODE that are never seen during UNIMODE.
    - it is achieved by saving the dataset indices that were used during training mode and loading these in the subsequent test mode run
    - actually it wouldn't matter (and it would be better) if ALL (training and test) examples were pushed through the now trained model                                 << TODO
- Concatenate the two sets of matched UNIMODE embeddings to establish the training set for image_rna.  This corresponds to the first part of step 5 below
- Use the concatenated embeddings to train the image_rna model. Since they are 1D vectors, we use DENSE to accomplish this. This corresponds to the second part of step 5 below
    - we now have our trained image_rna model
    - ignore the results from this step -- the many of the embeddings are already highly trained from the UNIMODE training steps, therefore classification accuracy will be ~100% most of the time
    - SIDE NOTE: 
       - don't push the image_rna test embeddings through the optimised train_rna model, even though it seems like this might be a good idea
         - this is because there's no way to ensure image_rna test embeddings would correspond 1:1 with their associated image embeddings, thus it's very likely high that such an image_rna embedding test set would be heavily polluted with image_rna training examples
- What remains is to see whether it worked or not. To do this we must create concatenated embeddings from the reserved MULTIMODE examples, and push these through the image_rna model. This corresponds to steps 6, 7 and 8 belowe
    - steps 6 and 7 creates separate image and rna embeddings using the models established in steps 1 - 4
    - step 8 concatenates the embeddings and pushes them through the trained image_rna model established at step 5

shell script: ./do_image_rna/sh

 step | shell script:    dataset:  inputs:       embedding:   tranche:                             | train/test?  |  purpose:                                  |   use this model:                      |  output:
 -----+--------------------------------------------------------------------------------------------+--------------+--------------------------------------------+----------------------------------------+------------------------------------------
   1  | ./do_all.sh     -d stad   -i image                   -c DESIGNATED_UNIMODE_CASES   -v True |  train+test  |  train model on images                     |   trainable VGG11                      |  optimised model (model_image.pt)
   2  | ./just_test.sh  -d stad   -i image      -m generate  -c DESIGNATED_UNIMODE_CASES           |  test        |  generate optimised image embeddings       |   optimised model (model_image.pt)     |  optimised image embeddings 
   3  | ./do_all.sh     -d stad   -i rna                     -c DESIGNATED_UNIMODE_CASES           |  train+test  |  train model on rna                        |   trainable DENSE                      |  optimised model (model_rna.pt)
   4  | ./just_test.sh  -d stad   -i rna        -m generate  -c DESIGNATED_UNIMODE_CASES           |  test        |  generate optimised rna   embeddings       |   optimised model (model_rna.pt)       |  optimised rna embeddings
   5  | ./do_all.sh     -d stad   -i image_rna               -c DESIGNATED_UNIMODE_CASES           |  train+test  |  concatenate the image_rna embeddings &    |   NONE                                 |  concatenated embeddings
                                                                                                   |  train       |  train model on image_rna embeddings       |   trainable DENSE                      |  optimised model (model_image_rna.pt)

   6  | ./just_test.sh  -d stad   -i image      -m use       -c DESIGNATED_MULTIMODE_CASES         |  test        |  generate optimised image     embeddings   |   optimised model (model_image.pt)     |  optimised image embeddings
   7  | ./just_test.sh  -d stad   -i rna        -m use       -c DESIGNATED_MULTIMODE_CASES         |  test        |  generate optimised rna       embeddings   |   optimised model (model_rna.pt)       |  optimised rna embeddings
   8  | ./just_test.sh  -d stad   -i image_rna  -m use       -c DESIGNATED_MULTIMODE_CASES         |  test        |  classify held out image_rna cases         |   optimised model (model_image_rna.pt) |  classifications


Late Realisation
================
- Don't have to use only matched data when coming up with the two opimized models that will be used to generate the embedding set                                        << TODO


Bio-Dataset Management
======================
   I Three aspects:
       1 cases which have matched examples versus those which don't (HAS_MATCHED_IMAGE_RNA_FLAG)
       2 split of case between training and test mode for TRAINING runs (there are three training runs in total: image, rna-seq and image_rna
       3 cases which are exclusively reserved for 'image_rna' tesing  (DESIGNATED_MULTIMODE_CASES) versus other matched cases (DESIGNATED_MULTIMODE_CASE_FLAG)
       
   II   Directories which contain both image and rna-seq images are flagged ('HAS_MATCHED_IMAGE_RNA_FLAG'). These are the only cases that are used when the user option -m image_rna is selected

   III  To firewall MULTIMODE test examples from UNIMODE examples.
         Further, 
           cases with matched image and rna-seq data  have a 'HAS_MATCHED_IMAGE_RNA_FLAG'    file
           cases designated for use only in UNIMODE   have a: DESIGNATED_UNIMODE_CASE_FLAG   flag
           cases designated for use only in MULTIMODE have a: DESIGNATED_MULTIMODE_CASE_FLAG flag
         Therefore,
           in all directory traversals, skip cases that don't lacking a HAS_MATCHED_IMAGE_RNA_FLAG, and ...
             in the four UNIMODE runs,   only use cases which also have the DESIGNATED_UNIMODE_CASE_FLAG
             in the four MULTIMODE runs, only use cases which also have the DESIGNATED_MULTIMODE_CASE_FLAG
             
   IV  To generate the concatenated embeddings
         Note that while each dataset directory (case) contains multiple image (tile) files but only a single rna-seq file
           Traverse 'dataset' (os.walk)
           Within each directory (case) look for a file with the name '_image_rna_matched___rna'. This is the rna embedding
           If it existis, then make make one concatenated embedding for every image embedding ('_NNNNNNNN_image_rna_matched___image.npy') which exists in the same directory
           Name the resulting  concatenated embeddings ('_NNNNNNNN_image_rna_matched___image_rna.npy') and save back to the same directory
           At the end of this process, each MATCHED, UNIMODE directory will contain as many new concatenated embeddings as there were image embeddings
           These concatenated embeddings will be used to train the MULTIMODE image+rna model 
   
   V To ensure as many embeddings as possible are generated in UNIMODE for use in MULTIMODE training:
   
       a during UNIMODE   image     training  run, save training and test indices (as 'train_inds_image' and 'test_inds_image' respectively)
       b during UNIMODE   image     test      run, push saved TRAINING examples through the  trained model ('_79636225_image_rna_matched___image.npy')
       c during UNIMODE   rna-seq   training  run, save training and test indices (as 'train_inds_image' and 'test_inds_image' respectively)
       d during UNIMODE   rna-seq   test      run, push saved TRAINING examples through the  trained model ('_image_rna_matched___rna.npy')
       e during MULTIMODE image_rna training  run, use every embedding that was generated at b and d for MULTIMODE training
      (- there is no image_rna test run. No easy way to ensure image_rna test embeddings would correspond 1:1 with their associated image embeddings, so it's likely that such a 'test' set would be heavily polluted with image_rna training examples

Implementation Notes
====================

1  Change the Bash scripts to used keyword arguments rather than positional arguments                                                                                               <<< completed 19/10/20
      preparatory, really just to make flag handling easier. should have done this before now TBH

2  Perform single mode training, to generate an optimised model for each of image and rna data modes                                                                                <<< completed 19/10/20

     ./do_all stad image 
     ./do_all stad rna
   
    - change generate() to create distinct model files for each mode: model_image.pt and model_rna.pt
    - test this change thoroughly before doing any further multi-modal enhancements
    
    inputs:  tiles and rna-seq data from matched cases
    outputs: two trained models

3 Using the UNIMODE trained models to generate matched image+rna embeddings
   
     ./just_test -d stad -i image -m image_rna -c DESIGNATED_UNIMODE_CASES
     ./just_test -d stad -i rna   -m image_rna -c DESIGNATED_UNIMODE_CASES
     
     - test mode only
     - repurpose the 'image_rna' flag so that it becomes a trigger to:                                                                                                              <<< completed 24/10/20 (images)
               (i)  only used matched cases and                                                                                                                                    
               (ii) extract and save FC1 embeddings BACK INTO THE CASE DIRECTORIES that the inputs came from
        - during generation(), only used matched cases (perform a spreadsheet lookup and 'skip' if not matched)                                                                     <<< completed 19/10/20
        - insert code into VGG11 to save FC1 embedding back to the corresponding case directory (saved in fnames)                                                                   <<< completed 24/10/20
             - VGGNN to return FC1 embedding (batch) as well as y2_hat                                                                                                              
               - but only during test mode, (using the last model saved during training)
               - this is a variation on the way we currently use test_mode, where we use it to push an entire patch through the optimum model
               - each embedding is the equivalent of a tile, so we will end up with as many embeddings in a case directory as there were tiles chosen from that same directory
               - we need (and have) the 'fnames', because they tell us where (which case directory) to save each embedding to ( <<<<< this also applies to rna-seq )
               - OTOH, can't use the fname to name the embeddings, since these are only unique per sample, not per tile/embedding. And also, we don't currently retain tile names during tile processing.
               - therefore, use random integers in embeddding file names, as follows '96369306_image_rna_matched___image.npy' to identify them as image embeddings
               - recall that each batch contains BATCH_SIZE embeddings, each row of which is a distinct embedding to be saved                                                       <<< completed 24/11/20 (images)
        - insert code into DENSE to save FC1 embedding back to the corresponding case directory (saved in fnames)                                                                   <<< completed 20/11/20 (rna-seq)                                                                                                                                           <<< completed 20/11/20 (rna-seq)
               - Make and store a  softlink based on an integer reference to the case id for later use so that DENSE will later know where to save the rna-seq embeddings           <<< completed 18/11/20 (rna-seq)
               - insert code into DENSE FC1 to create embeddings (mini-batch at a time)                                                                                             <<< completed 20/11/20 (rna-seq)
               - insert code into trainlenet5 to associate embeddings with corresponding case directory (saved in fnames) and save there                                            <<< completed 20/11/20 (rna-seq)
               - recall that each batch contains BATCH_SIZE embeddings, each row of which is a distinct embedding to be saved                                                       <<< completed 24/11/20 (rna-seq)
               - name rna-seq embeddding files as follows '_image_rna_matched___rna.npy' to identify them as rna-seq embeddings
        - modify just_test.sh to delete image embedding files where input is image (since new ones will be created)                                                                 <<< completed 20/11/20 (rna-seq)
        - modify just_test.sh to delete rna   embedding files where input is rna   (since new ones will be created)                                                                 <<< completed 20/11/20 (rna-seq)
        - in the first instance, use ALL matched cases, LATER allow user to specify a particular number of cases 
        - in the first instance, just select one of the FC layer outputs to save for embeddings, LATER allow user to specify the layer to use via an environment variable

    inputs:  the two optimised models saved at step 2
    outputs: embeddings for each of image and rna-seq, saved in the applicable case directories

 
4 Perform MULTIMODE training using the matched image+rna embeddings 
           
      -  shell file modifications
                ./do_all.sh   -d stad -i image_rna   should delete existing concatenated embedding files ( *___image_rna.npy ) & perform generation                                 <<< completed 21/11/20
                ./just_run.sh -d stad -i image_rna   should assume the multimode .pt file already exists & skip generation
                ./only_run.sh -d stad -i image_rna   should assume the multimode .pt file already exists & skip generation

      -  Notes:                                                                                                                                                                     <<< review & planning 21/11/20
           -  never need to tile, since both the image and rna input files take the form of 1-D embedding vectors
           -  use input flag "-i image_rna" ( -i = args.input mode) to indicate that embeddings should be used as inputs rather than tiles or rna
           -  in the first (this) version, use ALL matched cases, LATER allow user to specify a particular number of cases
      -  mods to generate():                                                                                                                                                        <<< completed 24/11/20
         1  create the concatenated embedding vectors
           -  make the fqln links
              - move the rna fqln code to the top of the function and allow it to do BOTH rna and image_rna
           -  work through each case and locate ones that have both image and rna files                                                                                             <<< completed 21/11/20
           -  if found, concatentate and save back to same directory
              - use a file name based on the image embedding files, viz: '96369306_image_rna_matched___image_rna.npy'
              - then continue processing of this case identically to rna processing
         2  create and save the pytorch data file (.pth) in the identical manner as currently used for both image unimode and rna unimode                                           
           -  the image_rna pt file will need just 'new_image_rna', 'new_label', and 'new_fname'
           -  save the concatenated embeddings file to 'dpcca/data/dlbcl_image/train.pth', overwriting any existing pytorch input file
              - should it later prove necessary to keep existing path files, the new name for the concatenated embeddings file can be 'dpcca/data/dlbcl_image/train_image_rna.pth'  <<< don't implement
      -  mods to trainlenet5():
         -  process identically to rna-seq unimode.  
              -   if ( args.input_mode=='rna') | ( args.input_mode=='image_rna') : 
              -  save the optimised model as model_image_rna.pt, paralleing the names currently used for image and rna unimode model files
      
    inputs:  the embeddings files for each of image and rna-seq, saved in the applicable case directories (i.e. the output of step 3), viz:
                - _<random_integer>_image_rna_matched___image.npy
                -                  _image_rna_matched___rna.npy 
      
    outputs: optimised multimodal model, ' model_image_rna.pt'; tensorflow curves

      
5 Perform dual-mode testing using the optimised model                                                                                                                               <<< notionally completed 25/11/20
                                                                                                                                                                                    <<< finally    completed 08/11/20
     ./just_test stad image_rna
     
     - make sure to avoid flag confusion with 3 above
     - push some or all held-out multimode embeddings through the optimised model
     
    inputs:  optimised multimodal model, ' model_image_rna.pt'
    outputs: classifications
    
6  Regression testing to make sure the other modes still work                                                                                                                       <<< completed 08/11/20