-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathstrategy_for_multimodal.txt
182 lines (147 loc) · 19.3 KB
/
strategy_for_multimodal.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
Summary
=======
- Reserve a set of matched examples (say 20%) that will ONLY be used for MULTIMODE ('image+rna')
- the 'v' ('divide_cases') option divides matched cases (only) into two classes: DESIGNATED_UNIMODE_CASES and DESIGNATED_MULTIMODE_CASES. It ignores cases (i.e. directories) which do not have both image and rna data
- the assigment is random. There is always a 1/4 chance that a given case will be assigned to the DESIGNATED_MULTIMODE_CASES class. This means later cases are less likely to become MULTIMODE cases which would be a problem if the cases were ordered by class (as far as we know, they're not)
The reason it's done this way is because we're only making one pass through the directory so we need to ensure that the DESIGNATED_MULTIMODE_CASES class is filled with the user requested number of MULTIMODE cases before we've completed the one and only pass.
If CASES_RESERVED_FOR_IMAGE_RNA is more than 20% of all matched cases, and especially if it is >25% of all matched cases, the the DESIGNATED_MULTIMODE_CASES class is unlikely to be filled. So make sure CASES_RESERVED_FOR_IMAGE_RNA is <=20% of all matched cases,
- Perform UNIMODE training on the remainder (say 80%) of the examples for each of the two modes, generating a trained model for each. This corresponds to steps 1 and 3 below
- the number of examples reserved for testing in unimode can be relatively small, however there must be at least some because the criterion for saving a model is 'lower test score than all earlier training epochs'
- it's important to use a model that's 'UNIMODE optimised' (for example, making sure not to overfit), so take the same care with hyperparameters as you would were you not performing MULTIMODE
- Push all TRAINING examples through each (now trained) model to generate the two sets of UNIMODE embeddings. This corresponds to steps 2 and 4 below
- this seems odd, but recall that we have separately reserved a set of examples for MULTIMODE that are never seen during UNIMODE.
- it is achieved by saving the dataset indices that were used during training mode and loading these in the subsequent test mode run
- actually it wouldn't matter (and it would be better) if ALL (training and test) examples were pushed through the now trained model << TODO
- Concatenate the two sets of matched UNIMODE embeddings to establish the training set for image_rna. This corresponds to the first part of step 5 below
- Use the concatenated embeddings to train the image_rna model. Since they are 1D vectors, we use DENSE to accomplish this. This corresponds to the second part of step 5 below
- we now have our trained image_rna model
- ignore the results from this step -- the many of the embeddings are already highly trained from the UNIMODE training steps, therefore classification accuracy will be ~100% most of the time
- SIDE NOTE:
- don't push the image_rna test embeddings through the optimised train_rna model, even though it seems like this might be a good idea
- this is because there's no way to ensure image_rna test embeddings would correspond 1:1 with their associated image embeddings, thus it's very likely high that such an image_rna embedding test set would be heavily polluted with image_rna training examples
- What remains is to see whether it worked or not. To do this we must create concatenated embeddings from the reserved MULTIMODE examples, and push these through the image_rna model. This corresponds to steps 6, 7 and 8 belowe
- steps 6 and 7 creates separate image and rna embeddings using the models established in steps 1 - 4
- step 8 concatenates the embeddings and pushes them through the trained image_rna model established at step 5
shell script: ./do_image_rna/sh
step | shell script: dataset: inputs: embedding: tranche: | train/test? | purpose: | use this model: | output:
-----+--------------------------------------------------------------------------------------------+--------------+--------------------------------------------+----------------------------------------+------------------------------------------
1 | ./do_all.sh -d stad -i image -c DESIGNATED_UNIMODE_CASES -v True | train+test | train model on images | trainable VGG11 | optimised model (model_image.pt)
2 | ./just_test.sh -d stad -i image -m generate -c DESIGNATED_UNIMODE_CASES | test | generate optimised image embeddings | optimised model (model_image.pt) | optimised image embeddings
3 | ./do_all.sh -d stad -i rna -c DESIGNATED_UNIMODE_CASES | train+test | train model on rna | trainable DENSE | optimised model (model_rna.pt)
4 | ./just_test.sh -d stad -i rna -m generate -c DESIGNATED_UNIMODE_CASES | test | generate optimised rna embeddings | optimised model (model_rna.pt) | optimised rna embeddings
5 | ./do_all.sh -d stad -i image_rna -c DESIGNATED_UNIMODE_CASES | train+test | concatenate the image_rna embeddings & | NONE | concatenated embeddings
| train | train model on image_rna embeddings | trainable DENSE | optimised model (model_image_rna.pt)
6 | ./just_test.sh -d stad -i image -m use -c DESIGNATED_MULTIMODE_CASES | test | generate optimised image embeddings | optimised model (model_image.pt) | optimised image embeddings
7 | ./just_test.sh -d stad -i rna -m use -c DESIGNATED_MULTIMODE_CASES | test | generate optimised rna embeddings | optimised model (model_rna.pt) | optimised rna embeddings
8 | ./just_test.sh -d stad -i image_rna -m use -c DESIGNATED_MULTIMODE_CASES | test | classify held out image_rna cases | optimised model (model_image_rna.pt) | classifications
Late Realisation
================
- Don't have to use only matched data when coming up with the two opimized models that will be used to generate the embedding set << TODO
Bio-Dataset Management
======================
I Three aspects:
1 cases which have matched examples versus those which don't (HAS_MATCHED_IMAGE_RNA_FLAG)
2 split of case between training and test mode for TRAINING runs (there are three training runs in total: image, rna-seq and image_rna
3 cases which are exclusively reserved for 'image_rna' tesing (DESIGNATED_MULTIMODE_CASES) versus other matched cases (DESIGNATED_MULTIMODE_CASE_FLAG)
II Directories which contain both image and rna-seq images are flagged ('HAS_MATCHED_IMAGE_RNA_FLAG'). These are the only cases that are used when the user option -m image_rna is selected
III To firewall MULTIMODE test examples from UNIMODE examples.
Further,
cases with matched image and rna-seq data have a 'HAS_MATCHED_IMAGE_RNA_FLAG' file
cases designated for use only in UNIMODE have a: DESIGNATED_UNIMODE_CASE_FLAG flag
cases designated for use only in MULTIMODE have a: DESIGNATED_MULTIMODE_CASE_FLAG flag
Therefore,
in all directory traversals, skip cases that don't lacking a HAS_MATCHED_IMAGE_RNA_FLAG, and ...
in the four UNIMODE runs, only use cases which also have the DESIGNATED_UNIMODE_CASE_FLAG
in the four MULTIMODE runs, only use cases which also have the DESIGNATED_MULTIMODE_CASE_FLAG
IV To generate the concatenated embeddings
Note that while each dataset directory (case) contains multiple image (tile) files but only a single rna-seq file
Traverse 'dataset' (os.walk)
Within each directory (case) look for a file with the name '_image_rna_matched___rna'. This is the rna embedding
If it existis, then make make one concatenated embedding for every image embedding ('_NNNNNNNN_image_rna_matched___image.npy') which exists in the same directory
Name the resulting concatenated embeddings ('_NNNNNNNN_image_rna_matched___image_rna.npy') and save back to the same directory
At the end of this process, each MATCHED, UNIMODE directory will contain as many new concatenated embeddings as there were image embeddings
These concatenated embeddings will be used to train the MULTIMODE image+rna model
V To ensure as many embeddings as possible are generated in UNIMODE for use in MULTIMODE training:
a during UNIMODE image training run, save training and test indices (as 'train_inds_image' and 'test_inds_image' respectively)
b during UNIMODE image test run, push saved TRAINING examples through the trained model ('_79636225_image_rna_matched___image.npy')
c during UNIMODE rna-seq training run, save training and test indices (as 'train_inds_image' and 'test_inds_image' respectively)
d during UNIMODE rna-seq test run, push saved TRAINING examples through the trained model ('_image_rna_matched___rna.npy')
e during MULTIMODE image_rna training run, use every embedding that was generated at b and d for MULTIMODE training
(- there is no image_rna test run. No easy way to ensure image_rna test embeddings would correspond 1:1 with their associated image embeddings, so it's likely that such a 'test' set would be heavily polluted with image_rna training examples
Implementation Notes
====================
1 Change the Bash scripts to used keyword arguments rather than positional arguments <<< completed 19/10/20
preparatory, really just to make flag handling easier. should have done this before now TBH
2 Perform single mode training, to generate an optimised model for each of image and rna data modes <<< completed 19/10/20
./do_all stad image
./do_all stad rna
- change generate() to create distinct model files for each mode: model_image.pt and model_rna.pt
- test this change thoroughly before doing any further multi-modal enhancements
inputs: tiles and rna-seq data from matched cases
outputs: two trained models
3 Using the UNIMODE trained models to generate matched image+rna embeddings
./just_test -d stad -i image -m image_rna -c DESIGNATED_UNIMODE_CASES
./just_test -d stad -i rna -m image_rna -c DESIGNATED_UNIMODE_CASES
- test mode only
- repurpose the 'image_rna' flag so that it becomes a trigger to: <<< completed 24/10/20 (images)
(i) only used matched cases and
(ii) extract and save FC1 embeddings BACK INTO THE CASE DIRECTORIES that the inputs came from
- during generation(), only used matched cases (perform a spreadsheet lookup and 'skip' if not matched) <<< completed 19/10/20
- insert code into VGG11 to save FC1 embedding back to the corresponding case directory (saved in fnames) <<< completed 24/10/20
- VGGNN to return FC1 embedding (batch) as well as y2_hat
- but only during test mode, (using the last model saved during training)
- this is a variation on the way we currently use test_mode, where we use it to push an entire patch through the optimum model
- each embedding is the equivalent of a tile, so we will end up with as many embeddings in a case directory as there were tiles chosen from that same directory
- we need (and have) the 'fnames', because they tell us where (which case directory) to save each embedding to ( <<<<< this also applies to rna-seq )
- OTOH, can't use the fname to name the embeddings, since these are only unique per sample, not per tile/embedding. And also, we don't currently retain tile names during tile processing.
- therefore, use random integers in embeddding file names, as follows '96369306_image_rna_matched___image.npy' to identify them as image embeddings
- recall that each batch contains BATCH_SIZE embeddings, each row of which is a distinct embedding to be saved <<< completed 24/11/20 (images)
- insert code into DENSE to save FC1 embedding back to the corresponding case directory (saved in fnames) <<< completed 20/11/20 (rna-seq) <<< completed 20/11/20 (rna-seq)
- Make and store a softlink based on an integer reference to the case id for later use so that DENSE will later know where to save the rna-seq embeddings <<< completed 18/11/20 (rna-seq)
- insert code into DENSE FC1 to create embeddings (mini-batch at a time) <<< completed 20/11/20 (rna-seq)
- insert code into trainlenet5 to associate embeddings with corresponding case directory (saved in fnames) and save there <<< completed 20/11/20 (rna-seq)
- recall that each batch contains BATCH_SIZE embeddings, each row of which is a distinct embedding to be saved <<< completed 24/11/20 (rna-seq)
- name rna-seq embeddding files as follows '_image_rna_matched___rna.npy' to identify them as rna-seq embeddings
- modify just_test.sh to delete image embedding files where input is image (since new ones will be created) <<< completed 20/11/20 (rna-seq)
- modify just_test.sh to delete rna embedding files where input is rna (since new ones will be created) <<< completed 20/11/20 (rna-seq)
- in the first instance, use ALL matched cases, LATER allow user to specify a particular number of cases
- in the first instance, just select one of the FC layer outputs to save for embeddings, LATER allow user to specify the layer to use via an environment variable
inputs: the two optimised models saved at step 2
outputs: embeddings for each of image and rna-seq, saved in the applicable case directories
4 Perform MULTIMODE training using the matched image+rna embeddings
- shell file modifications
./do_all.sh -d stad -i image_rna should delete existing concatenated embedding files ( *___image_rna.npy ) & perform generation <<< completed 21/11/20
./just_run.sh -d stad -i image_rna should assume the multimode .pt file already exists & skip generation
./only_run.sh -d stad -i image_rna should assume the multimode .pt file already exists & skip generation
- Notes: <<< review & planning 21/11/20
- never need to tile, since both the image and rna input files take the form of 1-D embedding vectors
- use input flag "-i image_rna" ( -i = args.input mode) to indicate that embeddings should be used as inputs rather than tiles or rna
- in the first (this) version, use ALL matched cases, LATER allow user to specify a particular number of cases
- mods to generate(): <<< completed 24/11/20
1 create the concatenated embedding vectors
- make the fqln links
- move the rna fqln code to the top of the function and allow it to do BOTH rna and image_rna
- work through each case and locate ones that have both image and rna files <<< completed 21/11/20
- if found, concatentate and save back to same directory
- use a file name based on the image embedding files, viz: '96369306_image_rna_matched___image_rna.npy'
- then continue processing of this case identically to rna processing
2 create and save the pytorch data file (.pth) in the identical manner as currently used for both image unimode and rna unimode
- the image_rna pt file will need just 'new_image_rna', 'new_label', and 'new_fname'
- save the concatenated embeddings file to 'dpcca/data/dlbcl_image/train.pth', overwriting any existing pytorch input file
- should it later prove necessary to keep existing path files, the new name for the concatenated embeddings file can be 'dpcca/data/dlbcl_image/train_image_rna.pth' <<< don't implement
- mods to trainlenet5():
- process identically to rna-seq unimode.
- if ( args.input_mode=='rna') | ( args.input_mode=='image_rna') :
- save the optimised model as model_image_rna.pt, paralleing the names currently used for image and rna unimode model files
inputs: the embeddings files for each of image and rna-seq, saved in the applicable case directories (i.e. the output of step 3), viz:
- _<random_integer>_image_rna_matched___image.npy
- _image_rna_matched___rna.npy
outputs: optimised multimodal model, ' model_image_rna.pt'; tensorflow curves
5 Perform dual-mode testing using the optimised model <<< notionally completed 25/11/20
<<< finally completed 08/11/20
./just_test stad image_rna
- make sure to avoid flag confusion with 3 above
- push some or all held-out multimode embeddings through the optimised model
inputs: optimised multimodal model, ' model_image_rna.pt'
outputs: classifications
6 Regression testing to make sure the other modes still work <<< completed 08/11/20