-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcreate_master_mapping_file.py
428 lines (326 loc) · 28.8 KB
/
create_master_mapping_file.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
"""=============================================================================
Code to set up the master mapping file from TCGA project file containing clinical data which is applicable to the cancer type
For example,the project file for TCGA STAD is "nationwidechildrens.org_clinical_patient_stad.txt"
NOTES:
=====
1 The TCGA project file must be MANUALLY edited as follows prior to running 'create_master_mapping file'. Otherwise nothing mentioned here will work.
a) insert a new column with heading 'type_n'. This column will hold the class labels, which the user will manually enter
b) insert the class for each case (based on the descriptions provided in other columns, e.g. "histologic diagnosis" in the case of STAD)
- classes must be numbers, starting at zero, and without gaps (e.g. 0,3,5,6 is no good)
It's not possible to generate the class labels automatically, because the text descriptions tend to be at least a little ambiguous, and often very ambiguous and overlapping
2 This module (create_master_mapping_file.py) will use the manually edited file described above as it's input. It will not work unless it (i) exists and (ii) has been edited exactly as per 1. above
3 This module (create_master_mapping_file.py) will do the following:
a) delete any existing custom mapping files which may exist in the applicable global data directory (e.g. "stad_global"), since these would otherwise be confusing orhpans
b) convert all text in the 'mapping_file_master' to lower case
c) add two new columns to the immediate right of the "type_n" column, as follows: "have_wsi", "have_rna", with the following meanings
"have_wsi" = the case exists in the master dataset directory (e.g. "stad_global") and it contains at least one Whole Slide Image file
"have_rna" = the case exists in the master dataset directory (e.g. "stad_global") and it contains at least one rna_seq file
d) scan the master dataset directory (i.e. "xxxx_global") and populate each of these two columns with a number specifying the number of WSI/rna-seq files that currently exist in the master dataset directory
e) delete any cases (subdirectories) which may exist in the master dataset directory but which have '0' in both the "have_wsi" column and the "have_rna"
such cases are useless because we don't have class information for them. In theory there shouldn't be any, but in practice TCGA does contain at least some sample data that is not listed in the project spreadsheet, and it's easier to just delete them that to cater for them in downstream code
f) output a file called 'xxxx_master_mapping_file' to the applicable master dataset directory, where xxxx is the TCGA abbreviation for the applicable cancer (e.g. 'stad_master_mapping_file')
4 The output file (e.g. 'stad_master_mapping_file') is the default mapping file used by the expertiment platform for classification experiments. It has no file extension.
vvvvvvvvvvvvvvvvvvvvvvvvv NOT IMPLEMENTED FROM HERE ON vvvvvvvvvvvvvvvvvvvvvvvvvvvvv
5 However a second module - "customise_mapping_file.py" - can optionally be used to generate a custom mapping file which may alternatively be used for an experiment job
6 The following kinds of customisations may be used either alone or in combination with "customise_mapping_file.py" to define a custom mapping file which:
a) removes classes which exist in very small numbers (if requested, this is done first)
b) define dataset comprising:
(i) ALL or a specified number of just image files OR just rna_seq files OR just matched image + rna_seq files AND
(ii) optionally specified that it must be balanced (as defined by applicable user parameters)
7 "customise_mapping_file.py" will be a new file in master dataset directory with a readable name indicating the nature of the customisation, as follows:
mapping_file_custom_stad_[not_]balanced_<[image_nnn] [rna_nnn] [matched_nnn]"" interpretable as per the following examples:
mapping_file_custom_stad_not_balanced_image_all --- includes every case which has an image ; no attempt to balance classes
mapping_file_custom_stad_balanced_rna_all --- includes every rna file, consistent with the classes being balanced
mapping_file_custom_stad_not_balanced_matched_all --- includes every case which has matched image and rna_seq data
mapping_file_custom_stad_balanced_image_100 --- includes 100 image cases, consistent with the classes being balanced << if theren't aren't 100, it will give a warning, use max available and name the file accordingly
8 note the following:
a) neither 'create_master_mapping_file.py' nor 'customise_mapping_file.py' will change the contents of the master dataset directory (i.e. "xxxx_global") other than to delete directories that don't exist in the applicable project file
b) downstream code should use the contents of the applicable mapping file to generate a pytorch dataset which corresponds to the mapping_file
c) downstream code should not
(i) delete cases (subdirectories) from the master dataset directory
(ii) delete cases (subdirectories) from the working dataset directory (otherwise a new copy would have to be made for every experiment, and the copy is very time-consuming)
9 customise mapping files are used by 'generate()' in the following fashion:
1) each time generate() traverses a new case (subdirectory), it checks to see if the case is listed in the currently selected custom mapping file
a) if it is, it uses the files in the directory, in accordance with the 'INPUT_MODE' flag ( 'image', 'rna', 'image_rna')
b) if it is not it skips the directory
2) it accomplishes this by asking the a helper function "check_mapping_file( < image | rna | image_rna > ) which returns either 'True' (use it) 'False' (skip it)
this somewhat convoluted method is used is to avoid having to re-generate the working dataset (a time consuming process) each time a different custom mapping file is selected by the user
10 user notes:
a) if a custom file is to be used, it must (i) exist in the applicable master dataset directory (e.g. "stad_global") and (ii) be specified at MAPPING_FILE_NAME in variables.sh (e.g. mapping_file_custom_stad_not_balanced_image_all)
b) if MAPPING_FILE_NAME is not specified, the applicable master mapping file will be used (e.g. stad_master_mapping_file). Again:
(i) it must exist (and it will only exist if 'create_master_mapping_file.py' has been run
(ii) it must be in the applicable master dataset directory (e.g. "stad_global") ('create_master_mapping_file.py' will take care of this)
c) if a custom mapping file is specified and there is no master mapping file, the job will still run, but this is bad practice as the custom mapping file will have an unknown
============================================================================="""
import os
import re
import sys
import glob
import math
import time
import pprint
import argparse
import numpy as np
import pandas as pd
from tabulate import tabulate
pd.set_option('max_colwidth', 50)
#===========================================
np.set_printoptions(edgeitems=500)
np.set_printoptions(linewidth=400)
pd.set_option('display.max_rows', 50 )
pd.set_option('display.max_columns', 13 )
pd.set_option('display.width', 300 )
pd.set_option('display.max_colwidth', 99 )
# ------------------------------------------------------------------------------
from classi.constants import *
DEBUG = 1
# ------------------------------------------------------------------------------
def main(args):
now = time.localtime(time.time())
print(time.strftime(f"\nCREATE_MASTER: INFO: {MIKADO}%Y-%m-%d %H:%M:%S %Z{RESET}", now))
start_time = time.time()
base_dir = args.base_dir
data_source = args.data_source
global_data_dir = args.global_data_dir
dataset = args.dataset
case_column = args.case_column
class_column = args.class_column
tcga_rna_seq_file_suffix = args.tcga_rna_seq_file_suffix
image_column = 3
rna_seq_column = 4
# Global settings --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
np.set_printoptions(formatter={'float': lambda x: "{:>7.3f}".format(x)})
#pd.set_option( 'display.max_columns', 25 )
#pd.set_option( 'display.max_categories', 24 )
#pd.set_option( 'precision', 1 )
pd.set_option( 'display.min_rows', 8 )
pd.set_option( 'display.float_format', lambda x: '%6.2f' % x)
np.set_printoptions(formatter={'float': lambda x: "{:>6.2f}".format(x)})
cancer_class = dataset
class_specific_global_data_location = f"{global_data_dir}/{dataset}_global"
class_specific_dataset_files_location = f"{data_source}/{dataset}"
print ( f"CREATE_MASTER: INFO: cancer class (from TCGA master spreadsheet as edited) = {CYAN}{cancer_class}{RESET}" )
print ( f"CREATE_MASTER: INFO: class_specific_global_data_location = {CYAN}{class_specific_global_data_location}{RESET}" )
print ( f"CREATE_MASTER: INFO: class_specific_dataset_files_location = {CYAN}{class_specific_dataset_files_location}{RESET}" )
if os.path.isdir(class_specific_global_data_location)==False:
print ( f"{RED}CREATE_MASTER: FATAL: the expected global data sub-directory for cancer project '{MAGENTA}{cancer_class}{RESET}{RED}', namely, '{MAGENTA}{class_specific_global_data_location}{RESET}{RED}' does not exist.{RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: remedy: (i) create a directory under '{MAGENTA}pipeline{RESET}{RED}' with name '{MAGENTA}{dataset}_global{RESET}{RED}'{RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: remedy: (ii) place a copy of the TCGA master clinical data spreadsheets file applicable to '{MAGENTA}{cancer_class}{RESET}{RED}' in this directory{RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: remedy: The TCGA master clinical spreadsheet for '{MAGENTA}{cancer_class}{RESET}{RED}' will have a filename similar to this '{CYAN}nationwidechildrens.org_clinical_patient_{dataset}.csv{RESET}{RED}'{RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: remedy: TCGA master clinical data spreadsheets can be found at the NIH GDC data repository: '{CYAN}https://portal.gdc.cancer.gov/repository{RESET}{RED}' (clickable link){RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: remedy: Local 'convenience copies' of TCGA master clinical data spreadsheets are stored in: '{CYAN}all_tcga_project_level_files/{RESET}{RED}' , hoever it is preferable to download a fresh copy from the GDC in case there have been changes{RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: remedy: (iii) instructions on how to manually adjust the master spreadsheet can be found in the comments section of this ({MAGENTA}create_master_mapping_file.py{RESET}{RED}) module{RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: remedy: the adjustments are mandatory. Experiments cannot work unless they are made{RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: cannot continue - halting now{RESET}" )
sys.exit(0)
master_spreadsheet_found=False
for f in os.listdir( class_specific_global_data_location ):
if f.endswith(f"_{dataset}.csv"): # we can't be sure of the exact name, but we know it must end like this
master_spreadsheet_found=True
master_spreadsheet = f
print ( f"CREATE_MASTER: INFO: proceeding with master spreadsheet '{MAGENTA}{master_spreadsheet}{RESET}'" )
print ( f"CREATE_MASTER: INFO: now looking for {CYAN}{dataset}{RESET} master clinical data spreadsheet, which is assumed to be the only file in '{MAGENTA}{dataset}_global{RESET}' ending with '{MAGENTA}_{dataset}.csv{RESET}'" )
break
if master_spreadsheet_found==False:
print ( f"{RED}CREATE_MASTER: FATAL: could not find the '{MAGENTA}{cancer_class}{RESET}{RED}' master spreadsheet in {MAGENTA}{class_specific_global_data_location}{RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: remedy: ensure there's a valid master spreadsheet with the extension {CYAN}.csv{RESET}{RED} in {MAGENTA}{class_specific_global_data_location}{RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: instructions on how to construct a master spreadsheet can be found in the comments of this ({MAGENTA}create_master_mapping_file.py{RESET}{RED}) module{RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: cannot continue - halting now{RESET}" )
sys.exit(0)
else:
print ( f"CREATE_MASTER: INFO: have now found {CYAN}{dataset}{RESET} master clinical data spreadsheet, which has the name '{MAGENTA}{master_spreadsheet}{RESET}'" )
fqn = f"{class_specific_global_data_location}/{master_spreadsheet}"
if (DEBUG>0):
print ( f"CREATE_MASTER: INFO: about to open: '{MAGENTA}{fqn}{RESET}'")
try:
df = pd.read_csv( f"{fqn}", sep=',' )
except Exception as e:
print ( f"{RED}CREATE_MASTER: FATAL: '{e}'{RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: explanation: there is no mapping file named {MAGENTA}{mapping_file_name}{RESET}{RED} in the dataset working copy ({MAGENTA}{data_dir}{RESET}{RED}){RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: remedy: ensure there's a valid mapping file named {MAGENTA}{mapping_file_name}{RESET}{RED} in the {MAGENTA}{dataset}{RESET}{RED} source dataset directory ({MAGENTA}{global_data}{RESET}{RED}){RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: cannot continue - halting now{RESET}" )
sys.exit(0)
if DEBUG>0:
print ( f"CREATE_MASTER: INFO: df.shape = {CYAN}{ df.shape}{RESET}", flush=True )
if DEBUG>9:
print ( f"CREATE_MASTER: INFO: pandas description of df: \n{CYAN}{df.describe}{RESET}", flush=True )
if DEBUG>99:
print ( f"CREATE_MASTER: INFO: start of df: \n{CYAN}{df.iloc[:,1]}{RESET}", flush=True )
if DEBUG>99:
print(tabulate(df, tablefmt='psql'))
df.insert(loc=3, column='image', value='') # insert new column to hold image counts
df.iloc[0,3]='image'
df.iloc[1,3]='image'
df.iloc[2:,3]=0
df.insert(loc=4, column='rna_seq', value='') # insert new column to hold rna_seq counts
df.iloc[0,4]='rna_seq'
df.iloc[1,4]='rna_seq'
df.iloc[2:,4]=0
df = df.fillna('').astype(str).apply(lambda x: x.str.lower())
found_cases = 0
found_clone_directories = 0
found_non_clone_directories = 0
global_found_slide_file = 0
global_found_rna_seq_file = 0
global_found_file_of_interest = 0
global_other_files_found = 0
matched_cases_count = 0
for i in range(2, len(df)): # for each case (row) listed in the master spreadsheet
case = df.iloc[i, 1]
fqn = f"{class_specific_dataset_files_location}/{case}"
found_cases+=1
if DEBUG>99:
print(fqn)
matches = glob.glob( f"{fqn}*" ) # picks up the extra directories for the cases where there is more than one slide file. These have the extension "_<n>"
if len(matches)>0:
if DEBUG>9:
print ( f"{BLEU}{matches}{RESET}" )
clone_found_slide_file = 0 # total for all clone directories. this is the value we record in the master spreadsheet
clone_found_rna_seq_file = 0 # total for all clone directories. this is the value we record in the master spreadsheet
clone_found_file_of_interest = 0 # total for all clone directories. this is the value we record in the master spreadsheet
if os.path.isdir(matches[0]):
found_non_clone_directories+=1
for j in range( 0, len(matches)) :
found_slide_file = 0 # total for all clone directories. this is the value we record in the master spreadsheet
found_rna_seq_file = 0 # total for all clone directories. this is the value we record in the master spreadsheet
found_file_of_interest = 0 # total for all clone directories. this is the value we record in the master spreadsheet
if DEBUG>9:
print ( f"{ARYLIDE}{matches[j]}{RESET}" )
if os.path.isdir(matches[j]):
if DEBUG>0:
print ( f"CREATE_MASTER: INFO: {GREEN}directory {CYAN}{matches[j]}{RESET}{GREEN} exists{RESET}" )
if DEBUG>9:
print ( f"CREATE_MASTER: INFO: directory {CYAN}{matches[j]}{RESET}" )
found_clone_directories+=1
for f in os.listdir(matches[j]): # for each clone directory
if f.endswith(".svs") or f.endswith(".SVS") or f.endswith(".tif") or f.endswith(".TIF"):
found_slide_file +=1
clone_found_slide_file +=1
global_found_slide_file +=1
found_file_of_interest +=1
clone_found_file_of_interest +=1
global_found_file_of_interest +=1
if DEBUG>0:
print( f"CREATE_MASTER: INFO: in this dir: found slide file {CARRIBEAN_GREEN}{f}{RESET} number found = {CARRIBEAN_GREEN}{found_slide_file}{RESET}" )
if DEBUG>11:
print( f"CREATE_MASTER: INFO: {CARRIBEAN_GREEN}{df.iloc[i, clone_found_slide_file]}{RESET}" )
df.iloc[i, image_column] = clone_found_slide_file
elif f.endswith(tcga_rna_seq_file_suffix[1:]):
found_rna_seq_file +=1
clone_found_rna_seq_file +=1
global_found_rna_seq_file +=1
found_file_of_interest +=1
clone_found_file_of_interest +=1
global_found_file_of_interest +=1
if DEBUG>0:
print( f"CREATE_MASTER: INFO: in this dir: found rna-seq file {BITTER_SWEET}{f}{RESET} number found = {BITTER_SWEET}{found_rna_seq_file}{RESET}" )
if DEBUG>11:
print( f"CREATE_MASTER: INFO: in this dir: {BITTER_SWEET}{df.iloc[i, found_rna_seq_file]}{RESET}" )
df.iloc[i, rna_seq_column] = clone_found_rna_seq_file
else:
global_other_files_found+=1
if found_slide_file>0 and found_rna_seq_file>0:
matched_cases_count +=1
if DEBUG>0:
print( f"CREATE_MASTER: INFO: {MAGENTA}matched files{RESET}" )
if DEBUG>0:
if found_slide_file>1:
print( f"CREATE_MASTER: INFO: {BLEU}multiple ({MIKADO}{found_slide_file}{RESET}) slide files exist in directory {CYAN}{matches[j]}{RESET}" )
if DEBUG>0:
if found_rna_seq_file>1:
print( f"CREATE_MASTER: INFO: {ORANGE}multiple ({MIKADO}{found_rna_seq_file}{RESET}{ORANGE}) rna-seq files exist in directory {CYAN}{matches[j]}{RESET}" )
if DEBUG>0:
if found_file_of_interest==0:
print( f"CREATE_MASTER: INFO: {MAGENTA}no files of interest in directory {CYAN}{matches[j]}{RESET}" )
if DEBUG>9:
print( f"CREATE_MASTER: INFO: clone dirs: found slide files {CARRIBEAN_GREEN}{clone_found_slide_file}{RESET}" )
print( f"CREATE_MASTER: INFO: clone dirs: found rna-seq files {BITTER_SWEET}{clone_found_rna_seq_file}{RESET}" )
if DEBUG>11:
print( f"CREATE_MASTER: INFO: clone dirs: totals: {BITTER_SWEET}{df.iloc[i, clone_found_rna_seq_file]}{RESET}" )
else:
print ( f"CREATE_MASTER: INFO: {RED}directory {CYAN}{matches[j]}{RESET}{RED} does not exist{RESET}" )
# (2) Cross check files in dataset against the applicable master spreadsheet
actual_dirs=-1
if DEBUG>0: # so that we don't count the root directory, only subdirectories
print ( f"\nCREATE_MASTER: INFO: about to scan {CYAN}{class_specific_dataset_files_location}{RESET} to ensure all cases stored locally are also listed in the '{MAGENTA}{cancer_class}{RESET}' clinical master spreadsheet ('{CYAN}{master_spreadsheet}{RESET}'){RESET}" )
for _, d, f in os.walk( class_specific_dataset_files_location ):
actual_dirs+=1
for el in enumerate ( d ):
if DEBUG>0:
print ( f"CREATE_MASTER: INFO: {PINK}now considering directory {CYAN}{el[1]}{RESET}" )
if DEBUG>100:
print ( f"{PINK}length is {MIKADO}{len(case)}{RESET}{RESET}" )
case_found_in_spreadsheet=False
if re.search( "_[0-9]", el[1]): # cases which have more than one RNA-seq example. These have the extension _1 _2 etc. Only cater for up to _9 coz never seen one with more than two
if DEBUG>9:
print ( (el[1])[:-2] )
else:
pass
for c in range(2, len(df)):
case = df.iloc[c, 1]
if DEBUG>99:
print ( f"{BLEU}el {MIKADO}{el[1]}{RESET} {BLEU} against case {CYAN}{case}{RESET}" )
if ( el[1]==case ):
case_found_in_spreadsheet=True
if re.search( "_[0-9]", el[1] ):
if (el[1])[:-2] == case:
case_found_in_spreadsheet=True
if case_found_in_spreadsheet==True:
if DEBUG>0:
print ( f"CREATE_MASTER: INFO: {GREEN}directory (case) {CYAN}{el[1]}{RESET}{GREEN} \r\033[98C(or its root if applicable) is listed in master clinical spreadsheet{RESET}" )
else:
pass
else:
if DEBUG>0:
print ( f"CREATE_MASTER: INFO: {ORANGE}directory (case) {CYAN}{el[1]}{RESET}{ORANGE}\r\033[98C(or its root if applicable) is not listed in master clinical spreadsheet\r\033[200C <<<<< notional anomoly, but no action will be taken{RESET}" )
# (3) show some useful stats
offset=176
print ( f"\n" )
print ( f"CREATE_MASTER: INFO: total cases listed in TCGA {CYAN}{cancer_class}_global{RESET} master spreadsheet ('{CYAN}{master_spreadsheet}{RESET}') as edited: \r\033[{offset}Cfound cases = {MIKADO}{found_cases}{RESET}" )
print ( f"CREATE_MASTER: INFO: total directories (exc. clones) found in class specific dataset files location '{CYAN}{class_specific_dataset_files_location}{RESET}': \r\033[{offset}Cfound (non_clone) directories = {MIKADO}{found_non_clone_directories}{RESET}" )
print ( f"CREATE_MASTER: INFO: {ITALICS}hence{RESET} total cases in master spreadsheet that don't exist in the local dataset: \r\033[{offset}Cfound_cases {BLEU}minus{RESET} found (non_clone) directories = {GREEN if found_cases-found_non_clone_directories==0 else RED}{found_cases-found_non_clone_directories:3d}{RESET}", end="" )
if not found_cases - found_clone_directories == 0:
print ( f"\r\033[235C{RED} <<<<< this many cases don't exist in the class specific dataset files location{RESET}")
else:
print ("")
print ( f"CREATE_MASTER: INFO: total examples (clone directories) found in class specific dataset files location '{CYAN}{class_specific_dataset_files_location}{RESET}': \r\033[{offset}Cfound_clone_directories = {MIKADO}{found_clone_directories}{RESET}" )
print ( f"CREATE_MASTER: INFO: total clone directories in class specific dataset files location '{CYAN}{class_specific_dataset_files_location}{RESET}'': \r\033[{offset}Cactual_dirs = {MIKADO}{actual_dirs}{RESET}" )
print ( f"CREATE_MASTER: INFO: {ITALICS}hence{RESET} directories in class specific dataset files location that don't correspond to a case in the master spreadsheet{RESET}': \r\033[{offset}Cactual_dirs {BLEU}minus{RESET} found_non_clone_directories = {GREEN if actual_dirs - found_clone_directories==0 else RED}{actual_dirs - found_clone_directories:2d}{RESET}", end="" )
if not actual_dirs - found_clone_directories == 0:
print ( f"\r\033[225C{RED} <<<<< anomoly - not listed in spreadsheet{RESET}")
else:
print ("")
print ( f"CREATE_MASTER: INFO: total {DIM_WHITE}files of no interest{RESET} actually found = {DIM_WHITE}{global_other_files_found}{RESET}" )
print ( f"CREATE_MASTER: INFO: total {DIM_WHITE}files of interest{RESET} actually found = {DIM_WHITE}{global_found_file_of_interest}{RESET}" )
print ( f"CREATE_MASTER: INFO: total {CARRIBEAN_GREEN}slide{RESET} files actually found = {CARRIBEAN_GREEN}{global_found_slide_file}{RESET}" )
print ( f"CREATE_MASTER: INFO: total {BITTER_SWEET}rna-seq{RESET} files actually found = {BITTER_SWEET}{global_found_rna_seq_file}{RESET}" )
print ( f"CREATE_MASTER: INFO: total {BLEU}matched{RESET} cases = {BLEU}{matched_cases_count}{RESET}" )
save_file_name = f"{class_specific_global_data_location}/{dataset}_mapping_file_MASTER.csv"
if (DEBUG>0):
print ( f"\nCREATE_MASTER: INFO: about to save: {MAGENTA}{save_file_name}{RESET}" )
try:
df.to_csv( save_file_name, sep=',', index=False )
except Exception as e:
print ( f"{RED}CREATE_MASTER: FATAL: '{e}'{RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: could notw write {MAGENTA}{mapping_file_name}{RESET}{RED} ({MAGENTA}{local_cancer_specific_dataset}{RESET}{RED}){RESET}" )
print ( f"{RED}CREATE_MASTER: FATAL: cannot continue - halting now{RESET}" )
sys.exit(0)
print( f"\nCREATE_MASTER: INFO: {MIKADO}finished{RESET}" )
hours = round((time.time() - start_time) / 3600, 1 )
minutes = round((time.time() - start_time) / 60, 1 )
seconds = round((time.time() - start_time), 0 )
#pprint.log_section('Job complete in {:} mins'.format( minutes ) )
print(f'CREATE_MASTER: INFO: took {MIKADO}{minutes}{RESET} mins ({MIKADO}{seconds:.1f}{RESET} secs)')
if __name__ == '__main__':
p = argparse.ArgumentParser()
p.add_argument('--base_dir', type=str, default="/home/peter/git/pipeline" )
p.add_argument('--data_source', type=str )
p.add_argument('--data_dir', type=str )
p.add_argument('--global_data_dir', type=str )
p.add_argument('--dataset', type=str, required=True )
p.add_argument('--mapping_file_name', type=str )
p.add_argument('--case_column', type=str, default="bcr_patient_uuid" )
p.add_argument('--class_column', type=str, default="type_n" )
p.add_argument('--tcga_rna_seq_file_suffix', type=str, default='*star_gene_counts.tsv' )
args, _ = p.parse_known_args()
main(args)