Skip to content

Commit

Permalink
Merge pull request #36 from AuReMe/mpwt_0.5.3
Browse files Browse the repository at this point in the history
mpwt 0.5.3
  • Loading branch information
ArnaudBelcour authored Jan 9, 2020
2 parents 5507079 + 9896697 commit e35b2b7
Show file tree
Hide file tree
Showing 11 changed files with 365 additions and 132 deletions.
54 changes: 46 additions & 8 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ PF file example:
INTRON START1-STOP1
//
Look at the `Pathologic format <http://bioinformatics.ai.sri.com/ptools/tpal.pf/>`__ for more informations.
Look at the `Pathologic format <http://bioinformatics.ai.sri.com/ptools/tpal.pf>`__ for more informations.

You have to provide one nucleotide sequence for each pathologic containing one scaffold/contig.

Expand Down Expand Up @@ -262,7 +262,7 @@ mpwt can be used with the command line:

.. code:: sh
mpwt -f path/to/folder/input [-o path/to/folder/output] [--patho] [--hf] [--dat] [--md] [--cpu INT] [-r] [--clean] [--log path/to/folder/log] [--ignore-error] [-v]
mpwt -f path/to/folder/input [-o path/to/folder/output] [--patho] [--hf] [--op] [--nc] [--dat] [--md] [--cpu INT] [-r] [--clean] [--log path/to/folder/log] [--ignore-error] [-v]
Optional argument are identified by [].

Expand All @@ -279,6 +279,8 @@ mpwt can be used in a python script with an import:
output_folder=folder_output,
patho_inference=optional_boolean,
patho_hole_filler=optional_boolean,
patho_operon_predictor=optional_boolean,
no_download_articles=optional_boolean,
dat_creation=optional_boolean,
dat_extraction=optional_boolean,
size_reduction=optional_boolean,
Expand All @@ -291,13 +293,17 @@ mpwt can be used in a python script with an import:
+-------------------------+------------------------------------------------+-------------------------------------------------------------------------+
| Command line argument | Python argument | description |
+=========================+================================================+=========================================================================+
| -f | input_folder(string: folder pathname) | input folder as described in Input data |
| -f | input_folder(string: folder pathname) | Input folder as described in Input data |
+-------------------------+------------------------------------------------+-------------------------------------------------------------------------+
| -o | output_folder(string: folder pathname) | output folder containing PGDB data or dat files (see --dat arguments) |
| -o | output_folder(string: folder pathname) | Output folder containing PGDB data or dat files (see --dat arguments) |
+-------------------------+------------------------------------------------+-------------------------------------------------------------------------+
| --patho | patho_inference(boolean) | launch PathoLogic inference on input folder |
| --patho | patho_inference(boolean) | Launch PathoLogic inference on input folder |
+-------------------------+------------------------------------------------+-------------------------------------------------------------------------+
| --hf | patho_hole_filler(boolean) | launch PathoLogic Hole Filler with Blast |
| --hf | patho_hole_filler(boolean) | Launch PathoLogic Hole Filler with Blast |
+-------------------------+------------------------------------------------+-------------------------------------------------------------------------+
| --op | patho_operon_predictor(boolean) | Launch PathoLogic Operon Predictor |
+-------------------------+------------------------------------------------+-------------------------------------------------------------------------+
| --nc | no_download_articles(boolean) | Launch PathoLogic without loading PubMed citations |
+-------------------------+------------------------------------------------+-------------------------------------------------------------------------+
| --dat | dat_creation(boolean) | Create BioPAX/attribute-value dat files |
+-------------------------+------------------------------------------------+-------------------------------------------------------------------------+
Expand All @@ -320,6 +326,21 @@ mpwt can be used in a python script with an import:
| -v | verbose(boolean) | Print some information about the processing of mpwt |
+-------------------------+------------------------------------------------+-------------------------------------------------------------------------+

There is also another argument:

.. code:: sh
mpwt topf -f input_folder -o output_folder -c cpu_number
.. code:: python
import mpwt
mpwt.create_pathologic_file(input_folder, output_folder, cpu_number)
This argument reads the input data inside the input folder. Then it converts Genbank and GFF files into PathoLogic Format files. And if there is already PathoLogic files it copies them.

It can be used to avoid issues with parsing Genbank and GFF files. But it is an early Work in Progress.

Examples
~~~~~~~~

Expand Down Expand Up @@ -350,20 +371,37 @@ Create PGDBs of studied organisms inside ptools-local:
mpwt.multiprocess_pwt(input_folder='path/to/folder/input',
patho_inference=True)
Create PGDBs of studied organisms inside ptools-local with the Hole-Filler:
Convert Genbank and GFF files into PathoLogic files then create PGDBs of studied organisms inside ptools-local:

..
.. code:: sh
mpwt topf -f path/to/folder/input -o path/to/folder/pf
mpwt -f path/to/folder/pf --patho
.. code:: python
import mpwt
mpwt.create_pathologic_file(input_folder='path/to/folder/input', output_folder='path/to/folder/pf')
mpwt.multiprocess_pwt(input_folder='path/to/folder/pf', patho_inference=True)
Create PGDBs of studied organisms inside ptools-local with Hole Filler, Operon Predictor and without loading PubMed citations (need Pathway Tools 23.5 or higher):

..
.. code:: sh
mpwt -f path/to/folder/input --patho --hf --log path/to/folder/log
mpwt -f path/to/folder/input --patho --hf --op --nc --log path/to/folder/log
.. code:: python
import mpwt
mpwt.multiprocess_pwt(input_folder='path/to/folder/input',
patho_inference=True,
patho_hole_filler=True,
patho_operon_predictor=True,
no_download_articles=True,
patho_log='path/to/folder/log')
Create PGDBs of studied organisms inside ptools-local and create dat files:
Expand Down
2 changes: 1 addition & 1 deletion mpwt/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from mpwt.pwt_wrapper import run_pwt, run_pwt_dat
from mpwt.mpwt_workflow import multiprocess_pwt
from mpwt.utils import cleaning, cleaning_input, find_ptools_path, list_pgdb, remove_pgdbs
from mpwt.utils import cleaning, cleaning_input, create_pathologic_file, find_ptools_path, list_pgdb, pubmed_citations, remove_pgdbs
10 changes: 8 additions & 2 deletions mpwt/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
The script takes a folder name as argument.
usage:
mpwt -f=DIR [-o=DIR] [--patho] [--hf] [--dat] [--md] [--cpu=INT] [-r] [-v] [--clean] [--log=FOLDER] [--ignore-error] [--taxon-file]
mpwt -f=DIR [-o=DIR] [--patho] [--hf] [--op] [--nc] [--dat] [--md] [--cpu=INT] [-r] [-v] [--clean] [--log=FOLDER] [--ignore-error] [--taxon-file]
mpwt --dat [-f=DIR] [-o=DIR] [--md] [--cpu=INT] [-v]
mpwt -o=DIR [--md] [--cpu=INT] [-v]
mpwt --clean [--cpu=INT] [-v]
Expand All @@ -21,6 +21,8 @@
-o=DIR Output folder path. Will create a output folder in this folder.
--patho Will run an inference of Pathologic on the input files.
--hf Use with --patho. Run the Hole Filler using Blast.
--op Use with --patho. Run the Operon predictor of Pathway-Tools.
--nc Use with --patho. Turn off loading of Pubmed entries.
--dat Will create BioPAX/attribute-value dat files from PGDB.
--md Move only the dat files into the output folder.
--clean Clean ptools-local folder, before any other operations.
Expand All @@ -32,7 +34,7 @@
--ignore-error Ignore errors (PathoLogic and dat creation) and continue for successful builds.
--taxon-file For the use of the taxon_id.tsv file to find the taxon ID.
-v Verbose.
topf Will convert Genbank file into PathoLogic Format file.
topf Will convert Genbank and/or GFF files into PathoLogic Format file.
"""

Expand Down Expand Up @@ -60,6 +62,8 @@ def run_mpwt():
output_folder = args['-o']
patho_inference = args['--patho']
patho_hole_filler = args['--hf']
patho_operon_predictor = args['--op']
no_download_articles = args['--nc']
dat_creation = args['--dat']
move_dat = args['--md']
size_reduction = args['-r']
Expand Down Expand Up @@ -111,6 +115,8 @@ def run_mpwt():
output_folder=output_folder,
patho_inference=patho_inference,
patho_hole_filler=patho_hole_filler,
patho_operon_predictor=patho_operon_predictor,
no_download_articles=no_download_articles,
dat_creation=dat_creation,
dat_extraction=move_dat,
size_reduction=size_reduction,
Expand Down
70 changes: 53 additions & 17 deletions mpwt/mpwt_workflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
-check the results (results_check)
"""

import csv
import logging
import os
import shutil
Expand All @@ -14,7 +15,7 @@
from mpwt import utils
from mpwt.pwt_wrapper import run_pwt, run_pwt_dat, run_move_pgdb
from mpwt.results_check import check_dat, check_pwt, permission_change
from mpwt.pathologic_input import check_input_and_existing_pgdb, create_mpwt_input, pwt_input_files, create_only_dat_lisp, create_dat_creation_script
from mpwt.pathologic_input import check_input_and_existing_pgdb, create_mpwt_input, pwt_input_files, create_only_dat_lisp, create_dat_creation_script, read_taxon_id
from multiprocessing import Pool

logging.basicConfig(format='%(message)s', level=logging.CRITICAL)
Expand All @@ -23,21 +24,24 @@


def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None,
patho_hole_filler=None, dat_creation=None, dat_extraction=None,
size_reduction=None, number_cpu=None, patho_log=None,
ignore_error=None, taxon_file=None, verbose=None):
patho_hole_filler=None, patho_operon_predictor=None, no_download_articles=None,
dat_creation=None, dat_extraction=None, size_reduction=None,
number_cpu=None, patho_log=None, ignore_error=None,
taxon_file=None, verbose=None):
"""
Function managing all the workflow (from the creatin of the input files to the results).
Use it when you import mpwt in a script.
Args:
input_folder (str): pathname to input folder
output_folder (str): pathname to output folder
patho_inference (bool): pathologic boolean (True/False)
patho_hole_filler (bool): pathologic hole filler boolean (True/False)
dat_creation (bool): BioPAX/attributes-values files creation boolean (True/False)
dat_extraction (bool): BioPAX/attributes-values files extraction boolean (True/False)
size_reduction (bool): Delete ptools-local data at the end boolean (True/False)
patho_inference (bool): PathoLogic inference (True/False)
patho_hole_filler (bool): PathoLogic hole filler (True/False)
patho_operon_predictor (bool): PathoLogic operon predictor (True/False)
no_download_articles (bool): turning off loading of PubMed citations (True/False)
dat_creation (bool): BioPAX/attributes-values files creation (True/False)
dat_extraction (bool): BioPAX/attributes-values files extraction (True/False)
size_reduction (bool): delete ptools-local data at the end (True/False)
number_cpu (int): number of CPU used (default=1)
patho_log (str): pathname to mpwt log folder
verbose (bool): verbose argument
Expand Down Expand Up @@ -69,8 +73,16 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None
sys.exit('To use --ignore-error/ignore_error, you need to use the --patho/patho_inference argument.')

# Check if taxon_file is used with patho_inference.
if taxon_file and not patho_inference:
sys.exit('To use --taxon-file/taxon_file, you need to use the --patho/patho_inference argument.')
if (taxon_file and not patho_inference) and (taxon_file and not input_folder):
sys.exit('To use --taxon-file/taxon_file, you need to use the --patho/patho_inference argument. Or you can use it with the -f argument to create the taxon file from data.')

#Check if patho_operon_predictor is used with patho_inference.
if patho_operon_predictor and not patho_inference:
sys.exit('To use --op/patho_operon_predictor, you need to use the --patho/patho_inference argument.')

#Check if no_download_articles is used with patho_inference.
if no_download_articles and not patho_inference:
sys.exit('To use --nc/no_download_articles, you need to use the --patho/patho_inference argument.')

# Use the number of cpu given by the user or 1 CPU.
if number_cpu:
Expand All @@ -82,6 +94,23 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None
number_cpu_to_use = 1
mpwt_pool = Pool(processes=number_cpu_to_use)

# Create taxon file in the input folder.
if taxon_file and input_folder and not patho_inference:
taxon_file_pathname = input_folder + '/taxon_id.tsv'
if os.path.exists(taxon_file_pathname):
sys.exit('taxon ID file (' + taxon_file_pathname + ') already exists.')
else:
taxon_ids = read_taxon_id(input_folder)
with open(taxon_file_pathname, 'w') as taxon_id_file:
taxon_id_writer = csv.writer(taxon_id_file, delimiter='\t')
taxon_id_writer.writerow(['species', 'taxon_id'])
for species, taxon_id in taxon_ids.items():
taxon_id_writer.writerow([species, taxon_id])

# Turn off loading of pubmed entries.
if no_download_articles:
utils.pubmed_citations(activate_citations=False)

# Check input folder and create input files for PathoLogic.
if input_folder:
run_ids = [folder_id for folder_id in next(os.walk(input_folder))[1]]
Expand All @@ -95,8 +124,9 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None
if run_patho_dat_ids:
# Create the list containing all the data used by the multiprocessing call.
multiprocess_inputs = create_mpwt_input(run_ids=run_patho_dat_ids, input_folder=input_folder, pgdbs_folder_path=pgdbs_folder_path,
patho_hole_filler=patho_hole_filler, dat_extraction=dat_extraction, output_folder=output_folder,
size_reduction=size_reduction, only_dat_creation=None, taxon_file=taxon_file)
patho_hole_filler=patho_hole_filler, patho_operon_predictor=patho_operon_predictor,
dat_extraction=dat_extraction, output_folder=output_folder, size_reduction=size_reduction,
only_dat_creation=None, taxon_file=taxon_file)

logger.info('~~~~~~~~~~Creation of input data from Genbank/GFF/PF~~~~~~~~~~')
mpwt_pool.map(pwt_input_files, multiprocess_inputs)
Expand Down Expand Up @@ -140,8 +170,9 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None
dat_run_ids = create_only_dat_lisp(pgdbs_folder_path, tmp_folder)

multiprocess_inputs = create_mpwt_input(run_ids=dat_run_ids, input_folder=tmp_folder, pgdbs_folder_path=pgdbs_folder_path,
patho_hole_filler=patho_hole_filler, dat_extraction=dat_extraction, output_folder=output_folder,
size_reduction=size_reduction, only_dat_creation=only_dat_creation, taxon_file=taxon_file)
patho_hole_filler=patho_hole_filler, patho_operon_predictor=patho_operon_predictor,
dat_extraction=dat_extraction, output_folder=output_folder, size_reduction=size_reduction,
only_dat_creation=only_dat_creation, taxon_file=taxon_file)
# Add species that have data in PGDB but are not present in output folder.
# Or if ignore_error has been used, select only PathoLogic build that have succeed + species in input with PGDB and not in output.
if input_folder:
Expand All @@ -154,8 +185,9 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None
for run_dat_id in run_dat_ids:
create_dat_creation_script(run_dat_id, input_folder + "/" + run_dat_id + "/" + "dat_creation.lisp")
multiprocess_dat_inputs = create_mpwt_input(run_ids=run_dat_ids, input_folder=input_folder, pgdbs_folder_path=pgdbs_folder_path,
patho_hole_filler=patho_hole_filler, dat_extraction=dat_extraction, output_folder=output_folder,
size_reduction=size_reduction, only_dat_creation=None, taxon_file=taxon_file)
patho_hole_filler=patho_hole_filler, patho_operon_predictor=patho_operon_predictor,
dat_extraction=dat_extraction, output_folder=output_folder, size_reduction=size_reduction,
only_dat_creation=None, taxon_file=taxon_file)
multiprocess_inputs.extend(multiprocess_dat_inputs)

# Create BioPAX/attributes-values dat files.
Expand Down Expand Up @@ -198,6 +230,10 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None
mpwt_pool.close()
mpwt_pool.join()

# Turn on loading of pubmed entries.
if no_download_articles:
utils.pubmed_citations(activate_citations=True)

end_time = time.time()
times.append(end_time)
steps.append('mpwt')
Expand Down
Loading

0 comments on commit e35b2b7

Please sign in to comment.