Releases: martian-lang/martian
3.2.2 minor release
New features:
- Add a
--dot
option formrc
, which outputs a representation of the
pipeline in GraphViz dot format to standard out. - In addition to logging the type of filesystem for the pipestance
directory, also log the type of filesystem for the martian bin
directory (which is often different from the pipestance directory). - When stages exceed their memory reservation, mrjob will now log the
process tree for thet stage, including memory statistics. - mrp's debug endpoint now exposes the Golang expvar interface in
addition to pprof.
Other changes:
- Update dependency versions for
golang.org/x/sys
and some npm
dependencies. - Fix the build on OSX (though mrp still does not function there).
- Update vim syntax highlighting to include support for
vmem_gb
keyword.
v3.2.1 bugfix release
- Performance fixes for VDR computation in cases where stages have a
large number of output files. - Include invoulentary context switches in rusage tracking.
- Fix a crash in cases where the
mrp
binary becomes unavailable on
disk during a pipestance run. - Spelling corrections, mostly in code comments.
Martian 3.2.0
Martian 3.2.0 release.
Major new features:
- The Python stage code adapter now works with Python 3.
- Martian can now account for virtual address space size, in addition to
physical memory.- Normally, virtual address space (vmem) size is ignored, since modern
linux systems have no good reason to restrict it - vmem size is not
the same as rss+swap, contrary to inexplicably popular belief. - In local mode, a limit may be specified with the
--localvmem
flag. - A limit will also be imposed automatically if a virtual size rlimit
(e.g.ulimit -d
orulimit -v
) is detected by mrp. SGE's
h_vmem
,s_vmem
,h_data
, ands_data
resource specifiers set
these limits. - In cluster mode job templates, users may now use
__MRO_VMEM_GB__
and related variables in the same way as the existing
__MRO_MEM_GB__
variables to get the predicted virtual address
space (vmem) size rather than the physical memory requirement. - The job mode configuration for cluster modes found in
jobmanagers/config.json
may set themem_is_vmem
key totrue
,
in which case__MRO_MEM_GB__
and related template variables will
also use the virtual address space size, for backwards compatibility
with existing user templates (most SGE clusters mistakenly enforce
virtual size, if they handle anything like memory reservations at
all). This is turned on by default for SGE. - Stages may specify a
vmem_gb
requirement in addition tomem_gb
,
through all of the same existing mechanisms:- Specifying
using ( vmem_gb = 4, )
in the mro declaration of the
stage. - Specifying
__vmem_gb
in the chunk or join definitions returned
by a split phase. - In overrides.json.
- Specifying
- Stages which do not specify a vmem requirement will be allocated an
amount equal to their physical memory requirement plus a constant
specified in theextra_vmem_per_job
key configured in
jobmanagers/config.json
. - With
--monitor
,mrjob
will now restrict stage virtual size as
well as physical size, to make sure the requests are being set
correctly. It will include its own virtual size in the restriction,
but will not include the virtual size of profiling jobs (e.g.
perf record
) which may be running alongside the stage code.
- Normally, virtual address space (vmem) size is ignored, since modern
- Update graph UI page
- Reduce the amount of excess bytes required to render the page.
- Inline the 7% of bootstrap.min.css we actually use.
- Remove the fonts, just use an svg icon instead.
- Remove the clipboard button, since it hasn't actually worked in a
long time.
- Remove dead js files. These files either were already not being
included in the serve package or are no longer required. - Concatenate javascript source files together.
- Remove duplicated DOM element IDs.
- Get angular, dagare-d3 from npm, as well as support libraries d3 and
lodash. This means we're no longer shipping an insecure version of
lodash. - Add pan/zoom now works on the graph page.
- Reduce the amount of excess bytes required to render the page.
- MRO syntax now supports escaping for string literals, using json
escaping syntax.
Minor improvements:
- mrp now checks for stage completion whenever local-mode jobs complete.
Previously it would check every 3 seconds regardless. For very short
jobs (such as, frequently, split phases) this results in shorter
pipeline wall times. While the impact on large pipelines should be
tiny in percentage terms, this significantly accelerates integration
tests. make tarball
now produces bothtar.gz
andtar.xz
.- Improvements to tests.
- Integration tests can now run in parallel (
make -j longtests
) - Fix some bugs in integration test result validation.
- More test coverage for both unit and integration tests.
- Integration tests can now run in parallel (
- Pipelines should be more robust against missed or delayed updates
from the pipestance journal directory. Rather than timing out,
mrp will now check whether the file exists if a notification wasn't
seen. mrjob
now includes its own memory usage in the statistics included
in the jobinfo, which are used to generate the_perf
summary..
Bug fixes:
- Fix a potential deadlock when mrp receives a signal (e.g. from
kill
)
or a shutdown request over the API while it is in the middle of
starting or restarting a pipeline. - Fix a crash in
mrf --includes
if a stage called by a pipeline was
not present in the transitive includes of the file defining the
pipeline. - Fix a bug in
mrf --includes
which resulted in duplicate declarations
for existing user-defined file types. - Updated npm dependencies.
mrjob
will now begin waiting on the profiling command (e.g.
perf record
) immediately, rather than waiting until the stage code
finishes. This prevents zombie processes lying around if the
profiling command finishes before the stage code.mrp
will no longer read chunk_outs
files if no chunk outputs
were expected, e.g. for pre-flight stages. This prevents spurious
errors when chunk outputs were not a dictionary object. It also
means chunk outputs need to be properly declared if the stage has
no outputs.
v3.2.0-pre2
Fix a typo in limit exceeded message.
v3.2.0-pre1: Martian 3.2.0 release candidate.
Major new features: * The Python stage code adapter now works with Python 3. * Martian can now account for virtual address space size, in addition to physical memory. * Normally, virtual address space (vmem) size is ignored, since modern linux systems have no good reason to restrict it - vmem size is not the same as rss+swap, contrary to inexplicably popular belief. * In local mode, a limit may be specified with the `--localvmem` flag. * A limit will also be imposed automatically if a virtual size rlimit (e.g. `ulimit -d` or `ulimit -v`) is detected by mrp. SGE's `h_vmem`, `s_vmem`, `h_data`, and `s_data` resource specifiers set these limits. * In cluster mode job templates, users may now use `__MRO_VMEM_GB__` and related variables in the same way as the existing `__MRO_MEM_GB__` variables to get the predicted virtual address space (vmem) size rather than the physical memory requirement. * The job mode configuration for cluster modes found in `jobmanagers/config.json` may set the `mem_is_vmem` key to `true`, in which case `__MRO_MEM_GB__` and related template variables will also use the virtual address space size, for backwards compatibility with existing user templates (most SGE clusters mistakenly enforce virtual size, if they handle anything like memory reservations at all). * Stages may specify a `vmem_gb` requirement in addition to `mem_gb`, through all of the same existing mechanisms: * Specifying `using ( vmem_gb = 4, )` in the mro declaration of the stage. * Specifying `__vmem_gb` in the chunk or join definitions returned by a split phase. * In overrides.json. * Stages which do not specify a vmem requirement will be allocated an amount equal to their physical memory requirement plus a constant specified in the `extra_vmem_per_job` key configured in `jobmanagers/config.json`. * With `--monitor`, `mrjob` will now restrict stage virtual size as well as physical size, to make sure the requests are being set correctly. It will include its own virtual size in the restriction, but will not include the virtual size of profiling jobs (e.g. `perf record`) which may be running alongside the stage code. Minor improvements: * mrp now checks for stage completion whenever local-mode jobs complete. Previously it would check every 3 seconds regardless. For very short jobs (such as, frequently, split phases) this results in shorter pipeline wall times. While the impact on large pipelines should be tiny in percentage terms, this significantly accelerates integration tests. * `make tarball` now produces both `tar.gz` and `tar.xz`. * Improvements to tests. * Integration tests can now run in parallel (`make -j longtests`) * Fix some bugs in integration test result validation. * More test coverage for both unit and integration tests. * Pipelines should be more robust against missed or delayed updates from the pipestance journal directory. Rather than timing out, mrp will now check whether the file exists if a notification wasn't seen. * `mrjob` now includes its own memory usage in the statistics included in the jobinfo, which are used to generate the `_perf` summary.. Bug fixes: * Fix a potential deadlock when mrp receives a signal (e.g. from `kill`) or a shutdown request over the API while it is in the middle of starting or restarting a pipeline. * Fix a crash in `mrf --includes` if a stage called by a pipeline was not present in the transitive includes of the file defining the pipeline. * Fix a bug in `mrf --includes` which resulted in duplicate declarations for existing user-defined file types. * Updated npm dependencies. * `mrjob` will now begin waiting on the profiling command (e.g. `perf record`) immediately, rather than waiting until the stage code finishes. This prevents zombie processes lying around if the profiling command finishes before the stage code.
Martian 3.1.0
Martian 3.1
This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp
(especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro
source code more convenient, and logging improvements should improve make debugging easier in the event of failures.
VDR (volatile disk recovery) Changes
VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.
General changes
- "rolling" is now the default VDR mode.
- Each stage job (split, chunk, join) now has its own
$TEMPDIR
, which is cleaned up as soon as that stage phase has completed. - If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
- When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
- Stage metadata files are now accounted for in the storage high-water mark calculation.
New feature: strict-mode volatile
Stages may now declare themselves as being "strict-mode volatile" compatible:
stage FOO(
in bam in1,
out bam bamfile,
out bai index,
out json summary,
) using (
volatile = strict,
)
In this mode, the volatile
modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1
takes bamfile
as input and STAGE2
takes summary
as input, bamfile
can be deleted as soon STAGE1
completes, rather than waiting for STAGE2
to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.
One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam
often implies the existence of filename.bai
. If a downstream stage does not bind an output which mentions filename.bai
then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.
New feature: "retained" outputs
In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:
Pipeline retains
pipeline BAR(
in bam input1,
out bam output1,
)
{
call FOO(
input1 = self.input1,
)
call FOO as BAZ(
input2 = FOO.output1,
)
return (
output1 = BAZ.output1,
)
retain (
FOO.output1
)
}
This specifies that output1
of this pipeline's call to FOO
should never be deleted, for example if one wants to be able to later re-run BAZ
. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain
directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ
in this example, or from other pipelines) which do not need to retain the output.
Stage retains
stage FOO(
in int input1,
out bam output1,
out json summary,
) using (
volatile = strict,
) retain (
summary,
)
This specifies that VDR should never delete summary
. This should be used in the case where a file should always be preserved for potential later inspection.
Runtime improvements
User-facing improvements
- The memory and CPU consumption of
mrp
has been reduced, especially for very large pipelines, and in cases where stages create large output objects. - There is now a timeout, configurable with the
--retry-wait
command line flag, between whenmrp
observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second. - The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
- The web UI will now show files in the
/extras
directory of the pipestance. This is intended primarily for outputs of on-finish hooks.
Improvements for Pipeline-Developers
mrp
can now run stage invocations as well as pipeline invocations.mrs
now exists only as a symlink tomrp
for backwards compatibility. This eliminates the feature gap betweenmrs
andmrp
, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.- Performance profiling modes are now configurable through
jobmanagers/config.json
. Each profiling mode may specify an executable to run to attach to the stage code (such asperf
) and environment variables (for exampleHEAPPROFILE
may be used to enable tcmalloc's heap profiler), in addition to any profiling built into the stage adapter itself. - The default event collection for
--profile=perf
no longer includesbpf-output
events. This was not working in most cases. - Profiling mode may now be specified with in the overrides json file (used with the
--overrides
flag) to enable or disable it for an individual stage. - The
_mrosource
output now includes comments indicating file boundaries from the original source code with merged@include
s. - More logging about the environment when
mrp
starts up- Log a few more environment variables (
MALLOC_*
andRUST_*
). - The filesystem type and available space and inode count are now logged on startup.
- Log a few more environment variables (
- When
mrp
is run with--monitor
, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them. - The stage code parent process,
mrjob
, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages. - When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
- Stage and pipeline
_invocation
files now only@include
the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline. - Stage and pipeline
_invocation
files now properly represent call aliasing (e.g.call FOO as BAR
). - The python adapter now exposes two new methods,
get_threads_allocation
andget_memory_allocation
, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.
Improvements to the compiler/parser (mrc
) and formatter (mrf
)
- The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of
@include
directives. - The parser is now substantially faster, and uses less memory.
mrf
has a new flag,--includes
, which will remove@include
directives which are not required, and will attempt to add@include
andfiletype
directives which are missing. It is inspired by the clang iwyu tool.mrf
now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
...
)
becomes
call FOO(
...
) using (
volatile = true,
)
mrf
now sorts keys inmap
literals.mrf
now inserts a trailing comma in map and array literals, e.g.
[
1,
2,
]
Bug fixes
- The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket conn...
Martian 3.1.0 release candidate 3
Martian 3.1
This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp
(especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro
source code more convenient, and logging improvements should improve make debugging easier in the event of failures.
VDR (volatile disk recovery) Changes
VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.
General changes
- "rolling" is now the default VDR mode.
- Each stage job (split, chunk, join) now has its own
$TEMPDIR
, which is cleaned up as soon as that stage phase has completed. - If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
- When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
- Stage metadata files are now accounted for in the storage high-water mark calculation.
New feature: strict-mode volatile
Stages may now declare themselves as being "strict-mode volatile" compatible:
stage FOO(
in bam in1,
out bam bamfile,
out bai index,
out json summary,
) using (
volatile = strict,
)
In this mode, the volatile
modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1
takes bamfile
as input and STAGE2
takes summary
as input, bamfile
can be deleted as soon STAGE1
completes, rather than waiting for STAGE2
to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.
One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam
often implies the existence of filename.bai
. If a downstream stage does not bind an output which mentions filename.bai
then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.
New feature: "retained" outputs
In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:
Pipeline retains
pipeline BAR(
in bam input1,
out bam output1,
)
{
call FOO(
input1 = self.input1,
)
call FOO as BAZ(
input2 = FOO.output1,
)
return (
output1 = BAZ.output1,
)
retain (
FOO.output1
)
}
This specifies that output1
of this pipeline's call to FOO
should never be deleted, for example if one wants to be able to later re-run BAZ
. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain
directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ
in this example, or from other pipelines) which do not need to retain the output.
Stage retains
stage FOO(
in int input1,
out bam output1,
out json summary,
) using (
volatile = strict,
) retain (
summary,
)
This specifies that VDR should never delete summary
. This should be used in the case where a file should always be preserved for potential later inspection.
Runtime improvements
User-facing improvements
- The memory and CPU consumption of
mrp
has been reduced, especially for very large pipelines, and in cases where stages create large output objects. - There is now a timeout, configurable with the
--retry-wait
command line flag, between whenmrp
observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second. - The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
- The web UI will now show files in the
/extras
directory of the pipestance. This is intended primarily for outputs of on-finish hooks.
Improvements for Pipeline-Developers
mrp
can now run stage invocations as well as pipeline invocations.mrs
now exists only as a symlink tomrp
for backwards compatibility. This eliminates the feature gap betweenmrs
andmrp
, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.- Performance profiling modes are now configurable through
jobmanagers/config.json
. Each profiling mode may specify an executable to run to attach to the stage code (such asperf
) and environment variables (for exampleHEAPPROFILE
may be used to enable tcmalloc's heap profiler), in addition to any profiling built into the stage adapter itself. - The default event collection for
--profile=perf
no longer includesbpf-output
events. This was not working in most cases. - Profiling mode may now be specified with in the overrides json file (used with the
--overrides
flag) to enable or disable it for an individual stage. - The
_mrosource
output now includes comments indicating file boundaries from the original source code with merged@include
s. - More logging about the environment when
mrp
starts up- Log a few more environment variables (
MALLOC_*
andRUST_*
). - The filesystem type and available space and inode count are now logged on startup.
- Log a few more environment variables (
- When
mrp
is run with--monitor
, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them. - The stage code parent process,
mrjob
, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages. - When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
- Stage and pipeline
_invocation
files now only@include
the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline. - Stage and pipeline
_invocation
files now properly represent call aliasing (e.g.call FOO as BAR
). - The python adapter now exposes two new methods,
get_threads_allocation
andget_memory_allocation
, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.
Improvements to the compiler/parser (mrc
) and formatter (mrf
)
- The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of
@include
directives. - The parser is now substantially faster, and uses less memory.
mrf
has a new flag,--includes
, which will remove@include
directives which are not required, and will attempt to add@include
andfiletype
directives which are missing. It is inspired by the clang iwyu tool.mrf
now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
...
)
becomes
call FOO(
...
) using (
volatile = true,
)
mrf
now sorts keys inmap
literals.mrf
now inserts a trailing comma in map and array literals, e.g.
[
1,
2,
]
Bug fixes
- The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket conn...
Martian 3.1.0 release candidate 2
Martian 3.1
This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp
(especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro
source code more convenient, and logging improvements should improve make debugging easier in the event of failures.
VDR (volatile disk recovery) Changes
VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.
General changes
- "rolling" is now the default VDR mode.
- Each stage job (split, chunk, join) now has its own
$TEMPDIR
, which is cleaned up as soon as that stage phase has completed. - If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
- When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
- Stage metadata files are now accounted for in the storage high-water mark calculation.
New feature: strict-mode volatile
Stages may now declare themselves as being "strict-mode volatile" compatible:
stage FOO(
in bam in1,
out bam bamfile,
out bai index,
out json summary,
) using (
volatile = strict,
)
In this mode, the volatile
modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1
takes bamfile
as input and STAGE2
takes summary
as input, bamfile
can be deleted as soon STAGE1
completes, rather than waiting for STAGE2
to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.
One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam
often implies the existence of filename.bai
. If a downstream stage does not bind an output which mentions filename.bai
then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.
New feature: "retained" outputs
In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:
Pipeline retains
pipeline BAR(
in bam input1,
out bam output1,
)
{
call FOO(
input1 = self.input1,
)
call FOO as BAZ(
input2 = FOO.output1,
)
return (
output1 = BAZ.output1,
)
retain (
FOO.output1
)
}
This specifies that output1
of this pipeline's call to FOO
should never be deleted, for example if one wants to be able to later re-run BAZ
. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain
directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ
in this example, or from other pipelines) which do not need to retain the output.
Stage retains
stage FOO(
in int input1,
out bam output1,
out json summary,
) using (
volatile = strict,
) retain (
summary,
)
This specifies that VDR should never delete summary
. This should be used in the case where a file should always be preserved for potential later inspection.
Runtime improvements
User-facing improvements
- The memory and CPU consumption of
mrp
has been reduced, especially for very large pipelines, and in cases where stages create large output objects. - There is now a timeout, configurable with the
--retry-wait
command line flag, between whenmrp
observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second. - The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
- The web UI will now show files in the
/extras
directory of the pipestance. This is intended primarily for outputs of on-finish hooks.
Improvements for Pipeline-Developers
mrp
can now run stage invocations as well as pipeline invocations.mrs
now exists only as a symlink tomrp
for backwards compatibility. This eliminates the feature gap betweenmrs
andmrp
, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.- The default event collection for
--profile=perf
no longer includesbpf-output
events. This was not working in most cases. - For users who desire more control over perf profile recording with
--profile=perf
, the environment variableMRO_PERF_ARGS
allows one to specify the command like toperf record
. This overridesMRO_PERF_EVENTS
,MRO_PERF_FREQ
, andMRO_PERF_DURATION
. The command that runs will beperf record -p <pid> -o <path/to/job/_perf.data> $MRO_PERF_ARGS
. The same behavior as setting those variables can thus be achieved by settingMRO_PERF_ARGS="-g -e $MRO_PERF_EVENTS -F $MRO_PERF_FREQ sleep $MRO_PERF_DURATION"
. - The
_mrosource
output now includes comments indicating file boundaries from the original source code with merged@include
s. - More logging about the environment when
mrp
starts up- Log a few more environment variables (
MALLOC_*
andRUST_*
). - The filesystem type and available space and inode count are now logged on startup.
- Log a few more environment variables (
- When
mrp
is run with--monitor
, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them. - The stage code parent process,
mrjob
, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages. - When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
- Stage and pipeline
_invocation
files now only@include
the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline. - Stage and pipeline
_invocation
files now properly represent call aliasing (e.g.call FOO as BAR
). - The python adapter now exposes two new methods,
get_threads_allocation
andget_memory_allocation
, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.
Improvements to the compiler/parser (mrc
) and formatter (mrf
)
- The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of
@include
directives. - The parser is now substantially faster, and uses less memory.
mrf
has a new flag,--includes
, which will remove@include
directives which are not required, and will attempt to add@include
andfiletype
directives which are missing. It is inspired by the clang iwyu tool.mrf
now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
...
)
becomes
call FOO(
...
) using (
volatile = true,
)
mrf
now sorts keys inmap
literals.mrf
now inserts a trailing comma in map and array literals, e.g.
[
1,
2,
]
Bug fixes
- The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket connections until the server ran out of file handles....
Martian 3.1.0 release candidate 1
Martian 3.1
This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp
(especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro
source code more convenient, and logging improvements should improve make debugging easier in the event of failures.
VDR (volatile disk recovery) Changes
VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.
General changes
- "rolling" is now the default VDR mode.
- Each stage job (split, chunk, join) now has its own
$TEMPDIR
, which is cleaned up as soon as that stage phase has completed. - If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
- When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
- Stage metadata files are now accounted for in the storage high-water mark calculation.
New feature: strict-mode volatile
Stages may now declare themselves as being "strict-mode volatile" compatible:
stage FOO(
in bam in1,
out bam bamfile,
out bai index,
out json summary,
) using (
volatile = strict,
)
In this mode, the volatile
modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1
takes bamfile
as input and STAGE2
takes summary
as input, bamfile
can be deleted as soon STAGE1
completes, rather than waiting for STAGE2
to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.
One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam
often implies the existence of filename.bai
. If a downstream stage does not bind an output which mentions filename.bai
then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.
New feature: "retained" outputs
In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:
Pipeline retains
pipeline BAR(
in bam input1,
out bam output1,
)
{
call FOO(
input1 = self.input1,
)
call FOO as BAZ(
input2 = FOO.output1,
)
return (
output1 = BAZ.output1,
)
retain (
FOO.output1
)
}
This specifies that output1
of this pipeline's call to FOO
should never be deleted, for example if one wants to be able to later re-run BAZ
. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain
directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ
in this example, or from other pipelines) which do not need to retain the output.
Stage retains
stage FOO(
in int input1,
out bam output1,
out json summary,
) using (
volatile = strict,
) retain (
summary,
)
This specifies that VDR should never delete summary
. This should be used in the case where a file should always be preserved for potential later inspection.
Runtime improvements
User-facing improvements
- The memory and CPU consumption of
mrp
has been reduced, especially for very large pipelines, and in cases where stages create large output objects. - There is now a timeout, configurable with the
--retry-wait
command line flag, between whenmrp
observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second. - The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
- The web UI will now show files in the
/extras
directory of the pipestance. This is intended primarily for outputs of on-finish hooks.
Improvements for Pipeline-Developers
mrp
can now run stage invocations as well as pipeline invocations.mrs
now exists only as a symlink tomrp
for backwards compatibility. This eliminates the feature gap betweenmrs
andmrp
, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.- The default event collection for
--profile=perf
no longer includesbpf-output
events. This was not working in most cases. - For users who desire more control over perf profile recording with
--profile=perf
, the environment variableMRO_PERF_ARGS
allows one to specify the command like toperf record
. This overridesMRO_PERF_EVENTS
,MRO_PERF_FREQ
, andMRO_PERF_DURATION
. The command that runs will beperf record -p <pid> -o <path/to/job/_perf.data> $MRO_PERF_ARGS
. The same behavior as setting those variables can thus be achieved by settingMRO_PERF_ARGS="-g -e $MRO_PERF_EVENTS -F $MRO_PERF_FREQ sleep $MRO_PERF_DURATION"
. - When running with
--zip --profile=perf
, profiling outputs are no longer included in_metadata.zip
. - The
_mrosource
output now includes comments indicating file boundaries from the original source code with merged@include
s. - More logging about the environment when
mrp
starts up- Log a few more environment variables (
MALLOC_*
andRUST_*
). - The filesystem type and available space and inode count are now logged on startup.
- Log a few more environment variables (
- When
mrp
is run with--monitor
, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them. - The stage code parent process,
mrjob
, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages. - When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
- Stage and pipeline
_invocation
files now only@include
the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline. - Stage and pipeline
_invocation
files now properly represent call aliasing (e.g.call FOO as BAR
). - The python adapter now exposes two new methods,
get_threads_allocation
andget_memory_allocation
, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.
Improvements to the compiler/parser (mrc
) and formatter (mrf
)
- The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of
@include
directives. - The parser is now substantially faster, and uses less memory.
mrf
has a new flag,--includes
, which will remove@include
directives which are not required, and will attempt to add@include
andfiletype
directives which are missing. It is inspired by the clang iwyu tool.mrf
now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
...
)
becomes
call FOO(
...
) using (
volatile = true,
)
mrf
now sorts keys inmap
literals.mrf
now inserts a trailing comma in map and array literals, e.g.
[
1,
2,
]
Bug fixes
- The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial ...
v2.3.3: Fix web UI in Firefox.
This is a manual cherry-pick of d5f6fe982014edf3215f7c07e11d4fa5a773c1c5