Skip to content

Releases: martian-lang/martian

3.2.2 minor release

08 May 19:41
Compare
Choose a tag to compare

New features:

  • Add a --dot option for mrc, which outputs a representation of the
    pipeline in GraphViz dot format to standard out.
  • In addition to logging the type of filesystem for the pipestance
    directory, also log the type of filesystem for the martian bin
    directory (which is often different from the pipestance directory).
  • When stages exceed their memory reservation, mrjob will now log the
    process tree for thet stage, including memory statistics.
  • mrp's debug endpoint now exposes the Golang expvar interface in
    addition to pprof.

Other changes:

  • Update dependency versions for golang.org/x/sys and some npm
    dependencies.
  • Fix the build on OSX (though mrp still does not function there).
  • Update vim syntax highlighting to include support for vmem_gb keyword.

v3.2.1 bugfix release

07 Mar 18:01
Compare
Choose a tag to compare
  • Performance fixes for VDR computation in cases where stages have a
    large number of output files.
  • Include invoulentary context switches in rusage tracking.
  • Fix a crash in cases where the mrp binary becomes unavailable on
    disk during a pipestance run.
  • Spelling corrections, mostly in code comments.

Martian 3.2.0

14 Jan 18:46
Compare
Choose a tag to compare

Martian 3.2.0 release.
Major new features:

  • The Python stage code adapter now works with Python 3.
  • Martian can now account for virtual address space size, in addition to
    physical memory.
    • Normally, virtual address space (vmem) size is ignored, since modern
      linux systems have no good reason to restrict it - vmem size is not
      the same as rss+swap, contrary to inexplicably popular belief.
    • In local mode, a limit may be specified with the --localvmem flag.
    • A limit will also be imposed automatically if a virtual size rlimit
      (e.g. ulimit -d or ulimit -v) is detected by mrp. SGE's
      h_vmem, s_vmem, h_data, and s_data resource specifiers set
      these limits.
    • In cluster mode job templates, users may now use __MRO_VMEM_GB__
      and related variables in the same way as the existing
      __MRO_MEM_GB__ variables to get the predicted virtual address
      space (vmem) size rather than the physical memory requirement.
    • The job mode configuration for cluster modes found in
      jobmanagers/config.json may set the mem_is_vmem key to true,
      in which case __MRO_MEM_GB__ and related template variables will
      also use the virtual address space size, for backwards compatibility
      with existing user templates (most SGE clusters mistakenly enforce
      virtual size, if they handle anything like memory reservations at
      all). This is turned on by default for SGE.
    • Stages may specify a vmem_gb requirement in addition to mem_gb,
      through all of the same existing mechanisms:
      • Specifying using ( vmem_gb = 4, ) in the mro declaration of the
        stage.
      • Specifying __vmem_gb in the chunk or join definitions returned
        by a split phase.
      • In overrides.json.
    • Stages which do not specify a vmem requirement will be allocated an
      amount equal to their physical memory requirement plus a constant
      specified in the extra_vmem_per_job key configured in
      jobmanagers/config.json.
    • With --monitor, mrjob will now restrict stage virtual size as
      well as physical size, to make sure the requests are being set
      correctly. It will include its own virtual size in the restriction,
      but will not include the virtual size of profiling jobs (e.g.
      perf record) which may be running alongside the stage code.
  • Update graph UI page
    • Reduce the amount of excess bytes required to render the page.
      • Inline the 7% of bootstrap.min.css we actually use.
      • Remove the fonts, just use an svg icon instead.
      • Remove the clipboard button, since it hasn't actually worked in a
        long time.
    • Remove dead js files. These files either were already not being
      included in the serve package or are no longer required.
    • Concatenate javascript source files together.
    • Remove duplicated DOM element IDs.
    • Get angular, dagare-d3 from npm, as well as support libraries d3 and
      lodash. This means we're no longer shipping an insecure version of
      lodash.
    • Add pan/zoom now works on the graph page.
  • MRO syntax now supports escaping for string literals, using json
    escaping syntax.

Minor improvements:

  • mrp now checks for stage completion whenever local-mode jobs complete.
    Previously it would check every 3 seconds regardless. For very short
    jobs (such as, frequently, split phases) this results in shorter
    pipeline wall times. While the impact on large pipelines should be
    tiny in percentage terms, this significantly accelerates integration
    tests.
  • make tarball now produces both tar.gz and tar.xz.
  • Improvements to tests.
    • Integration tests can now run in parallel (make -j longtests)
    • Fix some bugs in integration test result validation.
    • More test coverage for both unit and integration tests.
  • Pipelines should be more robust against missed or delayed updates
    from the pipestance journal directory. Rather than timing out,
    mrp will now check whether the file exists if a notification wasn't
    seen.
  • mrjob now includes its own memory usage in the statistics included
    in the jobinfo, which are used to generate the _perf summary..

Bug fixes:

  • Fix a potential deadlock when mrp receives a signal (e.g. from kill)
    or a shutdown request over the API while it is in the middle of
    starting or restarting a pipeline.
  • Fix a crash in mrf --includes if a stage called by a pipeline was
    not present in the transitive includes of the file defining the
    pipeline.
  • Fix a bug in mrf --includes which resulted in duplicate declarations
    for existing user-defined file types.
  • Updated npm dependencies.
  • mrjob will now begin waiting on the profiling command (e.g.
    perf record) immediately, rather than waiting until the stage code
    finishes. This prevents zombie processes lying around if the
    profiling command finishes before the stage code.
  • mrp will no longer read chunk _outs files if no chunk outputs
    were expected, e.g. for pre-flight stages. This prevents spurious
    errors when chunk outputs were not a dictionary object. It also
    means chunk outputs need to be properly declared if the stage has
    no outputs.

v3.2.0-pre2

14 Jan 18:49
Compare
Choose a tag to compare
Fix a typo in limit exceeded message.

v3.2.0-pre1: Martian 3.2.0 release candidate.

14 Jan 18:49
Compare
Choose a tag to compare
Major new features:
* The Python stage code adapter now works with Python 3.
* Martian can now account for virtual address space size, in addition to
  physical memory.
  * Normally, virtual address space (vmem) size is ignored, since modern
    linux systems have no good reason to restrict it - vmem size is not
    the same as rss+swap, contrary to inexplicably popular belief.
  * In local mode, a limit may be specified with the `--localvmem` flag.
  * A limit will also be imposed automatically if a virtual size rlimit
    (e.g. `ulimit -d` or `ulimit -v`) is detected by mrp.  SGE's
    `h_vmem`, `s_vmem`, `h_data`, and `s_data` resource specifiers set
    these limits.
  * In cluster mode job templates, users may now use `__MRO_VMEM_GB__`
    and related variables in the same way as the existing
    `__MRO_MEM_GB__` variables to get the predicted virtual address
    space (vmem) size rather than the physical memory requirement.
  * The job mode configuration for cluster modes found in
    `jobmanagers/config.json` may set the `mem_is_vmem` key to `true`,
    in which case `__MRO_MEM_GB__` and related template variables will
    also use the virtual address space size, for backwards compatibility
    with existing user templates (most SGE clusters mistakenly enforce
    virtual size, if they handle anything like memory reservations at
    all).
  * Stages may specify a `vmem_gb` requirement in addition to `mem_gb`,
    through all of the same existing mechanisms:
    * Specifying `using ( vmem_gb = 4, )` in the mro declaration of the
      stage.
    * Specifying `__vmem_gb` in the chunk or join definitions returned
      by a split phase.
    * In overrides.json.
  * Stages which do not specify a vmem requirement will be allocated an
    amount equal to their physical memory requirement plus a constant
    specified in the `extra_vmem_per_job` key configured in
    `jobmanagers/config.json`.
  * With `--monitor`, `mrjob` will now restrict stage virtual size as
    well as physical size, to make sure the requests are being set
    correctly.  It will include its own virtual size in the restriction,
    but will not include the virtual size of profiling jobs (e.g.
   `perf record`) which may be running alongside the stage code.

Minor improvements:
* mrp now checks for stage completion whenever local-mode jobs complete.
  Previously it would check every 3 seconds regardless.  For very short
  jobs (such as, frequently, split phases) this results in shorter
  pipeline wall times.  While the impact on large pipelines should be
  tiny in percentage terms, this significantly accelerates integration
  tests.
* `make tarball` now produces both `tar.gz` and `tar.xz`.
* Improvements to tests.
  * Integration tests can now run in parallel (`make -j longtests`)
  * Fix some bugs in integration test result validation.
  * More test coverage for both unit and integration tests.
* Pipelines should be more robust against missed or delayed updates
  from the pipestance journal directory.  Rather than timing out,
  mrp will now check whether the file exists if a notification wasn't
  seen.
* `mrjob` now includes its own memory usage in the statistics included
  in the jobinfo, which are used to generate the `_perf` summary..

Bug fixes:
* Fix a potential deadlock when mrp receives a signal (e.g. from `kill`)
  or a shutdown request over the API while it is in the middle of
  starting or restarting a pipeline.
* Fix a crash in `mrf --includes` if a stage called by a pipeline was
  not present in the transitive includes of the file defining the
  pipeline.
* Fix a bug in `mrf --includes` which resulted in duplicate declarations
  for existing user-defined file types.
* Updated npm dependencies.
* `mrjob` will now begin waiting on the profiling command (e.g.
  `perf record`) immediately, rather than waiting until the stage code
  finishes.  This prevents zombie processes lying around if the
  profiling command finishes before the stage code.

Martian 3.1.0

11 Oct 22:45
Compare
Choose a tag to compare

Martian 3.1

This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp (especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro source code more convenient, and logging improvements should improve make debugging easier in the event of failures.

VDR (volatile disk recovery) Changes

VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.

General changes

  • "rolling" is now the default VDR mode.
  • Each stage job (split, chunk, join) now has its own $TEMPDIR, which is cleaned up as soon as that stage phase has completed.
  • If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
  • When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
  • Stage metadata files are now accounted for in the storage high-water mark calculation.

New feature: strict-mode volatile

Stages may now declare themselves as being "strict-mode volatile" compatible:

stage FOO(
    in  bam in1,
    out bam bamfile,
    out bai index,
    out json summary,
) using (
    volatile = strict,
)

In this mode, the volatile modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1 takes bamfile as input and STAGE2 takes summary as input, bamfile can be deleted as soon STAGE1 completes, rather than waiting for STAGE2 to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.

One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam often implies the existence of filename.bai. If a downstream stage does not bind an output which mentions filename.bai then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.

New feature: "retained" outputs

In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:

Pipeline retains

pipeline BAR(
    in  bam  input1,
    out bam  output1,
)
{
    call FOO(
        input1 = self.input1,
    )
    call FOO as BAZ(
        input2 = FOO.output1,
    )
    return (
        output1 = BAZ.output1,
    )
    retain (
        FOO.output1
    )
}

This specifies that output1 of this pipeline's call to FOO should never be deleted, for example if one wants to be able to later re-run BAZ. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ in this example, or from other pipelines) which do not need to retain the output.

Stage retains

stage FOO(
    in  int  input1,
    out bam  output1,
    out json summary,
) using (
    volatile = strict,
) retain (
    summary,
)

This specifies that VDR should never delete summary. This should be used in the case where a file should always be preserved for potential later inspection.

Runtime improvements

User-facing improvements

  • The memory and CPU consumption of mrp has been reduced, especially for very large pipelines, and in cases where stages create large output objects.
  • There is now a timeout, configurable with the --retry-wait command line flag, between when mrp observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second.
  • The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
  • The web UI will now show files in the /extras directory of the pipestance. This is intended primarily for outputs of on-finish hooks.

Improvements for Pipeline-Developers

  • mrp can now run stage invocations as well as pipeline invocations. mrs now exists only as a symlink to mrp for backwards compatibility. This eliminates the feature gap between mrs and mrp, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.
  • Performance profiling modes are now configurable through jobmanagers/config.json. Each profiling mode may specify an executable to run to attach to the stage code (such as perf) and environment variables (for example HEAPPROFILE may be used to enable tcmalloc's heap profiler), in addition to any profiling built into the stage adapter itself.
  • The default event collection for --profile=perf no longer includes bpf-output events. This was not working in most cases.
  • Profiling mode may now be specified with in the overrides json file (used with the --overrides flag) to enable or disable it for an individual stage.
  • The _mrosource output now includes comments indicating file boundaries from the original source code with merged @includes.
  • More logging about the environment when mrp starts up
    • Log a few more environment variables (MALLOC_* and RUST_*).
    • The filesystem type and available space and inode count are now logged on startup.
  • When mrp is run with --monitor, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them.
  • The stage code parent process, mrjob, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages.
  • When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
  • Stage and pipeline _invocation files now only @include the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline.
  • Stage and pipeline _invocation files now properly represent call aliasing (e.g. call FOO as BAR).
  • The python adapter now exposes two new methods, get_threads_allocation and get_memory_allocation, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.

Improvements to the compiler/parser (mrc) and formatter (mrf)

  • The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of @include directives.
  • The parser is now substantially faster, and uses less memory.
  • mrf has a new flag, --includes, which will remove @include directives which are not required, and will attempt to add @include and filetype directives which are missing. It is inspired by the clang iwyu tool.
  • mrf now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
    ...
)

becomes

call FOO(
    ...
) using (
    volatile = true,
)
  • mrf now sorts keys in map literals.
  • mrf now inserts a trailing comma in map and array literals, e.g.
[
    1,
    2,
]

Bug fixes

  • The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket conn...
Read more

Martian 3.1.0 release candidate 3

09 Oct 20:18
Compare
Choose a tag to compare
Pre-release

Martian 3.1

This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp (especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro source code more convenient, and logging improvements should improve make debugging easier in the event of failures.

VDR (volatile disk recovery) Changes

VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.

General changes

  • "rolling" is now the default VDR mode.
  • Each stage job (split, chunk, join) now has its own $TEMPDIR, which is cleaned up as soon as that stage phase has completed.
  • If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
  • When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
  • Stage metadata files are now accounted for in the storage high-water mark calculation.

New feature: strict-mode volatile

Stages may now declare themselves as being "strict-mode volatile" compatible:

stage FOO(
    in  bam in1,
    out bam bamfile,
    out bai index,
    out json summary,
) using (
    volatile = strict,
)

In this mode, the volatile modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1 takes bamfile as input and STAGE2 takes summary as input, bamfile can be deleted as soon STAGE1 completes, rather than waiting for STAGE2 to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.

One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam often implies the existence of filename.bai. If a downstream stage does not bind an output which mentions filename.bai then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.

New feature: "retained" outputs

In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:

Pipeline retains

pipeline BAR(
    in  bam  input1,
    out bam  output1,
)
{
    call FOO(
        input1 = self.input1,
    )
    call FOO as BAZ(
        input2 = FOO.output1,
    )
    return (
        output1 = BAZ.output1,
    )
    retain (
        FOO.output1
    )
}

This specifies that output1 of this pipeline's call to FOO should never be deleted, for example if one wants to be able to later re-run BAZ. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ in this example, or from other pipelines) which do not need to retain the output.

Stage retains

stage FOO(
    in  int  input1,
    out bam  output1,
    out json summary,
) using (
    volatile = strict,
) retain (
    summary,
)

This specifies that VDR should never delete summary. This should be used in the case where a file should always be preserved for potential later inspection.

Runtime improvements

User-facing improvements

  • The memory and CPU consumption of mrp has been reduced, especially for very large pipelines, and in cases where stages create large output objects.
  • There is now a timeout, configurable with the --retry-wait command line flag, between when mrp observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second.
  • The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
  • The web UI will now show files in the /extras directory of the pipestance. This is intended primarily for outputs of on-finish hooks.

Improvements for Pipeline-Developers

  • mrp can now run stage invocations as well as pipeline invocations. mrs now exists only as a symlink to mrp for backwards compatibility. This eliminates the feature gap between mrs and mrp, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.
  • Performance profiling modes are now configurable through jobmanagers/config.json. Each profiling mode may specify an executable to run to attach to the stage code (such as perf) and environment variables (for example HEAPPROFILE may be used to enable tcmalloc's heap profiler), in addition to any profiling built into the stage adapter itself.
  • The default event collection for --profile=perf no longer includes bpf-output events. This was not working in most cases.
  • Profiling mode may now be specified with in the overrides json file (used with the --overrides flag) to enable or disable it for an individual stage.
  • The _mrosource output now includes comments indicating file boundaries from the original source code with merged @includes.
  • More logging about the environment when mrp starts up
    • Log a few more environment variables (MALLOC_* and RUST_*).
    • The filesystem type and available space and inode count are now logged on startup.
  • When mrp is run with --monitor, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them.
  • The stage code parent process, mrjob, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages.
  • When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
  • Stage and pipeline _invocation files now only @include the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline.
  • Stage and pipeline _invocation files now properly represent call aliasing (e.g. call FOO as BAR).
  • The python adapter now exposes two new methods, get_threads_allocation and get_memory_allocation, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.

Improvements to the compiler/parser (mrc) and formatter (mrf)

  • The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of @include directives.
  • The parser is now substantially faster, and uses less memory.
  • mrf has a new flag, --includes, which will remove @include directives which are not required, and will attempt to add @include and filetype directives which are missing. It is inspired by the clang iwyu tool.
  • mrf now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
    ...
)

becomes

call FOO(
    ...
) using (
    volatile = true,
)
  • mrf now sorts keys in map literals.
  • mrf now inserts a trailing comma in map and array literals, e.g.
[
    1,
    2,
]

Bug fixes

  • The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket conn...
Read more

Martian 3.1.0 release candidate 2

06 Sep 02:14
67837c0
Compare
Choose a tag to compare
Pre-release

Martian 3.1

This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp (especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro source code more convenient, and logging improvements should improve make debugging easier in the event of failures.

VDR (volatile disk recovery) Changes

VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.

General changes

  • "rolling" is now the default VDR mode.
  • Each stage job (split, chunk, join) now has its own $TEMPDIR, which is cleaned up as soon as that stage phase has completed.
  • If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
  • When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
  • Stage metadata files are now accounted for in the storage high-water mark calculation.

New feature: strict-mode volatile

Stages may now declare themselves as being "strict-mode volatile" compatible:

stage FOO(
    in  bam in1,
    out bam bamfile,
    out bai index,
    out json summary,
) using (
    volatile = strict,
)

In this mode, the volatile modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1 takes bamfile as input and STAGE2 takes summary as input, bamfile can be deleted as soon STAGE1 completes, rather than waiting for STAGE2 to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.

One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam often implies the existence of filename.bai. If a downstream stage does not bind an output which mentions filename.bai then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.

New feature: "retained" outputs

In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:

Pipeline retains

pipeline BAR(
    in  bam  input1,
    out bam  output1,
)
{
    call FOO(
        input1 = self.input1,
    )
    call FOO as BAZ(
        input2 = FOO.output1,
    )
    return (
        output1 = BAZ.output1,
    )
    retain (
        FOO.output1
    )
}

This specifies that output1 of this pipeline's call to FOO should never be deleted, for example if one wants to be able to later re-run BAZ. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ in this example, or from other pipelines) which do not need to retain the output.

Stage retains

stage FOO(
    in  int  input1,
    out bam  output1,
    out json summary,
) using (
    volatile = strict,
) retain (
    summary,
)

This specifies that VDR should never delete summary. This should be used in the case where a file should always be preserved for potential later inspection.

Runtime improvements

User-facing improvements

  • The memory and CPU consumption of mrp has been reduced, especially for very large pipelines, and in cases where stages create large output objects.
  • There is now a timeout, configurable with the --retry-wait command line flag, between when mrp observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second.
  • The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
  • The web UI will now show files in the /extras directory of the pipestance. This is intended primarily for outputs of on-finish hooks.

Improvements for Pipeline-Developers

  • mrp can now run stage invocations as well as pipeline invocations. mrs now exists only as a symlink to mrp for backwards compatibility. This eliminates the feature gap between mrs and mrp, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.
  • The default event collection for --profile=perf no longer includes bpf-output events. This was not working in most cases.
  • For users who desire more control over perf profile recording with --profile=perf, the environment variable MRO_PERF_ARGS allows one to specify the command like to perf record. This overrides MRO_PERF_EVENTS, MRO_PERF_FREQ, and MRO_PERF_DURATION. The command that runs will be perf record -p <pid> -o <path/to/job/_perf.data> $MRO_PERF_ARGS. The same behavior as setting those variables can thus be achieved by setting MRO_PERF_ARGS="-g -e $MRO_PERF_EVENTS -F $MRO_PERF_FREQ sleep $MRO_PERF_DURATION".
  • The _mrosource output now includes comments indicating file boundaries from the original source code with merged @includes.
  • More logging about the environment when mrp starts up
    • Log a few more environment variables (MALLOC_* and RUST_*).
    • The filesystem type and available space and inode count are now logged on startup.
  • When mrp is run with --monitor, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them.
  • The stage code parent process, mrjob, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages.
  • When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
  • Stage and pipeline _invocation files now only @include the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline.
  • Stage and pipeline _invocation files now properly represent call aliasing (e.g. call FOO as BAR).
  • The python adapter now exposes two new methods, get_threads_allocation and get_memory_allocation, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.

Improvements to the compiler/parser (mrc) and formatter (mrf)

  • The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of @include directives.
  • The parser is now substantially faster, and uses less memory.
  • mrf has a new flag, --includes, which will remove @include directives which are not required, and will attempt to add @include and filetype directives which are missing. It is inspired by the clang iwyu tool.
  • mrf now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
    ...
)

becomes

call FOO(
    ...
) using (
    volatile = true,
)
  • mrf now sorts keys in map literals.
  • mrf now inserts a trailing comma in map and array literals, e.g.
[
    1,
    2,
]

Bug fixes

  • The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket connections until the server ran out of file handles....
Read more

Martian 3.1.0 release candidate 1

13 Aug 20:56
Compare
Choose a tag to compare
Pre-release

Martian 3.1

This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp (especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro source code more convenient, and logging improvements should improve make debugging easier in the event of failures.

VDR (volatile disk recovery) Changes

VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.

General changes

  • "rolling" is now the default VDR mode.
  • Each stage job (split, chunk, join) now has its own $TEMPDIR, which is cleaned up as soon as that stage phase has completed.
  • If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
  • When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
  • Stage metadata files are now accounted for in the storage high-water mark calculation.

New feature: strict-mode volatile

Stages may now declare themselves as being "strict-mode volatile" compatible:

stage FOO(
    in  bam in1,
    out bam bamfile,
    out bai index,
    out json summary,
) using (
    volatile = strict,
)

In this mode, the volatile modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1 takes bamfile as input and STAGE2 takes summary as input, bamfile can be deleted as soon STAGE1 completes, rather than waiting for STAGE2 to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.

One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam often implies the existence of filename.bai. If a downstream stage does not bind an output which mentions filename.bai then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.

New feature: "retained" outputs

In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:

Pipeline retains

pipeline BAR(
    in  bam  input1,
    out bam  output1,
)
{
    call FOO(
        input1 = self.input1,
    )
    call FOO as BAZ(
        input2 = FOO.output1,
    )
    return (
        output1 = BAZ.output1,
    )
    retain (
        FOO.output1
    )
}

This specifies that output1 of this pipeline's call to FOO should never be deleted, for example if one wants to be able to later re-run BAZ. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ in this example, or from other pipelines) which do not need to retain the output.

Stage retains

stage FOO(
    in  int  input1,
    out bam  output1,
    out json summary,
) using (
    volatile = strict,
) retain (
    summary,
)

This specifies that VDR should never delete summary. This should be used in the case where a file should always be preserved for potential later inspection.

Runtime improvements

User-facing improvements

  • The memory and CPU consumption of mrp has been reduced, especially for very large pipelines, and in cases where stages create large output objects.
  • There is now a timeout, configurable with the --retry-wait command line flag, between when mrp observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second.
  • The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
  • The web UI will now show files in the /extras directory of the pipestance. This is intended primarily for outputs of on-finish hooks.

Improvements for Pipeline-Developers

  • mrp can now run stage invocations as well as pipeline invocations. mrs now exists only as a symlink to mrp for backwards compatibility. This eliminates the feature gap between mrs and mrp, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.
  • The default event collection for --profile=perf no longer includes bpf-output events. This was not working in most cases.
  • For users who desire more control over perf profile recording with --profile=perf, the environment variable MRO_PERF_ARGS allows one to specify the command like to perf record. This overrides MRO_PERF_EVENTS, MRO_PERF_FREQ, and MRO_PERF_DURATION. The command that runs will be perf record -p <pid> -o <path/to/job/_perf.data> $MRO_PERF_ARGS. The same behavior as setting those variables can thus be achieved by setting MRO_PERF_ARGS="-g -e $MRO_PERF_EVENTS -F $MRO_PERF_FREQ sleep $MRO_PERF_DURATION".
  • When running with --zip --profile=perf, profiling outputs are no longer included in _metadata.zip.
  • The _mrosource output now includes comments indicating file boundaries from the original source code with merged @includes.
  • More logging about the environment when mrp starts up
    • Log a few more environment variables (MALLOC_* and RUST_*).
    • The filesystem type and available space and inode count are now logged on startup.
  • When mrp is run with --monitor, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them.
  • The stage code parent process, mrjob, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages.
  • When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
  • Stage and pipeline _invocation files now only @include the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline.
  • Stage and pipeline _invocation files now properly represent call aliasing (e.g. call FOO as BAR).
  • The python adapter now exposes two new methods, get_threads_allocation and get_memory_allocation, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.

Improvements to the compiler/parser (mrc) and formatter (mrf)

  • The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of @include directives.
  • The parser is now substantially faster, and uses less memory.
  • mrf has a new flag, --includes, which will remove @include directives which are not required, and will attempt to add @include and filetype directives which are missing. It is inspired by the clang iwyu tool.
  • mrf now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
    ...
)

becomes

call FOO(
    ...
) using (
    volatile = true,
)
  • mrf now sorts keys in map literals.
  • mrf now inserts a trailing comma in map and array literals, e.g.
[
    1,
    2,
]

Bug fixes

  • The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial ...
Read more

v2.3.3: Fix web UI in Firefox.

19 Jul 01:44
Compare
Choose a tag to compare
This is a manual cherry-pick of d5f6fe982014edf3215f7c07e11d4fa5a773c1c5