Skip to content

Martian 3.1.0 release candidate 1

Pre-release
Pre-release
Compare
Choose a tag to compare
@adam-azarchs adam-azarchs released this 13 Aug 20:56
· 692 commits to master since this release

Martian 3.1

This release extensively reworks Volatile Disk Recovery (VDR), adding two new language keywords (thus the bump of the minor version number) - see below for details. A secondary focus on performance significantly reduces the memory footprint of mrp (especially important for users in cluster mode, where the submit host often may have more constrained resources than the compute nodes). Improvements to developer tools should make authoring of mro source code more convenient, and logging improvements should improve make debugging easier in the event of failures.

VDR (volatile disk recovery) Changes

VDR has been extensively overhauled. The general changes improve storage high-water-mark for all pipelines, without further modifications. Additionally, two new features have been added to further improve storage utilization and streamline development.

General changes

  • "rolling" is now the default VDR mode.
  • Each stage job (split, chunk, join) now has its own $TEMPDIR, which is cleaned up as soon as that stage phase has completed.
  • If a volatile stage call's output is bound to the top-level pipeline outputs, rather than preventing VDR on that stage from happening at all, only prevent deletion of the files explicitly mentioned in the bound output.
  • When determining whether it is safe to clean up a stage's files, only outputs containing paths to files within the stage's directory hierarchy are considered when looking for downstream stages.
  • Stage metadata files are now accounted for in the storage high-water mark calculation.

New feature: strict-mode volatile

Stages may now declare themselves as being "strict-mode volatile" compatible:

stage FOO(
    in  bam in1,
    out bam bamfile,
    out bai index,
    out json summary,
) using (
    volatile = strict,
)

In this mode, the volatile modifier on the call to the stage is ignored. In addition, rather than VDR being an all-or-nothing afair that doesn't run until all downstream stages have completed, each file in the stage's outputs is evaluated separately. In the example above, if STAGE1 takes bamfile as input and STAGE2 takes summary as input, bamfile can be deleted as soon STAGE1 completes, rather than waiting for STAGE2 to complete so that both can be deleted. In addition, any files not specified in any of the stage's outputs (most commonly intermediate files generated by the chunks and merged by the join) are deleted as soon as the stage completes. In many cases these changes significantly reduce the storage high-water mark for a pipeline, and obviate the need for some weird hacks such as creating intermediate stages which simply copy selected outputs from another stage in order to allow the earlier stage to be cleaned up.

One important note about this feature is that in many cases stage code may produce files where their existence implies the existence of other files. For example, filename.bam often implies the existence of filename.bai. If a downstream stage does not bind an output which mentions filename.bai then that file may be deleted by the time that stage runs. As another example, if one of the outputs is a file containing a list of other file names, those other file names may also be deleted by VDR when in strict mode. This is why the feature is opt-in on a stage-by-stage basis. Any files produced by the stage which downstream stages are expected to read must be listed in the stage outputs, and those downstream stages must take those outputs as inputs.

New feature: "retained" outputs

In some cases, a file produced by a stage isn't part of the formal outputs of a pipeline, but should still not be deleted for other reasons. For example, during debugging one might want to preserve the outputs of one stage in order to have them as an input when re-run a later stage that is being actively developed. As another example, some files may be small enough that the savings involved in deleting them is too small to justify a reduction in ease of debugging the outputs of a pipeline later. There are two ways to prevent such files from being cleaned up by VDR:

Pipeline retains

pipeline BAR(
    in  bam  input1,
    out bam  output1,
)
{
    call FOO(
        input1 = self.input1,
    )
    call FOO as BAZ(
        input2 = FOO.output1,
    )
    return (
        output1 = BAZ.output1,
    )
    retain (
        FOO.output1
    )
}

This specifies that output1 of this pipeline's call to FOO should never be deleted, for example if one wants to be able to later re-run BAZ. This is should be preferred when one wishes to preserve a stage output for development purposes, first of all because it puts the retain directive closer to where the output may be reused later, and second because the stage in question might be called in other cases (such as aliased to BAZ in this example, or from other pipelines) which do not need to retain the output.

Stage retains

stage FOO(
    in  int  input1,
    out bam  output1,
    out json summary,
) using (
    volatile = strict,
) retain (
    summary,
)

This specifies that VDR should never delete summary. This should be used in the case where a file should always be preserved for potential later inspection.

Runtime improvements

User-facing improvements

  • The memory and CPU consumption of mrp has been reduced, especially for very large pipelines, and in cases where stages create large output objects.
  • There is now a timeout, configurable with the --retry-wait command line flag, between when mrp observes a potentially-transient failure and when it retries the failure. In many cases (for example cluster-mode jobs running on a remote machine which was taken offline) the failures are clustered, and waiting a short time allows all of the failures to be dealt with at once. The default wait time is 1 second.
  • The web UI now dynamically lists top-level metadata files (such as log) rather than having a hard-coded list which included files which are often not present before the pipestance completes.
  • The web UI will now show files in the /extras directory of the pipestance. This is intended primarily for outputs of on-finish hooks.

Improvements for Pipeline-Developers

  • mrp can now run stage invocations as well as pipeline invocations. mrs now exists only as a symlink to mrp for backwards compatibility. This eliminates the feature gap between mrs and mrp, including for example restartability and user interface, as well as reducing the maintenance overhead involved in having separate binaries.
  • The default event collection for --profile=perf no longer includes bpf-output events. This was not working in most cases.
  • For users who desire more control over perf profile recording with --profile=perf, the environment variable MRO_PERF_ARGS allows one to specify the command like to perf record. This overrides MRO_PERF_EVENTS, MRO_PERF_FREQ, and MRO_PERF_DURATION. The command that runs will be perf record -p <pid> -o <path/to/job/_perf.data> $MRO_PERF_ARGS. The same behavior as setting those variables can thus be achieved by setting MRO_PERF_ARGS="-g -e $MRO_PERF_EVENTS -F $MRO_PERF_FREQ sleep $MRO_PERF_DURATION".
  • When running with --zip --profile=perf, profiling outputs are no longer included in _metadata.zip.
  • The _mrosource output now includes comments indicating file boundaries from the original source code with merged @includes.
  • More logging about the environment when mrp starts up
    • Log a few more environment variables (MALLOC_* and RUST_*).
    • The filesystem type and available space and inode count are now logged on startup.
  • When mrp is run with --monitor, stages with overly-large outputs (512mb, or 1GB/number of chunks for chunk outputs) will now fail rather than potentially causing mrp to run out of memory while trying to parse them.
  • The stage code parent process, mrjob, now polls memory and IO usage much more frequently and efficiently. This provides a more accurate measurement of peak memory usage for multithreaded stages.
  • When attempting to reattach to an existing pipestance, verify that the mro source hasn't changed in "significant" ways.
  • Stage and pipeline _invocation files now only @include the mro file defining that stage or pipeline, rather than the complete set of includes for the whole pipeline.
  • Stage and pipeline _invocation files now properly represent call aliasing (e.g. call FOO as BAR).
  • The python adapter now exposes two new methods, get_threads_allocation and get_memory_allocation, for stages to use in determining what they're allowed to use in cases where their request was large or dynamic.

Improvements to the compiler/parser (mrc) and formatter (mrf)

  • The parser has been extensively rewritten to provide more useful (and correct) line number outputs for errors, especially in cases with complicated webs of @include directives.
  • The parser is now substantially faster, and uses less memory.
  • mrf has a new flag, --includes, which will remove @include directives which are not required, and will attempt to add @include and filetype directives which are missing. It is inspired by the clang iwyu tool.
  • mrf now transforms call modifiers to the new syntax introduced in Martian 3.0. That is,
call volatile FOO(
    ...
)

becomes

call FOO(
    ...
) using (
    volatile = true,
)
  • mrf now sorts keys in map literals.
  • mrf now inserts a trailing comma in map and array literals, e.g.
[
    1,
    2,
]

Bug fixes

  • The web UI now times out stale connections. Previously a buggy or malicious client could introduce a denial of service condition by opening more and more socket connections until the server ran out of file handles.
  • Fix type validation for array outputs containing null elements.
  • Fix a bug which made the --auth-key flag not do anything.
  • When reattaching to a pipestance with a _metadata.zip, ignore failures in unzipping _metadata.zip caused by files already existing.
  • mrc warnings now output to stderr, not stdout.
  • mrc now correctly exits with a non-zero return code when run on individual files.