Replace temp files with python scripts #40

sgsutcliffe · 2025-02-11T15:29:02Z

A lot of processes have file manipluation being performed via bash scripts, which leads to tmp files being created. Would be more efficient and scalable if it was wrapped in a python script:

In the current release of gasnomenclature 0.3.0

e.g., APPEND_CLUSTERS() which has a script block of:

    script:
    """
    # Function to get the first address line from the files, handling gzipped files
    get_address() {
        if [[ "\${1##*.}" == "gz" ]]; then
            zcat "\$1" | awk 'NR>1 {print \$2}' | head -n 1
        else
            awk 'NR>1 {print \$2}' "\$1" | head -n 1
        fi
    }

    # Check if two files have consistent delimeter splits in the address column
    init_splits=\$(get_address "${initial_clusters}" | awk -F '${params.gm_delimiter}' '{print NF}')
    add_splits=\$(get_address "${additional_clusters}" | awk -F '${params.gm_delimiter}' '{print NF}')

    if [ "\$init_splits" != "\$add_splits" ]; then
        echo "Error: Address levels do not match between initial_clusters and --db_clusters."
        exit 1
    fi

    # Add a "source" column to differentiate the reference profiles and additional profiles
    csvtk mutate2 -t -n source -e " 'ref' " ${initial_clusters} > reference_clusters_source.tsv
    csvtk mutate2 -t -n source -e " 'db' " ${additional_clusters} > additional_clusters_source.tsv

    # Combine profiles from both the reference and database into a single file
    csvtk concat -t reference_clusters_source.tsv additional_clusters_source.tsv | csvtk sort -t -k id > combined_profiles.tsv

    # Calculate the frequency of each sample_id across both sources
    csvtk freq -t -f id combined_profiles.tsv > sample_counts.tsv

    # For any sample_id that appears in both the reference and database, add a 'db_' prefix to the sample_id from the database
    csvtk join -t -f id combined_profiles.tsv sample_counts.tsv | \
    csvtk mutate2 -t -n id -e '(\$source == "db" && \$frequency > 1) ? "db_" + \$id : \$id' | \
    csvtk cut -t -f id,address > reference_clusters.tsv
    """
}

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace temp files with python scripts #40

Replace temp files with python scripts #40

sgsutcliffe commented Feb 11, 2025

Replace temp files with python scripts #40

Replace temp files with python scripts #40

Comments

sgsutcliffe commented Feb 11, 2025