Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace temp files with python scripts #40

Open
sgsutcliffe opened this issue Feb 11, 2025 · 0 comments
Open

Replace temp files with python scripts #40

sgsutcliffe opened this issue Feb 11, 2025 · 0 comments

Comments

@sgsutcliffe
Copy link
Contributor

A lot of processes have file manipluation being performed via bash scripts, which leads to tmp files being created. Would be more efficient and scalable if it was wrapped in a python script:

In the current release of gasnomenclature 0.3.0

e.g., APPEND_CLUSTERS() which has a script block of:

    script:
    """
    # Function to get the first address line from the files, handling gzipped files
    get_address() {
        if [[ "\${1##*.}" == "gz" ]]; then
            zcat "\$1" | awk 'NR>1 {print \$2}' | head -n 1
        else
            awk 'NR>1 {print \$2}' "\$1" | head -n 1
        fi
    }

    # Check if two files have consistent delimeter splits in the address column
    init_splits=\$(get_address "${initial_clusters}" | awk -F '${params.gm_delimiter}' '{print NF}')
    add_splits=\$(get_address "${additional_clusters}" | awk -F '${params.gm_delimiter}' '{print NF}')

    if [ "\$init_splits" != "\$add_splits" ]; then
        echo "Error: Address levels do not match between initial_clusters and --db_clusters."
        exit 1
    fi

    # Add a "source" column to differentiate the reference profiles and additional profiles
    csvtk mutate2 -t -n source -e " 'ref' " ${initial_clusters} > reference_clusters_source.tsv
    csvtk mutate2 -t -n source -e " 'db' " ${additional_clusters} > additional_clusters_source.tsv

    # Combine profiles from both the reference and database into a single file
    csvtk concat -t reference_clusters_source.tsv additional_clusters_source.tsv | csvtk sort -t -k id > combined_profiles.tsv

    # Calculate the frequency of each sample_id across both sources
    csvtk freq -t -f id combined_profiles.tsv > sample_counts.tsv

    # For any sample_id that appears in both the reference and database, add a 'db_' prefix to the sample_id from the database
    csvtk join -t -f id combined_profiles.tsv sample_counts.tsv | \
    csvtk mutate2 -t -n id -e '(\$source == "db" && \$frequency > 1) ? "db_" + \$id : \$id' | \
    csvtk cut -t -f id,address > reference_clusters.tsv
    """
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant