get_genome_builds()
can now be called to quickly get the genome build without running the whole reformatting.format_sumstats(compute_n)
now has more methods to compute the effective sample size with "ldsc", "sum", "giant" or "metal".format_sumstats(convert_ref_genome)
now implemented which can perform liftover to GRCh38 from GRCh37 and vice-versa enabling better cohesion between different study's summary statistics.
check_no_rs_snp
can now handle extra information after an RS ID. So if you havers1234:A:G
that will be separated into two columns.check_two_step_col
andcheck_four_step_col
, the two checks for when multiple columns are in one, have been updated so if not all SNPs have multiple columns or some have more than the expected number, this can now be handled.- Extra mappings for the
FRQ
column have been added to the mapping file
check_multi_rs_snp
can now handle all punctuation with/without spaces. So if a row containsrs1234,rs5678
orrs1234, rs5678
or any other punctuation character other than,
these can be handled.format_sumstats(path)
can now be passed a dataframe/datatable of the summary statistics directly as well as a path to their saved location.- Input summary statistics with
A0/A1
corresponding to ref/alt can now be handled by the mappign file as well asA1/A2
corresponding to ref/alt.
import_sumstats
reads GWAS sum stats directly from Open GWAS. Now parallelised and reports how long each dataset took to import/format in total.find_sumstats
searches Open GWAS for datasets.compute_z
computes Z-score from P.compute_n
computes N for all SNPs from user defined smaple size.format_sumstats(ldsc_format=TRUE)
ensures sum stats can be fed directly into LDSC without any additional munging.read_sumstats
,write_sumstas
, anddownload_vcf
functions now exported.format_sumstats(sort_coordinates=TRUE)
sorts results by their genomic coordinates.format_sumstats(return_data=TRUE)
returns data directly to user. Can be returned in eitherdata.table
(default),GRanges
orVRanges
format usingformat_sumstats(return_format="granges")
.format_sumstats(N_dropNA=TRUE)
(default) drops rows where N is missing.format_sumstats(snp_ids_are_rs_ids=TRUE)
(default) Should the SNP IDs inputted be inferred as RS IDs or some arbitrary ID.format_sumstats(write_vcf=TRUE)
writes a tabix-indexed VCF file instead of tabular format.format_sumstats(save_path=...)
lets users decide where their results are saved and what they're named.- When the
save_path
indicates it's intempdir()
, message warns users that these files will be deleted when R session ends. - Summary of data is given at the beginning and the end of
format_sumstats
viareport_summary()
. - Readability of
preview_sumstats()
messages improved. - New checks standard error (SE) must >0 and BETA (and other effect columns)
must not equal 0:
format_sumstats(pos_se=TRUE,effect_columns_nonzero=TRUE)
- Log directory containing all removed SNPs is now available and can be
changed to a different directory by setting:
format_sumstats(log_folder_ind=TRUE,log_folder=tempdir())
- All imputed data can now be identified with a column in the output using:
format_sumstats(imputation_ind=TRUE)
- Users can now input their own mapping file to be used for the column header
mapping in place of
data(sumstatsColHeaders)
. Seeformat_sumstats(mapping_file = mapping_file)
.
- CHR column now standardised (X and Y caps, no "chr" prefix).
- Allele flipping done on a per-SNP basis (instead of whole-column).
- Allele flipping now includes FRQ column as well as effect columns.
- The effect allele is now interpreted as the A2 allele consistent with IEU GWAS VCF approach. A1 will always be the reference allele.
read_vcf
upgraded to account for more VCF formats.check_n_num
now accounts for situations where N is a character vector and converts to numeric.
- Preprint publication citation added.
- MungeSumstats released to Bioconductor.