-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Live News |
The R data.table
package provides an in-memory columnar structure just like base R's data.frame
since 1997 (the structure is ideal and unchanged) but with the following enhancements :
- fast and friendly file reader: fread. This accepts system commands directly such as grep, gunzip, etc.
- fast and parallelised file writer: fwrite, from v1.9.8
- parallelised row subsets from v1.9.8 - See this benchmark for timings
- fast aggregation of large data; e.g. 100GB in RAM (see benchmarks on up to two billion rows)
- fast add/update/delete columns by reference by group using no copies at all
- fast ordered joins; e.g. rolling forwards, backwards, nearest and limited staleness
- fast overlapping range joins; similar to
findOverlaps
function from IRanges/GenomicRanges Bioconductor packages, but not limited to genomic (integer) intervals. - fast non-equi (or conditional) joins, i.e., joins using operators
>, >=, <, <=
as well, available from v1.9.8+ - a fast primary ordered index; e.g.
setkey(DT,col1,col2)
-
automatic secondary indexing; e.g.
DT[col==val,]
andDT[col %in% vals,]
- fast and memory efficient combined join and group by;
by=.EACHI
- fast reshape2 methods (dcast and melt) without needing reshape2 and its dependency chain installed or loaded
- group summary results may be many rows (e.g. first and last row by group) and each cell value may itself be a vector/object/function (e.g. unique ids by group as a list column of varying length vectors - this is pretty printed with commas)
- automatic row numbers built in and exposed via symbol
.I
- convenience symbol
.N
for the number of rows (usually by group) without the overhead of a function call - any R function from any R package can be used in queries not just the subset of functions made available by a database backend
- has no dependencies at all other than base R itself, for simpler production/maintenance
- the R dependency is as old as possible for as long as possible and we test against that version; e.g. next release v1.9.8 will bump dependency up from 4.5 year old R 2.14.0 to 3 year old R 3.0.0.
It has a natural syntax:
DT[where, select|update|do, by]
These queries can be chained together just by adding another one on the end:
DT[...][...]
.
See data.table compared to dplyr on Stack Overflow and Quora.
NB : We moved from R-Forge to GitHub in June 2014. Commit and issue history was imported.
Guidelines for filing issues / pull requests: Contribution Guidelines.
As of 11 Mar 2016, data.table continues to be the 2nd largest tag about an R package and the 7th most starred R package on GitHub. It has over 180 CRAN and Bioconductor packages depending on or importing it.