Skip to content
Matt Dowle edited this page Sep 29, 2016 · 80 revisions

Live News       Linux/Mac:   Windows:

The R data.table package provides an in-memory columnar structure just like base R's data.frame since 1997 (the structure is ideal and unchanged) but with the following enhancements :

  • fast and friendly file reader: fread. This accepts system commands directly such as grep, gunzip, etc.
  • fast and parallelised file writer: fwrite, from v1.9.8
  • parallelised row subsets from v1.9.8 - See this benchmark for timings
  • fast aggregation of large data; e.g. 100GB in RAM (see benchmarks on up to two billion rows)
  • fast add/update/delete columns by reference by group using no copies at all
  • fast ordered joins; e.g. rolling forwards, backwards, nearest and limited staleness
  • fast overlapping range joins; similar to findOverlaps function from IRanges/GenomicRanges Bioconductor packages, but not limited to genomic (integer) intervals.
  • fast non-equi (or conditional) joins, i.e., joins using operators >, >=, <, <= as well, available from v1.9.8+
  • a fast primary ordered index; e.g. setkey(DT,col1,col2)
  • automatic secondary indexing; e.g. DT[col==val,] and DT[col %in% vals,]
  • fast and memory efficient combined join and group by; by=.EACHI
  • fast reshape2 methods (dcast and melt) without needing reshape2 and its dependency chain installed or loaded
  • group summary results may be many rows (e.g. first and last row by group) and each cell value may itself be a vector/object/function (e.g. unique ids by group as a list column of varying length vectors - this is pretty printed with commas)
  • automatic row numbers built in and exposed via symbol .I
  • convenience symbol .N for the number of rows (usually by group) without the overhead of a function call
  • any R function from any R package can be used in queries not just the subset of functions made available by a database backend
  • has no dependencies at all other than base R itself, for simpler production/maintenance
  • the R dependency is as old as possible for as long as possible and we test against that version; e.g. next release v1.9.8 will bump dependency up from 4.5 year old R 2.14.0 to 3 year old R 3.0.0.

It has a natural syntax:
DT[where, select|update|do, by]
These queries can be chained together just by adding another one on the end:
DT[...][...].
See data.table compared to dplyr on Stack Overflow and Quora.

NB : We moved from R-Forge to GitHub in June 2014. Commit and issue history was imported.
Guidelines for filing issues / pull requests: Contribution Guidelines.

As of 11 Mar 2016, data.table continues to be the 2nd largest tag about an R package and the 7th most starred R package on GitHub. It has over 180 CRAN and Bioconductor packages depending on or importing it.

Clone this wiki locally