data.table {data.table}  R Documentation 
data.table
inherits from data.frame
. It offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development. It is inspired by A[B]
syntax in R where A
is a matrix and B
is a 2column matrix. Since a data.table
is a data.frame
, it is compatible with R functions and packages that only accept data.frame
.
The 10 minute quick start guide to data.table
may be a good place to start: vignette("datatableintro")
. Or, the first section of FAQs is intended to be read from start to finish and is considered core documentation: vignette("datatablefaq")
. If you have read and searched these documents and the help page below, please feel free to ask questions on datatablehelp or the Stack Overflow data.table tag. To report a bug please type: bug.report(package="data.table")
.
Please check the homepage for up to the minute news.
Tip: one of the quickest ways to learn the features is to type example(data.table)
and study the output at the prompt.
*NEW* :
data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL) ## S3 method for class 'data.table' x[i, j, by, keyby, with=TRUE, nomatch = getOption("datatable.nomatch"), # default: NA_integer_ mult = "all", roll = FALSE, rolltolast = FALSE, which = FALSE, .SDcols, verbose=getOption("datatable.verbose"), # default: FALSE drop=NULL]
... 
Just as 
keep.rownames 
If 
check.names 
Just as 
key 
Character vector of one or more column names which is passed to 
x 
A 
i 
Integer, logical or character vector, expression of column names, integer and logical vectors work the same way they do in character is matched to the first column of expression is evaluated within the frame of the When Advanced: When Advanced: When Advanced: When 
j 
A single column name, single expresson of column names,

by 
A single unquoted column name, a The When Advanced: Aggregation for a subset of known groups is particularly efficient when passing those groups in Advanced: When grouping by
Advanced: In the 
keyby 
An ad hoc by just as 
with 
By default 
nomatch 
Same as 
mult 
When multiple rows in 
roll 
Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If 
rolltolast 
Like 
which 

.SDcols 
Advanced. Specifies the columns of 
verbose 

drop 
Never used by 
data.table
builds on base R functionality to reduce 2 types of time :
programming time (easier to write, read, debug and maintain)
compute time
It combines database like operations such as subset
, with
and by
and provides similar joins that merge
provides but faster. This is achieved by using R's column based ordered inmemory data.frame
structure, eval
within the environment of a list
, the [.data.table
mechanism to condense the features, and compiled C to make certain operations fast.
The package can be used just for rapid programming (compact syntax). Largest compute time benefits are on 64bit platforms with plentiful RAM, or when smaller datasets are repeatedly queried within a loop, or when other methods use so much working memory that they fail with an out of memory error.
As with [.data.frame
, compound queries can be concatenated on one line; e.g.,
DT[,sum(v),by=colA][V1<300][tail(order(V1))] # sum(v) by colA then return the 6 largest which are under 300The
j
expression does not have to return data; e.g.,
DT[,plot(colB,colC),by=colA] # produce a set of plots (likely to pdf) returning no dataMultiple
data.table
s (e.g. X
, Y
and Z
) can be joined in many ways; e.g.,
X[Y][Z] X[Z][Y] X[Y[Z]] X[Z[Y]]A
data.table
is a list
of vectors, just like a data.frame
. However :
it never has rownames. Instead it may have one key of one or more columns. This key can be used for row indexing instead of rownames.
it has enhanced functionality in [.data.table
for fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all.
Since a list
is a vector
, data.table
columns may be type list
. Columns of type list
can contain mixed types. Each item in a column of type list
may be different lengths. This is true of data.frame
, too.
Several methods are provided for data.table
, including is.na
, na.omit
,
t
, rbind
, cbind
, merge
and others.
If keep.rownames
or check.names
are supplied they must be written in full because R does not allow partial argument names after '...
'. For example, data.table(DF,keep=TRUE)
will create a
column called "keep"
containing TRUE
and this is correct behaviour; data.table(DF,keep.rownames=TRUE)
was intended.
POSIXlt is not supported as a column type because it uses 40 bytes to store a single datetime. Unexpected errors may occur if you manage to create a column of type POSIXlt. Please see NEWS for 1.6.3, and IDateTime
instead. IDateTime has methods to convert to and from POSIXlt.
data.table
homepage: http://datatable.rforge.rproject.org/
User reviews: http://crantastic.org/packages/datatable
http://en.wikipedia.org/wiki/Binary_search
http://en.wikipedia.org/wiki/Radix_sort
data.frame
, [.data.frame
, as.data.table
, setkey
, J
, SJ
, CJ
, merge.data.table
, tables
, test.data.table
, IDateTime
, unique.data.table
, copy
, :=
, alloc.col
, truelength
, rbindlist
## Not run: example(data.table) # to run these examples at the prompt ## End(Not run) DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9) DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9) DF DT identical(dim(DT),dim(DF)) # TRUE identical(DF$a, DT$a) # TRUE is.list(DF) # TRUE is.list(DT) # TRUE is.data.frame(DT) # TRUE tables() DT[2] # 2nd row DT[,v] # v column (as vector) DT[,list(v)] # v column (as data.table) DT[2:3,sum(v)] # sum(v) over rows 2 and 3 DT[2:5,cat(v,"\n")] # just for j's side effect DT[c(FALSE,TRUE)] # even rows (usual recycling) DT[,2,with=FALSE] # 2nd column colNum = 2 DT[,colNum,with=FALSE] # same setkey(DT,x) # set a 1column key. No quotes, for convenience. setkeyv(DT,"x") # same (v in setkeyv stands for vector) v="x" setkeyv(DT,v) # same # key(DT)<"x" # copies whole table, please use set* functions instead DT["a"] # binary search (fast) DT[x=="a"] # vector scan (slow) DT[,sum(v),by=x] # keyed by DT[,sum(v),by=key(DT)] # same DT[,sum(v),by=y] # ad hoc by DT["a",sum(v)] # j for one group DT[c("a","b"),sum(v)] # j for two groups X = data.table(c("b","c"),foo=c(4,2)) X DT[X] # join DT[X,sum(v)] # join and eval j for each row in i DT[X,mult="first"] # first row of each group DT[X,mult="last"] # last row of each group DT[X,sum(v)*foo] # join inherited scope setkey(DT,x,y) # 2column key setkeyv(DT,c("x","y")) # same DT["a"] # join to 1st column of key DT[J("a")] # same. J() stands for Join, an alias for list() DT[list("a")] # same DT[.("a")] # same. In the style of package plyr. DT[J("a",3)] # join to 2 columns DT[.("a",3)] # same DT[J("a",3:6)] # join 4 rows (2 missing) DT[J("a",3:6),nomatch=0] # remove missing DT[J("a",3:6),roll=TRUE] # rolling join (locf) DT[,sum(v),by=list(y%%2)] # by expression DT[,.SD[2],by=x] # 2nd row of each group DT[,tail(.SD,2),by=x] # last 2 rows of each group DT[,lapply(.SD,sum),by=x] # apply through columns by group DT[,list(MySum=sum(v), MyMin=min(v), MyMax=max(v)), by=list(x,y%%2)] # by 2 expressions DT[,sum(v),x][V1<20] # compound query DT[,sum(v),x][order(V1)] # ordering results DT[,z:=42L] # add new column by reference DT[,z:=NULL] # remove column by reference DT["a",v:=42L] # subassign v by reference DT[,m:=mean(v),by=x] # add new column by reference by group DT[,.SD[which.min(v)],by=x] # nested query by group # Follow rhelp posting guide, support is here (*not* rhelp) : # datatablehelp@lists.rforge.rproject.org # or # http://stackoverflow.com/questions/tagged/data.table ## Not run: vignette("datatableintro") vignette("datatablefaq") vignette("datatabletimings") test.data.table() # over 700 low level tests update.packages() # keep up to date ## End(Not run)