Solveddata.table memory leak

Only when using the current 3.6.0 nightly build, I found what appears to be a memory leak in data.table

# Minimal reproducible example

require(data.table)
n <- 2e6
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
print(pryr::mem_used())
fff <- function(aref) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(get(aref))),by=b][,c:=letters[i]]
    dt2
  })
  return(rbindlist(ff))
}
for(i in 1:10) {
  f <- fff("a")
  rm("f")
  gc()
  print(pryr::mem_used())
}

I would expect the printed memory usage to stay flat (since f is removed and gc() is called) but it goes up. Notice that it always takes some time. adding [,c:=letters[i]] makes it cause problems faster but is not necessary (you may have to increase 10 to 200 or something)

example output

66.6 MB
66.6 MB
66.6 MB
66.6 MB
87.3 MB
190 MB

# Output of sessionInfo()

R Under development (unstable) (2018-05-10 r74708)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.3

loaded via a namespace (and not attached):
[1] compiler_3.6.0   pryr_0.1.4       magrittr_1.5     tools_3.6.0     
[5] Rcpp_0.12.16     stringi_1.1.7    codetools_0.2-15 stringr_1.3.0  

on SO

This is a problem on the CRAN version and the Git Hub version of data.table.

37 Answers

✔️Accepted Answer

Markus: I shouldn't have written RAM usage. What I had in mind more specifically was cache, not RAM. My laptop has 32KB of L1d, 256K of L2, 6MB L3. When we use multiple threads, that same cache is shared by all the threads (and the OS and other processes). So divide those cache figures by 4 if data.table is using 4 threads. Cache isn't typically much larger on servers, so divide cache by 32 if 32 threads on a server. if we create a new 10,000 item hop vector, then that's 78KB of weight sloshing through cache. Each use of that vector will need that element's cache line (64 bytes on x86). However, thinking about it, in data.table's case most often it wouldn't be an extra hop vector with the base SEXP vector sloshing through cache as well, but rather the parallel code would use the hop vector instead of the base SEXP one. That should be fine then and I agree ncol*8 of working bytes is not a concern even for a 1e6 column dt (8MB). It's certainly cleaner and likely faster to use raw C pointers inside parallel regions, so I'm very comfortable with that. Yes any help much appreciated. I'm looking too so we should coordinate.

Thanks Luke - understood and thanks again. The only place I have deliberately violated the write-barrier on STRSXP and VECSXP is in reorder.c. This is a uniquely special operation : a shuffle, such as performed by setkey(). There it is absolutely guaranteed that all the members of the STRSXP are still there afterwards, they just change position. Say there are 8 STRSXP columns and 8 threads. Each thread gets a column and the 8 columns are shuffled in parallel (the pointers are moved within the vector) but SET_STRING_ELT is not used. This is only ok because it's a shuffle. Otherwise, data.table carefully respects the write barrier (at least, it is intended to).

For those reading that are wondering what the 'write barrier' means (it took me a while to understand it too, with help from Luke in the past), and roughly speaking, SET_STRING_ELT doesn't just assign the pointer (to the CHARSXP in global character cache) to the vector, but it also increments the reference count of the string being assigned and decrements the reference count of the string being replaced. The reference count is on the single global CHARSXP (i.e. shared between threads). So if two threads do that at the same time, the reference count can get messed up (a race). Because an increment (e.g. val++ in C) is actually several operations at machine level: read, add, assign. Two threads could read at the same time and then write the same result back to the variable, causing one increment to be missed and eventually a crash elsewhere. val++ is a write op that is not thread safe in C if val is shared between threads. Parallel C code in data.table is already aware of things like that and should be safe. What's changed is that R API functions like INTEGER() are no longer thread-safe, where they used to be (we relied on undocumented behaviour). The other thing to be aware of is that data.table defines USE_RINTERNALS which allows data.table to use R internals which are not part of the R API; e.g. DATAPTR. Luke and R-core would like data.table, eventually, to remove the use of USE_RINTERNALS and use only R's official C API. In the case of INTEGER() no longer being thread-safe, it would not have helped, since INTEGER is part of R's C API. Rather, w.r.t. INTEGER(), we relied on undocumented thread-safety of an API function. In general, data.table does make Luke's job harder in changing R itself because of our use of USE_RINTERNALS, behind the C API. Our goal is to remove that too and CRAN policy may be revised in future, very reasonably, to not allow USE_RINTERNALS. It's easily quantifiable by removing the #define in data.table.h and seeing what breaks (currently, lots, e.g. 73 uses of DATAPTR). If we start now, we'll find out if there's anything that really can't be overcome, and then we can start a discussion with Luke in good time and possibly get our requests in for additions to the official C API if we can make a compelling case for it. Originally, many years ago, I was motivated to USE_RINTERNALS because that changed INTEGER from a function to a macro and was faster. That is no longer true so we should revisit. The other reason was DATAPTR was a convenient generic way to access the vector's contents for code brevity. But DATAPTR now can allocate so it's no longer just an offset from SEXP.

Other Answers:

@jangorecki its a race condition. The risk exists on all platforms. When it bites varies from run to run within platforms, How likely it is to bite varies between platforms with performance differences in particular in thread implementations. On my Ubuntu system where getDTthreads() returns 4 this variant shows the problem on most runs:

require(data.table)
n <- 20
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
v <- gc()
print(sum(v[, 2]))
fff <- function(aref) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(get(aref))),by=b][,c:=letters[i]]
    dt2
  })
  return(rbindlist(ff))
}
for(i in 1:100) {
  f <- fff("a")
  rm("f")
  v <- gc()
  print(sum(v[, 2]))
}

I'll add that the call to get appears to be necessary (sum(get(aref))) but the call need not be in a function call. So, this works

# NOTE: sum(get(aref)) now sum(a)
require(data.table)
n <- 2e6
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
print(pryr::mem_used())
aref <- "a"
for(i in 1:10) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(a)),by=b][,c:=letters[i]]
    dt2
  })
  f <- rbindlist(ff)
  rm("f")
  gc()
  print(pryr::mem_used())
}
# no memory leak
#66.6 MB (repeated)

but this causes the memory leak

# NOTE: no fff function
require(data.table)
n <- 2e6
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
print(pryr::mem_used())
aref <- "a"
for(i in 1:10) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(get(aref))),by=b][,c:=letters[i]]
    dt2
  })
  f <- rbindlist(ff)
  rm("f")
  gc()
  print(pryr::mem_used())
}
# 66.6 MB
# 170 MB
# 273 MB
# 375 MB

Interestingly, running the first code chunk after the second and it too has a memory leak, it just can't get it started.

Hint: Running with OMP_NUM_THREADS=1 makes the memory problem go away.

Calling DATAPTR for ALTREP objects temporarily disables the GC. Calling DATAPTR from multiple threads without synchronization creates a race condition that can leave the GC disabled, so memory use grows. It's not a leak: signaling a user interrupt, or anything that triggers a jump to top level, re-enables the GC.

Sure. Hope that edit cab7c11 just now is ok. Many many thanks @nguyentr17 !

Related Issues:

11
data.table Compatibility with the future native pipe
(FWIW the tidyverse team are planning to work with R core to figure out a placeholder syntax (or equ...
6
data.table memory leak
Markus: I shouldn't have written RAM usage What I had in mind more specifically was cache not RAM My...
3
data.table fwrite(): final items
Do people actually like having quote=TRUE when writing to csv? I find it to be a big nuisance and wo...
3
data.table fwrite UTF8
Having encoding issues in Windows Windows encoding is a real pain +1 Hello Could you please add an o...
3
data.table Error when installing from source on macOS
Never mind I figured it out and am able to install with multithreaded support I had to change my .R/...