Quantitative finance collector
C++ Matlab VBA/Excel Java Mathematica R/Splus Net Code Site Other
Oct 26

Handling Large Datasets in R

Posted by abiao at 23:52 | Code » R/Splus | Comments(2) | Reads(13166)
Handling large dataset in R, especially CSV data, was briefly discussed before at Excellent free CSV splitter and Handling Large CSV Files in R. My file at that time was around 2GB with 30 million number of rows and 8 columns. Recently I started to collect and analyze US corporate bonds tick data from year 2002 to 2010, and the CSV file I got is 6.18GB with 40 million number of rows, even after removing biases data as in Biases in TRACE Corporate Bond Data.

How to proceed efficiently? Below is an excellent presentation on handling large datasets in R by Ryan Rosario at http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/, a short summary of the presentation:
1, R has a few packages for big data support. The presentation covers the following: bigmemory and ff; and also some uses of parallelism to accomplish the same goal using Hadoop and MapReduce;
2, the data used in the presentation is 11GB comma-separated values with 120 million rows, 29 columns;
3, For datasets with size in the range 10GB, bigmemory and ff handle themselves well;
4, For larger datasets, use Hadoop;



BTW, determining the number of rows of a very big file is tricky, you don't have to load the data first and use dim(), which easily leads to short of memory. One way of doing it is readLines(), for example:
data <- gzfile("yourdata.zip",open="r")
MaxRows <- 50000
TotalRows <- 0
while((LeftRow <- length(readLines(data,MaxRows))) > 0 )
TotalRows <- TotalRows+LeftRow
close(data)


Tags: ,
For a linecount, in Unix, simply do

wc -l $filename
countLines <- function(x) as.numeric(system(sprintf("wc -l %s | grep -Eo '[0-9]+'", x), intern=TRUE))
Pages: 1/1 First page 1 Final page
Add a comment
Emots
Enable HTML
Enable UBB
Enable Emots
Hidden
Remember
Nickname   Password   Optional
Site URI   Email   [Register]