Quantitative finance collector
C++ Matlab VBA/Excel Java Mathematica R/Splus Net Code Site Other
Oct 26

Handling Large Datasets in R

Posted by abiao at 23:52 | Code » R/Splus | Comments(11) | Reads(36860)
Handling large dataset in R, especially CSV data, was briefly discussed before at Excellent free CSV splitter and Handling Large CSV Files in R. My file at that time was around 2GB with 30 million number of rows and 8 columns. Recently I started to collect and analyze US corporate bonds tick data from year 2002 to 2010, and the CSV file I got is 6.18GB with 40 million number of rows, even after removing biases data as in Biases in TRACE Corporate Bond Data.

How to proceed efficiently? Below is an excellent presentation on handling large datasets in R by Ryan Rosario at http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/, a short summary of the presentation:
1, R has a few packages for big data support. The presentation covers the following: bigmemory and ff; and also some uses of parallelism to accomplish the same goal using Hadoop and MapReduce;
2, the data used in the presentation is 11GB comma-separated values with 120 million rows, 29 columns;
3, For datasets with size in the range 10GB, bigmemory and ff handle themselves well;
4, For larger datasets, use Hadoop;

BTW, determining the number of rows of a very big file is tricky, you don't have to load the data first and use dim(), which easily leads to short of memory. One way of doing it is readLines(), for example:
data <- gzfile("yourdata.zip",open="r")
MaxRows <- 50000
TotalRows <- 0
while((LeftRow <- length(readLines(data,MaxRows))) > 0 )
TotalRows <- TotalRows+LeftRow

Tags: ,
For a linecount, in Unix, simply do

wc -l $filename
countLines <- function(x) as.numeric(system(sprintf("wc -l %s | grep -Eo '[0-9]+'", x), intern=TRUE))
This blog is very nice.I really like such a fantastic written blog.I will keep coming here again and again.Visit my link as well.brain games tv show
Great work as always.
useful and interesting article, found a  lot of useful information, glad to join your community
Angeline Wallace Homepage
I would like to analyze a large GPS data set which I extracted by tapping in to the interface of a GPS dog fence. I can't use simple .csv analysis techniques because the data set is rather large > 10 GB.

Can MaxRows <- 50000 be adjusted to
MaxRows <- 500000 ?
Httpmarketing the online marketing specialists in the Netherlands ( Warmenhuizen / Schagen ). We are specialize in search engine optimization ( SEO ) and website optimization. ( translation dutch: Websiteoptimalisatie en zoekmachine optimalisatie ). Also the seo company for the best low obl and high DA PA backlinks ( include High trustflow metrics ).Coursera
Concrete Polishing Email Homepage
This blog is very nice.I really like such a fantastic written blog.I will keep coming here again and again.Visit my link as well
home automation for the nerds...
So relaxing to view a huge amount of high-quality articles. I adore the design and style too. Nice job!
bowmasters hack
It might help to share some examples of what kind of analysis you want to do and what your data looks like. trade me property for sale
Pages: 1/1 First page 1 Final page
Add a comment
Enable HTML
Enable UBB
Enable Emots
Nickname   Password   Optional
Site URI   Email   [Register]