Sep
12

Probably all of us have met the issue of

I came across an easy-to-use

**handling missing data**, from the basic portfolio correlation matrix estimation, to advanced multiple factor analysis, how to**impute missing data**remains a hot topic. Missing data are unavoidable, and more encompassing than the ubiquitous association of the term, irgoring missing data will generally lead to biased estimates. The following ways are often applied to handle the problem:**1, simple deletion strategies**: including pairwise deletion and listwise deletion, the former may lead to inconsistent results, for example, non positive definite correlation or covariance matrices, while the cumulation of deleted cases may be enormous except in the case of very few missing values for the latter method;**2, so called "Working around" strategies**, for example, the Full Information Maximum Likelihood (FIML) integrates out the missing data when fitting the desired model;**3, imputation strategies**, these are the most widely used methods both in academia and industry, replacing missing value with an estimate of the actual value of that case. For instance, ‘hot-deck’ imputation consists of replacing the missing value by the observed value from another, similar case from the same dataset for which that variable was not missing; mean imputation consists of replacing the missing value by the mean of the variable in question; expectation Maximization (EM) arrives at the best point estimates of the true values, given the model (which itself is estimated on the basis of the imputed missings); regression-mean imputation replaces the missing value by the conditional regression mean, and multiple imputation, rather than a single imputed value, multiple ones are derived from a prediction equation.I came across an easy-to-use

**missing data imputation**named Amelia II developed by professor Gary King from Harvard university, as its webpage introduces: Amelia II "multiply imputes" missing data in a single cross-section (such as a survey), from a time series (like variables collected for each year in a country), or from a time-series-cross-sectional data set (such as collected by years for each of several countries). Amelia II implements bootstrapping-based algorithm that gives essentially the same answers as the standard IP or EMis approaches, is usually considerably faster than existing approaches and can handle many more variables....Unlike...other statistically rigorous imputation software, it virtually never crashes.
Sep
9

I always believe R is better than Matlab in terms of data loading, for example,

Let's say you have a csv file with structure like below

How to import it in Matlab? csvread() obviously doesn't work as this file has mixed format, and it returns error

**read.table**() is able to read a large tab-deliminated file (such as CSV file) easily. You may argue there is also**csvread**() function in Matlab, but don't forget,**csvread() works only with all numeric, or all text data, but not for multiple format file**, which is often used in practice, for instance, a file containing both bond coupon rate and bond issuer name, etc.Let's say you have a csv file with structure like below

How to import it in Matlab? csvread() obviously doesn't work as this file has mixed format, and it returns error

??? Error using ==> dlmread at 145

Mismatch between file and format string.

Trouble reading number from file (row 1, field 1) ==> CUSIP

Error in ==> csvread at 52

m=dlmread(filename, ',', r, c);

Mismatch between file and format string.

Trouble reading number from file (row 1, field 1) ==> CUSIP

Error in ==> csvread at 52

m=dlmread(filename, ',', r, c);

**dlmread**() does not, either.**textscan**() may work at the cost of inflexibility.
Sep
8

Congratulations to myself that this blog's R category has been indexed by

Interested readers please check http://www.r-bloggers.com/ for more.

**R bloggers**, which is absolutely an encouragement to write more quality articles on R, thank you Tal for your permission.**What is R-Bloggers.com?**R-Bloggers.com is a central hub (e.g: A blog aggregator) of content collected from bloggers who write about R (in English). The site will help R bloggers and users to connect and follow the “R blogosphere”.

Interested readers please check http://www.r-bloggers.com/ for more.

Sep
6

Some of you may know this

Suppose you have a matrix of bond data

**R reshape package**already, I have started to play with it after the post Handling Large CSV Files in R. It is really an excellent one worthing a new post to introduce formally.**What is reshape package? reshape: Flexibly reshape data, Reshape lets you flexibly restructure and aggregate data using just two functions: melt and cast.**Therefore basically it allows us to massage, re-organize our data as the hierarchy we need with only two steps: first melt the data into a form suitable for easy casting, then cast a molten data frame into the reshaped or aggregated form you want. Sounds tongue twisters? A small example will help you feel clearer.Suppose you have a matrix of bond data

Sep
2

I was asked how to improve the convergence speed of Greeks calculation with Monte Carlo simulation. Besides those variance reduction techniques such as antithetic, or low discrepancy random numbers, one efficient way is to use pathwise derivative instead of finite difference.

This is the most widely used & straightforward method, as its name suggests, basically, to estimate dy/dx, we increase x by a very small quantity to x1, re-calculate the option value y1, and then estimate the sensitivity as (y-y1)/(x1-x). Thus this method requires us to calculate the option value at least twice (three times for central difference method), and obviously is a big challenge when we have to simulate lots of times.

contrary to finite difference approximation, pathwise derivative estimate derivative directly, without simulating multiple times. It takes advantage of additional information about the dynamics and parameter dependence of a simulated process. Simply put, by the chain rule, if we could find another variable z such that , and there are solutions to the two derivatives at the right hand side, the pathwise derivative estimator can be applied, and for most cases, stock price S(T) for European option or S(tau) for American option is an excellent choice of z, tau is the optimal timing for exercise. Please read the chapter 7 of Monte Carlo Methods in Financial Engineering (Stochastic Modelling and Applied Probability) (v. 53) for detail.

**1, Finite Difference approximation**This is the most widely used & straightforward method, as its name suggests, basically, to estimate dy/dx, we increase x by a very small quantity to x1, re-calculate the option value y1, and then estimate the sensitivity as (y-y1)/(x1-x). Thus this method requires us to calculate the option value at least twice (three times for central difference method), and obviously is a big challenge when we have to simulate lots of times.

**2, pathwise derivative estimate**contrary to finite difference approximation, pathwise derivative estimate derivative directly, without simulating multiple times. It takes advantage of additional information about the dynamics and parameter dependence of a simulated process. Simply put, by the chain rule, if we could find another variable z such that , and there are solutions to the two derivatives at the right hand side, the pathwise derivative estimator can be applied, and for most cases, stock price S(T) for European option or S(tau) for American option is an excellent choice of z, tau is the optimal timing for exercise. Please read the chapter 7 of Monte Carlo Methods in Financial Engineering (Stochastic Modelling and Applied Probability) (v. 53) for detail.