Variance Matrix with 'nan' values

Question

I am trying to optimize a simple portfolio using several random weights and choosing the best. When the number of assets is large I get a covariance matrix with 'nan' values because some asset pairs do not have trading days in common.

How should I treat the 'nan' values?

Sounds like the 'covariance matrix' is not positive definite, i.e. not a covariance matrix. Where did cov_matrix come from? — nbbo2, May 11 '17 at 16:30
Do you get NaN values for your portfolio weights? Or during the computation of the covariance matrix? I'm confused. I don't see any NaNs in the portfolio weights vector $\mathbf{w}$. Are you saying that $\mathbf{w}' \Sigma \mathbf{w}$ gives you a NaN but that $\Sigma$ doesn't have a NaN and $\mathbf{w}$ doesn't have a NaN? — Matthew Gunn, May 11 '17 at 16:52
@MatthewGunn Yes, w'Σw gives me a NaN but Σ doesn't have a NaN and w doesn't have a NaN? And it only happens as the number of assets increases. — Pedro Rio, May 11 '17 at 17:14
@noob2 I had several panda time series of asset returns and computed with cov_matrix = returns.cov() — Pedro Rio, May 11 '17 at 17:17
@PedroRio Is there an inf anywhere? Absurdly large or small numbers? I'm struggling to see you can get a nan from matrix multiplication except through something like (-np.inf) + np.inf. Check w.min(), w.max(). If you don't take the square root, what is the value you get for the portfolio variance? If your covariance matrix is rank deficient and not quite positive semi-definite, perhaps $\mathbf{w}' \Sigma \mathbf{w}$ is somehow negative? Those are some ideas but you're going to have to track down what is going wrong. — Matthew Gunn, May 11 '17 at 18:23
@MatthewGunn The problem is when I compute np.dot(cov_matrix*252, weights) but when I multiply the minimum value of the cov_matrix and the minimum value from the weights I get a number and not 'nan'. The same happens when I do that form the maximums. I'm going to investigate more but thank you for the ideas. — Pedro Rio, May 11 '17 at 19:20
@MatthewGunn I assumed that the cov_matrix did not have 'nan' values but it actually has 'nan' values. What is the standard procedure for this? If I substitute 'nan' values for 0 I get a result but I do not know if that is the correct choice. — Pedro Rio, May 12 '17 at 12:21
Definitely not the correct choice to assume that a NaN is replaced by zero correlation. When computing correlations you should ignore NaNs parewise. I had a similar question on how to compute correlations igoring NaNs (I was using matlab), but it might be helpful for you: http://stackoverflow.com/questions/29615002/correlation-matrix-ignoring-nan — phdstudent, May 12 '17 at 16:19

nbbo2 · Answer 1 · 2017-05-12T18:52:30.893

4

This is a common problem in covariance matrix estimation, with several possible solutions. One of the simplest involves two steps:

(1) You compute each element of the covariance matrix on a 'best efforts' basis, meaning you take the covariance of the two time series involved after REMOVING any data pairs having a N/A value. (Note that this means each element of the matrix will be based on a different number of observations, which means the resulting matrix is not a standard covariance matrix, it may not be positive definite for example). I assume that for any two time series there are at least a few common observations, otherwise it is an ill-posed problem as Matthew Gunn pointed out.

(2) You "massage" the resulting matrix to make it positive definite (and thus acceptable for use as a covariance matrix) using the routine nearPD which is available in R link

[Even after all this work, a large covariance matrix will be very 'noisy' and of poor quality. You should consider further steps such as 'shrinkage' link before you use the results to find an optimum portfolio].

edited May 12 '17 at 18:52

answered May 12 '17 at 18:35

nbbo2

11,262
3
21
33

nearPD is nice functionality – FinanceGuyThatCantCode May 12 '17 at 18:51
1

The current version of Pandas.cov already calculates covariance "excluding NA/null values" which is the "best efforts" basis you are referring to. – Matthew Gunn May 12 '17 at 19:20
An example of nearPd I did last year: http://stackoverflow.com/questions/36153022/error-in-quadratic-programing-in-r-using-portfolio-optim-function – rbm May 12 '17 at 19:48
Most "nearPD" implementations just do it via evd and then removing negative entries. There are some better ways (Imo) based on vectors in some hypersphere. – will May 12 '17 at 19:51
@will - in that case, if I am understanding you properly, it doesn't sound like those "nearPD" methods will necessarily be a nearest pos def matrix (I assume a Frobenius norm). I imagine a solution to the actual nearest pos def matrix is a convex optimization problem. I also imagine it may only guarantee a return of a positive semi-definite matrix - which may therefore just convert negative eigenvalues to zero in an optimal way. – FinanceGuyThatCantCode May 12 '17 at 20:09
The reality is though that the values should never be that negative, so the difference should be small. If it's large then you were putting in garbage... – will May 12 '17 at 20:11
@will Is the Kercheval (2009) method mentioned here https://quant.stackexchange.com/questions/2074/what-is-the-best-way-to-fix-a-covariance-matrix-that-is-not-positive-semi-defi what you have in mind? – nbbo2 May 12 '17 at 20:15
@noob2 it looks like it yah. – will May 12 '17 at 20:23
2

While I appreciate your concern for negative eigenvalues and the possibility of portfolio weights that purport to achieve negative volatility, the OP almost certainly has a far more basic problem of having delisted securities in his return matrix! – Matthew Gunn May 12 '17 at 20:56

Matthew Gunn · Accepted Answer · 2017-05-12T20:59:28.250

Your estimated covariance matrix includes nan entries.
The current Pandas.cov function already makes a best effort to estimate covariance based upon available data by ignoring nan/null values.

This implies that to obtain a nan in the estimate of covariance, you must have at least two return series that have ZERO time periods in common!

Your question is ill posed

What's the correlation between returns of the Dutch East India Company (1602-1799) and Google (2004 - now)? It's an unanswerable and non-sensical question.
And if your portfolio optimizer says to put $\frac{1}{2}$ your portfolio in Dutch East India Company and $\frac{1}{2}$ in Google, how are you going to do that?

A direction to move in

If you're going to work with securities that enter and leave your sample, you need to do something more sophisticated than estimate some unconditional covariance matrix with sigma = mydata.cov() and using that to choose portfolio weights.

If the point is come up with portfolio weights for time $t$, it doesn't make sense to include securities which one cannot invest in at time $t$!
You need some notion of $\Sigma_t$, an estimate that's designed for time $t$.

And replacing nan with 0 is not a sensible thing to do! The average covariance term is not zero. Systematic aggregate risk exists and this manifests itself in greater than zero covariance terms.

Variance Matrix with 'nan' values

2 Answers2

Your question is ill posed

A direction to move in