Tuesday, December 16, 2014

Book Review: R Object-oriented Programming

Packt sent me a copy of R Object-oriented Programming by Kelly Black to review so here we go:

The aim of this book is to provide a basic understanding of R and programming in R. It is suggested that readers are already familiar programming, but the book does seem to start from square one and I feel is probably accessible to most people using R. 

The contents are roughly split 50/50 between general R stuff like working with data and the various data types R has and then OO. The first half provides good coverage of the basics, and I thought it was a good introduction to R. It includes a chapter on working with time variables which I think would be very useful to people new to R and working with time series data.   

The second half is about objected orientated programming. S3 and S4 classes are covered, and some decent examples are provided that go beyond the usual trivial hello world or Shape, Circle, Square type examples, which is nice to see.

Strangely though, no mention is made of reference classes. For a book titled R Object-oriented Programming, making no mention about one of the ways to do OO in R seems like a rather large oversight. 

Another nitpick I have is the indent style and formatting. It is mostly Whitesmiths style (which I personally dislike), but beyond that it is used inconsistently throughout the examples. This is generally considered a rather bad practice in programming circles. Pick one style and stick with it. 

It is challenging to present source code and text together, but some of it is laid out very poorly, with lines wrapping in strange places. It doesn’t really affect the content because most people will likely download the code to experiment with it, but it does detract from what is otherwise a good book. 

Arguing about whitespace may seem petty, but the reality is most programmers spend most of their day looking at source code, so it is important. There's plenty of projects that wont even look at your code unless it is properly formatted. Again its not the choice of style, its the inconsistent use and poor formatting that bothers me. There's a lot of terrible R code out there and a book about programming should do better.

Despite these issues I think it gives good coverage of relevant topics for those starting to get more into the programming side of R. If you are starting out in statistics and looking for a book on R to get going with programming (versus say how to do linear regression etc) it's a useful one.

You can see some other reviews herehere, and on Amazon.

Sunday, September 21, 2014

HMM example with depmixS4

On a scale of one to straight up voodoo, Hidden Markov Models (HMMs) are definitely up there for me.

They have all sorts of applications, and as the name suggests, they can be very useful when you wish to use a Markovian approach to represent some stochastic process.

In loose terms this just means we wish to represent our process as some set of states and probabilistic transitions between them.

For example if I am in state 1, there may be a 85% chance of staying in state 1, and a 15% chance of moving to state 2.

To complete this simple two state model, we would also have to define the transitions for state 2, namely what is the probability we will stay in state 2 if we are already in state 2, and what is the probability we will transition from state 2 to state 1.

But what if we don’t know this information? What if we just have a set of realizations from some process and a vague idea of the number of states that might be behind it?

Enter HMMs. Among other things, they can be used to determine the state transition probabilities that underlie some process, as well as the distributional parameters for each state.

That is, it can work out all the details that are "hidden" from an observer of a process. All we really need to do is specify how many states we think there are.


To test this out, and to get more familiar with the depmixS4 package, I made a small test program. It creates an observation series with:

a) a known number of states
b) known state transition matrices
c) known underlying distribution parameters

The aim is to use depmixS4 to estimate all this information, which should help get a grip on how to use the package, and also let us see if HMMs are actually any good.

We have two states, lets call them S1 and S2. They were setup with the following properties

Both states provide realizations from the normal distribution, in this case S1 has mean -1, standard deviation 1, and S2 has mean 1, standard deviation 1. We can also see the transition matrix between S1 and S2 in the last two columns.

A trajectory was sampled, and depmixS4 used to fit a HMM. It provided the following estimates

You can see it did a pretty good job of recovering the true values.

In the following plot, you can see the sample trajectory, the estimated state from the HMM and the actual state used to generate the sample. This example worked very well, it’s not always the case that things turn out so nicely.

Voodoo I tell you!


There is no guarantee that a fitted HMM will be of any use, and even with this simple example, the state estimates can be wildly inaccurate. You can try this for yourself by using different seed values.

This example is based on one from the book Hidden Markov Models and Dynamical Systems, which I found to be an excellent resource on the topic. It is clearly written, covers the basic theory and some actual applications, along with some very illustrative examples. Source code is provided in python.

I've read (or at least tried to read) pretty much every book on HMMs I could find, and found this one to be the most useful if you are new to HMMs and are interested in applications.

You may also find these other posts about HMMs useful as well:

Fun With R and HMMs
Getting Started with Hidden Markov Models in R
Hidden Markov Models Examples in R, part 3/4

There is also the classic paper by Rabiner.

Hopefully this has shed a bit of light on HMMs and the depmixS4 package. Code is up here.

Monday, July 14, 2014

Bayesian Naive Bayes for Classification with the Dirichlet Distribution

I have a classification task and was reading up on various approaches. In the specific case where all inputs are categorical, one can use “Bayesian Na├»ve Bayes” using the Dirichlet distribution. 

Poking through the freely available text by Barber, I found a rather detailed discussion in chapters 9 and 10, as well as example matlab code for the book, so took it upon myself to port it to R as a learning exercise.

I was not immediately familiar with the Dirichlet distribution, but in this case it appeals to the intuitive counting approach to discrete event probabilities.

In a nutshell we use the training data to learn the posterior distribution, which turns out to be counts of how often a given event occurs, grouped by class, feature and feature state.

Prediction is a case of counting events in the test vector. The more this count differs from the per-class trained counts, the lower the probability the current candidate class is a match.

Anyway, there are three files. The first is a straightforward port of Barber’s code, but this wasn’t very R-like, and in particular only seemed to handle input features with the same number of states.

I developed my own version that expects everything to be represented as factors. It is all a bit rough and ready but appears to work and there is a test/example script up here. As a bigger test I ran it on a sample  car evaluation data set from here, the confusion matrix is as follows:

testY   acc good unacc vgood
  acc    83    3    29     0
  good   16    5     0     0
  unacc  17    0   346     0
  vgood  13    0     0     6

That’s it for now. Comments/feedback appreciated. You can find me on twitter here

Links to files:

Everything in one directory (with data) here

Sunday, June 22, 2014

Trading in a low vol world

I wanted to take a look at what works in low vol environments, such as we are currently experiencing. I am open to the idea we have entered a period of structurally low volatility due to increased regulatory burden and flow on effects from the decline of institutional FICC trading. Or it may just be a function of QE, and post-tapering we will see a return to higher levels.

The plan

The main idea is to compare mean reversion (MR) vs. follow through (FT). For simplicity I define mean reversion as an up day being followed by a down day, and a down day being followed by an up day. Conversely, follow through sees an up day followed by another up day, and a down day followed by a down day.

I took a look at the major US equity indices, SPX (GSPC), NDX and RUT.

For each series we calculate daily log returns for the current period and shift the forward to get the return for the next period. Then we calculate realized volatility (RV) and split the data set into "low volatility" and "high volatility", by looking at median realized vol for the whole series.

Then, for each series, we use bootstrapped samples to simulate a number of trajectories/equity curves for each strategy (MR/FT) under the two classes of RV. Finally we take an average of the total return of each trajectory to get a ballpark idea of how they went.


The data is from the start of 1999 to the present, so roughly 15 years. Each run generates 1000 trajectories with a sample size of roughly 950.

For the low vol case, the results are unfortunately ambiguous. Follow through in a low vol environment seemed to do well for NDX and RUT, but the opposite was the case for SPX.

The TR column is the sum of the series over the whole period for the volatility class (i.e. a simple long only strategy), giving an idea of a directional bias that may be present in the sampling.

In the high vol environment, mean reversion was a clear winner, and consistent over the different underlyings.

The results seem relatively stable across trajectory size/sample size.


I'm not really sure what is going on SPX. My intuition was that FT would do well in low vol environments, but that doesn't seem to be the case, at least not for SPX.

I was actually getting consistent votes for FT in the low vol case, then restarted R to run with a clean environment and started getting the above instead. You can't spell argh without R it seems.

Source is up here. As always you can find me on twitter here. Thanks for stopping by.

Saturday, May 31, 2014

Divergence on NDX

I generally take a dim view of old-timey technical indicators, perhaps they work for some people but I have found there are much better tool available. One exception is divergence, which in this case is when price makes a new high, but the MACD (or your favourite oscillator) does not.  

There is very nice looking divergence on NDX, and it also shows up in a weaker form on SPX and INDU. I never take it as a trade signal by itself, but it does make me look a little closer. I have marked off some previous occurrences as well. It does not give any indication about when a sell off may occur, or how much of a sell off will eventuate. Pretty useful isn't it?

Another form of divergence I take note of is the marked failure of RUT to make it back to its recent highs, which differs from NDX/SPX/INDU. 

You can also see in the charts above that volume has been declining, especially over the last 4-5 weeks. 

I do think we are in a long run bull market which still has a few more years to go. In the event of a shorter term sell off I would generally be looking to buy dips. 

A good sign of a bull market is shrugging off negative events. We've had some reasonably serious geopolitical happenings, the invasion in Ukraine, a coup in Thailand, and anti-Chinese riots in Vietnam that produced a number of fatalities. 

Struggling to think what a catalyst might be, perhaps some unpleasant surprise regarding QE tapering, or unconstrained collapse in the Chinese property market, both of which I think are pretty unlikely

There's a bunch of macro data out next week, and Apple is having its WWDC. Apple used to make up a very large amount of NDX, something like 24% of the index value was determined by AAPL prices. I know they rebalanced it and am not up to date with where it currently stands.

A quick look at FX realized vol

Much has been said about the decline in volatility. At the moment I am very active in FX spot trading and as a generalization do better the more vol there is.

 I wanted to see how things stood on the crosses I am most active in, namely EUR/USD, GBP/USD and USD/JPY.

 I took hourly data from FxPro (not my broker, nor an endorsement), calculated volatility as the high minus the low, and summed the total for each day. You can think of it as how many pips were on offer if one could correctly call the high and low of each hour of each day.

 All up there is about 90 days of data, so it covers roughly the last four months. I also took the average of the last five days which are the red X’s on the box plots. We are here.

As you can see, vol is below average. It was quiet week overall, bank holidays in the US and parts of Europe, and only a moderate amount of data coming out. Next week should be a bit busier I think.

Since I had all the data I also took at look at the average hourly RV per day.

I don’t want to read too much into this chart but things have been quiet. I read somewhere else fx vol is approaching levels of 2007 which was a very quiet time indeed.

Some R code is up here, data is here.

Saturday, May 17, 2014

RcppArmadillo cheatsheet

I have been using RcppArmadillo more and more frequently, so thought I would make a cheatsheet/cookbook type reference that translates common R operations into equivalent arma code.

I have put them up on a github wiki page here.

The functions are all pretty basic and not particularly robust. In particular they do not do any bounds or sanity checking.

You might also enjoy the arma documentation, in particular the matlab/octave syntax conversion example.

There is also an excellent book Seamless R and C++ Integration with Rcpp

Any corrections or additions are most welcome.