So I've been squeezing in what work I can on this project and I think I have something useful- or at least a starting point.
This post is about the HMM (Hidden Markov Model) I've worked up. The other side of this project was a simple probability matrix for sequences of highs of day, lows of day, double tops and double bottoms. That's finished (methodology detailed in my previous post).
I did break my first version. I was sampling some intraday data before it was available - price vs the extreme of price for the day. This is obviously only available once the day is over. So I have recoded the NT strategy that collects the source data so that states and emissions are sampled every 4 hours, at 00:00, 03:30, 07:30: 11:30, 15:30, 19:30.
There are 3 different states:
i) Is price greater than 0.1% than it was 4 hours ago (state 2)
ii) Is price less than 0.1% than it was 4hrs ago (state 1)
iii) Is price between these two boundaries, as it was 4 hours ago (state 3)
The price is sample from a 2 period 5min EMA in an endeavor to catch the meaningful shifts in the price without the spikes.
There are 6 different emissions:
Emissions are as they were before, 3 for movement below yesterday's mid line and 3 for movement above it:
i) Is price less than 0.8% below yesterdays mid line (emission 1)
ii) Is price between 0.8-0.4% below yesterday's mid line (emission 2)
iii) Is price between 0.4% below and yesterday's mid line (emission 3)
(these ranges are reversed for price above and make up emission 4,5,6)
Price is sampled as a 2 period 30 minute EMA.
The two files are then imported to Matlab and the HMM applied.
What I am trying to explore is whether there is a relationship between current price in relation to price 4hrs ago, in light to what has come over the previous n days.
The HMM model applies probability algorithms to generate the next most probable states based on the states and corresponding emissions that are fed in. Some code comes with the statistics toolbox and others I have hacked together.
My model does this:
Takes the first 480 data points for emissions (6 samples per day so 80 days worth of data) and applies the hmmestimate Matlab function. There are two outputs: a probability matrix (3x3) for the maximum likelihood estimate of each states transitioning to any other state. There is also a probability matrix (3x6) for the maximum likelihood estimate of each emissions being generated by each state.
The next step is to use the hmmtrain function to train on / estimate the transmission and emission probabilities for the original 480 data emission sequence. I feed the output from the hmmestimate in, so the model is training on the most up to date real position.
The last step is to run the Viterbi algorithm which generates the most likely state path through the HMM (using the probability matrices produced in the second step) based on the actual sequence of emissions. The output of this step also predicts the next most probable state.
The model loops through the entire data set (c.6000 points covering about 3.5 years) with a moving window of 480 data points and I harvest the next predicted state each time. The final output is a matrix the same size and shape as the input of original states but shifted back by one state, so that I can compare the actual state at a particularly sampling point with the prediction made 4hrs earlier.
I tested the steps and think it is sound, insofar that the code and model is doing what I intended and not generating erroneous output.
There is more work to do in trialling different lengths of window and initial parameters for the state and emission inputs. I have gone with a common sense approach to these first to see what popped out. I am also limited by computing power, a full run for a 3 state, 6 emission, 6000 data point model takes 1 hour and my development machine is no slouch.
I have completed a few runs and had a chance to feed back the Matlab output to an NT strategy where a state change to 2 goes long, state change to 1 goes short (ignores states 3 which is ranging). My initial comments on the hmm and the initial back tests are:
- Accuracy of any given data point being correctly predicted is no better (and no worse) than a random 33%.
- The output of states does not exactly resemble the distribution of the actual states, it has a slightly different character. The output states tend to come in waves where say a move from state 3 to 2 will bleed in as a trend up is predicted and then takes hold
- Based on the feature above a dominant 'regime' can very often be identified, which could be used as a bias overlay for intraday trading
- The output of predicted states does have a moving average feel to it and I will do some comparisons with an MA crossover to see if in fact the HMM model is any better. I have run a back test with the model pulled nearer by a further 4hours and the results were stellar, which i) gives me some confidence that the model is not just a random, but ii) led me to think it's just another MA.
- OOS performs as the in sample does.
My aim is not to end up with an automated strategy, just a way to use a bias filter for my few discretionary setups. So next steps are to keep get into the detail of the outputs, back test and find out if I can incorporate the state predictions in my trading.
Apologies in advance for not posting any code up (yet). I am wary that the method / code may be flawed and would be more comfortable working on it a while longer. However my intention is to post both the Matlab and NT files in due course so others can have a play too.
Hope to have some backtests stats and other findings available over the next few weeks.
The following 2 users say Thank You to mokodo for this post:
I have been working with Accord .NET and might be able to get an On-line version of you logic running when you are ready. Accord has a HMM built in so you might be able to run the model within an indicator. Instructions on setting up Accord:
Just taken a quick look at this and it is open source awesomeness. Great signpost, thanks. If I can prove to myself that my model/concept contributes to profit, this integration via C# with NT is about as smooth at it could get.
I've decided to put this project on hold for now. I can sense there is something in there but winkling it out is taking a lot of time - which I simply do not have spare right now. Plan to tidy up the code for Matlab and NT and post up in the new year to see if anyone wants to take it on.
I understand, I have a shortage of time right now as well. Best wishes, look forward to your progress on this topic. I believe there is some hope as well, just takes time and patients to get through the research side.
Well I was just about to put this project on the shelf, mostly due to lack of time (but I wasn't making the progress I thought I would, so was perhaps looking for an excuse), and I saw through one of the issues I was coming up against - and have identified a way around it.
The idea I am trying to investigate is that there are hidden patterns in price, which HMM's are supposed to be good at uncovering. My initial dead-end with my first raft of modelling was that the two variables I used were very closely correlated, both sampled price, simply at different time intervals. As a result my model generated something like a lagging moving average.
What I think I needed were two uncorrelated variables but which had strong co-integration, i.e. there is a strong relationship between them, but one that is not the same step by step. Google co-integration and there is a well used analogy of a drunk walking home with his dog, they both wander around, circle back, etc but not in unison. But they leave the same place and end up in the same and are closely related through out there journey although it looks pretty random at times.
So I have tested out using price as one variable and a bounded indicator (in my case Williams R) as the other. I chose WilliamsR as it's very reactive and as a consequence jumps around a fair bit (stochastics would also be an option for the same reason).
The model is very similar to the previous version, although now I have categorised WilliamsR between 0 -100 into 6 states and the model aims to predict the state for the indicator to be in. I've sampled hourly and have a run 40 day moving training window to return the next hour's prediction.
Maybe progress, but have to run tests to see what's what. Feels like I am on a more fruitful path though.
The following user says Thank You to mokodo for this post:
Every thing derived from price will have a correlation with it, they are not independently drawn , i.e the price and william's R. my suggestion In case try like or better experiment like. 100 possible states (rounded to 0 decimal places) and from that calculate the estimates of transition and emission probabilities . Better use RSI , as if your price remapped.
I have experienced similar effects when dealing with SVMs. I need more than a few SMA's, I need those "othoginal" type of parameters. Too much of the same thing doesn't and much more knowledge to the system. Need to look at the propblem from different angles/perspectives.
I would say, most every indicator is dependant upon price action and/or volume. They all need to be derived from the raw data available which is Level I/II data. Order book is still closely related to price action, you could consider it the cause (precedes price). What I have found from machine learning it digesting this limited data to "features" to present to machine learning systems. I believe this is what happens in the mind of the expert trader, they convert what they see into features and determine the future likelyhood of something occuring. I would say, most expert traders don't know they are doing this, it just happens as it is something they have conditioned their mind to do.
Orderbook only seems to me like ..... some guys waiting for their Limits / STops to be hit.. do you agree ?
Still its not the cause for price to follow the orderbook..
The best indicator is price action itself nodoubt about that.