Since writing my last post I have done further research into big data issues. Turns out variable reduction is a big issue and one that is quite easily solved. I was on the right track with reducing my variables based on the correlation matrix but my methods were crude. It turns out that a method known as variable clustering provides a much more powerful means of getting the job done.
I have since clustered my variables based on the 4 different correlation matrics and again settled on 15 variables selected under the hoeffding. In a few hours I was able to complete what I was struggling to do within the last two weeks!
Now that I have the vars selected I will be able to move forward onto quotients with price, investigating custom vars through modeling the series itself, calculating and smoothing WOEs and preparing a datamart.
Wow Ive been busy and have thus neglected my blog. After shortlisting the 15 vars I joined them to my outcome data and ran a prelim model estimation using the raw vars. Instead of a combination of log-reg models I opted for a single model of an NN type. The results were very promising. Following this I began constructing the full datamart. I opted to follow 4x different forex pairs and downloaded the data and indicators for this. I have modified my outcome code and have just completed this to join it all together. Whilst reviewing outcomes I decided to check a less sophisticated outcome based on a fixed duration. If I had a strategy that was 100% accurate I would trade more often and make more money with this alternative outcome but on a per trade basis this outcome performs worse than outcome1. When I look to test my model deployment I may revisit this second outcome. I now have to put together the full datamart with outcome combinations.
So at this point I have a basic model which I could begin to deploy. The model consists of 5 vars:
- Envelope on 2hr chart
- OBV on 2hr chart
- ADXW +signal on 2hr chart
- RVI on 2hr chart
- last closing price on 5min chart
I have had limited success optimising a sas version of the model so Im not convinced of its robustness or profitability. I will, However, look to optimise a simple NN using these vars for EURUSD if I cant derive a more robust system. I guess this is my fall back model.
Currently I am continuing to refine my trend analysis which I hope will allow me to select a better set of 'classes'. Once I have a few potential profitable classes (especially historically profitable for Oct-Jan period) I will begin to look at some further pattern analysis with the aim of creating some custom indicators.
For me, modelling is an iterative process. Every point of analysis brings ever more clarity to the ideas.
Ive been doing a lot of research on different classification models and I keep coming back to log-reg.
During my outcome analysis work I have settled on a very promising concept. I will apply a filter to the data which I think is like something in trading called a zig-zag but in engineering is a bandpass filter. by varying the bandwidth of the bandpass filter I can remove varying degrees of noise from the price signal. I then create the empirical distribution of the duration of the resulting trends and settle on the bandwidth that gives me reasonable lengths of trend. I will settle on something like 2-4hrs.
I can then class each data observation as belonging to either a long or short trend (I will look to extend this to include sideways at a later date).
Once I have the trends marked out I can take a reasonable sized window across where the trends change. I will use 30-60mins on each side of the inflection points. I will class the observations within the windows as LS, SL or N for Long->Short, Short->Long and Neutral respectively.
I will then model these classifications with a log-reg!
Hopefully I will be more confident with this resulting model than the last one...
As I have refined the outcomes I will be looking to predict I need to go back to variable selection. I am starting with a basic list of 200 different indicators spanning 1,5,10,15,30,60 & 120 minute time horizons. I have then combined these variables relative to the alternate time horizons within the same variable classes to create 2000 possible indicators. An example is:
5min MA / 1 min MA = new var determining if the faster MA is diverging/converging relative to the slow.
I will be ranking the value of these vars by multinomial concordance and maximum log-likelihood. This will allow me to focus on a subset of vars to transform from continuous to discrete. I should have the shortlist by the end of the weekend.
I can follow along with your approach so far and I think it looks interesting. One thing you may consider as an alternate or additional step is to use K means clustering which is an unsupervised learning technique. Unsupervised learning can be a cheaper alternative to a supervised logistic regression and it could uncover important data relationships automatically. Plus you could take the results of K means, classify the results, and then feed it into your logistic regression and perhaps have a better model or get there quicker than having to classify your data by hand.
The following user says Thank You to Strato for this post:
That is an interesting idea. Let's see if I have it right.
1 - classify each possible event using a NN, something like a k nearest neighbour.
2 - feed that classification into the log-reg?
My work using ordinal log-reg has been very promising so far. I wanted to use a multinomial log-reg straight off the mark but have an issue with calculating the gini coeff for it so settled with the much simpler ordinal. I am currently in the process of reducing degrees of freedom by collapsing var buckets, checking coeffs are ordered correctly, and outright removal of vars to maximise gini. This is a very time consuming process.
The following user says Thank You to ajespy for this post: