Intro:
I have decided to document the development and deployment of my first FOREX ATS. Upon completion of the system, I hope to be able to review this journal to see where I went wrong/right in the development process.

I welcome the comments along the way...

AIM: I am developing an ATS to trade multiple currency pairs. I hope to use this as a basis to continue developing a separate ATS focused on ES contracts. I am looking to complete the FOREX ATS in time to be tested in this years MQL competition. I hope that the structured competition will provide a good means of testing the performance of the strategy.

I will look to update this journal a few time a week at least...

Journal Entry #1:
A bit of background to development as it currently stands: So far I have downloaded and installed MT5 which I have no experience using. I also have no experience trading forex so this will be a challenge.

The 1st thing I did after dl'ing MT5 was play around with the application. The strategy optimization/analysis tools look awesome.
I then read a few articles on how to develop different strategies. I have to say I am quite impressed by how the MQL community is setup. Everything seems to promote collaboration which is a big plus for newbies like me.
Following this I reviewed the last years competition details to see how it faired. The rules are interesting and allowed me to formulate a development strategy. I am a little concerned as it appears that there are 1000s of entries to this competition and a lot of the strategies appear to be VERY basic. These strategies do appear to succeed in this competition and I ask myself if this is more a lottery and not a test of skill? I am looking to develop a statistically robust strategy which may not win against some of these 'lucky' strategies. I guess the measure of success will not be winning the competition but to test if the development framework I employ can result in a consistently profibitable strategy. Time will tell if I am successful or not.

My next step was to download a whole bunch of EURUSD 1m data for importing into SAS (my preferred analytics package).

Once I have my historical 1m time series I need to look at creating some outcome flags. To do this I have to define a crude business case based on the questions: What is a profitable trend? What is my entry point?
By conducting some univariate analysis I get an understanding of how the price moves over different time frames. I dont want to trade too quickly as this is a trend model and I dont want to exit too quickly based on the price moving against me either. To capture these characteristics I apply a price filter and some time constraints.

Once I have these answered I can then extend the single 1m observations to a window of observations where I can enter the market and be profitable. Now that I have these trading horizons defined for both long and short trends I can move onto some indicators.

The modeling session continues so whilst my code runs its a good idea to jot down my thoughts...
On further consideration I have decided to continue with investigating the outcomes.
I think my original filter was too light at 0.0035. As a result of this my flagged outcomes was too small at 0.0095. The horizon cut-off was set at 0.0055.
I was capturing too much noise in the data. I have just increased the filter to 0.0055 and flag cut-off to 0.017 with a horizon limit of 0.01.

On running a random model, the results are promising. The mean GP for both long and short flagged trades have doubled. Trading has also halfed. I am happy enough with this for now to move onto the indicators.

As I have no idea about what works in FOREX I have opted to start with a top down analysis and will conduct a thorough data mining/dredging plan. I will then focus my attention on any promising indicators to see if a couple of customized indicators can be developed (bottom up) to be used in the ATS.

To complete the data mining I have used a MQL5 script which I picked up on trial that allows the downloading of the standard MQL historical indicator values. I have 217 indicator and time-frame combinations to check against my outcomes. To complete this analysis I am using basic univariate IV.

Another midnight modelling session!
I have begun high level analysis of 31 different indicators across 7 different time windows for a total of 317. Some indicators have multiple variables associated with the indicator (for example ADX_Wilder which has ADX_Wilder, +PI and -PI). Due to the fact that this is very broad brush stroke analysis I will be looking at these supplementary vars also.

The analysis consists of constructing a histogram for the variable then using these buckets as initial buckets for WOE calcs against my oucomes. I have written a quick piece of code which should allow me to complete the +317 vars in a few days.

Once I have compiled the rough WOE calcs I will be able to narrow down the var list by disgarding the indicators with low value IVs. I will then look for commonalities between the different vars and see if I can get some intial thoughts on constructing some potent custom vars.

So far I have knocked off 14 primary vars... only leaves +303!

The code works great and I have managed to initially test 105 different vars so far... so about a qtr of the way through. At this rate I maybe able to finish the intial testing this evening.

On a different note...
The primary model concept is that at any time the market will immediately begin to operate in two states:
1- trending up or 2-trending down.
Based on technical analysis a probability is assigned to both these states.
As I was perusing the web today I came across grid trading systems which from my brief introduction is used for trading sideways trending markets... I have not accounted for this 3rd state. As I continue my analysis I will toy with some ideas about how I can incorporate a sideways trend strategy into my model framework.

Developing this beast is going to be interesting...

The following user says Thank You to ajespy for this post:

After a monster effort I have completed first pass analysis of 287 different vars. I will now begin culling those vars with the lowest IVs.
If a var has an outright IV of less than 0.01 for either long or short outcomes I will move it to a discard list.
74 vars have been removed by this rule.
I will then focus on what is left and begin to selectively cull the weaker of the remainder. As we are initially focusing on polarised states I will consider the correlation of the WOE in this cull. Weak IV vars which are nicely correlated will be retained where as the vars with stronger IV values with WOE which is a little more noisy will be disgarded. I will however try to retain the more promising vars in each group as to asertain any commonalities.

The first pass cull has resulted in a subset of 185 different indicators.The subset is a mixed bag. Some vars have weak Information Value but are nicely distributed with respect to the underlying histogram; others have relatively strong IV but the distribution of outcomes is noisy. These vars will require a lot of work to develop anything useful, which will result in significant loss of IV.

The next step in the indicator analysis is to gain some insight into how the variables relate to each other. Each prospective var will capture some level of unique information. To gauge this I will construct a summary matrix of the elements used in the construction of the various indicators. In addition to this hgh level 'map' I will construct various correlation matrices.

This analysis will assist me in maximising the amount of information from the various indicators.

The following user says Thank You to ajespy for this post:

So I coded up a SQL join of 185 different variables from 180 different data tables. This resulted in a lack of computing power

I split the join down to just the first 10 vars and ran the correlation on that. Doing it this way would take too long and I wouldnt get my high level image of how the vars fit with each other very easily.

I wasnt getting anywhere very fast so I decided to go back and be more aggressive in my cull. I decided that I would only retain vars if they either where the best of similarly distributed indicators or they where distributed completely differently within the indicator type. I also would not keep more than 3 of the same indicator (still too soft I know).

Second cull has resulted in 93 variables being retained. This is a little more manageable. Back to running correlations.

I have just finalised 4 different correlation matrices: Pearson, Spearman, Kendall and Hoeffding. I will use these matrices to continue the cull with the aim of cutting at least 3/4 of the current variables. I will then be able to focus on the remaining variables.

To do this I will group the vars with a correlation of over 0.5 in the order of Hoeffding, Kendall, Spearman then lastly Pearson. I will then refine the WOE buckets for each var and then remove those vars with weakest resultant IV with similar information content.

The Hoeffding analysis alone resulted in removal of 22 more vars.

OK so I have finalised filtering variables based on both kendall and hoeffding measures of correlation as well as reviewing different variable components and Information Value. I managed to chop through half my variables with the current retained count sitting at 47!

So I began at ~300 and have cut that down to 47 which at this stage is not too bad. I still need to focus on variables as gleaning as much info from the data is a sure way to increase the likelihood of good trades which in turn leads to increased expected returns.

My next task is to combine the 47 with the outcome flags into a single datamart. I will then calculate the WOE for the 47 based on the histograms and run that through a step-wise log regression. I just want to get a feel for what I am dealing with. I still need to cull because I know that some vars need to be dealt with as a quotient of another var (like MAs where one crosses above or below another). If I analysed the whole 47 that would result in 47! vars (Im thinking on the fly so this may be incorrect) or some other crazy number which is just impossible to crunch with my computing power. So what I will do is cull the list down a little more with the step-wise then use a boxcox to transform the remainder, get the quotients and then check all of them for some more info.

In addition to these vars I think I may model the time series itself, and construct some vars out of the resultant models.

On another note I further formulated my trading model framework today. I am hoping to batch optimise 4 log-reg models to capture the correlation. I have no idea how to achieve this in SAS or any other stat package. I think I may estimate the singles then optimise the combination with a NN. I will then overlay my risk management and money management and optimise that with an NN also.

It will be interesting to see the results.

The following user says Thank You to ajespy for this post:

Since writing my last post I have done further research into big data issues. Turns out variable reduction is a big issue and one that is quite easily solved. I was on the right track with reducing my variables based on the correlation matrix but my methods were crude. It turns out that a method known as variable clustering provides a much more powerful means of getting the job done.

I have since clustered my variables based on the 4 different correlation matrics and again settled on 15 variables selected under the hoeffding. In a few hours I was able to complete what I was struggling to do within the last two weeks!

Now that I have the vars selected I will be able to move forward onto quotients with price, investigating custom vars through modeling the series itself, calculating and smoothing WOEs and preparing a datamart.

Wow Ive been busy and have thus neglected my blog. After shortlisting the 15 vars I joined them to my outcome data and ran a prelim model estimation using the raw vars. Instead of a combination of log-reg models I opted for a single model of an NN type. The results were very promising. Following this I began constructing the full datamart. I opted to follow 4x different forex pairs and downloaded the data and indicators for this. I have modified my outcome code and have just completed this to join it all together. Whilst reviewing outcomes I decided to check a less sophisticated outcome based on a fixed duration. If I had a strategy that was 100% accurate I would trade more often and make more money with this alternative outcome but on a per trade basis this outcome performs worse than outcome1. When I look to test my model deployment I may revisit this second outcome. I now have to put together the full datamart with outcome combinations.

So at this point I have a basic model which I could begin to deploy. The model consists of 5 vars:
- Envelope on 2hr chart
- OBV on 2hr chart
- ADXW +signal on 2hr chart
- RVI on 2hr chart
- last closing price on 5min chart

I have had limited success optimising a sas version of the model so Im not convinced of its robustness or profitability. I will, However, look to optimise a simple NN using these vars for EURUSD if I cant derive a more robust system. I guess this is my fall back model.

Currently I am continuing to refine my trend analysis which I hope will allow me to select a better set of 'classes'. Once I have a few potential profitable classes (especially historically profitable for Oct-Jan period) I will begin to look at some further pattern analysis with the aim of creating some custom indicators.

For me, modelling is an iterative process. Every point of analysis brings ever more clarity to the ideas.

Ive been doing a lot of research on different classification models and I keep coming back to log-reg.
During my outcome analysis work I have settled on a very promising concept. I will apply a filter to the data which I think is like something in trading called a zig-zag but in engineering is a bandpass filter. by varying the bandwidth of the bandpass filter I can remove varying degrees of noise from the price signal. I then create the empirical distribution of the duration of the resulting trends and settle on the bandwidth that gives me reasonable lengths of trend. I will settle on something like 2-4hrs.

I can then class each data observation as belonging to either a long or short trend (I will look to extend this to include sideways at a later date).

Once I have the trends marked out I can take a reasonable sized window across where the trends change. I will use 30-60mins on each side of the inflection points. I will class the observations within the windows as LS, SL or N for Long->Short, Short->Long and Neutral respectively.

I will then model these classifications with a log-reg!

Hopefully I will be more confident with this resulting model than the last one...

As I have refined the outcomes I will be looking to predict I need to go back to variable selection. I am starting with a basic list of 200 different indicators spanning 1,5,10,15,30,60 & 120 minute time horizons. I have then combined these variables relative to the alternate time horizons within the same variable classes to create 2000 possible indicators. An example is:

5min MA / 1 min MA = new var determining if the faster MA is diverging/converging relative to the slow.

I will be ranking the value of these vars by multinomial concordance and maximum log-likelihood. This will allow me to focus on a subset of vars to transform from continuous to discrete. I should have the shortlist by the end of the weekend.

I can follow along with your approach so far and I think it looks interesting. One thing you may consider as an alternate or additional step is to use K means clustering which is an unsupervised learning technique. Unsupervised learning can be a cheaper alternative to a supervised logistic regression and it could uncover important data relationships automatically. Plus you could take the results of K means, classify the results, and then feed it into your logistic regression and perhaps have a better model or get there quicker than having to classify your data by hand.

The following user says Thank You to Strato for this post:

Hey Strato,
That is an interesting idea. Let's see if I have it right.
1 - classify each possible event using a NN, something like a k nearest neighbour.
2 - feed that classification into the log-reg?

My work using ordinal log-reg has been very promising so far. I wanted to use a multinomial log-reg straight off the mark but have an issue with calculating the gini coeff for it so settled with the much simpler ordinal. I am currently in the process of reducing degrees of freedom by collapsing var buckets, checking coeffs are ordered correctly, and outright removal of vars to maximise gini. This is a very time consuming process.

The following user says Thank You to ajespy for this post: