TensorFlow + NT8 Strategy - Battle of the Bots style

May 20th, 2021, 10:41 PM

Inspired by Battle of The Bots, I am creating a Battle of the Bots style strategy and incorporating TensorFlow's machine learning. My goal is to take a strategy which trades a few times or more per week that has mediocre performance and see if I can improve its overall metrics. I intend to change the strategy very little, if at all so I can evaluate TensorFlow's contribution instead of strategy parameter/feature cherry picking.

I drew inspiration from the Battle of the Bots competition, @NJAMC, @Fat Tails, @kevinkdog, and @quantismo's numerous contributions over the years, as well as @rleplae's Encog and TensorFlow experiments.

This is meant to be an academic exercise showing what may be possible and a general framework on how to get TensorFlow incorporated into a strategy. This is not meant to be an all-encompassing turn key solution that you can try to just plug in your strategy and start trading, sorry. I'm pretty certain the strategy will not be profitable in the beginning and will likely only "suck less" once I throw TensorFlow at it.

I have no idea how this will turn out and look forward to the community's contributions in this experiment. Wish me luck!

Goals:
- Don't tinker with the actual strategy
- Decrease MAE
- Increase or hold steady MFE
- Increase Sharpe ratio
- Hopefully instrument agnostic
- Learn something!!!
- Maybe teach something

Tools I used:
- NT8 + Visual Studio 3rd party dll dev approach, see my thread here (video is in depth but fast paced, have coffee ready)
- NinjaTrader 8 + Grpc.Tools
- TensorFlow 2.x, Keras
- Jupyter Notebook
- VMWare virtual environment running Ubuntu 20.04LTS (TF Serving lives here)

Assumptions:
- Ability to install TensorFlow, TensorFlow Serving, NinjaTrader 8.
- Ability to code NT8 indicator and strategy to at least a beginner level (no wizard).
- Simple bar type (1m/5m/15m etc). This helps when generating training samples.
- Ability to create, in python 3, a TensorFlow model.
- Slippage and commissions will not be included only so as to keep comparisons between instruments as apples-to-apples as possible
- Limit entry
- Limit exit

Disclaimer:
I have done a ton of coding on a fully custom framework to get everything working between NT8 and TensorFlow Serving via grpc. If I had it to go back and do over again, I'd probably just use the REST interface instead. While REST is not as performant, it would have been much faster to get a proof of concept pipeline working versus grpc. This grpc framework is not something I can put on FIO since it contains and requires compiled dll's. I will however include a proof of concept NT8 strategy C# file as well as an example which uses REST and TensorFlow Serving's included half_plus_two toy model (that's the "hello world" for TF). For those determined to use this approach, these two examples should significantly reduce your time to get up and running, but be warned, it's not easy and web documentation is mediocre at best.

The attachment _TensorFlowStrategyTest.cs shows how I did initial testing using REST.

The NT8 Strategy:
- Super basic, moving average crossover trading pullbacks based on a fib level
- If/Then based logic
- Uses simple bar types
- Back test are normal for a 1st-5th iteration of an automated strategy (aka, it basically sucks)
- Forward testing on out of sample data to corroborate back test results

I have attached the initial version of the indicator. However, any changes will be on my GitLab project located here: GitLab repository. This is a much easier way to manage any future versions of the strategy. All code I use in that JDT repository is and always will be 100% freely available.

Snapshot of trade entry:

Initial performance metrics:

As you can see, the strategy is terrible. However, it does trade quite a bit which gives me a lot of data on which to train and do testing.

The Basics:
- Store NT8 indicator values in a Queue with a specified window size (maybe 50, 100, will test)
- When a trade signal is triggered, send all of the metrics (the queue(s)) to the TF model
- Receive a prediction back on whether this is a good trade to take (like @rleplae's trade filter idea)
- Trade (or don't) accordingly. May flip disagreements...again, have to test

TF Model Architecture:
This is one area I am really hoping the FIO community can make some recommendations. My C# skills are well beyond my python abilities so this, for me, will present the greatest challenge. Hopefully some python ninjas will chime in. My initial thought is either some sort of CNN or multi-head attention architecture. I'll probably just start with an extremely simple sequential model using 1 or 2 dense layers then try to move to the more advanced architectures later.

I tried attaching the python jupyter notebook but it appears FIO considers a .ipnb an invalid file type. I have put the jupyter notebook in my repository located here: GitLab repository. My findings will be in a reply post since it will be in depth.

TF Model Training:
It is very important not to give the TF model data which cannot be known at the time the trade is to be taken. This is information leakage and will result in predictions which are essentially useless. Well not essentially, completely useless. This model will use the supervised learning approach which means it'll need labeled data.

Data Processing:
- Make sure data is clean (not trading major news, holidays, etc.)
- Ensure there are numerous samples of chop, uptrends, downtrends
- Ensure there are a relatively equal amount of longs versus shorts (prevent imbalanced data set issues)
- Most high level white papers recommend normalizing data
- Test different approaches to normalization (z-score, % change, etc)

Training Data Labels:
- This is the one place where we sort of give the TF model future information but only for training.
- This information is whether the trade was a winner or not (0 - loser, 1 - winner).
- My plan is to do this in the OnExecutionUpdate something like this:

Code

protected override void OnExecutionUpdate(Execution execution, string executionId, double price, int quantity, MarketPosition marketPosition, string orderId, DateTime time)
{
    base.OnExecutionUpdate(execution, executionId, price, quantity, marketPosition, orderId, time);

    // generate training samples

    if (execution.IsEntryStrategy)
    {
            // store all known metrics
            // TODO: need a way to identify this specific trade once it closes...(dictionary?)
    }

    if (execution.IsExitStrategy)
    {
            // find the execution
            // store the trade result as winner/loser
            // var trade = SystemPerformance.AllTrades.GetTrades(base.Instrument.FullName, guid.ToString(), 1);
            // write data out as training sample (probably binary for speed)
    }
}

I hope it goes without saying, but I'll say it anyway. Please, please do not take anything created in this thread and trade it live without doing your own due diligence and testing it like crazy.

--------------------------------------------------------------------------------
Thread Table of Contents
--------------------------------------------------------------------------------

I will edit post with link to significant sections and topics. Hopefully this is a popular topic and the TOC is a living and changing part of this thread

May 20th, 2021, 10:41 PM

The python side of this project has by far been the most difficult.

I have tried the following model architectures:

Simple single fully connected layer, no class weighting
Simple single fully connected layer, with class weighting
Simple with multiple fully connected layers + class weighting
Convolutional Neural Network (CNN) based model + class weighting
Encoder Decoder model + class weighting
Recurrent Neural Network (RNN) with stacked LSTMs + class weight

The Data:
The dataset is made of 3 indicators:

EMA14
EMA50
ATR14

Each indicator's previous 5 values are stored for a total of 15 inputs per sample. Here is a basic representation of the dataset's structure:

Code

ema14_0, ema14_1, ema14_2, ema14_3, ema14_4, ema50_0, ema50_1, ema50_2, ema50_3, ema50_4, atr14_0, atr14_1, atr14_2, atr14_3, atr14_4, labels

I used z-score normalization which did improve the prediction metrics. SciKitLearn calls it StandardScaler but they're the same thing. For the labels, I used LabelEncoder to encode a loser, 0, as [1, 0] and a winner, 1, as [0, 1].

The Model Scoring:
In addition to accuracy, I'm also using ROC AUC which gives a much better indication of a model's true skill.

Results:
Long story short, nothing worked!....yet ("Much to learn, you still have...")

The only noticeable difference was when comparing normalizing the input data versus leaving the data raw. Normalizing did make a considerable improvement. Despite all of the different model architectures, nothing was able to predict better than random.

Initially, this was not easy to diagnose. One of the models had a 74% accuracy. I felt like my heart literally skipped a beat when I first saw it. However, when I tested it, the trading results were virtually identical. The more I investigated, the more I felt like I went down the rabbit hole.

That exploration led me to learning about the class_weight parameter in the model.fit method. This is very useful when you have an imbalanced dataset, which I did. My dataset has roughly a 2.5:1 of losers:winners. Once I started using the class_weight parameter, I started getting much more representative performance metrics. Before I used it, I was getting fantastic metrics but when I actually ran a backtest, the results were terrible, usually worse than baseline.

Gotchas:
I ran into so many different problems, I don't even know where to start so I'll hit the biggest hurdles I think most people will run into if you try to do something similar.

Model always predicts the same class (either 1 or 0, every! single! time!)
This one was probably the most frustrating because no matter what I did with all these super fancy model architectures, nothing changed and nothing worked. Ultimately, I learned that there are a few tell-tale signs that I was doing some basic stuff wrong:

Too high of a learning rate
Too complex of a model
Imbalanced dataset
Model shape parameters
Single test prediction + reshape input

1) Some optimizers like different learning rates. Adam, for example, like values in the range of 1e-3 to 1e-7. I found that if you're still seeing the same behavior and your all the way down to 1e-7, you have a different issue.

2) My initial thought was that if 10 neurons in a Dense layer was good, 500 should be a printing press. Welp, not so much. Not only does this approach absolutely crush your system resources (especially when you start using LSTMs and CNNs), but it just flat out doesn't work. Basically, the model becomes so complex that it can simply memorize an entire dataset. My training dataset had somewhere between 60k-70k training samples and some of the beefed up model just memorized everything.

3) This one completely punched me in the face for weeks. I included a link above to a TensorFlow tutorial which I think you absolutely must understand if you're going to try to apply deep learning to trading.

4) This one is an extremely close 2nd to the level of frustration that it caused me. LSTM and CNN layers require a 3 dimensional input vector. Even if your input data is 2D, you have to reshape it. I know, it makes virtually no sense but it is what it is. I chose to build the reshape into my model architecture because then I can just pass a huge 1D array from NT8 to my model sitting on my TensorFlow Serving server.

5) Along the lines of #4, always always always do a test prediction on your model before you start training. This will help sniff out any bonehead mistakes before you schedule this massive training job. I didn't quite understand at first that I needed to add an additional dimension to this single prediction, but once I understood how TensorFlow works, I realized that I had to do it every single time.

Being able to actually use the prediction
It took me a while to figure out how to actually use the predictions I was received. Based on how I created my models, they returned an array of probabilities. Here's an example:

[0.25 0.75]
This means the model is giving a 25% probability (confidence) that the trade will be a loser and a 75% probability that it will be a winner. That's great but how do you actually use that in a strategy's trading logic. Once I got this response from TensorFlow Serving, I had to process it on the C# side.

In python:
On the python side, numpy has a method called argmax. This method will return the index of the highest probability. In the above array, argmax will return 1 because the array is in the [0, 1] format. If the array were [0.8 0.2], argmax would return 0 because the 0th index has the highest value.

In C#:
I feel like C# is just plain easier than python but that's probably because I'm 10+ in C# and less than 1 in terms of python. Anyway, Array.IndexOf(theModelResponseArray) takes care of everything.

So for now, as far as I can tell, EMA14 EMA50 ATR14, don't have any predictive information in them when looking at the previous 5 values of each.

I tried numerous other indicators and basically came up with the same conclusion. I'm open to anyone with domain knowledge making suggestions which I will try to test and report back.

May 21st, 2021, 12:04 AM

Since I can't attach a .ipnb file and I forgot to post the actual location of the jupyter notebook in the repository, it's located here:
https://gitlab.com/jasonnatordaytrader/jdt.nt8/-/blob/master/JDT.NT8/Python/Classification_Testing.ipynb

May 21st, 2021, 09:04 PM

Jasonnator

Since I can't attach a .ipnb file and I forgot to post the actual location of the jupyter notebook in the repository, it's located here:
https://gitlab.com/jasonnatordaytrader/jdt.nt8/-/blob/master/JDT.NT8/Python/Classification_Testing.ipynb

Epoch 1/10
1414/1414 - 4s - loss: 0.5784 - accuracy: 0.7462 - val_loss: 0.5604 - val_accuracy: 0.7525
Epoch 2/10
1414/1414 - 3s - loss: 0.5603 - accuracy: 0.7477 - val_loss: 0.5576 - val_accuracy: 0.7525
Epoch 3/10
1414/1414 - 3s - loss: 0.5590 - accuracy: 0.7478 - val_loss: 0.5565 - val_accuracy: 0.7526
Epoch 4/10
1414/1414 - 3s - loss: 0.5585 - accuracy: 0.7478 - val_loss: 0.5560 - val_accuracy: 0.7526
Epoch 5/10
1414/1414 - 3s - loss: 0.5582 - accuracy: 0.7477 - val_loss: 0.5557 - val_accuracy: 0.7526
Epoch 6/10
1414/1414 - 3s - loss: 0.5580 - accuracy: 0.7478 - val_loss: 0.5555 - val_accuracy: 0.7526
Epoch 7/10
1414/1414 - 3s - loss: 0.5579 - accuracy: 0.7477 - val_loss: 0.5553 - val_accuracy: 0.7526
Epoch 8/10
1414/1414 - 3s - loss: 0.5578 - accuracy: 0.7477 - val_loss: 0.5552 - val_accuracy: 0.7527
Epoch 9/10
1414/1414 - 3s - loss: 0.5577 - accuracy: 0.7476 - val_loss: 0.5551 - val_accuracy: 0.7527
Epoch 10/10
1414/1414 - 3s - loss: 0.5576 - accuracy: 0.7476 - val_loss: 0.5551 - val_accuracy: 0.7527

Jason,

It has been a little while since I have looked at this, but I suspect there is no learning occurring, maybe not enough iterations, I usually had to go hundreds... You can see the Accuracy is relatively flat and the val-acc the same. I would expect a slope up on Accuracy at a minimum, validation accuracy sometimes rises with it and then sometimes breaks and decreases as you start to over fit.

-Greg

May 21st, 2021, 09:17 PM

I thought the same and have ran it for hours and hundreds of epochs with no difference. I also used the sonar dataset and virtually all of the models learn very well and get up to the 70-80% ROC AUC (0.5 for a classifier means it knows nothing, 1.0 means it's perfect) which tells me the model architectures are correct for being able to classify at some level. I've also tried several different window sizes ranging from 5 (which is what this notebook uses) all the way up to 250. The data structure is sound so I am banging my head against the wall.

May 21st, 2021, 09:32 PM

Jasonnator

I thought the same and have ran it for hours and hundreds of epochs with no difference. I also used the sonar dataset and virtually all of the models learn very well and get up to the 70-80% ROC AUC (0.5 for a classifier means it knows nothing, 1.0 means it's perfect) which tells me the model architectures are correct for being able to classify at some level. I've also tried several different window sizes ranging from 5 (which is what this notebook uses) all the way up to 250. The data structure is sound so I am banging my head against the wall.

Hummm... Don't know, doesn't seem right... As for the Data Structure.... EMA does not exhibit stationarity. This is a big problem for most machine learning techniques. But generally you would see the accuracy increase (or the loss trending down) and the validation getting worse at some point. I don't have time to run this test and debug, but suspect something isn't right if you are running this for hours with no real change in those numbers. This implies the weights are generally not changing, so they are pretty much similar to the random initialization values which is unlikely statistically ... I don't know how many weights are in the model, but it is highly unlikely that they are all randomly selected properly at the initiation phase.

May 21st, 2021, 09:38 PM

Thanks Greg. Your insight is always very much appreciated.

May 21st, 2021, 09:47 PM

Jasonnator

Thanks Greg. Your insight is always very much appreciated.

ema14_0
ema14_1
ema14_2
ema14_3
ema14_4
ema50_0
ema50_1
ema50_2
ema50_3
ema50_4
atr14_0
atr14_1
atr14_2
atr14_3
atr14_4
labels

0
2086.585693
2086.574463
2086.564453
2086.522461
2086.452881
2086.350342
2086.356201
2086.361816
2086.357422
2086.343506
0.319288
0.314339
0.291886
0.306751
0.302698
1

1
2087.169189
2087.146729
2087.093750
2087.047852
2086.974854
2086.751953
2086.761719
2086.761230
2086.760742
2086.750488
0.301015
0.297371
0.293987
0.272988
0.271346
0

2
2086.032715
2086.028320
2086.024658
2086.021240
2086.051758
2086.240967
2086.231445
2086.222412
2086.213623
2086.215088
0.211080
0.196003
0.217717
0.237880
0.238746
0

33756
4159.763184
4159.794434
4159.755371
4159.788086
4159.782715
4157.833008
4157.917969
4157.979980
4158.059082
4158.125488
1.182963
1.169894
1.157759
1.128633
1.119445
0

Just a quick look at the input data with he EMA ( and likely ATR), you are standardizing the inputs. So if you graph the post-processed data, you are likely to see ramps and all the subtle data is lost as EMA starts in the 2000's and by sample 33756 it is in the 4000's... This means that you have lost the local data since this is non-stationary. You have essentially "Zoomed Out" on the dataset. I am not sure about the ATR don't remember how that indicator works at the moment. But try the derivative of the EMA's as you may not be learning because there is simply no information left in your dataset other than noise which means the weights will randomly move slightly up/down.

May 21st, 2021, 09:59 PM

Another reason why it might not be learning is the "LearningRate" is too low. I don't see how the .FIT method uses the optimizers to be able to set a new rate, so you might want to look at that. If the learning rate is too low, you can end up stuck learning in tiny local minima, so best thing to try here is to start cranking up the learning rate (maybe go big and then back it down) when it is too high, it will jump around on accuracy. When it is right, it will slowly increase, when it is too low, it will stay flat like I see there...

You do still have a problem with the input dataset, but it should still "Learn" that dataset, but not generalize...

May 21st, 2021, 10:26 PM

NJAMC

Another reason why it might not be learning is the "LearningRate" is too low. I don't see how the .FIT method uses the optimizers to be able to set a new rate, so you might want to look at that. If the learning rate is too low, you can end up stuck learning in tiny local minima, so best thing to try here is to start cranking up the learning rate (maybe go big and then back it down) when it is too high, it will jump around on accuracy. When it is right, it will slowly increase, when it is too low, it will stay flat like I see there...

You do still have a problem with the input dataset, but it should still "Learn" that dataset, but not generalize...

I definitely tried using a "stepped" learning rate with the LearningRateScheduler callback. I tested this extensively actually because I know about peak/valley minima. Again, worked fine on the sonar dataset but absolutely did not with financial dataset.

Any pointers to what that "problem" may be would be very helpful.

I could be completely wrong but the fact that the different model architectures work with a very similarly structured dataset (sonar) but do not with the financial data leads me to believe the indicators just don't have any predictive information contained in them.

TensorFlow + NT8 Strategy - Battle of the Bots style

Discussion in NinjaTrader

TensorFlow + NT8 Strategy - Battle of the Bots style