Help wanted on on statistics / testing approach for prediction
 Started: June 29th, 2014 (06:02 PM) by aquarian1

# Help wanted on on statistics / testing approach for prediction

July 3rd, 2014, 07:33 PM
aquarian1
 Thank-you. No I am not just looking for the set with the highest probability "Goal To establish a systematic way of investigating the possible condition sets to the highest probability combinations for a rule set to predict my outcome, which I can then test my searching my database." If I follow your reply yes D is highest alone. I think in your reply of "joint probabilities" you are referring to "ands". There will be some days when D does not occur on those days I could have J^G and knowing what their odd of that giving M are would be useful. Additionally if D^G is lower than D alone and I have a day with a D^G I would like to know that M has become less likely. I thought that if one has multiple rules forming a set that one could derive a better trading system.

You're most welcome.

I will hint that important rules to bear in mind are that (1) correlation does not imply causation, and (2) description and prescription. Historically, most daily deaths occur on the days I wear black underwear. That's nice to know and very descriptive, but it is probably not a prescriptive relationship.

If you think of it in this way, you don't need formal mathematics to gain an intuition for many statistical problems that you are encountering.

July 4th, 2014, 12:15 AM
artemiso
 Thanks. Actually, my solution is in the general form and true even for dependent variables. In the special case that {Xi, i is a positive integer} is a collection of pairwise independent variables, you can further decompose the conditional probability P(Xj | Xk) = P(Xj) and P(Xi & Xj) = P(Xi)*P(Xj). I think what you're meaning to say is that it doesn't solve the problem if each of the random variables {Xi, i is a positive integer} is itself a member of some non-stationary stochastic process. I thank you for pointing out. Well, that's an issue with @aquarian1's methodology...

@artemiso

I was hoping there might be a better methodology/approach and there would be others who understand the problem better. This is why I asked for help and started the thread.

"each of the random variables is itself a member of some non-stationary stochastic process."

1. I do not believe the variable are independent. As stated they are all based on the same data series -specifically EOD data for the ES. I would expect that they are non-independent.

2. I do understand that correlation is not causation, but I believe I am a long way from there and it is like sinking the boat before I can even find it! One has to start somewhere. I'm still in the water.

Last edited by aquarian1; July 4th, 2014 at 04:08 PM.

July 4th, 2014, 03:38 PM
aquarian1
 @artemiso Thank-you. I was hoping there might be a better methodology/approach and there would be others who understand the problem better. This is why I asked for help and started the thread. Your reply is too scholarly for me to understand: "each of the random variables is itself a member of some non-stationary stochastic process." 1. I do not believe the variable are independent. As stated they are all based on the same data series -specifically EOD data for the ES. I would expect that they are non-independent. 2. I do understand that correlation is not causation, but I believe I am a long way from there and it is like sinking the boat before I can even find it! One has to start somewhere. I'm still in the water. Your reply does not seem to offer an alternative approach.

@aquarian1,

I am stuck a bit with the "classifications" of the EOD closes. Is each day assigned one such class or are multiple classes assigned to each day?

So is the following something like the language?
- A = Close Up from previous day
- B = Close Up from previous day by large amount
- C = Close Down from Previous day
- D = Close Down from Previous day by Large amount

Or do they classes reach back further?
- A = Closed up 1 day in a row
- B = Closed up 2 days in a row
- C = Closed up 3 days in a row

July 4th, 2014, 03:45 PM
aquarian1
 I am looking for help on statistics / testing approach to search for rule sets for increased probabilities of certain outcomes. Situation I have a database of conditions for EOD results from 1 Feb 2012 forward to 27 June 2014. This equals 605 records. Each condition has a letter associated with it and these go from A to U. I want to establish a rule set which will give the highest predictive strength of a condition of the next day's, my desired predicted outcome. (e.g M) Goal To establish a systematic way of investigating the possible condition sets to the highest probability combinations for a rule set to predict my outcome, which I can then test my searching my database. Here is where I am at: Please register on futures.io to view futures trading content such as post attachment(s), image(s), and screenshot(s). Please register on futures.io to view futures trading content such as post attachment(s), image(s), and screenshot(s). Please register on futures.io to view futures trading content such as post attachment(s), image(s), and screenshot(s). My goal would be something like a set of rules such as: 1. If D^G and G^J and d^~J then M will happen 68% of the time. 2. If pair 1 or 2 and not pair N1b M will happen 50% of the time. Perhaps Venn diagrams would be helpful in determining the best rule sets? I am looking for ideas on an approach to find a solution just as much as a solution. Thanks in advance. ----------- Clarifying notes: 1. "^" symbol = the AND condition so D^G is "both D ^ G occur" 2. In the table of occurrences of individual conditions D occurred 223 of 390 records or 57.2% of the time M happened the next day. The percentage on the right 12.4% = 223 of 1796 and is just a relative strength %. 3. "~" symbol = NOT

Hummmm....

I am starting to get your approach. I think Rapid Miner may help as I think you have created a class of "things":
A, B, C, D, E,... U

What I think might help here is to develop a Fitness function. So create a function that does something like what you have stated, but I think of it this way:
Fitness=k1*A+k2*B+...+Kx*U

You can then use a generic algorithm to "search" this function to maximize the fitness function. k1, k2 ... kx are likely one of 3 values -1, 0, +1 (NOT, absent, Present).

This is the approach I would likely take to solve this as you have stated you have ~20 possible input combinations which leads to a very large search space.

July 4th, 2014, 04:32 PM
NJAMC
 @aquarian1, I am stuck a bit with the "classifications" of the EOD closes. Is each day assigned one such class or are multiple classes assigned to each day? So is the following something like the language? - A = Close Up from previous day - B = Close Up from previous day by large amount - C = Close Down from Previous day - D = Close Down from Previous day by Large amount Or do they classes reach back further? - A = Closed up 1 day in a row - B = Closed up 2 days in a row - C = Closed up 3 days in a row

It is
"specifically EOD data for the ES"
not EOD closes.

So not what you posted - which would certainly be a some very good things to consider. later.
I have not got that far yet.

July 4th, 2014, 04:49 PM
aquarian1
 It is "specifically EOD data for the ES" not EOD closes. So not what you posted - which would certainly be a some very good things to consider. later. I have not got that far yet.

Hummmm.... Okay,

So the letters represent "features" that occurred that day?

July 5th, 2014, 02:01 PM
NJAMC
 Hummmm.... I am starting to get your approach. I think Rapid Miner may help as I think you have created a class of "things": A, B, C, D, E,... U What I think might help here is to develop a Fitness function. So create a function that does something like what you have stated, but I think of it this way: Fitness=k1*A+k2*B+...+Kx*U You can then use a generic algorithm to "search" this function to maximize the fitness function. k1, k2 ... kx are likely one of 3 values -1, 0, +1 (NOT, absent, Present). This is the approach I would likely take to solve this as you have stated you have ~20 possible input combinations which leads to a very large search space.

Thought you might like this thread...

