Our members are dedicated to PASSION and PURPOSE without drama!

Preamble to The Prediction Formula

Started by Bayes, October 23, 2014, 03:17:22 PM

Previous topic - Next topic

0 Members and 3 Guests are viewing this topic.

Bayes

As a preamble to the formula itself, I'd like to say something about the accepted viewpoint regarding games of so-called "pure chance", such as roulette, craps, and baccarat.

Everyone "knows" that, for example, R/B has a 50:50 chance of hitting (ignoring the zero). But it's important to understand that this depends on a certain interpretation of probability. This interpretation is based on relative frequency: that in an infinite series of trials the ratio of reds to blacks will approach 1. In practice, the ratio approximates to 1 in only a few hundred trials or less in most cases, but even so, the definition of probability as a relative frequency means that it makes no sense to predict future outcomes based on past outcomes, because by definition the probability of red is fixed at 1/2, (the ratio of reds to [reds + blacks] in an infinite number of trials) so you can only ever "predict" the past, not the future.

Another interpretation of probability involves symmetry. Thus, it seems intuitively obvious that if there are X ways that we can get a certain outcome (say Red) and Y ways we can get another outcome (Black), and neither of these outcomes seems more likely than the other, then the probability of the outcome is again fixed at 1/2, which is the ratio of the number of ways the event of interest can happen to the total number of possibilities. This is called the Classical interpretation, and is the one most often used in elementary applications. It was the first  notion in our understanding of probability (which was developed in the context of games of chance), and is most appropriate in those situations where the symmetry is obvious- each outcome has (so it seems) an equal chance of occurring, so in the case of N equally likely outcomes, each  has a probability of 1/N (e.g. each outcome in roulette has a probability of 1/37)

Again, this interpretation does not allow any predictions of future outcomes because the symmetry is "built into" the game (it's assumed), and since trials are independent, on every trial there are just as many ways red can occur as black, so the probability of Red is again fixed in stone at 1/2, and accordingly, it makes no sense to predict the future (you can only predict the past).

Notice that both these interpretations assume that the idea of "randomness" is something built into the system in question. e.g. Roulette outcomes are "random" because of the symmetry of the wheel; dice outcomes are similarly random because (again, by assumption), no side is more likely to land uppermost than any other. Furthermore, we can do some physical experiments to confirm this, so we roll the die a few thousand times and see that indeed, the relative frequency of any side approaches 1/6, just as symmetry considerations suggest that it should.

The key idea I'm trying to get across is that as regards probability, these interpretations (which seem particularly well suited to games like roulette) imply that the probabilities are fixed and somehow objectively a part of the games themselves: the odds of red/black are just 50:50 - end of story. Therefore, bet selection (particularly if based on past results) is meaningless, all systems are useless, and only the casino can win "in the long term".

Now, logically, this is perfectly true, but logic can only tell you what conclusion(s) follow from certain premises, it can say nothing about whether the premises themselves are true.

In fact, there is another interpretation of probability, which includes the others as special cases, and holds that it is subjective, not objective. In a nutshell, probability is in the mind, not something "out there", as an attribute or property of a thing.

That this is obvious doesn't really require much argument. If you have information regarding a certain roulette wheel that I don't have (for example, that it's biased), then your probability of the ball falling into say, pocket 13 may be different from mine. Are either of us wrong, logically speaking? No, it's just that our conclusions are based on difference premises.

It's the same thing with probabilities.Take the action of flipping a coin. Is it part of what it means to be a coin that the probability of it coming up heads if you flip it is 1/2? Not at all; the (normally unexpressed) premise which goes with the statement "the chance of the coin landing heads is 1/2" is that the coin is "fair". But what does "fair" mean? Well, it's assumed that it isn't weighted on one side more than the other (symmetry), but also you have to take physics into account in the way it's flipped. According to Newtonian mechanics, the result of flipping is perfectly deterministic and depends on the initial velocity given to the coin given an initial position. In theory, a precisely made coin flipping machine which gave the same initial force to the coin, which was placed in exactly the same way into the machine every time, would always result in the same outcome.

The "Prediction Formula", which I'll be posting soon, is based on the subjective view of probability, and is derived from Bayes' theorem, which is the key equation used to make inferences about data.

The main purpose of statistics is to come to some conclusions about a "population" based on a sample. For example, sticking with roulette, the sample is some number of "trials" or outcomes, and the population is the total number of outcomes which could be sampled. It's best to regard the roulette system (wheel + ball etc) as an urn full of balls which have the numbers 0-36 written on them. Each trial consists of you taking a ball from the urn, then replacing it. You would like to build up a picture of the "population" from your trials regarding the composition of the balls in the urn. Specifically, you want to do this for purposes of prediction,  how do you do it?

more to come...

Bayes

Ok, let's take a closer critical look at the relative frequency view of probability. This view is the dominant one in statistics today (although there is a significant move toward the subjective view because people are starting to realize that relative frequency is failing) because it apparently leads to "objective" results, and if there's one thing science strives for, it's objectivity.

However, this objectivity is more apparent than actual. The definition of probability on this view is "The relative frequency of occurrence of an event after an infinite number of similar trials has occurred".

Note the requirement is for similar trials. Obviously, the trials cannot be identical, otherwise the outcome of each trial would be the same, so the trials must be at leastly slightly different, but just how slightly is left undefined, and so a subjective element enters into this supposedly objective definition.

Suppose you want to determine the probability of some die rolling event. What exactly defines the roll? what explicit reference do we use so that, if we believe in relative frequency, we can define the limiting sequence? Rolling just this die? any die? how shall it be rolled? what will be the temperature, wind speed, gravitational field?, how much velocity?, on what surface? and so on... Every physical thing that happens does so under very specific, unique circumstances. Thus, there are no reference classes and nothing can have a limiting relative frequency.

The second problem with relative frequencies is that they cannot apply to unique events. If an event can only happen once, it makes little sense to enquire about its past history or even an imagined repetition of trials on which it could occur. This applies to horse racing, football matches, and sporting events of all kinds. And yet, it seems reasonable to assign degrees of belief to the outcomes of these events.

Scientific hypotheses also have this characteristic. If you consider them as events, they either occur or they do not (they are either true or are not true). So on the relative frequency view, hypotheses are given the probability or either 1 or 0; the frequentist (someone who accepts the view that probability is a relative frequency) statistician can never talk about the probability of a hypothesis, he can only talk about the probability of the data given the truth of it.

Data can be repeated, so do the experiment again, and again, and you should get the same, or nearly the same, data. Relative frequencies make sense in terms of repeated observations in which data can occur, so it makes sense to talk about the probability of data. On the other hand, the subjectivist (or "Bayesian") statistician is willing to attach probabilities to both data and hypotheses (through the medium of Bayes' theorem).

Another problem with frequentism is the fact that, because probability is defined as the limit of a ratio after an infinite (or at least very large) number of trials, it can say nothing about what the probability is in the short term (meaning anything less than infinity). But as the economist John Maynard Keynes famously said: "in the long run we are all dead". A view of probability which admits the validity of its results only after an infinitely long series of trials is not going to help anyone who needs to make an optimum decision now, based on limited data.

At this point there might be an objection that this is all well and good, but in cases involving roulette wheels, dice, and so on, we don't need to get data (at least in simple cases), because we can deduce what the probability is (using the Classical, symmetry interpretation). Thus, the roulette wheel having 37 pockets and not knowing anything other than that except the basic mechanics of how roulette works, we can deduce that the probability of any number hitting will be 1/37, AND, if we run a lot of trials and get data (relative frequency) this does confirm that the probability is in fact 1/37.

Yes, but the fact that the relative frequency interpretation and the classical interpretation sometimes agree (not always, by any means, because the classical interpretation is subject to the same kind of criticism as relative frequency; what, for example, is the "reference class" for the axis of symmetry used - there can be many) is what lead people to the backward conclusion that probability just is relative frequency.

To be continued...

Bayes

Ok, back to answering the question: how to make your best "educated guess" of a probability, assuming we're signing up to the subjective interpretation of probability?

I should add that the "subjective" part only comes in at the beginning of the process, before we have any data. After that, the data itself will determine what the probability is, and this can change (it is not fixed as it is in the classical and relative frequency views). This will become clearer when I've introduced Bayes' theorem.

Before introducing it, I'll need to define some terms.

P(A) - the probability that event A will happen.The probability is the chance it will happen, from 0 to 1. O means it will never happen, and 1 means that it is certain to happen. Most events are somewhere in between.

P(A and B) - The probability that *both* event A and B occur. This is called the joint probability.

P(A|B) - This is the probability that event A occurs given that we already know that B has occurred (the symbol | means "given that"). This is called the conditional probability.

In a sense, all probabilities are conditional, because there are always some "givens" even though they may not be expressed. The subjective (Bayesian) view of probability tries to make all assumptions explicit.

The simplest form of Bayes' theorem is:

P(A|B) = P(A and B) / P(B)

In words: "the probability of A given B is the probability of both A and B divided by the probability of B."

Bayes' theorem provides a way to calculate the probability of an event which isn't possible to observe directly. If event A is what we want to predict, and event B is what we have just observed, then A|B becomes "what we want to predict given what we have just observed". The theorem provides a way of updating the probability of A as new observations come in.

By the way, this isn't THE prediction formula which is the subject of this series of posts, but the formula is derived from Bayes' theorem. I wanted to introduce BT and give a few examples of how it's used, because it's the basis of just about any form of prediction or forecasting procedure, not only in statistics, but in fields like Artificial Intelligence and decision making.

First, I anticipate an objection. BT might be all right for sports betting and some other speculative activities, but surely with casino games, outcomes are independent, aren't they? so by definition past events are "meaningless", therefore "updating" a probability based on past events in roulette, for example, must also be meaningless, correct?

More later...

Bayes

I have two replies to this objection. First, let's assume that the game (roulette or whatever) is "fair". There are two ways of being fair: (1) Outcomes are independent, and (2) Outcomes are unbiased.

Unbiased means that "in the long run" (there's that expression again!) or "on average", no event is favoured more than any other. Unbiased uses the concept of relative frequency, so a definition of unbiased could be:

A chance setup is unbiased if and only if the relative frequency in the long run of each outcome is equal to that of any other.

Notice that, for any given "chance setup", if we have no data or evidence that this is the case, we have to rely on other arguments for lack of bias, such as symmetry (the classical interpretation) and the fact that it would not be in the casino's interest if any of their games were biased.

Outcomes are independent if they have no "memory", so a fair setup does not know, at any trial, what happened on previous trials. Another way of putting it is to say that there is no regularity in the outcomes. The sequence RBRBRBRBRB... is certainly unbiased, because each outcome occurs as often as the other, but of course it would be very easy to beat a wheel if it consistently generated such obvious patterns.

It certainly seems obvious that in a game like roulette, for example, outcomes are independent. But that does not mean that past results are meaningless or that they cannot in some way indicate future events. I use the word "indicate" rather than "influence" or "cause" because these latter two words are used inappropriately in this context too often, in my opinion. Mere data cannot "cause" anything (and neither can a formula). The data is evidence of a cause, not the cause itself. If a roulette wheel happens to be biased, this will be reflected in the data (one sector or number appears more often than it should), and this will be true whether or not you know what the cause of the bias is.

Another point to make regarding independence is that there is a precise, mathematical definition of it, but this definition is purely mathematical and makes no allowances for kinds of dependence which don't involve the "sample space". The sample space is the total number of possible outcomes which could occur in the event of interest, so in the case of roulette, there are 37.

Independence in relation to events means that the occurrence of one does not affect the probability of the other. With regard to the sample space, this means that the sample space is the same for event A, regardless of whether even B has occurred or not. i.e.:

P(A|B) = P(A)   "the probability of A given B is just the probability of A". Thus, the fact that we've seen 10 reds in a row tells us nothing about the next spin, because the sample space is fixed - there are still 18 reds, 18 blacks, and 1 zero.

This seems intuitively correct. Suppose someone says they have a "trigger" for a roulette system: whenever event B occurs (perhaps 2 hits of a particular number in the last 10 spins), then bet that the number will hit again in the next 10 spins.

So A is the hypothesis: "number x will hit in the next 10 spins".
and B is the data: "number x has hit twice in the last 10 spins".

Therefore P(A|B) means that "number x will hit in the last 10 spins given that number x has hit twice in the last 10 spins".

We can now test this hypothesis by looking at some actual data; collect some sequences of 10 spins where a number hits twice, then look at the following 10 spins and see if the number occurs again. If this happens more often than not, the hypothesis is confirmed, otherwise it's refuted (i.e., P(A|B) = P(A)).

But of course, this tells us nothing about events which may affect future outcomes but are not just a matter of counting the outcomes in the sample space - the event might affect the probability of events in the sample space itself. For example, a change in temperature or humidity, dealer or ball can affect the distribution of outcomes. Perhaps the casino is cheating and a certain number of chips on a number might "trigger" a device which prevents the ball from landing on it, etc.

The advantage of the Bayesian view of probability is that it allows us to incorporate other knowledge that we may have which may affect the outcomes, in a way that the relative frequency or classical interpretations can not. So the probability of an event is proportional to the degree of evidence or data which supports it, and this data may not be apparent or available to everyone, hence the probability is "subjective". Obviously, this makes the Bayesian approach far more flexible and wide-ranging than either of the others, although from the casino's point of view they are "good enough", because its bottom line depends on "the long run" or "on average".

More later...

Bayes

My other reply to the objection is that, even if every effort is made to ensure that outcomes are "random" (unpredictable), i.e., independent (in both senses) and unbiased, it seems to be the case that they will, at times, exhibit tendencies which are both non-independent and biased! That is, it's part of what it means to be random that you will get sometimes long sequences which are apparently regular and biased, and if that weren't the case, the outcomes would fail any test for randomness.

There have been quite a few studies on this. For example, tests show that when people are asked to invent what they think would be a "typical" sequence of coin tosses, they always underestimate the length of streaks. They "know" that the probability of heads or tails is 1/2, so they construct sequences which reflect that knowledge; in which the deviation from 50:50 is small, even though experienced roulette or baccarat players know that departures from "balance" can be large, especially in small samples.

This dispersion or variance in relatively small samples is ignored in both the classical and frequentist interpretation of probability. The probability just IS 1/2, period. Actually, the dispersion is taken account of in the variance, which is a separate parameter, but again, this is based on long-term relative frequency.

On the Bayesian view, however, the probability is not "given", it is what we are trying to determine. When samples are relatively small, it's perfectly legitimate to take the current probability (as determined by the incoming data) as a basis for decisions (future events). To sum up the differences between the interpretations, if H = the Hypothesis that an event has a certain probability, and D = the data, then on the frequentist/classical view, we have

P(D|H)  The probability of the data given the hypothesis.

whereas the Bayesian is looking for

P(H|D) The probability of the hypothesis given the data.

In a nutshell, the Bayesian starts with a hypothesis, then updates the probability of it as data comes in. The data is given, or fixed; it is, after all, the only thing you actually have in your hand, so to speak, assuming that you know nothing which would indicate some bias or dependence. After each update (each data item) you have a new value for the probability, on the basis of which you make your decision - the process is empirical and predictive.

On the other hand, the frequentist/classicist starts with the hypothesis, which is fixed at 1/2 or whatever, and then looks to see if the data is "compatible" with the hypothesis. In effect, he asks - "what is the probability of getting this data, given that the hypothesis that the probability is 1/2, is true?"

Now this is a pretty odd way of going about things, if you think about it. In the first place, it doesn't really tell us what we want to know, which is the probability of the next event. But the probability of the next event is 1/2, as is the probability of the event after that, and the event after that...

Second, why should the given hypothesis have a probability of exactly 1/2? The procedure implicitly relies not on the data you actually have, but on a lot of "imaginary" data garnered from an infinite series of "random" trials using an imaginary setup, which may or may not be "similar" enough to the current setup to warrant its use.

P(D|H) is the probability of the data, given the hypothesis. But there is no "probability" of the data; there is no uncertainty regarding it - it's right there and you have it. What you are uncertain about is the hypothesis (the probability), so it makes more sense to determine P(H|D) rather than P(D|H).

As an example of P(D|H), consider the event of 10 reds in a row. D = 10 reds in a row, and H = 1/2. Now the probability of getting 10 reds in a row, given that the probability of a red is 1/2, is very small, so the gambler may reason: "since this a very small probability, it means that the next outcome is more likely to be black, therefore I'll bet black."

This is, of course, classic gambler's fallacy. But the "fallacy" part is not that the gambler is making a decision based on past events, it is simply that he is contradicting himself; for P(D|H) means that the probability is fixed (given), but implicit in his decision to bet on black because it is "more likely, given the data" is that the probability isn't fixed. He has no right to refer to the data as "given"; it's the hypothesis that's given, not the data.

Next, I will present the more commonly used version of Bayes' theorem, which is more complex than the simple formula given above, but is derived from it.