Wikipedia:Reference desk/Archives/Mathematics/2024 January 25

Mathematics desk
< January 24 << Dec | January | Feb >> Current desk >
Welcome to the Wikipedia Mathematics Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


January 25

edit

Testing a null hypothesis for tossing N coins

edit

A Q from twitter [1], adapting the language slightly.

Suppose we have N coins, and we have a null hypothesis that gives coin   a probability   of coming up heads and   of tails.

We toss the N coins. From the sequence of results, can we calculate a confidence level for our null hypothesis ?

If we were tossing the same coin N times, we could calculate a Binomial distribution for the total number of heads, and read off the probability of being nearer the expectation value than our observed number.

Can we do anything similar (but based on the actual sequence observed, rather than the total number) if the probabilities   are not all the same  ? Jheald (talk) 01:39, 25 January 2024 (UTC)[reply]

Let the random variable   represent a toss. Define the random variable   by   where
 
 
The expected value and variance of   under the null hypothesis equal
 
 
For a sample obtained by a large number   of independent tosses, the sample mean should then be approximately normally distributed with   and   Using a Monte Carlo method, a good approximation of the expected distribution under the null hypothesis for smaller values of   may be obtained, giving a test with more power.
There may be better approaches, for example by letting the aggregate random variable be a weighted sum of the   with weights depending on the probabilities. I have not attempted to investigate this.  --Lambiam 12:44, 25 January 2024 (UTC)[reply]
Thanks. I think that makes a lot of sense, and I like how it falls back to exactly the test with the binomial distribution that one would do, if the   are all the same. I've fed it back to the original twitter thread here: [2].
One interesting wrinkle is using the number of observed events that had higher probability events as the test statistic S, rather than just the number of heads -- as presumably this should have just a slightly smaller variance?
Thank you so much again for thinking about this -- this answer makes a lot of sense to me. Jheald (talk) 14:21, 26 January 2024 (UTC)[reply]
Here is another approach. Let us identify   with   and   with   so that each toss corresponds with a vertex of the solid unit hypercube   The expected value under the null hypothesis of the arithmetical average, taken coordinate-wise, of a number of tosses, is the point   For a large sample, the probability distribution of this average approximates a multivariate normal distribution in which all   components are independent. The iso-density hypersurfaces of this distribution approximate a hyperellipsoid whose axes are aligned with the hypercube; it can be turned into a hypersphere by scaling the  -th coordinate up by a factor of  
Then I expect that a good test statistic is given by the Euclidean distance between the point corresponding to the expected value and the observed arithmetical average of the sample, after scaling. I am confident that the distribution of this statistic is well known and not difficult to find, but I'd have to look it up. As before, this can be replaced by a Monte Carlo approach.  --Lambiam 13:57, 25 January 2024 (UTC)[reply]
@Lambiam: Not quite so sure about this one. When you talk of the "arithmetical average" it sounds like you're thinking of the case where you can perform multiple tosses on individual coin (i); and also to have enough data to look at conditonal dependence, or covariance, between toss (i) and toss (i'), whereas my question was more motivated by what you can say when each coin is tossed only once (and whether one can find evidence of skill in such a situation).
I also get nervous about bleaching distributions by applying scalings, and when that does or doesn't make sense; so I would need to think a bit more about that bit. Jheald (talk) 14:21, 26 January 2024 (UTC)[reply]
I did not realize there was just a single toss. (I use "toss" to mean a joint toss of all   coins.) Then the "arithmetical average" is just the outcome of that one toss. If   is the outcome of a toss, you might try the log of the likelihood under the null hypothesis,
 
The distribution of   under the null hypothesis can be approximated à la Monte Carlo. This won't do anything for coins that are null-hypothesized to be fair, but, clearly, these won't give one any usable information in just one single toss. Unless the   tend to be somewhat extreme,   needs to be very large for one to be able to make any plausible determination, no matter how well-crafted and powerful the test statistic.  --Lambiam 15:02, 26 January 2024 (UTC)[reply]
@Lambiam: Interestingly, FWIW this in fact was also my first knee-jerk response [3]: do the probabilities   appear to be well-calibrated, in the sense that they appear to be able to code the results actually obtained with about the expected message-length.
But then I spotted, as you note above, that this approach isn't capable of telling us anything if every  . Whereas your first approach can, even in that case.
So just the observed message-length coded according to   can't be the whole story. Jheald (talk) 17:49, 26 January 2024 (UTC)[reply]
As also noted below, there is the issue of what it is you are testing against, the set of alternative hypotheses. This is already obvious in the distinction between one-sided and two-sided tests. If someone is suspected of faking their experimental data, sleuths might choose to examine if the reported data is too good to be believable. If every coin is supposed to be fair,   and the report says exactly   heads were observed and an equal number of tails, this might be interpreted as confirming the suspicion.  --Lambiam 19:17, 26 January 2024 (UTC)[reply]
You need some sort of alternatives to test a hypothesis. If the actual probabilities are limited to p=0.5 and p=0.5001 it would take a lot of testing to distinguish the two and early results indicating p=0.3 doesn't provide much support for either actual probability. NadVolum (talk) 23:37, 25 January 2024 (UTC)[reply]
@NadVolum: No. Certainly in statistics you can compare the weight of evidence for different hypotheses. But that is not the only thing you can do. You can also ask questions such as : is my data reasonably consistent with what I would expect to see under Hypothesis H0 ? Which is what Lambiam does above.
On twitter Adam Kucharski took a slightly different approach [4], using the additional information that in the null hypothesis the probabilities   could be seen as coming from an urn problem, so   where   and   are the number of red balls and number of white balls in the urn at stage (i) respectively.
Kucharski suggested to look for evidence of deviations from H0 by consdiering the alternative model  , where the skill factor A allows some deviation from H0 (thus introducing the alternatives you were wishing for), and looking at what sort of distribution you get for A. For a given set [5] of quite limited data, he was able to calculate an estimate of A = 1.3 with a 95% confidence limit of 0.5 to 1.9 -- which he summarised as "a bit better [than random], but can't be sure". Jheald (talk) 14:21, 26 January 2024 (UTC)[reply]
That's assuming a uniform prior distribution and applying Bayes Theorem to test against that. A very good way of doing things but even the uniform prior to assume can sometimes be a bit contentious, see Bertrand paradox (probability). NadVolum (talk) 14:37, 26 January 2024 (UTC)[reply]
Indeed. Always need to think about the effect of priors (sometimes implicit) and whether they are reasonable. If one didn't mind a little more complexity, something might be said for tweaking his   to be  , so that a flat prior of A would correspond to treating equally a-priori a down-weighting or an up-weighting of the number of red balls by a factor of 2; with   remaining well-defined over the full range of A from -inf to +inf. Jheald (talk) 14:52, 26 January 2024 (UTC)[reply]
You can always calculate a p-value by brute force. Enumerate all of the 2N possible outcomes and calculate the probability for each one by multiplying the appropriate factors of   and  . Sort the list in order of increasing probability, from least probable to most probable (under the specified null hypothesis). When you toss the coins, look up your observed outcome in the table, and take the sum of the probabilities of all the entries up to that one. That sum is the p-value by definition. Then the math problem is to find a simpler way to compute the same cumulative sum, either exactly, or given some suitable approximation. --Amble (talk) 18:20, 26 January 2024 (UTC)[reply]
The   statistic I introduced above is the logarithm of this probability. If   is too large to make this brute-force approach feasible (you don't want to sort 260 values), and you have decided in advance on the significance level (if only to avoid being seduced to commit post hoc analysis), you can simply estimate   in which   is the actual toss outcome and   is a random variable representing a toss under the null hypothesis, as follows. Generate a large number of tosses using a good pseudo-random number generator and count the fraction for which the computed value for   is at least   If that fraction is too low, given the chosen significance level, the null hypothesis can be rejected.  --Lambiam 18:55, 26 January 2024 (UTC)[reply]
(ec) @Amble: Thanks. Useful perspective.
Equivalently, if we take the take the logarithm of those probabilities, that essentially takes us to the "message length" discussion above. And we can apply either a one-sided test (what are the chances of an outcome less likely than this / a message-length longer than this), or a two-sided test (what are the chances of an outcome further from the typical, either way, than this).
So yes, that's a very useful observation.
But, as per the "message length" discussion above, it wouldn't give us any help in a case where every   -- whereas we should be able to detect deviation in such a case, if 'heads' (or red balls) are coming up more often than expected... Jheald (talk) 18:58, 26 January 2024 (UTC)[reply]
If your null model is that every  , then (under that model) every possible outcome is equally likely, and equally consistent with the model. But that's just the probability (or log-likelihood, p-value, etc.) relative to the null model alone, without considering any specific alternatives. I think that's the original question here, but there are other questions we may want to ask. We're used to thinking of certain types of plausible alternative models like "all the coins are biased and really have  ." Different outcomes will have different p-values under that alternative model, so the comparison can give you information. And you can construct a random variable like "total number of tails" that is very sensitive to this kind of bias. But it will be completely insensitive to other kinds of alternative models like "even coins always give heads, odd coins always give tails". So a choice between two models given the data is different from a test of the null model that's intended to be independent of any assumption about the possible alternative models. --Amble (talk) 19:39, 26 January 2024 (UTC)[reply]
An interesting case to consider is one where the null model is independent, unbiased coins (all  ) and the alternative model is that a human is making up a sequence of H and T while trying to make them look as random as possible. --Amble (talk) 19:54, 26 January 2024 (UTC)[reply]
This is very similar to the case of suspicious data above where the outcome is exactly evenly divided between   and   A fraudulent scientist just needs to use a good random generator. The likelihood of alarm being triggered is then just the same as that of false alarm for a scientist who laboriously tosses and records   truly fair coins. And if they know the fraud test to be applied, they can keep generating data sets until one passes the test.  --Lambiam 21:30, 26 January 2024 (UTC)[reply]
I mean a task where the human has to make up the sequence themselves, without being able to use an RNG, as in a random item generation test. [6] --Amble (talk) 21:46, 26 January 2024 (UTC)[reply]