NeuroTalk Support Groups - A simple test to detect self-evident clusters

This post tries to help answer questions like:

In a town of 10,000 people, 50 are diagnosed with PD. Is this obviously a cluster?

Four people in my football (soccer) team have gone on to develop PD. Is this unusual?

In case the mathematics scares you off, let's get to the results first.

Given that the population of the UK is about 60 million, of whom about 120,000 have been diagnosed with PD, then, in order to claim that a group of size G contains a PD cluster, we need at least this number of PwP in the group:

G, #PwP
3, 3
10, 4
100, 6
1000, 11
10000, 39
100000, 244

To make it clear as to how to read this table look at the row 100, 6. This means that if you have an unbiased group of size 100, you need at least 6 PwP in the group to claim it is a self-evident cluster.

If anyone knows of any suspected clusters that satisfy the above table, I'd love to hear from you.

Now let's explain where these numbers come from.

Let me say at the outset that, although the test is simple, the collection of data in an unbiased manner is hugely difficult.

As I use the term, a cluster is a group of people, determined by geography, job, age or whatever, that has a significantly higher proportion of people with Parkinson's than would be expected given the statistics for the whole population.

By "self-evident" I mean that the specialness of this group rests on the data from this group alone and does not rely on evidence from neighbouring groups. Another, more subtle, point implied by self-evident is that we have a post hoc test: the question has come out of the data, rather than the other way around. For instance, with the football example above, it is clearly a stronger result to have asked the question 30 years ago, before anyone in the team had any signs of PD, rather than asking it now. This is because a similar question would have been asked if it had turned out that four people from the team had developed heart trouble or cancer, or whatever. Another problem is fitting the question to the data. For instance, if one of the football teams substitutes had PD, one might be more inclined to include the substitutes in the analysis.

The data needs to be unbiased to account for age, gender, ethnicity etc.; familial correlations need to be accounted for, as do group members having a common doctor, etc.. The status of the observer, being a PwP or not, if part of the group, biases the sample.

I've adopted a naive approach to the mathematics, using many heuristics. I'll be grateful to anyone who suggests ways that improve the quality of my analysis, and can revise the table above.

The test requires:
N = the size of the population (e.g. the national population)
D = the number of diagnosed PwP in the population
G = the group size
n = the number of PwP in the group

Then, in the case where all members of a group have Parkinson's the test for a cluster is:

N * (D/N)^n < 1

And in the general case:

(N/G) * (1-BINOM.DIST(n-1,G,D/N,TRUE)) < 1
where BINOM.DIST is defined in the same way as the Excel function of that name.

How is this approach justified?

- D/N estimates the probability of a person having PD.

- Assuming independence, the cumulative binomial distribution can be used to find the probability of at least n PwP in the group.

- If the whole population was divided up into groups of size G, there would be N/G groups.

- The number of groups multiplied by the probability gives the expected number of groups with at least this number of PwP. Call this E.

- Here I really wave my hands. Given that we're doing a post hoc analysis we want E<1, otherwise a result of this kind would be expected across the whole population anyway. E is similar to a significance level.

One feature of this approach is that, although fairly insensitive to it, the result depends on the population size: what would be defined as a self-evident cluster in a small population might not be in a large population.

John