The origin of the permutation test

Brendan McKay, Australian National University
Gil Kalai, The Hebrew University

The first public response from Witztum, Rips or Rosenberg (WRR) to our Statistical Science paper was an article on Witztum's web page charging that we had misrepresented the role of statistician Persi Diaconis in choosing the test used by WRR in their 1994 paper.

Witztum's article can be found here. In case it has changed or disappeared since we wrote this reply, the original of it can be found here.

The essence of the account in our paper is that WRR performed a test invented by themselves, and not the one that Professor Diaconis intended them to perform. Witztum's denial of this fact can easily be refuted, as we will now demonstrate.

What we wrote in our paper

From Section 3:

[McKay et al.]
About 1988, a shortened version of WRR's preprint (1987) was submitted to a journal (Proceedings of the National Academy of Sciences of the USA) for possible publication. To correct the error in treating P_1-4 (that is, P₁, P₂, P₃, P₄) as probabilities, Diaconis proposed a method that involved permuting the columns of a 32 x 32 matrix, whose (i,j)th entry was a single value representing some sort of aggregate distance between all the appellations of rabbi i and all the dates of rabbi j. This proposal was apparently first made in a letter of May 1990 to the Academy member handling the paper, Robert Aumann, though a related proposal had been made by Diaconis in 1988. The same design was again described by Diaconis in September (Diaconis, 1990), and there appeared to be an agreement on the matter. However, unnoticed by Diaconis, WRR performed the different permutation test described in Section 2. A request for a third sample, made by Diaconis at the same time, was refused.

Witztum also highlights a remark made by us in Section 10:

[McKay et al.]
Using the permutation test of Diaconis (discussed in Sections 3 and 4) rather than the test invented by WRR, the results are even worse...

Why does it matter?

WRR made no mention of Diaconis in their 1994 paper. However, since the publication of that paper, the role of Dicaonis has become important in the public debate about the famous rabbis experiment.

A statistical test has two important aspects, the data and the method of analysing the data. The final result depends on both, so any bias in the selection of either makes the test invalid. Witztum has many times claimed that a particular part of the analysis method, the "permutation test" could not have been selected by him in a biased manner because it was in fact selected by an independent person, Professor Diaconis. Examples of that claim (emphasis ours):

[Witztum and Rips, letter to Robert Aumann, Dec 1996; published on the internet by Witztum in 1998]:
Negotiations were carried out over whether or not to publish this preprint as a paper. It was during the course of these negotiations that Professor Diaconis first proposed, in a letter dated Aug. 3, '88, that we perform a test based on a large number of random permutations. Eventually the details of the test, the number of permutations and the required level of significance were specified by Prof. Diaconis in a letter dated Sept. 5, '90.

[Witztum, internet article "Does Tolstoy really love Brendan McKay?", late 1997]:
The permutations test proposed by Professor Persi Diaconis, as described in the Statistical Science paper...
...
We thought Dr. McKay was wrong, and that Professor Diaconis' permutation test was a good one.

[Witztum, letter to Galileo, Dec. 1997]:
After the great success of the second list's measurement, Prof. Diaconis suggested we use a new method of measurement on the second list. We did so, and the surprising results of our experiment are brought at the beginning of our letter.

[Witztum, article in Jewish Action, 1997]:
After receiving the second preprint, Professor Diaconis had a further request for us. He asked us to employ a new statistical test devised by him on the second list of rabbis to assess the likelihood of our findings appearing at random. He and the other referees were so confident that the new test would completely destroy our results that they required us to sign an agreement that even if the new experiment resulted in failure we would publish the results from the new experiment along with our previous results.
...
Professor Diaconis and his colleagues were shocked when, employing his measures of statistical significance, we once again obtained highly significant results for the list.

[Witztum, internet article, 1998]:
The randomization test of Professor Diaconis was suggested at a later stage.

These are not the only examples, but are representative. This version of events has become part of the standard mythology of the famous rabbis experiment, and has been repeated many times in books and articles.

More recently, after our criticism became known, Witztum has tended to be more careful in his claims. Now he usually says that the permutation test was "agreed to" by Diaconis, instead of that it was "designed" by Diaconis. As we shall see, even that weaker version is incorrect.

The documentary record and what it tells us

As Witztum acknowledges, the permutation test grew out of a long correspondence between three mathematics professors. Persi Diaconis was acting as a statistical advisor and, later, was a referee of an early version of WRR's paper. David Kazhdan and Robert Aumann were acting on behalf of WRR. Aumann was also the "communicator" of WRR's paper to the Proceedings of the National Academy of Sciences of the USA.

Recall that the data for the famous rabbis experiment involved 32 rabbis. For each rabbi, there were a list of names and a list of dates. The permutation test works like this:

Calculate some measure of the average closeness of the names to the dates.
Randomly permute the dates (so that rabbis will usually be assigned the dates of someone else) and calculate the average closeness again. Repeat this very many times, and see how often the closeness is as good or better than the closeness for the correct assignment of dates.

The issue here is the definition of "average closeness". While seemingly unimportant, different but equally feasible definitions can produce results that differ by factors of hundreds or thousands. The documents refer to two types of average closeness measures:

Type A: Take all 300 or so name-date pairs (from all the rabbis), then combine those 300 distances into an overall measure.
Type D: For each rabbi, combine the distances of his name-date pairs into a single number, then combine those 32 numbers into an overall measure.

Type A is what WRR used in their Statistical Science paper. However, type D is what Diaconis actually recommended, as the following documents show. Copies of the letters were kindly provided to us by Eliyahu Rips and Robert Aumann. Our interpretation of the documents was greatly assisted by discussions with Persi Diaconis, Robert Aumann, and Eliyahu Rips.

Diaconis to Kazhdan, Dec 30, 1986. scan
Diaconis proposes using a single permutation (a cyclic shift randomly chosen). It is not clear if a type A or type D measure is being referred to, but a later reference to this document in document 3 states it to be type D.
Witztum, Rips, and Rosenberg, unpublished preprint,1987.
This is where the second list of rabbis first appeared. A measure of type A is applied with a single cyclic shift by one position.
Diaconis to Aumann, Aug 3, 1988. scan
Diaconis criticises the method used in document 2, and again proposes a type D measure. This time he explicitly demonstrates how to combine the name-date distances for each rabbi into a single number. He further proposes this be done for "a few hundred" permutations.
Aumann to Diaconis, Nov 15, 1989. scan
Aumann here gives the first known description of the method finally published by WRR, which he states to have been proposed by WRR. (This reading of his letter, which is in any case completely clear, was confirmed in writing by Aumann on Sep 5, 1999.) In fact, the letter states that WRR had already written a computer program for it. It uses a type A measure with many permutations.
Diaconis to Aumann, May 7, 1990. scan
Diaconis suggests using a million permutations with a type D measure, and describes how to do the computation efficiently.
Aumann to Diaconis, Jun 19, 1990. scan
Aumann writes that WRR had "apparently reached similar conclusions" about the computational method. From his description we can see that he is still referring to a type A measure.
Diaconis to Aumann, Sep 5, 1990. scan
Diaconis reports "we are in agreement" but then explicitly describes the use of a type D measure.
Aumann to Diaconis, Sep 7, 1990. scan
Aumann seeks to "clarify the rules" by describing type A again. There is a note written on the bottom that Diaconis "looked it over and approved".

The pattern is very clear. Without exception, Diaconis described type D measures and Aumann described type A measures. This continued even after Diaconis reported "agreement". Neither person acknowledged the difference between them. It was almost as though neither person was reading the letters written by the other.

In summary, Diaconis designed a permutation test using a type D measure, and thought Aumann had agreed to it. Aumann argued for a permutation test that WRR had proposed, using a type A measure, and thought Diaconis had agreed to it. In fact, there was no agreement.

Both Diaconis and Aumann were surprised when one of us (McKay) reported the inconsistencies to them in 1997, and both now agree with us that their "agreement" had been an illusion. In email of Sep 5, 1999, Aumann wrote "Of course, by now I realize that there was a misunderstanding between Persi and the authors (and me) about the form of the test."

The effect of this misunderstanding was profound. Aumann and Diaconis had agreed that a significance level of 1/1000 was a reasonable criterion for success. When WRR applied the permutation test they had designed themselves, they met that target easily. If they had performed the test that Diaconis wanted, on the same data, they would have missed it. The failure of the experiment to pass the 1/1000 threshhold would have greatly reduced its prospects of ever being published in a scientific journal.

Finally, we make two remarks relevant to Witztum's claims.

Contrary to what Witztum claims, Diaconis' permutation test makes perfect sense and very little in his description is incorrect or vague.
The proposals of Diaconis and (via Aumann) WRR were mathematically incompatible, and there is no way to regard one of them as being a "clarification" of the other. This incompatibility has been confirmed by Eliyahu Rips:
[Rips, email to McKay, Apr 4, 1997]:
Dear Brendan,
I already had in mind to draw your attention to the fact that in letter 3 (from May 7, 1990) section 2 [and in letter 5 (from September 5, 1990)] [compare also with letter 1 (from December 30, 1986), the end of page 1] the statistic T has NOTHING TO DO with what we have actually computed. I do not know what is d(x_i,y_i) and why should their sum be taken.

Conclusion

Witztum's claims are without foundation. The account of the permutation test in our paper is supported both by the historical record and by the two major players in it.

Postscript

On Nov 5, 1999, Witztum posted a reply to the article above. We have studied it carefully without finding any successful refutations except of Witztum's straw men. We stand by both our published paper and our article above and do not see the need to revise either.

Back to Bible Codes Refuted
Creator: Brendan McKay, bdm@cs.anu.edu.au.