The origin of the permutation test
Brendan McKay, Australian National University
Gil Kalai, The Hebrew University
The first public response from Witztum, Rips or Rosenberg (WRR) to our Statistical Science paper was an article on Witztum's web page charging that we had misrepresented the role of statistician Persi Diaconis in choosing the test used by WRR in their 1994 paper.
Witztum's article can be found here. In case it has changed or disappeared since we wrote this reply, the original of it can be found here.
The essence of the account in our paper is that WRR performed a test invented by themselves, and not the one that Professor Diaconis intended them to perform. Witztum's denial of this fact can easily be refuted, as we will now demonstrate.
[McKay et al.]
About 1988, a shortened version of WRR's preprint (1987) was submitted to a journal (Proceedings of the National Academy of Sciences of the USA) for possible publication. To correct the error in treating P1-4 (that is, P1, P2, P3, P4) as probabilities, Diaconis proposed a method that involved permuting the columns of a 32 x 32 matrix, whose (i,j)th entry was a single value representing some sort of aggregate distance between all the appellations of rabbi i and all the dates of rabbi j. This proposal was apparently first made in a letter of May 1990 to the Academy member handling the paper, Robert Aumann, though a related proposal had been made by Diaconis in 1988. The same design was again described by Diaconis in September (Diaconis, 1990), and there appeared to be an agreement on the matter. However, unnoticed by Diaconis, WRR performed the different permutation test described in Section 2. A request for a third sample, made by Diaconis at the same time, was refused.
Witztum also highlights a remark made by us in Section 10:
[McKay et al.]
Using the permutation test of Diaconis (discussed in Sections 3 and 4) rather than the test invented by WRR, the results are even worse...
WRR made no mention of Diaconis in their 1994 paper. However, since the publication of that paper, the role of Dicaonis has become important in the public debate about the famous rabbis experiment.
A statistical test has two important aspects, the data and the method of analysing the data. The final result depends on both, so any bias in the selection of either makes the test invalid. Witztum has many times claimed that a particular part of the analysis method, the "permutation test" could not have been selected by him in a biased manner because it was in fact selected by an independent person, Professor Diaconis. Examples of that claim (emphasis ours):
[Witztum and Rips, letter to Robert Aumann, Dec 1996; published on the internet by Witztum in 1998]:
Negotiations were carried out over whether or not to publish this preprint as a paper. It was during the course of these negotiations that Professor Diaconis first proposed, in a letter dated Aug. 3, '88, that we perform a test based on a large number of random permutations. Eventually the details of the test, the number of permutations and the required level of significance were specified by Prof. Diaconis in a letter dated Sept. 5, '90.
[Witztum, internet article "Does Tolstoy really love Brendan McKay?", late 1997]:
The permutations test proposed by Professor Persi Diaconis, as described in the Statistical Science paper...
We thought Dr. McKay was wrong, and that Professor Diaconis' permutation test was a good one.
[Witztum, letter to Galileo, Dec. 1997]:
After the great success of the second list's measurement, Prof. Diaconis suggested we use a new method of measurement on the second list. We did so, and the surprising results of our experiment are brought at the beginning of our letter.
[Witztum, article in Jewish Action, 1997]:
After receiving the second preprint, Professor Diaconis had a further request for us. He asked us to employ a new statistical test devised by him on the second list of rabbis to assess the likelihood of our findings appearing at random. He and the other referees were so confident that the new test would completely destroy our results that they required us to sign an agreement that even if the new experiment resulted in failure we would publish the results from the new experiment along with our previous results.
Professor Diaconis and his colleagues were shocked when, employing his measures of statistical significance, we once again obtained highly significant results for the list.
[Witztum, internet article, 1998]:
The randomization test of Professor Diaconis was suggested at a later stage.
These are not the only examples, but are representative. This version of events has become part of the standard mythology of the famous rabbis experiment, and has been repeated many times in books and articles.
More recently, after our criticism became known, Witztum has tended to be more careful in his claims. Now he usually says that the permutation test was "agreed to" by Diaconis, instead of that it was "designed" by Diaconis. As we shall see, even that weaker version is incorrect.
As Witztum acknowledges, the permutation test grew out of a long correspondence between three mathematics professors. Persi Diaconis was acting as a statistical advisor and, later, was a referee of an early version of WRR's paper. David Kazhdan and Robert Aumann were acting on behalf of WRR. Aumann was also the "communicator" of WRR's paper to the Proceedings of the National Academy of Sciences of the USA.
Recall that the data for the famous rabbis experiment involved 32 rabbis. For each rabbi, there were a list of names and a list of dates. The permutation test works like this:
The issue here is the definition of "average closeness". While seemingly unimportant, different but equally feasible definitions can produce results that differ by factors of hundreds or thousands. The documents refer to two types of average closeness measures:
Type A is what WRR used in their Statistical Science paper. However, type D is what Diaconis actually recommended, as the following documents show. Copies of the letters were kindly provided to us by Eliyahu Rips and Robert Aumann. Our interpretation of the documents was greatly assisted by discussions with Persi Diaconis, Robert Aumann, and Eliyahu Rips.
The pattern is very clear. Without exception, Diaconis described type D measures and Aumann described type A measures. This continued even after Diaconis reported "agreement". Neither person acknowledged the difference between them. It was almost as though neither person was reading the letters written by the other.
In summary, Diaconis designed a permutation test using a type D measure, and thought Aumann had agreed to it. Aumann argued for a permutation test that WRR had proposed, using a type A measure, and thought Diaconis had agreed to it. In fact, there was no agreement.
Both Diaconis and Aumann were surprised when one of us (McKay) reported the inconsistencies to them in 1997, and both now agree with us that their "agreement" had been an illusion. In email of Sep 5, 1999, Aumann wrote "Of course, by now I realize that there was a misunderstanding between Persi and the authors (and me) about the form of the test."
The effect of this misunderstanding was profound. Aumann and Diaconis had agreed that a significance level of 1/1000 was a reasonable criterion for success. When WRR applied the permutation test they had designed themselves, they met that target easily. If they had performed the test that Diaconis wanted, on the same data, they would have missed it. The failure of the experiment to pass the 1/1000 threshhold would have greatly reduced its prospects of ever being published in a scientific journal.
Finally, we make two remarks relevant to Witztum's claims.
[Rips, email to McKay, Apr 4, 1997]:
I already had in mind to draw your attention to the fact that in letter 3 (from May 7, 1990) section 2 [and in letter 5 (from September 5, 1990)] [compare also with letter 1 (from December 30, 1986), the end of page 1] the statistic T has NOTHING TO DO with what we have actually computed. I do not know what is d(x_i,y_i) and why should their sum be taken.
Witztum's claims are without foundation. The account of the permutation test in our paper is supported both by the historical record and by the two major players in it.
On Nov 5, 1999, Witztum posted a reply to the article above. We have studied it carefully without finding any successful refutations except of Witztum's straw men. We stand by both our published paper and our article above and do not see the need to revise either.
Back to Bible Codes Refuted
Creator: Brendan McKay, firstname.lastname@example.org.