Brief notes on Gans' Primer
Brendan McKay, Australian National University
Witztum, Rips and Rosenberg (WRR) claimed to prove the existence of a hidden code in the Bible by an experiment that involved the names and appellations of famous rabbis, together with their dates of birth or death. In reply, McKay, Bar-Natan, Bar-Hillel and Kalai (MBBK) published a detailed analysis in the same journal. The conclusions reached by MBBK were that WRR's evidence is fatally defective, and indeed that some of it was most likely obtained by faulty procedures.
A multi-part reply by Witztum has been published on his web page, and the parts worth replying to have been responded to by us. Apart from Witztum's reply, the most comprehensive attack on our work was the "Primer" published on the internet by Harold Gans, a former employee of the United States National Security Agency and a "codes expert" with the pro-codes organization Aish HaTorah.
Many of Gans' arguments have been replied to in the course of replying to Witztum. In this document we will make some further replies. To help the reader understand the context, we will quote all of the html version of Gans' work other than Haralick's preface and the Appendices. However, we will transliterate Hebrew letters into Michigan-Clairmont. All the text in blue was written by Gans, and our replies are in red. Formatting errors in Gans' text are in the original.
It would considerably help in understanding the following to have read our published paper on the codes. Although Gans claims to summarise it, critical parts of our argument are absent or weakened in his summary.
Note that our failure to reply to any particular point does not signal that we accept it or have no reply. In most cases we have simply chosen to not reply, or to reply as time permits in the future.
[Gans:] Introduction
In August 1994 the peer reviewed journal "Statistical Science" published a paper entitled "Equidistant Letter Sequences in the Book of Genesis" by Doron Witztum, Eliyahu Rips, and Yoav Rosenberg. This paper reported the first scientific Torah codes experiment. As is well known, critics have challenged the validity and integrity of the scientific experiments that have shown the Torah codes phenomenon to be real. Some of the critics are technically qualified, and the challenges are often sophisticated. These attacks have been relentless, appearing in the media, and published widely in magazines and on the Internet. In addition, a peer-reviewed article by McKay, Bar-Natan, Bar Hillel, and Kalai (henceforth referred to as "MBBK"), "Solving the Bible Code Puzzle", purporting to prove that the Torah codes experiments are fatally flawed has been published in the May 1999 issue of Statistical Science (printed in September, 1999). We provide a concise summary of the major issues and their resolution in this paper. Note that one need not have any mathematical or scientific background, nor a background in Hebrew to understand what follows. One needs only to be able to follow logical reasoning and have patience.
Basics
Note: Some of the basics described below are themselves part of the controversy. These points will be discussed as appropriate later in this paper.
Doron Witztum and Professor Eliyahu Rips performed the first scientific Torah codes experiment in the 1980's. This experiment has five components, viz.:
1. The Hebrew text of the book of Genesis.
2. A list of famous Rabbinical personalities and their appellations (e.g., "Rambam" is an appellation of Rabbi Moshe Ben Maimon; "Hagaon MeVilna" and "HaGra" are appellations of Rabbi Eliyahu Ben Shlomo Zalman of Vilna, etc.).
3. A matching list of Hebrew dates of birth and death (month and day) paired with each personality (i.e., with each appellation of each personality).
4. A mathematical formula, which provides a measure of proximity between equidistant letter sequence (ELS, plural: ELSs) encodings of the appellations and ELS encodings of their corresponding dates of birth and death.
5. A mathematical technique that calculates the probability of obtaining the set of proximities obtained by the formula in number 4 above just "by chance". As required by one of the peer review referees, Professor Persi Diaconis, a probability against chance of 1/1,000 or smaller is considered a success. That is to say, suppose we hypothesize that the Torah codes do not exist. We then calculate a proximity measure. We now calculate the odds against the proximity measure obtained being at least as strong as it is. If these odds are 1,000 to 1 or greater, we have two possibilities: (a) An event with 1,000 to 1 odds just happened by chance, or (b) our initial hypothesis that the Torah codes do not exist must be wrong. Normal scientific procedure is to reject the possibility of an event with such strong odds happening by chance. Such an event is so highly improbable that declaring it to be nothing more than chance verges on the absurd. In fact, the normal scientific standard for such a rejection is 20 to 1.
[Comment:] The claims here are incorrect, as we show below.
[Gans:] Thus, we pick possibility (b), namely, Torah codes do exist and they cause the observed proximity measure to be as strong as observed. That is, we accept the statistical evidence as demonstrating that the phenomenon is real.
[Comment:] This is bad statistics. Possibility (b), technically known as the "alternative hypothesis", is only accepted in a formal sense. What really happens is that possibility (a) is rejected. In fact we agree in rejecting possibility (a) and only disagree on what the alternative is.
[Gans:] The Torah codes phenomenon, as discovered by Witztum and Rips, can be described as follows. Given an appellation of a famous rabbinical personality and his Hebrew date of birth or death, we search for each as an ELS in the Hebrew text of Genesis. If we find a minimum of one ELS of the appellation and a minimum of one for the date, then (some of) the ELS(s) of the appellation tend to be in closer proximity to (some of) the ELS(s) of the date than expected by chance.
Witztum and Rips prepared a list of 34 personalities, called "List 1", for their first experiment in the mid 1980's. These personalities were selected by a simple criterion, viz.: those that have at least 3 columns of text in their entry in the "Encyclopedia of Great Men in Israel; a Bibliographical Dictionary of Jewish Sages and Scholars from the 9th to the End of the 18th Century" edited by Margalioth and published in 1961. The list was then given to Professor S. Z. Havlin, of the Department of Bibliography and Librarianship at Bar Ilan University. Professor Havlin devised a set of historical / linguistic rules with which to determine the correct appellations associated with each personality in the list. In certain situations, his "professional judgement" was needed in addition to the rules. Professor Havlin thus produced a set of appellations, using his rules, which by his own testimony and that of Witztum and Rips, was produced with absolutely no feedback or coaching from Witztum or Rips. According to each of them, it was totally Havlin's work and completely independent of Witztum and Rips save for the latter providing the original list of personalities.
[Comment:] Of course, this is the "official history" of the experiment for which there is no documentary evidence (or even a claim) dating to the period of the experiment itself. The preprints of WRR's work merely cite Havlin for "valuable advices" and the earlier lecture of Rips described appellation collection differently. As far as we can tell, the first claim that Havlin prepared the data by himself occurred in the draft of WRR's paper written by Robert Aumann.
[Gans:] Witztum and Rips paired each of Havlin's appellations with each of three standard forms of the corresponding dates of birth and death (e.g. for the 13th of Adar the three forms are: YG ADR, YG BADR, BYG ADR) (for the 15th and 16th of the month there are 6 forms). This produced slightly under 300 pairs of appellations and dates. Using software developed by Yoav Rosenberg, they searched for each appellation and date in the list as an ELS code in the standard Koren edition of the book of Genesis. They next applied their proximity formula to those 157 pairs of ELSs found (not all names or dates were found as ELSs in Genesis) and obtained results that appeared significant. (This appearance of significance was based on the erroneous perception that the proximity formula produced true probabilities.) However, at that time, a valid technique for assessing the probability (odds) of the result, the necessary 5 th step, had not yet been developed.
The results were sent to Professor Persi Diaconis (through Professor David Kazhdan) who asked that a new list of personalities, List 2, be prepared from the same encyclopedia and the test redone on the new data.
[Comment:] This is not correct. Diaconis said nothing about using the same encyclopedia. He also specified a different way of testing the data, using a "cyclic shift randomly chosen". WRR's data does not pass that test, but if they ever tried it, they didn't say so.
[Gans:] There were several possible reasons for this request: (a) If the result was valid, then a similar result would be expected on the new set. On the other hand, if the result on List 1 was really not significant but was just an anomaly, there would be no reason to expect to obtain another apparently significant result on List 2. (b) There are many possible variations of the mathematical formula used to measure proximity. The possibility that many formulas were tried could not be ruled out. Clearly, the measure of significance depends strongly on how many formulas were tried; success in obtaining "interesting" results after trying many formulas is not nearly as significant as it would be had the same result been obtained by use of a single formula specified in advance (i.e., "a priori"). In fact, it is even plausible that the formula was purposefully constructed so as to give interesting-appearing results on List 1 (this would be "a posteriori", the opposite of "a priori"). Since the same formula would now be used on List 2 without any modification permitted, a success on List 2 could not possibly be attributed to any "fine-tuning" (or "wiggle room") of the formula to ensure a "success". (c) Although 3 common date forms were used in List 1, there are other (not so common) date forms. There are also several different versions of the Hebrew text of Genesis. Once again, since precisely the same date forms and Hebrew text of Genesis used in List 1 would be used for List 2, a success on List 2 could not possibly be attributed to the flexibility in choosing the date forms or the text. We shall henceforth refer to these properties as "list 1 - fixed". That is, any detail of the experiment on list 1 that was not changed before being applied to list 2 is "list 1 - fixed". A "list 1 - fixed" property or specification is thus not subject to "tuning" or manipulation on list 2 or any subsequent experiment.
List 2 was formed by selecting those personalities that have at least one and a half but less than three columns of text in their entry in the Margalioth encyclopedia. This produced a list of 32 personalities. Once again, Professor Havlin independently produced the corresponding appellations, and the final list consisted of slightly more than 160 appellation - date pairs found as ELSs. The proximity measure was applied to the els pairs found in exactly the same way as had been done with List 1. The results once again appeared significant.
In a September 5, 1990 letter addressed to Professor Robert Aumann, Professor Persi Diaconis, one of the referees for the peer review journal "Proceedings of the National Academy of Sciences", (PNAS), details a mathematical technique for measuring the probability of obtaining the set of proximities just "by chance" - that is, assuming that there is really no Torah code phenomenon. Some details that were not included in this letter (or were ambiguous) were included in a September 7, 1990 response from Professor Aumann. A copy of this letter was "given to Persi by hand in Sequioa hall, September 9, 1990, 2:50 PM. He looked it over and approved" according to a handwritten addendum on a copy of the letter, by Professor Aumann. Copies of these letters are included in appendix C.
[Comment:] Note that we dispute this description; see below.
[Gans:] This was the necessary 5 th component of the experiment. Professor Diaconis set the publication threshold for this probability at 1/1,000, well beyond the normal scientific standard of 1/20. That is, if the odds against getting such results (so many close proximities), given no Torah codes phenomenon was equal to or exceeded 1,000 to 1, the referees for PNAS would consider the result valid enough so as to recommend publication. The test was applied (to List 2 only - by Diaconis' request) and the resulting probability obtained was 1/62,500, or more than 62 times better than the agreed upon threshold for publication.
[Comment:] The assertion that 1/20 is the normal scientific standard is quite incorrect. It is a common standard for simple statistical experiments of an everyday nature, but for experiments of great importance (and who could say that discovering hidden messages in the Torah would not be important?), far stronger requirements are set.
An article on this subject can be found in Volume 289 of the journal Science (29 Sep 2000, pages 2260-2262). For example, important discoveries such as that of a planet around a distant star, or a new elementary particle, require a "five-sigma result". This corresponds to roughly 1 in 3 million. More interestingly, the article reports that "an astounding number" of five-sigma results, and even far stronger results, have turned out to be false alarms. This situation does not surprise statisticians, who know that it merely reflects the fact that calculations of probabilities in statistical experiments rely on mathematical models that never correspond exactly with reality. When a strong result is obtained, the truth of the research hypothesis is only one of the possible explanations, and it is incumbent on the scientist to search for alternatives. As stated by Nobel Prize winner Val Fitch, the statistical analysis "is based upon the assumption that you know everything and that everything is behaving as it should. But after everything you think of, there can be things you don't think of. A five-sigma discovery is only five sigma if you properly account for [systematic sources of error]."
[Gans:] In spite of this, the referees decided not to recommend publication of the paper (for reasons related to the "scientific interest" of the result, rather than to the validity of the work).
The paper was subsequently submitted to the prestigious peer reviewed journal "Statistical Science". Several additional tests were required to ensure that the experiment did not produce apparently significant results spuriously. For example, the same test was done using a Hebrew translation of Tolstoy's "War and Peace" (the first 78,064 letters - the same as the length of Genesis). Had "significant" results been obtained here too (that is, a small probability, even if it were not as small as 1/1,000) this would imply that there is something very wrong with the methodology of the experiment. After all, no one claims that there are codes in "War and Peace"! This type of test is called a "control experiment". Other control experiments involved randomly mixing up the verses, or the words within verses, of the Hebrew text of Genesis. After successfully completing several control experiments (i.e., they produced random - looking results as they should), the paper was published by Statistical Science as "Equidistant Letter Sequences in the Book of Genesis" by Doron Witztum, Eliyahu Rips, and Yoav Rosenberg. This paper has come to be known as "WRR". It was August 1994.
Synopsis of the WRR experimental results
We present a few details of the experimental results obtained by WRR on list 1 and on list 2, as these details will be relevant later. An individual proximity measure,
c(w, w') was calculated for each pair of words w and w' in each list. c(w, w') ranges from near 0 (actually, 1/125) to 1, and the smaller it is, the stronger the proximity between the ELSs of the paired words. Thus, 1/125 is the strongest individual proximity measure, and 1 is the weakest. An overall proximity measure can then be calculated for the entire list. For list 1, two such overall proximity measures were calculated. The first of these (called "P1") is given as a "sigma" value. The larger this number, the stronger the overall proximity. This measure is based on how many of the individual proximity measures, c(w, w'), are less than or equal to 0.2 (recall, individual proximity measures are stronger when the c(w, w') values are smaller). Since we will refer to this quantity (here, 0.2) several times in this paper, we shall call it the "P1 bunching threshold". That is, P1 is a function of the number of c(w, w') that bunch below this threshold. The second overall proximity measure, called P2, is based on all the c values, and the smaller it is, the stronger it is (opposite to that of P1). The same two overall proximity measures were used for list 2. In addition, each was modified slightly to produce two more overall proximity measures; the 3rd (P3) being similar to the 1st, and the 4th (P4) being similar to the 2nd. A description of how P3 and P4 are defined will be given later.
Proximity measures
Name Description Remark
Individual measure c(w, w') One c value per word pair Smaller values are stronger
1st Overall measure P1 Based on c values <= 0.2 Bigger values are stronger
2nd Overall measure P2 Based on all c values Smaller values are stronger
3rd Overall measure P3 Similar to P1 Bigger values are stronger
4th Overall measure P4 Similar to P2 Smaller values are stronger
For list 2, each of the four overall proximity measures is converted into a true probability, the smallest (most significant) being 4/1,000,000. We will describe this process later in the paper. Finally, the four probabilities are converted into one overall probability by simply multiplying the smallest of the four by 4. Thus, the overall probability for list 2 is 4x(4/1,000,000) = 16/1,000,000 = 1/62,500. The following table gives the values obtained for list 1 and for list 2. Note that small numbers are given in scientific notation: 1.29E-9 is the same as 0.00000000129, 1.15E-9 is the same as 0.00000000115, and 7.20E-9 is the same as 0.00000000720. The symbol "s" indicates a "sigma" value.
Experimental results
List 1 List 2
1st overall proximity 6.61s 6.15s
2nd overall proximity 1.29E-9 1.15E-9
3rd overall proximity --- 5.52s
4th overall proximity --- 7.20E-9
1st probability --- 453/1,000,000
2nd probability --- 5/1,000,000
3rd probability --- 570/1,000,000
4th probability --- 4/1,000,000
overall probability --- 16/1,000,000 = 1/62,500
Non scientific challenges
The validity of each of the 5 components of the WRR experiment has been challenged. In addition, questions have been raised concerning a tradition for the existence of ELS codes in the Torah, and the appropriateness of teaching codes, particularly as part of an outreach program. Concerning these non scientific issues, suffice it to say that several leading Gedolim (Torah sages) have given their written approbations stating that there is a tradition of Torah codes, as well as the appropriateness of teaching Torah codes for outreach. In a 1989 psak din (judgement), Rav Shlomo Fisher stated "It is a mitzvah to teach codes". Although some critics claim that in private conversations since then, Rav Fisher has indicated a retraction of that position, there is nothing in writing to lend credence to such a claim. On the contrary, in 1997, after discussing the issues at length with a number of critics, as well as with Doron Witztum, Rav Fisher wrote another letter of support and further indicated that he, together with Doron Witztum, had gone "a number of times in front of Rav [Shlomo Zalman] Auerbach, Zt'l, and he also backed this without reservation". In 1998 more letters of approbation were obtained from Rav Shmuel Deutch, Rav Shlomo Fisher, Rav Shmuel Auerbach (son of Rav Shlomo Zalman, Zt'l), Rav Shlomo Wolbe, and Rav Mattisyahu Salomon. In particular, Rav Wolbe states, "It is known that a way exists to discover hints and matters from the Torah by reading letters at equidistant intervals. This method is found in the commentary of Rabeinu Bachya on the Torah and in the works of Rav Moshe Cordovero". In 1999, Rav Shmuel Kamenetsky wrote a strong letter of support for Torah codes subsequent to his having had an extended conversation with some of the critics followed by a discussion with Harold Gans. Rabbi Moshe Heinemann has also stated that teaching Torah codes is a kiddush Hashem. He has further indicated that anyone who doubts this or challenges it should feel free to call him directly and discuss it. Anyone concerned with these issues should read the letters of these Gedolim. An English translation of the full text of these letters can be found in appendix B. The remainder of this paper will deal with the technical challenges to WRR.
[Comment:] We are only concerned with the question as a scientific one, and will not reply to these comments.
[Gans:]
Challenges to the text of Genesis
The Hebrew text of Genesis used for the WRR experiment (and indeed for all subsequent experiments unless specifically noted otherwise) is the Koren edition. This is the text used by Jewish communities all over the world (except the Yemenite Jews). (Incidentally, the experiment also produces statistically significant results on the Yemenite edition of Genesis, albeit somewhat weaker.) The fact that the worldwide standard text was used rules out any concern that the choice of the text was not a priori, i.e., that many different manuscripts or versions were used and Koren simply worked best. Note too that the text used was "list 1 - fixed" as explained previously, and therefore cannot be an issue for the experiment on list 2. In fact, the challenge comes from a different direction. There are "biblical scholars" who claim that the standard Koren edition of the Torah has a plethora of incorrect or missing letters. Of course, these views are supported by interpretations of statements in the Talmud or elsewhere. Now, since a single letter missed or added to the text will destroy an ELS code, even if there had been ELS codes in the Torah at some point in time, none could possibly have survived to be found in the Torah available to us today. Therefore, claims concerning the discovery of ELS codes in the Torah must be wrong.
Resolution
It must first be noted that the question posed is related to the issue of Torah codes, but cannot be a challenge to them. The evidence for Torah codes is obtained from analysis of the text of Genesis available to us today. No claim is made concerning how the codes got there or what the text looked like in the past. Hence, at most, the "challenge" simply asks the question, "How could there be Torah codes in the text if there is evidence that the text has been corrupted?" This question is an important one but it does not challenge the existence of the codes. One can surely find many possible answers to this question. We will show, however, that the basic premise of the question is likely to be incorrect.
Let us first consider the effect that a dropped or added letter in the text has on an ELS in the text. It is true that a single added or dropped letter will destroy an ELS - provided that the error is somewhere between the first and last letters of that ELS. It does so by making the distance between the two letters flanking the error unequal to the remaining equidistances (+1 for an added letter, -1 for a deleted letter). Clearly, an error that is outside the span of an ELS can have no effect on that ELS. Furthermore, the smaller the distances between the letters of an ELS, the less likely it is that an error will occur in its span - and the WRR experiment assigns more weight to ELSs of relatively small span as opposed to those of larger span.
[Comment:] This is true, but goes against WRR as our paper demonstrates. The first list of rabbis fails to "work" if only short ELSs are used. It relies on ELSs spanning more than 2000 letters, which are just those forming large targets for textual corruption.
[Gans:] Furthermore, several ELSs in the text usually represent each appellation or date. What this means is that a few errors are unlikely to destroy enough ELSs to render the entire WRR experiment no longer significant. Experimentation bears this out. On the other hand, hundreds of errors in the text of Genesis surely would destroy any evidence of codes. We can now rephrase the critic's argument more accurately, as follows: (1) A large number of errors in the text of Genesis would destroy any Torah codes phenomenon. (2) There are authoritative opinions that there are thousands of letters incorrect or missing in the Torah. Therefore, codes cannot exist in the Torah we have today.
[Comment:] This is misleading, since far fewer than "thousands" of letters in error are required to destroy WRR's experiments. In fact, about 10 errors in random places destroy their first experiment and 50 destroy the second.
[Gans:] All we need to do now is to reverse the argument: WRR provides very strong evidence that there are codes in the Torah. Hence, either (1) or (2) above must be incorrect. Since we know (1) to be true, the problem lies with (2). That is, there cannot be large numbers of errors in the text of Genesis. The "authorities" quoted by the critics must be wrong. Since the opinion held by these "authorities" is by no means the only authoritative opinion on that subject, this is certainly a viable and reasonable alternative. MBBK admit that there is a variance of opinion on the accuracy of Genesis: "The amount of variation that has already occurred during the many preceding centuries since Genesis was written is a matter of scholarly speculation..." (MBBK, page 165). In fact, some authorities believe that there is no more than 1 error in the entire Torah.
[Comment:] We should note that the "authorities" Gans disparages with quotation marks include some of the foremost Bible scholars in the world today, and are universally accepted as such by the world of scientific Bible study. They include the chair of the Bible Department at Bar-Ilan University and the Editor-in-Chief of the official Dead Sea Scrolls publication project. They gave their time generously to us and approved what we reported of their opinions. Mr Gans prefers the opinion of a limited number of religious leaders who disagree with the scientists. That is his right, but it might have been a good idea to inform his readers of his preferences. Some scholarly articles on this subject appear here.
Incidentally, we did not write "that has already occurred", but "that had already occurred", referring to the centuries before the first millenium. Scientific opinion about the later period is much less variable.
[Gans:] We certainly cannot resolve this issue here, but the existence of Torah codes implies that those who believe that there are very few errors in Genesis are closer to the truth than those who believe that there are many errors. In any case, one can certainly not disprove Torah codes by assuming a position that is the subject of a heated debate by the experts in that field. For more details on the historical evidence for textual integrity, please refer to the excellent article by Rabbi Dovid Lichtman, included as appendix A.
It is interesting to note that this particular challenge concerning the accuracy of the text of Genesis has implications to another question posed by the critics. They point out that ELSs for some appellations do not appear at all in Genesis. Furthermore, even when they do, they are not always in close proximity to an ELS of the matching date. Sometimes they are even in closer proximity to an ELS of an incorrect date! How valid could the claim of a close proximity Torah code then be?
The question posed reveals a misunderstanding of the WRR experiment; once one understands the experiment the apparent difficulty vanishes. WRR makes no claim that all appellations or dates are found as ELSs in Genesis. Nor is there a claim that every appellation and matching date will have ELSs in close proximity even if they do have ELSs in Genesis.
[Comment:] It is Gans who is misunderstanding here, as we have never made the claims about WRR's hypothesis that he says we are making. Rather, we are taking care to dispel one of the most common myths about WRR's experiment, spread by writers such as Jeffrey Satinover. Indeed, Gans' words are further demonstration of why our observations are necessary. It is not "sometimes" that appellations are closer to an ELS of an incorrect date, it is almost always.
[Gans:] WRR simply makes the following claim: If one searches for ELSs for each appellation in the list, and for the corresponding dates, and measures the proximity between the ELSs found for the appellations, and the ELSs found for the matching dates, then too many of them are in close proximity to each other to ascribe to "chance". This claim is not at all in conflict with the critic's observation that not all ELSs are found, and not all are in close proximity. Consider the following example. Given any English text that contains the words "umbrella" and "rain", we would surely expect these words to appear in close proximity more often than expected for unrelated words. We might even be able to measure this effect statistically. But no one would claim that every appearance of "rain" must be close to an appearance of "umbrella"! Still, one might pose the critic's question as a philosophical question rather than as a challenge. There are several possible answers. Perhaps the answer most likely to be correct is that we have not yet perfected our techniques and understanding of the codes phenomenon to the point where we can be assured of detecting everything that is encoded. The research into the Torah codes phenomenon is in its infancy. It is interesting to note, however, that one possible answer is that some ELSs may have been lost due to a small number of errors in the text. Thus, the critics' question is answered by their own challenge!
Hidden failures
We will now discuss an important concept known as "hidden failures". Suppose one performs an experiment, and one obtains a probability of 1/10,000. This means that the result obtained is unlikely to have happened by chance. Specifically, it means that the expectation is that one would have had to perform 10,000 different experiments before such a result might happen by chance alone. If one performs an experiment and obtains such a significance level, one is justified in concluding that it did not happen by chance. Rather, something caused this unusual result to happen, e.g., the existence of a code. Suppose, however, one did 10,000 experiments and one of them yielded a significance level (probability) of 1/10,000. Such a result is quite expected and we cannot conclude that anything (e.g., a code) more than chance is operating here.
Consider a simple example. Suppose you think of a random number between 1 and 100 and challenge a friend to guess the number. If he succeeds on the first try, then it is startling because by chance alone the probability of him doing so is 1/100. Suppose he guesses the number on the second try. This is still interesting but not quite as startling as guessing it on the first try. We can actually calculate the significance (mathematically, the "expectation") by simply multiplying the probability of success with one guess by the number of guesses. In this case it is 2x(1/100) = 1/50. Suppose he takes 50 different guesses? Then his probability of success is 50x(1/100) = ½ and this is not significant at all! If he takes 100 different guesses he is sure to guess the right number; his probability of success is 100x(1/100) = 1.
We see from this discussion that the number of experiments performed is just as important as the probability of success calculated because the true measure of success is the product of these two numbers. Thus, if one claims to have done an experiment and obtained a probability (of success given that exactly one experiment was performed) of 1/10,000 but has actually done, say, another 499 experiments in secret, then the experimental results have been misrepresented. He should have reported that he did 500 experiments before obtaining the 1/10,000 result, in which case the true significance of what he has found could be calculated as 500x(1/10,000) = 1/20 - still significant by the accepted scientific standard, but not nearly as significant as reported. Such an experiment is a fraud and is said to contain "hidden failures". In this case there were 499 hidden failures.
[Comment:] While Gans' analysis is correct in some strict sense, it does not address the type of prior experimentation that is important for our discussion. Gans considers the case where someone conducts many independent experiments and only reveals one of them, but the more important situation is where secret prior experiments are used to "stack the odds" for a later "official" experiment.
Suppose that there are 10 rabbis for which some choice of appellation is available, and that each choice is worth a factor of 2 in the performance of that rabbi. We can determine how to decide each choice in our favour by doing 10 experiments, one for each rabbi. Later, when we perform the official experiment with a larger list of rabbis, we make sure that the "right" choice is made in the case of the original 10 rabbis. Suppose the official experiment gets an apparent 1/10,000 result. According to Gans' discussion, we need to multiply by the total number of experiments (11) to obtain the real result: about 1/900. However, in fact the real result is close to 1/10 since each of the 10 prior experiments changed the odds by a factor of 2 and 2x2x2x2x2x2x2x2x2x2 = 1024.
We are not claiming that WRR cooked their experiment in precisely the manner just described. Our point is just that a remarkably small amount of hidden experimentation is required in order to explain an apparent very strong result. As illustrations of this process, consider the Hanukah candles fake experiment or the Son of Man fake experiment. In each case we obtained a result near 1 in a million with the help of far fewer than a million hidden experiments (depending on how they are counted, a few dozen and certainly less than 100).
[Gans:] An example of where the charge of hidden failures was made concerns a tape of a lecture given by Professor Rips in the mid 1980's, (around the time when the WRR experiment on list 1 was first performed). The existence of this tape was publicized by McKay in an Internet posting dated December 31, 1997 and entitled "New Historical Evidence Available". In this paper McKay says, " Several aspects are particularly disturbing, as they don't appear to fit in with the 'official history' of the experiment". McKay then proceeds to explain, "The lecture describes an experiment with 20 rabbis, defined by having an entry in Margalioth of at least 6 columns. We find that difficult to reconcile with the statement, repeated many times, that the first test ever done was with 34 rabbis having at least 3 columns". In other words, it would appear from this tape that at least one hidden experiment was performed and never documented. If there is one hidden failure, one may assume that there are many more besides the one that has been fortuitously uncovered.
[Comment:] This is a fair summary of our position except for the last sentence. As explained above, it is not necessary for there have been "many" prior experiments for our case. A handful is more than enough.
From this point on, Gans proceeds to quote arguments distributed by Witztum in a private forum (but leaked to us) without noting that we have already publically replied to them. As usual, Witztum's claims are either factually false or easy refuted. For the details see here. In our opinion, there is no reason to doubt that Rips was discussing an experiment exactly like he claims to be discussing. Moreover, that experiment cannot be reconciled with WRR's official history. We are not going to reply to the details of Gans' argument because we have already replied to them and once is enough.
[Gans:] Resolution
Note first that this challenge is irrelevant to the experiment on list 2, and the published WRR result is for list 2 only.
[Comment:] In fact it is extremely revelant. WRR's story of how they conducted their experiment is a critical part of their case for the codes, because it has been thoroughly established that there were opportunities for invalid practices. To put it simply, their "success" requires us to believe them. Now there is evidence (we would say "proof") that WRR's story about their first experiment is not true. The relevance of this should be completely obvious.
[Gans:] Nevertheless, we can resolve this issue for list 1 as well. Indeed, if one sets the criterion for personality selection at 6 columns of text or more in Margalioth, one produces 20 personalities. Thus, there is evidence here that another experiment, different, but closely related to the list 1 experiment, was at least contemplated at that point in time. This, by itself, is not unusual. It would be hard to imagine that WRR performed the first experiment that came to mind. The critical question is whether or not the experiment (and perhaps others as well) was actually performed. If we examine the words of Rips on the tape, it certainly appears that such an experiment was performed. For the quotes that follow, we use McKay's posted English transcript of the tape of the original Russian lecture. Rips says, "If an article took up three or more pages, we chose it" (page 13). Note that Rips actually says "3 pages", not "6 columns" as McKay quotes him. They amount to the same, but it is easier to see someone confusing 3 pages with 3 columns than someone confusing 6 columns with 3 columns. Undoubtedly, McKay made this little change without realizing that it made any difference. Rips continues, "We had an objective list which turned out to contain some 19 - 20 names" (page 13). So the list was definitely made. Rips proceeds to talk about using appellations and dates and concludes, "Thus we have obtained a list containing some 150 queries" (page 14). But wait! There are only 102 pairs (queries) for this list of personalities. There are 152 pairs - for list 1. This is the list produced by using entries with 3 columns or more, not 3 pages or more!
[Comment:] As expanded on in our analysis of this issue, these numbers are based on the assumption that Rips had the same appellations as later published. But that is not true; he even mentions some exceptions. Similarly, the numbers below assume the program used at that time was the same as that distributed years later, but that is also demonstrably false.
[Gans:] We thus have a contradiction in the tape itself. Rips talks about 19 to 20 personalities, but the number of pairs obtained corresponds to the 34 personalities of list 1, not the list of 20 personalities. Rips then describes the result of the experiment, "Therefore, on the basis of all this, we have a significant...7 sigmas...or let us say 6 sigmas...never mind...6 sigmas is quite sufficient to state with absolute confidence that what we have here is a real phenomenon..." (page 15). Now the score quoted here, between 6 and 7 sigmas, is P1, the first of the two overall statistics calculated for list 1.The sigma value calculated for list 1 of the WRR experiment is 6.61, while the sigma value for Rips' 20 personalities is 5.74. Thus, not only does the number of pairs quoted by Rips correspond to list 1 of the WRR experiment, so does the result. The evidence thus suggests that although an experiment with 19 - 20 personalities was certainly considered, the experiment actually performed was on the 152 pairs of list 1. It is understandable that Rips might have confused "3 columns" and "3 pages" (but not 6 columns!) while giving a lecture. Incidentally, a P1 score of 5.74 sigma is extremely significant; it is hard to imagine why anyone would change the list in any way having obtained such a strong result. Perhaps it is because of these considerations that MBBK are content to simply state, "However, an early lecture of Rips (1985) described an experiment with a particular subset of '19 or 20' rabbis. Be that as it may...." (MBBK, page 153) without pursuing the implications!
[Comment:] Gans is criticising us for being too generous to WRR. He may be right.
[Gans:] One more point can be made. As indicated earlier, the overall proximity measure of 5.74 sigma is calculated in part by counting how many individual proximity measures, c(w, w'), are less than or equal to 0.2, the "P1 bunching threshold". The value 0.2 appears to be arbitrarily set. Had WRR done just a bit of experimentation, they would have discovered that setting this value to 0.5 rather than 0.2 changes the overall proximity measure from 5.74 sigma to 6.19 sigma.
[Comment:] This is one of the errors Gans has copied from Witztum. In fact 0.2 gives a stronger result than 0.5 for the list of 19-20 rabbis (assuming the appellations in use then were similar to those later published); see our analysis for details. This provides a plausible historical explanation for the choice of 0.2.
[Gans:] In terms of sigma values, 6.19 is considerably stronger than 5.74. The fact that WRR "lost" this opportunity for a more significant result will figure prominently when we discuss the MBBK paper and "tuning". As we shall see there, changing the P1 bunching threshold from 0.2 to 0.5 makes the P1 measure for the entire list 1 7,000 times stronger. Nevertheless, WRR did not avail themselves of this obvious opportunity to "tune" their experiment to increase the apparent significance of their result.
Additional experiments
Before we address other challenges to the WRR experiment, we will discuss some additional Torah codes experiments that were successful. We will only describe four of these experiments, although there are a good deal more, because three have direct implications to the validity of WRR, and the fourth is so remarkable that it would be remiss not to include it.
The cities experiment
We will first describe the "cities experiment" of Gans, and its history. In the late 1980's, Harold Gans, then a Senior Cryptologic Mathematician with the National Security Agency, US Department of Defense, was told about the Great Rabbis Experiment. Being skeptical, he requested that Witztum and Rips provide him with the Book of Genesis on a computer disk so that he could duplicate the experiment. A few months later the data was provided. Gans did not immediately rerun the experiment; he reasoned that the data would never have been provided if the experiment were fraudulent. However, in 1990 Eric Coopersmith, then head of Aish HaTorah in North America, requested that he attempt to duplicate the Great Rabbis Experiment. Gans did so, using his own programs and following the specifications of the experiment in a preprint of the WRR paper. He then conceived of a new experiment: to use the same names and appellations as in WRR's list 1 and list 2 combined, but pair them with the names of the cities of birth and death, as opposed to the dates of birth and death as in WRR. He asked Zvi Inbal, a new acquaintance and a lecturer for Arachim in Israel to provide the list of cities for the new experiment. Inbal obliged, providing Gans with the list, along with an outline of the methodology used to construct the list. The database for the list of cities was the same encyclopedia used by WRR, in addition to the Encyclopedia Hebraica. The text of Genesis, the mathematical formula for proximity, and the method of calculating statistical significance used were precisely the same as in WRR. (Actually, Gans first used a minor modification of WRR's proximity measure, but later abandoned it in favor of WRR's formula.) Gans completed the cities experiment in 1990 and documented his results in a preprint entitled "Coincidence of Equidistant Letter Sequence Pairs in the Book of Genesis". The results were even more significant than that obtained by WRR: 6/1,000,000, or about 1/166,000. The paper was submitted for publication in Statistical Science but was rejected because it was not considered "of interest to the broad audience of Statistical Science" (Letter from the editor of Statistical Science to Gans, July 25, 1995). The editor suggested that a paper "whose focus was a review of the literature on probabilistic numeralogic calculations in Biblical texts" would be of more interest!
[Comment:] The editor was definitely right on this point.
[Gans:] In 1997, critics suggested to Gans that the Inbal list of cities had been contrived to ensure an apparently significant result. They pointed out that many names in the list were spelled differently than in either Encyclopedia and that there were apparent inconsistencies in the spellings. Gans took these criticisms seriously. He announced publicly that he had started an investigation of the validity of the Inbal list and would not take an official position on its validity until the investigation was complete. (Gans continued lecturing on codes during the investigative period, but always pointed out that the experiment was under investigation. He also pointed out that nothing had yet been found to suggest that the experiment was flawed.)
The first step in the investigation was to obtain a detailed explanation of the methodology of obtaining the city names. Upon request, Inbal provided Gans with a detailed explanation of the rules used to generate the list. These rules, now referred to as the "Inbal protocol", form a complete algorithm which can be applied in a purely mechanical way to form the list from the two encyclopedias. The Inbal protocol is quite complex, enabling it to produce linguistically and historically correct data from encyclopedias that are inconsistent in spelling the names of foreign cities transliterated into Hebrew.
[Comment:] In fact the rules are so complex that Gans, presumably exaggerating for dramatic effect, once said that he "almost vomited" on seeing them.
[Gans:] Another important issue (among several) is that many of the cities had specifically Jewish names. Since WRR used Hebrew as opposed to secular dates, consistency demanded that Hebrew names be used for the cities even where the encyclopedias used secular names. Inbal also provided Gans with a detailed explanation of how each name/spelling in the list was obtained using the protocol. This list contained a handful of corrections to the original list.
The task before Gans was twofold: first, to verify that the Inbal protocol, with all its complexity, was not contrived to ensure an experimental "success", but rather was designed solely to ensure linguistic and historical accuracy. Secondly, to verify that each name/spelling on the list was obtained by a purely mechanical application of the protocol. These tasks took Gans two years to complete. First, he extracted all the rules comprising the protocol, and posed these as questions to rabbaim, dayanim (Jewish Judges), roshei kollel (deans of advanced Torah study institutions), and experts in writing gittin (Jewish divorce documents, which must include the name of the venue of the divorce properly spelled in Hebrew or Aramaic) in the US, Israel, and England. He only queried those who had no previous contact with any of the details of any Torah codes experiments. For each question posed, he received one of two answers: either they did not know, or their answer agreed with the Protocol. There was not one instance in which anyone felt that the Inbal rule was wrong! There was also no rule in the protocol that was not verified by some of the experts queried. In addition, no one felt that the protocol was incomplete and should have additional rules, with one exception. One expert suggested that a single alternate form should also be tried: the addition of the digraph "QQ" (an acronym for k'hal kadosh, meaning "holy community") as a prefix to each name. This was tried and failed. The conclusion of this part of the investigation was clear: the Inbal protocol was designed solely to ensure historical and linguistic accuracy. It was certainly not contrived simply to produce an apparent experimental success.
The next task was to verify the Inbal list. For this, Gans obtained the assistance of Nachum Bombach. Together, they applied the Inbal protocol and produced a list that they compared to that of Inbal.
[Comment:] This passage refers to Gans' own "cities" experiment, where the appellations used by WRR were matched against community names provided by a friend of Witztum named Zvi Inbal (a chemist who lectured on the codes for the organization Arachim). He is describing how he went about defending the experiment against criticism.
The amazing thing about Gans' description is how thoroughly he damns himself without any apparent inkling of what is wrong with his procedures. It appears that Mr Gans simply does not understand the issues.
Let us consider the "first task". The critics questioned the objectivity of the Inbal "rules", so Gans went around asking rabbis if those rules were correct. However, this supposes that each rule is either "correct" or "incorrect", when in fact many of them are simply arbitrary. Suppose we ask "Is it right to use only the correct names of cities?". Probably the answer will be "yes". Instead we could ask "Is it right to use common acronyms of city names?", and maybe that would get the answer "yes" also. Obviously both "yes" answers are perfectly reasonable, but a factor of 60 in Gans' result rests on the difference between them. (This factor derives from a single word, which is a good illustration of just how much can be achieved by minor adjustments of the rules.) Another example: we could ask "Is it right to spell the names of cities as they were spelt in the Rabbi's lifetime?", or instead we could ask "Is it right to use the spellings of the cities that are common in rabbinical literature?" Gans apparently got the answer "yes" to the first question, but did he ask the second one? And was he at all concerned that the answer "yes" to the first question contradicts the rule used by WRR for their appellations?
There are other examples like this too. All Gans is doing by approaching experts with Inbal's "rules" is gaining positive support for his position. He is not testing whether the rules were objectively designed. That test would require having an expert design a complete set of rules without any knowledge of Inbal's rules and without any questions about them. If Gans has done that, he hasn't said so.
The "second task" is a similar story. The statement "Together, they applied the Inbal protocol" is incredible. Since both Gans and Bombach had seen the Inbal data, it was impossible for them to compile a new set of data without being influenced by the earlier choices. This says nothing about motives, it is just a plain fact that is known to every scientist. Nor will it do to claim that compilation of the data is objective; it isn't. There are examples where a judgement is needed. Consider this (invented but representative) example: "Rabbi Gans lived in Paris. He died in 1736." Can we take it that Rabbi Gans died in Paris? Did he never travel? Another example (similar to a real one): "Rabbi Gans died near Chicago." Can we take Chicago as the place of death? And what about cases where the place of death is uncertain?
As before, the only correct way to perform this task was to engage a totally independent person to do it blindly.
Note that this correct procedure has since been followed, by a commitee at the Hebrew University. Two separate sets of experts compiled the data using their own judgement and no trace of codes was found.
[Gans:] They noted several differences. In two instances Inbal used other historical sources rather than the two encyclopedias because in these cases the protocol produced erroneous data. In one case it was due to a typographical error in one of the encyclopedias. In the other case, it was due to two cities having very similar names. Gans and Bombach decided to use the erroneous data rather than violate the protocol. Thus, although there are a few known errors in the list, it is produced solely by mechanical application of the Inbal protocol. There are no exceptions. The final list produced by Gans and Bombach differs from the original Inbal list in only a handful of places. A new experiment was run on this list and produced the same level of statistical significance as had been obtained on the original Inbal list: 6/1,000,000. The investigation was successfully completed, and was announced publicly in May 1999 at the International Torah Codes Society conference in Jerusalem. Gans is currently documenting this work with all its details so that anyone who wishes can easily verify both the Inbal protocol and the list.
[Comment:] Note again that this experiment has now been repeatedly replicated without any codes being seen. More on this below.
[Gans:]
A replication of the famous rabbis experiment
Doron Witztum performed two other successful experiments that we shall describe. The first is entitled "A Replication of the Second Sample of Famous Rabbinical Personalities". The idea that underlies this experiment was suggested by Alex Lubotsky, a critic, in an article entitled "Unraveling the Code" in the Israeli newspaper "Ha' aretz", September 3, 1997. In this experiment, the personalities and dates are exactly the same as in WRR's list 2. Instead of using the appellations of WRR, however, each personality was referred to as "ben name" (son of name), where "name" is the name of the personality's father. The father's names were obtained from the same encyclopedia used for the data in WRR. The spelling rules used were exactly the same as specified in writing before the experiment was performed by Havlin and used for WRR.
[Comment:] Here Gans wrote "the idea that underlies" to avoid repeating Witztum's false claim that Lubotsky actually suggested this form of "appellation", while still managing to give the readers that impression. The form that Lubotsky actually suggested (name ben name) does not appear as an ELS for a single rabbi in the list!
The last two sentences quoted above are highly misleading. The first of them will convince naive readers that the data for this new experiment came from the same place as the data for the old experiment. Wrong! The second is just as bad. Leaving aside the fact that the rules mentioned were not specified as far as we know until Havlin wrote them down 9-10 years later, they are violated anyway. Consider the example "ben Ber" which is the star performer in Witztum's list. The separation of "Ber" from the rest of the father's name is only permitted (Havlin's Rule 2d) if it is "well-known and unambiguous", but "Ber" (the Yiddish form of "Dov") is extremely common and so highly ambiguous.
Of course these are not really appellations or they should have been included in WRR's original experiment. Nobody was ever referred to merely as "ben X". What Witztum really did was to test whether one attribute of the rabbis (their dates of birth and death) appeared close to another attribute (their fathers' first names). We also did such experiments years ago. For each of the rabbis we had their dates of birth and death, their cities of birth and death (according to the Gans preprint), the years of their birth and death, and the names of their famous books. This gives six pairs of attributes, which we tested for each of the two lists of rabbis. None of the 12 experiments achieved better than a 1/6 result (totally negative).
We can also question Witztum's integrity in his presentation of this experiment. Why did he not remind us that he had experimented with the forms ben name before? Two examples of this form appear in his book, as well as one example of the form B"R name which his "new" experiment uses in expanded form. In other words, Witztum's presentation of this as an a priori experiment was nothing but a deception.
Anyway, the success of Witztum's ben X experiment vanishes as soon as we compile the data according to Witztum's usual practice (using multiple sources) and correct the dates that are known to be wrong. Also see below for Gans' own unintended argument against this "experiment".
[Gans:] (In fact, the spellings are identical to the Margalioth entries with only two exceptions). No other appellations were used. Once again, the text of Genesis and all the mathematical components of the experiment were precisely the same as in WRR. The significance level obtained was 1/23,800 - an unqualified success. Witztum also tried using appellations of the form "ben rabbi name", but the result was not significant. It is also worth noting that the same two experiments were tried on list 1. The "ben rabbi name" experiment had a significance level of 0.0344. This difference in significance level and form between list 1 and list 2 is an unexplained curiosity, but does not detract from the unambiguous success obtained with list 2. Recall that WRR reported its success on list 2, not list 1.
[Comment:] Recall further that WRR claimed success for both lists but this one only "works" for the list of less famous rabbis. It is a (non-fatal) inconsistency in the nature of the claimed phenomenon. The real reason of course is that there is not enough flexibility in this data to allow Witztum to adjust it for success in both lists simultaneously.
[Gans:]
Personalities of Genesis
The second additional experiment performed by Witztum that we will discuss is entitled "Personalities of Genesis and their Dates of Birth". Witztum found that dates of birth for some personalities in Genesis are given explicitly in several Talmudic period sources. Using the CD-ROM of the Bar Ilan Responsa Database, he searched all the sources to find the one source that gave the largest number. (Recall that data size is often a critical element in trying to detect statistical significance.)
[Comment:] This is just a Witztum-esque excuse for using the source that gave the best results. If Witztum really wanted to follow his earlier procedures, he would have eliminated all the dates which are not given consistently between sources, just as he eliminated dates from the Great Rabbis experiment if they were "subject to dispute".
[Gans:] The Yalkut Shimoni provided the largest sample: 13 personalities (Adam, Yitzchok, and 11 of Yaakov's sons). Upon examining a critical edition of the Yalkut Shimoni, Witztum found that all 12 sons of Yaakov were included, making the final data size 14. An experiment similar to WRR, measuring proximity between personalities and dates of birth as encoded in Genesis was performed. The date forms used were exactly the same as in WRR. The personality names were spelled exactly as found in Genesis. The assessment of statistical significance was also done exactly as in WRR. However, the formula for measuring proximity was slightly different from the WRR technique in one point: the skips of the ELSs of the personalities were restricted to +1 and -1, i.e., strings of letters in the text itself, both forward and backward. The dates were searched for in exactly the same way as in WRR. This approach was not new, as it had been used in an earlier experiment of Witztum and Rips known as the "Nations experiment" and recorded in a preprint in 1995 (we shall describe this experiment shortly). Two versions of the experiment were performed; one produced a significance level of 1/1,960 while the other was 1/21,740. Once again, an unambiguously significant result was obtained. Even if one claimed that the original WRR proximity measure must have been tried as well, this would only multiply the result by 2, giving a significance level of 1/10,870.
A second version of this experiment, "The Tribes of Israel" (also known as "The sons of Yaakov"), was also performed by Rips.
[Comment:] First Witztum did one version of this experiment, making an unknown number of choices (of which the known choices were in his favour, as always). Then he suggested a variation to Rips, who dutifully calculated his "probability" without taking into account that the topic and the source of the data was suggested to him by someone who had studied it already! This is a fundamental and obvious error that invalidates his experiment completely.
To Witztum's "experiment" we have the following reply. Rather than trying the older sources considered by Witztum (which disagree severely with each other), we thought to try modern Hebrew sources. Four such sources were listed by Robert Haralick for his (now defunct?) rabbis experiment. These also disagree with each other, so we applied Witztum's criterion (described above by Gans) of prefering the source that gives the most data. This indictated the encyclopedia Melitzei Esh by A. Stern (NY, 1962). That encyclopedia is organised with a chapter for each day of the year, so we looked up which chapter each of the 12 personalities appeared in. We also made one other change: we wrote the name of the month "Cheshvan" in its original and correct form "Marcheshvan" (see any encyclopedia, or this article). Otherwise our method was exactly the same as Witztum's, yet we obtained a result of 1/17,200 in the random text R that WRR used as a control text in their paper.
So yet again we have demonstrated that a surprisingly small amount of choice is required for apparently strong results to be obtained. We did not even have the great freedom Witztum had in choosing the topic.
[Gans:] Even accounting for the possibility that separate experiments were performed on data taken from each known source, rather than from a single source (Yalkut Shimoni for Witztum, Midrash Tadshe for Rips), the significance level was still 1/28,570! One would have to claim that there were experiments on at least another 200 data sets taken from different (unknown) sources before one could discount the result as insignificant because of the number of experiments tried to achieve a success. No one has even suggested that anywhere near that many additional sources of this information as an explicit list (each of which must be different, of course) exist.
[Comment:] As our previous comment illustrates, the amount of choice hidden from view amongst apparently reasonable decisions can be much greater than could be imagined by anyone who has not learned how to find it. This is the basic secret of the "codes" that Gans will not face. We did not have to try hundreds of sources to achieve our 1/17,200. It was more like a dozen. Calculations like Gans is making are simply wrong, as we explained earlier.
[Gans:] It is interesting to contrast the approach of Witztum and that of Rips in dealing with the potential challenge that there were hidden failures in this experiment. There are several different sources for the information, as well as different options in choosing how to spell the names. Witztum's approach is to make a single choice of all the parameters, and explain how he made his choice and why it is logical that only that choice was made. Rips' approach is to tally all the sources and arguable choices that are possible and show that even if all of them were tried, the results are still very significant.
There have been other successful scientific Torah codes experiments performed by other, independent researchers, including Nachum Bombach, Alex Rottenberg, and Leib Schwartzman. Thus, other independent researchers have reproduced the Torah codes phenomenon. We have chosen to describe the three experiments above because they have logical implications to the validity of the WRR experiment. We now describe a remarkable result obtained by an associate of Witztum.
The nations prefix experiment
In 1995, Witztum, Rips, and Rosenberg (WRR II) prepared a paper entitled "Equidistant Letter Sequences in the Book of Genesis II: The Relation to the Text". This experiment has come to be known as the "Nations experiment". Most of the details of this experiment, and the critic's challenges to it are not relevant to the result that we are concerned with here, so we will only describe those details that are needed. Genesis, chapter 10, lists 70 descendants of Noah's three sons. These 70 descendants were the progenitors of the 70 biblical nations that constitute all of humanity. 68 of the 70 names are distinct. For the Nations experiment, each nation name is paired with a defining characteristic of a nation, used as a prefix to that nation's name. Four prefixes were used for what is known as the "regular" component of the experiment: AM (nation), ARZ (country), CPT (language), and KTB (script). Thus, for example, for KNAN (Canaan) the four pairs would be: (KNAN, AM KNAN) (Canaan, nation of Canaan), (KNAN, ARZ KNAN) (Canaan, country of Canaan), (KNAN, CPT KNAN) (Canaan, language of Canaan), and (KNAN, KTB CPT) (Canaan, script of Canaan). WRR II provide reasons for selecting these four defining characteristics based on a commentary on the Book of Job written by the Vilna Gaon, the Targum Yonatan's names for the nation/country as well as the plural forms. They also provide justifications for the Hebrew words used for these characteristics.
[Comment:] Readers should note Gans' exact words here. The Vilna Gaon, accepted as one of the greatest rabbis of the millenium, used four specific Hebrew words to denote four characteristics of a nation. WRR, without ever checking what the effect on their experiment would be, changed two of those words to other words that just happen to perform much better. All we have to do is believe them!
[Gans:] MBBK (page 162) dispute the a priori selection of these prefixes and argue that many characteristics and words representing these characteristics were tested, and high scoring ones were then chosen as prefixes. An a posteriori "story" was then concocted to "justify" the given selection. Unlike WRR, where the proximity is measured between ELSs of the pairs of words, here the proximity is measured between appearances of the first word of each pair in the text itself, either forward or backward (i.e., skip distances +1 and -1), and ELSs of the second word. A few minor modifications were made in the proximity measure to accommodate this change. The final probability reported for this "regular" component of the experiment as described here was 70/1,000,000,000 for P1 and 5,667/1,000,000,000 for P2.
On March 11, 1998, Bar Natan, McKay, and Sternberg (BMS) posted a paper on the Internet entitled "On the Witztum - Rips - Rosenberg Sample of Nations" (draft). In this paper, BMS construct a "counterfeit" Nations experiment to demonstrate how WRR II could have done the same.
[Comment:] Actually we construct several counterfeit experiments of this nature, all of which get a strong result using War and Peace in place of Genesis, and even one getting a good result in both books. We found that the counterfeiting task is in fact rather easy.
We also examined the rest of the data from WRR's experiment with the help of an expert on Aramaic. We found that many choices had been available (including some actual errors on WRR's part) and that almost all of them had been made in the direction favourable to WRR. Even without our observations on the four prefixes, this observation is enough to suspect the conduct of the experiment. See here for our paper.
[Gans:] Several responses to this paper were posted by Witztum in which he explains all of the details of how the Nations experiment was performed, and refutes the arguments of BMS. We are concerned with an experiment performed by an associate of Witztum, and reported in Witztum's August 29, 1999 paper "The Nations Sample" (Part II). BMS construct their experiment by starting with a list of 136 possible prefixes (BMS, Table 5), including the four used by WRR II. They next rank the 136 prefixes using the P2 proximity measure and select a high scoring set of four. They also construct a story to "justify" their choices. BMS obtain an apparent significance level of 5/100,000,000 using a Hebrew translation of "War and Peace" (the same used as a control by WRR).
BMS point out that there may be mathematical problems with the technique used to measure the probabilities by WRR II (BMS, Appendix; Notes on the Metric). Consequently, Witztum replaced the original measurement scheme with one that does not have any of these potential problems. This technique, called RPWL (Random Permutation of Word Letters), was devised by Witztum in 1994 for "Header Samples" and was described in a letter to Professor Robert Aumann on 22 February 1994.
[Comment:] The RPWL method suffers from mathematical problems of the same nature as WRR's original method does, plus some others. Natural texts, including the Torah, have a non-uniform distribution of letters (see here for detailed experimental evidence). Similarly, there is an obvious non-randomness in the distribution of letters within meaningful words (and this is even more obvious than usual for the words used in this experiment). These two non-randomnesses have an interrelationship (accidental or not) that the RPWL method ignores. Note that randomising the text destroys the first phenomenon, which is why Witztum's "demonstration" of the validity of RPWL using random texts is invalid.
Besides this, inventing new methods of analysing known data is a worthless task. It is always possible to find methods that improve the result and methods that worsen the result. This proves nothing, especially if the methods are defined in full view of the data they will be applied to.
[Gans:] It was used again for WRR3 in 1996. The RPWL technique was posted on Witztum's Web site in 1997. The paper describing the application of RPWL to the WRR list 2 is entitled "The 'Famous Rabbis' Sample: A New Measurement" and is dated May 10, 1998. Using this measure, the overall significance of the "regular" component of the Nations experiment is 1.4E-7 (i.e., 14/100,000,000) while the BMS "significance level" is 1.22E-4 (i.e., 122/1,000,000). BMS's "significance level" is meaningless since it was obtained by "tuning" (cheating). BMS claim that WRR II did the same.
The experiment performed by Witztum's associate uses the 136 prefixes created by BMS along with the accurate RPWL measurement scheme. Each one of the prefixes is ranked with P1 and P2, both for "War and Peace" as used by BMS, and for Genesis. The top scoring 4 prefixes are then selected for each. For "War and Peace" the score obtained for the combination of the best 4 prefixes is 6.16E-6, while for Genesis it was 4.0E-10. Neither of these is a true significance level (probability) since the four prefixes for each were not chosen a priori, but rather they were chosen to optimize the proximity measure. Yet, they differ by almost 4 orders of magnitude. The real issue is this: given the optimization procedure used, what is the true probability of obtaining a score of 4.0E-10 (or stronger)? Using this procedure, is a score of 6.16E-6 within the range of random expectation? It will be recalled that the individual proximity measures, c(w, w'), range from near 0 to 1 in value. One million simulations of this experiment were performed by simply replacing c(w, w') with a computer generated pseudo-random number with values ranging uniformly from near 0 to 1. The ranking of the "War and Peace" score of 6.16E-6 among these million simulations was 353,949. Thus, the probability of obtaining such a small score (or smaller) by taking the best four out of 136 prefixes is 0.35. This is well within the range of random expectation. The ranking of the score 4.0E-10 obtained on Genesis among the million simulations was 420. Thus, the probability of getting such a small score (or smaller) on Genesis is 0.00042. This probability easily passes Diaconis' threshold of 1/1,000. This result is remarkable because there are no stories to justify, no choices of prefixes to justify, and the list of 136 prefixes was provided as an apriori list by BMS. This result also suggests that there may be more encoded in Genesis concerning the 70 nations than WRR II suspected since the four best prefixes of the 136 in Genesis include two not used by WRR II.[Comment:] This is a perfect example of an a posteriori experiment. Every aspect of the data was known in advance of the experimental method being devised, which is exactly the opposite of what is required for a valid experiment. Gans has written about these things himself, and it is truly amazing that he can set aside his principles so readily.
The fact is that, given a list of things (in this case attributes), of which four are better than the others in some known way, all one has to do is to search for a measurement method that accentuates the advantage of the given four. Then a simulation like Gans described is likely to give results like he described.
In any case, the assumption of uniformity and independence of c(w,w') values that appears in the experiment is unjustified, in fact false, as explained in our previous comment. So the experimental method is wrong anyway.
Challenges to the date forms
WRR used three date forms. For example, for the 13th of Adar the three forms are: YG ADR, BYG ADR, and YG BADR. No one argues that these three forms are incorrect, or unusual, but there are other forms that exist. In particular, the form BYG BADR is not uncommon. In addition, there are several other date forms that can be found in use, e.g., YG LADR, BYG LADR, YG $L ADR, BYG $L ADR but are rare. There are also variations in the names of the months (e.g., Cheshvan vs. Marcheshvan), etc. It would appear that the issue of hidden failures would be relevant here. WRR might have performed many hidden experiments using various combinations of the date forms, and then reported the combination that worked best.Resolution:
The issue of hidden failures does not apply here at all. Since the date forms were completely specified for list 1, there was no freedom (sometimes called "wiggle room") to change them for list 2. That is, the date forms were "list 1 - fixed". Consequently, although theoretically there might have been wiggle room utilized in choosing the date forms for list 1 (which is one of the reasons that experimental results were not reported for list 1 in WRR), this was logically impossible for list 2.
It is worth noting that the form BYG BADR does well on list 1 - much better than the form YG ADR (MBBK, pg. 156).
[Comment:] We dispute the description "well", but in any case the form YG ADR was forced on WRR because they needed the unusual form BYG ADR that contributes most of their "success".
[Gans:] If wiggle room had been utilized for list 1 then it should have included the form BYG BADR. We conclude that it is logically impossible for any wiggle room with the date forms to have been taken advantage of for list 2, and there is good evidence that it did not take place for list 1.
One last point: the question of date forms does not apply at all to Gans' cities experiment. As for Witztum's two experiments listed above, the date forms used there are identical to the ones used in WRR. As with list 2, there was no wiggle room at all.
Challenges to the proximity formula
The proximity formula is an essential component of any Torah codes experiment. This formula accomplishes the following: Given a list of pairs of words (e.g., names and dates of death), along with the positions in the text where these words are found as ELSs, it produces a small set of numbers. Each of these numbers (P1, P2, etc.) is a slightly different measure of the overall proximity between ELSs of the paired words of the list. Each of the overall proximities of the list is, in turn, composed of the individual proximities ("c(w, w') values") found between ELSs of the words making up each pair. For WRR there were 4 overall proximities calculated; for the cities experiment and the two additional experiments of Witztum described above, only (the last) 2 of the 4 were used.Each overall proximity obtained must next be input to the process which calculates the probability of obtaining that measure by chance, i.e., assuming that there is no Torah codes phenomenon. The 4 (or 2) probabilities thus obtained may then be combined in a standard way to yield a single probability measure for the entire experiment. For WRR, the four overall proximity measures are (1) 6.15 sigma, (2) 1.15E-9, (3) 5.52 sigma, and (4) 7.20E-9. The probabilities of obtaining each of these proximity measures at random are (1) 453/1,000,000, (2) 5/1,000,000, (3) 570/1,000,000, and (4) 4/1,000,000. (The reason for presenting these probabilities as fractions over 1,000,000 will be made clear later in this paper.) These 4 probabilities are then combined to yield 16/1,000,000, which is equal to 1/62,500.
The proximity formula used is complex. This implies that there are many parts of it that could be changed to produce different proximity measures. This opens the door to hidden failures. Perhaps a large number of variations of the formula were tried, and only the one that worked best was publicized.
Resolution:
The resolution here is identical to the resolution of the challenge to the date forms since the proximity formula was "list 1 - fixed". The same holds true for the additional experiments (except for a possible factor of 2 in the "personalities" experiment, as described above).
[Comment:] For a description of the evidence against this conclusion, see here.
[Gans:]
Challenges to the process that calculates the probability
In the previous section of this paper, we indicated that the last major step in any Torah codes experiment is to transform the overall proximity measures into probabilities. These probabilities are then combined in a standard, non-controversial way, to give the final probability for the entire experiment. The probability thus calculated is the probability that a single experiment would produce the given overall proximity measures (or better) if there were no codes in the Torah. For example, given the first overall proximity measure of 6.15 sigma for the WRR experiment, the probability of a single experiment producing such a measure (or better) is 453/1,000,000. This means that one would expect to get a proximity measure that strong or stronger approximately 453 times for every 1,000,000 experiments performed, assuming pure chance, i.e., no codes.
The process of calculating the probability associated with an overall proximity measure is simple and straightforward. The overall proximity measure is comprised of all the individual proximity measures between personality appellations and dates of birth and death. Consider the effect of randomly mismatching the personalities and the dates (i.e., randomly permuting the dates vs. the personalities). The appellations are the same, the dates are the same, the text of Genesis is the same, and the proximity measure is the same, but the information is wrong. For example, suppose there were 5 personalities. For simplicity, we assume that each has one appellation and one date. Then we might represent the data as follows:
Personality(1) date(1) proximity(1,1)
Personality(2) date(2) proximity(2,2)
Personality(3) date(3) proximity(3,3)
Personality(4) date(4) proximity(4,4)
Personality(5) date(5) proximity(5,5)
Overall proximity
A "mismatched" or "permuted" experiment (one in which the dates have been randomly permuted with respect to the personalities) might look like this:
Personality(1) date(5) proximity(1,5)
Personality(2) date(3) proximity(2,3)
Personality(3) date(1) proximity(3,1)
Personality(4) date(2) proximity(4,2)
Personality(5) date(4) proximity(5,4)
Overall mismatched proximity
There are many ways of mismatching the personalities and dates, particularly for large numbers of personalities as in WRR. If we perform, say 99 permutations then we have 1 real overall proximity and 99 overall proximities for permuted experiments, or 100 all together. If there are no Torah codes, then there is no meaning to any of these proximities, and the real proximity measure is expected to be similar to the ones for the permuted experiments. It follows that the probability of the real overall proximity being the strongest of the 100 is 1/100 since there are 100 places and only 1 is best. Similarly, if one performs 999 random permutations with corresponding overall proximity calculations, and the real one is best, the probability is 1/1,000. If it is second best, the probability is 2/1,000 or 1/500, and so on. In this way, we can directly compute an estimate of the probability of obtaining a given real overall proximity measure by actually performing many permuted experiments and simply counting how many overall proximities for permuted experiments are better than the real one. As the number of permuted experiments is increased, the accuracy of this estimate is also increased. This process is often called a "randomization". For WRR, Diaconis requested that 999,999 permuted experiments be performed (letter to Prof. Robert Aumann, Sept 5, 1990). Together with the real experiment, this makes 1,000,000 experiments. A probability of 453/1,000,000 for the first overall proximity measure, P1, means that the real measure ranked 453 out of 1,000,000. Similarly, a probability of 4/1,000,000 for the 4th overall proximity measure, P4, means that only 3 overall proximity measures (of the same type, i.e., P4) on permuted experiments out of 999,999 ranked better than P4. (The 1/62,500 = 16/1,000,000 = 4x4/1,000,000 is the probability of obtaining 4 probabilities, the smallest of which is 4/1,000,000.)This approach is so intuitive and simple, it is hard to see where it might have a problem. Indeed, the challenge does not claim that the technique is totally incorrect, but rather that because of certain circumstances in the WRR experiment, the value that it produces may be inaccurate. One of the circumstances that might effect the accuracy is the fact that some dates and names appear more than once in the list (producing "dependencies"). A second circumstance that might effect the accuracy is the fact that the real experiment happens to have more pairs than about 98% of the permuted experiments and this may give the real experiment an advantage that has nothing to do with the presence of codes.
A further challenge to the calculation of the probabilities is the charge that Prof. Diaconis did not suggest the method; rather, it was Witztum and Rips who suggested it through Prof. Aumann. Specifically, MBBK say, "However, unnoticed by Diaconis, WRR performed the different permutation test described in section 2" (MBBK, page 153). Thus, the process for calculating the probability might have been designed by Witztum and Rips to ensure a "success" on list 2 (and presumably on list 1 as well).
Resolution
It is important to note that MBBK do not feel that this challenge is sufficient to reject the WRR experiment. We quote directly from page 154 of their paper: "Serious as these problems might be, we cannot establish that they constitute an adequate 'explanation' of WRR's result". Note: not only can they not establish that this "problem" constitutes a fatal flaw, they do not even claim that the "problems" are serious, only that they "might" be serious! There is good reason for their lack of a forceful challenge. They cannot prove that these "problems" are serious because there is strong evidence that these "problems" are not serious at all. First, we note that if the process were fatally flawed, it is impossible to explain why it should work on so many different experiments (WRR, Gans' cities, the additional Witztum experiments, etc.) but it does not work on any of the control experiments. If the process produces spurious results, why does it favor real experiments as opposed to permuted experiments? After all, whatever repeated words there are in the names and dates (i.e., dependencies) are equally present for all the experiments, real and permuted! Thus, the real experiment can have no advantage over the permuted experiments as a result of dependencies.
[Comment:] These comments would only make sense if the problems with the method were the only problems with the experiment. The problems with the method may not explain the result by themselves, but they might (and probably do) contribute to it. Also, a broken method need not give wrong results all the time. It will simply alter the probabilities so that the result will be wrong significantly often. For these reason, Gans' comments about the control tests are also incorrect.
[Gans:] Let us now consider the "problem" of data size. The critics point out that the real WRR experiment has more pairs than 98% of the permuted experiments. The critics claim that "the effect of this extremeness is hard to pin down..." (MBBK, page 154). In fact, the effect is not at all hard to pin down because it is easily eliminated. Their hypothesis is that experiments with larger data size (more pairs) have a statistical advantage over experiments with smaller data sizes. This statistical advantage could result in the correctly matched WRR experiment scoring better than (98% of) the permuted experiments even if there are no codes. Note first that no reason is given as to why this should be so. To test if this hypothesis is true, all we need do is restrict the permuted experiments to the 2% that have equal or larger data size than the real experiment. If the rank of the real experiment still yields a probability less than 1/1,000 then we conclude that "the effect of this extremeness" is not responsible for the success of the experiment, and the "problem" is no problem at all.
[Comment:]
Gans is making a serious error.
Divide the permutations into three classes:
A : data size less than the unpermuted data
B : data size the same as the unpermuted data
C : data size more than the unpermuted data
Our claim is that B might have some unfair advantage over A. Gans'
"refutation" (which he copied from Witztum) is to compare B to C.
However, the relationship between B and C has nothing to do with the
relationship between B and A. There is no reason the effect of data
size need be monotonic.
Incidentally, we have no obligation to prove that B does have an unfair advantage over A. Since WRR are claiming a miraculous explanation for their result, it is they who are obliged to eliminate other explanations.
[Gans:] McKay first raised this issue on February 22, 1997 in a paper posted on the Internet and entitled "Equidistant Letter Sequences in Genesis - A Report (DISCLAIMER PRELIMINARY DRAFT ONLY)". In this paper McKay performs an experiment similar to the one suggested in our analysis above. He restricts the permutations to those that have exactly the same size as the real experiment. He reported a P2 rank of 127/1,000,000 (page 4) and notes that this is "24 times worse" than the WRR result. What he failed to point out in that paper is that 127/1,000,000 = 1/7,874, which is still more than 7 times more significant than Diaconis' success threshold of 1/1,000. (In a later draft, this value is no longer quoted and Mckay notes that there may have been some errors in the data used for the first draft.) Furthermore, if one uses P4 - the measure that did best in WRR - not P2 as McKay did (why didn't he use P4?), and one requires that all permuted data sizes be greater than or equal to the unpermuted data size, one obtains a probability of 1/1,000,000. If one does 100,000,000 permutations (to obtain more accuracy) then the probability is 28/100,000,000, as opposed to a probability of 72/100,0000,000 if one does not constrain the data sizes. We conclude that there is evidence that extremeness in data size, first discovered by McKay, adversely affects the accuracy of the WRR results. When the extremeness is removed, the results are more significant! Apparently, MBBK felt that it was important to raise the question concerning extremeness in data size, but that it was irrelevant to mention the solution which McKay himself had proposed over two years earlier!
For the cities experiment the data sizes of the real and mismatched experiments are much closer than they are in WRR. If one applies the fix described above so that the "problem" does not apply, one obtains the same significance level as before: 6/1,000,000.
There is a further test that was applied to corroborate the accuracy of the process for computing the probability. A set of 30 control experiments (for the cities experiment) was performed by randomly mixing up (permuting) the letters within each word in the list. The permuted version of each word replaced the original word wherever it appeared in the original list. Thus, the number of repeated words (and therefore, dependencies) and the data sizes were maintained exactly. The 30 results were then analyzed to see if they conformed to random expectation. Even a slight bias could be significant if it were present in all or even most of the 30 results. The final probability measuring the significance of any deviation from random expectation in the 30 results was 0.4. This is a totally insignificant probability and is strong evidence that these so-called "problems" are not problems at all.
[Comment:] Amazingly, almost exactly the same experiment appeared in the unpublished report of McKay that Gans quoted above. The main differences were that it used the data of WRR's experiment (not Gans' experiment) and that only the dates were randomly permuted (not both names and dates). In contrast with the 0.4 significance level obtained by Gans, our experiment gave a significance level too low to measure (nominally 0.000000000003). How is that Gans could fail to mention this problem with his case?
Furthermore, Gans' simulation does not establish what he needs to make his point. He has to investigate how the bias in data size affects the probability of getting a very low p-value, which may (or may not, nobody knows) be a much greater effect than on large p-values. Very many more than 30 trials would be needed to test for this, but, since nobody has done this experiment, this aspect of the correctness of WRR's method remains conjectural.
[Gans:] On random data, uniformly random results are produced. Significant results are obtained only for real data. We now understand why even the critics did not claim that these problems are "serious", but only that they "might" be serious. It is also clear why the critics admitted, "we cannot establish that they constitute an adequate 'explanation' of WRR's results". One cannot establish as true that which can be empirically demonstrated to be untrue!
Another problem that has been raised (MBBK, page 171) is the following. Some appellations and dates are not found as ELSs in Genesis. Suppose all appellations (or all dates) for a particular personality are not found. Then it would seem that that personality could be removed from the list since it does not contribute any individual proximity measures. WRR, however, kept these entries, allowing the dates (or appellations, if no dates were found) to be paired with other appellations (dates) in the permuted experiments. According to MBBK, this introduces "noise" in the final probabilities.
In this case, MBBK may be correct, and as they show in their paper, removing these personalities from the list improves the final probability slightly. This is another opportunity for an improved experiment that WRR missed. We shall deal with the implications of this observation later.
There is another refutation of MBBK's thesis that various biases skew the accuracy of the randomization process that calculates the probability from the proximity measures. In a May 1998 paper posted on the Internet and entitled "The 'Famous Rabbis' Sample: A New Measurement", Witztum uses a measure first introduced in 1994 to calculate the probabilities in a way that avoids all the biases noted by MBBK.
[Comment:] This "proof" of Witztum was ignored by us due to it being of no value scientifically. There is a mathematical flaw that completely invalidates it. Witztum is capable of inventing junk like this for the rest of his life, but we are entitled to stop paying attention to it.
[Gans:] Part of the process involves partitioning the list of appellation -- date pairs into three sets.
[Comment:] And guess what happens if this brand new method of manipulating the data is not performed. Why isn't Gans asking these questions?
[Gans:] Each set uses only one of the three date forms, thus avoiding each appellation appearing multiple times corresponding to the different date forms. There were also several other changes including a randomization based on random permutations of the letters comprising each date rather than permutations of the dates versus the personalities (the RPWL technique discussed earlier). The final probabilities obtained for each set were: Set 1: 0.000000626, Set 2: 0.0258, and Set 3: 0.012. Thus, if the biases noted by MBBK had any effect on the results it was a detrimental one; it did not enhance the result. It is interesting to note how MBBK deal with this randomization technique - publicized a year before MBBK posted their paper on the internet -- a technique that proves that the WRR result is not due to biases in the methodology. MBBK do not mention it at all! Contrast this with MBBK's assertion that "nothing we have chosen to omit tells a story contrary to the story here" (MBBK, page 152).
The origin of the permutation test
We now address the charge that the permutation test used to obtain the probabilities was not suggested by Diaconis, but rather was invented by Witztum and Rips and suggested to Aumann. Note that in any case, this method was "list 2 - fixed" and hence could not be manipulated for any of the additional experiments. However, we will demonstrate that the charge is untrue.
[Comment:] We could demonstrate blow-by-blow why Gans' lengthy arguments below do not hold water, but we will not do so for this reason: Unlike Gans, we have interviewed both Aumann and Diaconis at length on this issue and both of them agree with us. Professor Aumann (who is well known by WRR as a supporter of their work) has confirmed both in writing and in a tape-recorded interview that the permutation test was invented by WRR, and also that he and Diaconis had a misunderstanding rather than an agreement. Here is an exact quotation from the tape recording:
[Aumann:] He [Diaconis] didn't suggest the permutation method explicitly in letters to me, it's really they [WRR] who suggested it, but it's possible that Persi had this in mind.
[Aumann (a little later):] Now it's true that what Persi wrote in his letter of the 5th of September is not what they did. Ok? That's correct. But Persi was not aware of that.
[Bar-Hillel:] Nobody was 'til Brendan discovered it.
[Aumann:] No. Nobody was.
[Bar-Hillel:] You weren't.
[Aumann:] I wasn't.
As far as we are concerned, that is the end of the argument.
We have analysed this history in detail here and don't need to repeat ourselves.
[Gans:] Appendix C contains copies of several letters between Diaconis and Aumann that conclusively demonstrate that Diaconis suggested the permutation test. The reader is urged to read these letters.
[Comment:] Note that we published these letters first. They support our case completely as our analysis shows.
[Gans:] In the September 5, 1990 letter to Aumann, Diaconis states that they "are in agreement about the appropriate testing procedure for the paper by Rips et al." He then goes on to describe the permutation test. Some of the details are clarified in the September 7, 1990 response from Aumann to Diaconis. In a paper entitled "The origin of the permutation test" by McKay and Kalai (MK) and posted on McKay's Web site shortly after the appearance of the MBBK paper, they state, "For each Rabbi, there were a list of names and dates. The permutation test works like this: calculate some measure of the average closeness of the names and dates. Randomly permute the dates..." Here is the first problem. Nowhere does Diaconis ever suggest an "average". The word is used by MK but not by Diaconis or Aumann. Diaconis does talk about an "additive" test, but Aumann clarifies this in his September 7 letter as follows: "incidentally, 'bunching' or 'twenty percent' might be a more suggestive name for the test you call 'additive' ". "Bunching" and "twenty percent" are clearly referring to the proximity measure P1 which measures the "bunching" of c(w, w') that are less than 20%, or 0.2. MB would have us believe that there are two different tests referred to in these letters, a "Type A" test suggested by Witztum and Rips through Aumann, and a "Type D" test suggested by Diaconis. MB define these tests as follows:
Nowhere does Diaconis explicitly call for a "Type D" measure. Diaconis would not call for such a test because "Type A" and "Type D" refer to the proximity measure, not a probability. The proximity measures P1 and P2 were defined by Witztum and Rips and thought, at the time, to be probabilities. Diaconis objected to the use of P1 and P2 as probabilities because it involved an implicit and unproved assumption that the individual proximity measures, c(w, w'), were independent. Therefore Diaconis and Aumann discussed how to measure the significance of P1 and P2 (as well as P3 and P4). The permutation test does this. However, if the individual c(w, w') for all appellations and dates are combined into a single measure for each personality, then the process for calculating the probability is not being applied to P1 and P2, but rather to a new function of the c (w, w')'s. That is, a new proximity measure has been created and the significance of the new measure is then computed. The success or failure of such an experiment is irrelevant to the original question posed: the significance level of Witztum and Rips' P1 and P2. Note too, that P1 and P2 were fixed on list 1 and no modifications were permitted for its application to list 2. This was the reason Diaconis requested the test on list 2!"Type A: Take all 300 or so name - date pairs (from all the rabbis), then combine those 300 distances into an overall measure.
Type D: For each rabbi, combine the distances of his name - date pairs into a single number, then combine those 32 numbers into an overall measure."
Let us examine the May 7, 1990 letter written by Diaconis to Aumann. The second sentence of the letter is "I have four points to make on the test proposed by Witztum, Rips, and Rosenberg". Thus, Diaconis is not suggesting anything new; he is commenting on what WRR had proposed. In paragraph 2 (the second of the four points) Diaconis says, "as I understand it, there is a fixed set of matched pairs (x1, y1) (x2, y2),...,(xn, yn) and a statistic T = d (x1, y1) +...+ d (xn, yn)". Thus, Diaconis does refer to a sum, but it is not his suggestion. This is what he understood concerning the proximity measure that Witztum and Rips had already applied to list 1. Diaconis is certainly not saying "as I understand it" with reference to his own suggestion! It would seem that at the time Diaconis thought that there was only one individual proximity measure for each personality, probably because he did not realize that there was more than one appellation and date per rabbi. Note, however, that this was his understanding at the time, not his suggestion. Nowhere does he define "d (xi, yi)" as MK would have us believe - Diaconis assumed that Witztum and Rips had already done so and accepted their definition of proximity, whatever it was.
We see that Diaconis was mistaken concerning the definition of the proximity measures, thinking that the individual proximity measures (his "d (xi, yi)") were summed. He does not suggest that an average be taken to combine the c(w, w') into d (xi, yi) for each personality. This would be redefining the proximity measure - something that neither Aumann nor Diaconis wanted since any change would result in the WRR hypothesis remaining untested. There is another reason Diaconis would not make such a suggestion - it makes no mathematical sense unless the measure is being tuned to guarantee failure. This is because when small values (significant) are averaged with large values (insignificant), the large values dominate the sum and the significance is lost. Consider the following example. Suppose we have two probabilities. One is p1 = 1/1,000,000 while the other is p2 = ½. We wish to calculate a single measure of the significance of obtaining both p1 and p2. Using the same technique as used in WRR, we obtain p = 2x(1/1,000,000) = 1/500,000. On the other hand, if we take the average of the two, we get 0.2500005, which is insignificant. It follows that Diaconis would never have made such a suggestion unless he was a novice in statistics or purposely tuning the measure to fail - hardly a reasonable thesis. Diaconis does refer (NOT suggest) an "additive" test and a "multiplicative" test. As we have shown, the "additive" test corresponds to the P1 measure while the "multiplicative" test corresponds to the P2 measure.
How do MB maintain a thesis that Diaconis and Aumann were talking about two different ways of calculating the probability when Diaconis and Aumann never say a word about such a disagreement, and actually say several times that they were in agreement? One can understand Diaconis not paying attention to the proximity measure. It was designed by Witztum and Rips and was not to be altered. But as for measuring the probability, this was the central issue that Diaconis was addressing! Here is MB's explanation of this paradox: "Neither person acknowledged the difference between them. It was almost as though neither person was reading the letters written by the other" (MB, page 6). It is important to note that the WRR paper was written before the calculation of the probability on list 2 was done. Of course, the results of the computation were not included, but were represented by question marks. This paper was given to Diaconis by Aumann, and Diaconis approved it (see the appendix for a copy of critical pages from this paper). In a cover letter for the WRR paper, dated Dec 6, 1991 (see appendix), Aumann states "needless to say, the test itself was not changed in any way; it is precisely the one to which we agreed in the summer of 1990." The WRR paper contains an exact and detailed description of the proximity formulas P1, P2, P3, and P4, as well as the permutation test. It was only after Diaconis approved the paper that the calculation of the probability was done. We may assume that Diaconis, one of the referees, noticed that the proximity formula was not what he thought it was back in May of 1990, but it made no difference. The proximity formula was never the subject of discussion between Aumann and Diaconis; the calculation of the probability was. Surely Diaconis would not have approved an unauthorized change in his probability calculation. (There is even another letter of agreement between Diaconis and Aumann dated August 28, 1992 concerning further research and again defining the probability calculation as in WRR.) Finally, when the WRR paper was rejected by Diaconis and the other referees, it was rejected because it was not considered appropriate for PNAS; the results were not considered to be of scientific interest. It was not rejected because the experiment performed did not meet Diaconis' approval or specifications.
It is interesting to note that although Diaconis never mentions what the definition of "d (xi, yi)" is for each personality, McKay does offer a possible definition. On August 9, 1998, in a paper entitled "Revisiting the Permutation Test", McKay defines this quantity as "the average of the logarithms of the defined c(w, w') values". He then writes: "Use of the logarithms gives much stronger prominence to the word pairs with small distances and in my opinion meets the objection of Rips that ordinary mean "averages out" the alleged ELS phenomenon". McKay then presents the results of this experiment: "For the WRR data, this method gives very respectable scores of 125/ million for the first list and 8/ million for the second." Thus, McKay himself provided a possible definition for the quantity that Diaconis never defined and obtained very significant results! Somehow, MBBK "forgot" to mention these results in their paper in spite of their earnest statement that "Nothing we have chosen to omit tells a story contrary to the story here".[Comment:] This comment is really quite ridiculous. It is quite impossible that Diaconis wanted to use the logarithm without mentioning it in his letters. The issue is of what Diaconis meant, not of what McKay can invent to satisfy Rips' a posteriori requirements. McKay can invent hundreds of methods that give strong results and hundreds of methods that don't. The two possibilities we mentioned in our paper were both specified by Diaconis - the minimum in one of his letters and the average in a telephone conversation.
[Gans:] It is noteworthy that it is possible to combine all the individual proximity measures for each personality and obtain a single true probability for each personality. This was first done with list 2 by Nachum Bombach using a technique called "BST" (Best Star Team) developed by Professor Robert Haralick. These individual probabilities can then be combined using a standard statistical formula (Fisher's Statistic) to obtain the probability associated with the entire list. The probability obtained this way is 0.00000218, very close to the original WRR result.
[Comment:] When the original version of Haralick's method was applied to WRR's data, the result was very weak (reported on the TCODE mailing list). It is a matter of supreme inconsequence whether he could tweak it to work better or Bombach could think of a way to apply it differently from what Haralick specified in order to get a smaller "p-value". Why isn't Gans telling us that this is junk statistics? It isn't that he doesn't know.
Furthermore, tweaking the method to work better on one set of data is likely to be of less advantage to its performance on another set of data. Remember that and read on...
[Gans:] On the other hand, if this technique is applied to MBBK's "War and Peace" "experiment", the result is only 0.00225 - it doesn't even pass the 1/1,000 threshold! Thus, this technique appears to distinguish between the real and the counterfeit.
[Comment:] Bingo!
Let's play that game too. WRR use the sum of the squares of three distances to measure the compactness of two ELSs in a letter array. Instead we could use a simpler measure: the square of the greatest distance between a letter of one ELS and a letter of the other. It never varies from WRR's measure by more than a constant. (This change to the method is vastly less than Bombach's change.) Now we find that War and Peace performs 400 times better than Genesis! Maybe this technique can distinguish between open fakes and secret fakes!
[Gans:] We conclude that either (a) MB have misinterpreted the historical record, or (b):
(i) Diaconis said "as I understand it" in reference to his own suggestions, and
(ii) Diaconis suggested things that make no mathematical sense or were specifically designed to fail, and
(iii) Both Aumann and Diaconis respond to each other's letters over an extended period of time without appearing to read the mail they receive. They say that they are in agreement on the central issue - calculating the probability of the proximity measures P1 through P4 - when in fact they do not even know what the other is talking about! and
(iv) Diaconis approved the WRR paper before the probability calculations were performed without noticing that the calculations described in detail were not what he specified!
We will let the reader judge between these two alternatives. Finally, recall that this issue is irrelevant to the cities experiment and all the other additional experiments since the calculation of the probability was fixed once the experiment on list 2 was documented.
Challenges to the appellations
We now come to the most serious charge: that the appellations used were selected specifically to produce an apparently significant result for list 1 as well as for list 2. We have already seen that virtually all the parameters of the WRR experiment were fixed on list 1 and thus could not be manipulated to produce a "success" on list 2. There is, however, one major component of the experiment that was new for list 2: the appellations. Thus, logically, if any component of the experiment on list 2 was manipulated to affect an apparent success, it has to be the appellations. Furthermore, in order to prove that such manipulation is possible and practical, MBBK produced another set of "appellations", "similar" to those of WRR's list 2, that "succeeds" very well on a Hebrew translation of Tolstoy's "War and Peace" (in fact, the same used as a control experiment by WRR) (MBBK, page 157). Since MBBK were able to select "appellations" to make their experiment look successful, WRR presumably could have done the same to make their experiment look successful on Genesis.
Resolution
Before we resolve this issue, it is worth making a few observations. Witztum and Rips have declared publicly that they did not provide any input at all to the selection of appellations. They say that they did no more than provide Havlin with the list of personalities. Havlin also has "certified explicitly that he had prepared the lists on his own" (MBBK, page 156). It follows that subconscious manipulation of the appellations to affect a "success" is not possible in this case. Witztum and Rips had no control over the choice of appellations, and Havlin had no way to assess which appellations would contribute to the success or failure of the WRR experiment. If a fraud were perpetrated here, it must have been a conscious collusion between Witztum, Rips, and Havlin.
[Comment:] This conclusion does not logically follow. Gans has not distinguished between what was done at the time and what was claimed about it only years later.
[Gans:] This is not impossible, but it is unlikely that a world class academician (Rips) and a renowned rabbi and scholar (Havlin) would risk their professional and personal reputations by perpetrating a hoax which was bound to be revealed eventually. As we shall see, if there is a conspiracy here the number of people necessarily involved in it will stretch the credulity of any reasonable person. It will be recalled that in the above section entitled "non scientific challenges", several Gedolim (leading rabbis and sages) wrote letters of support for WRR. In a testimony co-authored by Rav Shmuel Deutch and Rav Shlomo Fisher, they state, "We checked the rules according to which Professor Havlin formulated his list of names and titles of Torah leaders, and we found that it was commensurate with both professional standards and common sense. The list is in keeping with the principles. We found that all the opponents' individual claims concerning deviations from the principles are false, and are a testimony to their glaring ignorance and unfamiliarity with the subject. In light of the above, we hereby affirm that the work of Rav Doron Witztum, Professor Eliyahu Rips, and Rav Professor Shlomo Zalman Havlin does not contain an iota of fraud or deception and the claims of their opponents are a reprehensible libel". In other words, Havlin's rules and lists, as well as his "professional judgement" were checked by two independent experts, Gedolim, and certified to be correct. They specifically state that there were no deviations from the rules.
[Comment:] To our complete lack of surprise, Gans fails to mention our thorough 83-page rebuttal of Witztum's allegations. See here. We don't see any need to repeat those arguments here.
It is important to realize that very few of the disputes about appellations are disputes over facts. Here is a typical example: Suppose a rabbi spelt his name as X, but other people sometimes spelt it as Y. Should we use X or Y or both? Another one: suppose a name became accepted as the family name of a rabbi's family, but not until several generations later than the rabbi's generation. Should we use it? Questions of similar nature abound. It is obvious that an argument can be written to support any of these alternatives. It is also obvious that there is no academic or rabbinical qualification that enables one to answer such questions with authority. This is why we are less than impressed by those "experts" who claim to have examined Witztum's choices and found them to be correct.
[Gans:] Yet, MBBK claim that list 2 does not follow Havlin's rules completely and use this as justification for breaking Havlin's rules to form their list for "War and Peace" (MBBK, page 157). MBBK had to break Havlin's rules because the wiggle room supposedly introduced by those instances where Havlin used "professional judgement" was insufficient to allow tuning the experiment to the extent required. The only way MBBK could make their experiment "work" on "War and Peace" was to deviate from Havlin's rules. This implies that any wiggle room resulting from Havlin's use of "professional judgement" was also insufficient to tune the WRR experiment to the extent required. It follows that MBBK's demonstration that one can manipulate the appellations to fit an arbitrary text falls apart! Their "demonstration" could only be accomplished by breaking Havlin's rules, whereas WRR did not break the rules. In addition, MBBK claim that since Havlin's rules were only made public after the WRR experiments were complete (MBBK, page 157), they were retrofitted to the lists after the fact. Note that these two claims are contradictory: either the rules existed before the lists were formed or the rules were retrofitted to the lists - but not both!
[Comment:] This argument is completely devoid of logical content. The skeptics claim BOTH that Havlin's "rules" were only written 9-10 years after the data AND that even some of those post-facto justifications were not followed consistently. There is no contradiction between those claims.
In fact, the observation that the data does not always follow the rules is much easier to understand if the rules were written after the data. If the rules came first, they would have been followed except for perhaps a very small number of human errors. On the other hand, if the data came before the rules, the difficulty of retrofitting rules to the data would show itself in repeated violations of the rules. A third possibility is a three-stage process: identify the data that works best, write rules to select data close to the best data, then adjust the data to obey the rules. Our observations suggest the second approach was followed for the rabbis experiment and the third for the cities experiment, but neither can be proved absolutely. As we say in our paper, we believe the data was "tuned" but don't claim to know the exact mechanism for this "tuning".
[Gans:] If the former is true, then MBBK's "War and Peace" list would have to follow pre-existing rules. If the later is true, MBBK has to retrofit rules to their list. But they have done neither! Given the Gedolim's statement that "...all the opponents' claims concerning deviation from the principles are false...", it is clear that each and every appellation in list 2 (the list that they checked) is consistent with Havlin's rules. There are 102 appellations in list 2. The possibility of constructing rules that are "commensurate with both professional standards and common sense" and that will also fit "all" of 102 appellations, presumably selected to make an experiment look successful, is extremely remote. The ability to do something like that would be a wonder in itself! The critics have never demonstrated that such a thing is possible or even plausible. Unless one includes these Gedolim in the Witztum - Rips - Havlin "conspiracy", the charge of fraudulent appellation selection has been totally refuted.
In an article entitled "My Cities Experiment - Analysis and comments" posted on the Internet (http://wopr.com/biblecodes), Dr. Barry Simon says, "Mr. Gans suggests the right resolution 'would be to accept the challenge issued by Doron Witztum in an article in Galileo some time ago. He suggested that an independent linguistic expert whom everyone concerned agrees is impartial and qualified should be asked to provide a new list of Rabbis and appellations. Such an experiment would test the original Rabbis experiment directly. To date, no one has accepted Mr. Witztum's challenge.' But Mr. Gans fails to note a critical aspect of Mr. Witztum's proposal which is 'to allow an independent authority to prepare a new list of names and appellations for the 32 personalities on the second list, using Prof. Havlin's guidelines.' Namely, after the wiggle room has been favorably frozen by the rules (only stated nine years after the original experiment), there is a proposal to ask an outside expert to come in constrained by these rules. This is hardly a test of the rabbis experiment - it is a charade." It is clear from this statement that the critics acknowledge that if an impartial and independent expert were to reproduce the list of appellations strictly following Havlin's rules, the experiment would succeed! Otherwise, the challenge would have been eagerly accepted. Dr. Simon claims that the "wiggle room has been favorably frozen by the rules" which were constructed to fit the list. Hence, deviation from Havlin's rules is not the issue. The critics are claiming that it is possible to retrofit rules to appellations. This implies that their counterfeit experiment on "War and Peace" completely misses the point. The critics would have to retrofit rules such that (a) Every one of the appellations used in their "War and Peace" "experiment" follows the rules, and (b) the rules can be confirmed as being "commensurate with both professional standards and common sense" by independent and impartial authorities. Until the critics can meet the challenge of accomplishing precisely that which they claim WRR accomplished, their "War and Peace" demonstration is vacuous. This conclusion follows logically from the critic's own statements.
[Comment:] Almost every word in the above argument is illogical nonsense. We can't even find anything worth refuting.
[Gans:] It is of interest to note that the challenge concerning the appellations can also be refuted on logical grounds without being an expert in Hebrew or bibliography. Gans' cities experiment uses precisely the same appellations as in lists 1 and 2, and produces very significant results. If the appellations were selected on the basis of their being in close proximity to the dates, why are they also in close proximity to the cities? Recall that Gans' idea of using city names came years after lists 1 and 2 appeared in preprints. It follows logically that the success of the cities experiment implies that the appellations in lists 1 and 2 were not selected on the basis of close proximity to the dates, i.e., the appellations were chosen honestly, and both the WRR experiment and the cities experiment are valid successes. The obvious retort is that the city names were selected to be in close proximity to the appellations. In this scenario, the conspiracy has grown to include Witztum, Rips, Havlin, Rav Deutch, Rav Fisher, and Zvi Inbal, who provided the list of cities to Gans, as well as Gans and Bombach who verified the authenticity and accuracy of the protocol and list.
[Comment:] The claim that the skeptics believe in a conspiracy involving a large number of people is a very common straw man raised by the codes proponents. In fact we have NEVER claimed that Havlin assisted Witztum in cooking the data. (Quite likely he did exactly what WRR said at the time: provide "valuable advice".) We have also NEVER claimed that Gans cooked his data; we believe Gans when he says that he did not prepare the data himself. If we take into account the very well studied behaviours of naive researchers who do not know how to distinguish between genuine phenomena and the fruits of their own wishful thinking, there is only ONE person who needs to have been involved in knowing fakery, and a handful of his disciples who must be involved in the cover-up (perhaps with good intent).
[Gans:] Recall, however, that every single item in the city list is produced by the Inbal protocol without exception. This is easily verified, and, in fact, is not challenged by the critics in their paper. The protocol itself was made public years ago by Inbal and it, too, has not been challenged. No "counterfeit" cities experiment in "War and Peace" (or anywhere else) has ever been successfully performed by the critics. In order to show that such an experiment could be faked, MBBK would have to produce their own linguistically correct protocol and follow it mechanically to produce an apparent success in "War and Peace". This has never been done. It is instructive to see just what MBBK do say about the cities experiment. On page 163 they say, "The only other significant claim for a positive result is the preprint by Gans (1995), which analyzes data given to him by an associate of Witztum. It was later withdrawn (Gans, 1998), but Gans recently announced a new edition which we have not seen. The original edition raises our concerns regarding the objectivity of the city data, as many choices were available". In other words, there is no direct challenge to the protocol or the list. Rather, the author of the experiment has himself "withdrawn" the experiment, so it surely must be invalid.
Had Gans truly withdrawn his experiment? MBBK refer to a statement made by Gans in 1998. It is instructive to see what Gans actually said: "This unwillingness to speculate on an outcome of an investigation while it is still ongoing has prompted some people to interpret that as evidence that I am no longer convinced that the Torah codes phenomenon, as detailed in WRR, is a real phenomenon or that I no longer believe that the conclusions drawn from my original cities experiment are correct. Let me then state in absolute terms that this is not true. To date, I have not uncovered a single fact or even a hint that the list of cities that I was provided was manipulated in an attempt to make the results of the experiment appear significant when, in fact, they are not significant. I have not uncovered a single fact that causes me to doubt that the conclusions drawn from the original cities experiment were accurate." (Gans, March 23, 1998). Does this sound like a withdrawn experiment?
[Comment:] Note that Gans is claiming we suppressed the facts even though we (not he) made his complete statement available on the internet and gave the address in our paper. As to whether he had "withdrawn" his experiment, we think it is a reasonable thing to say that he did. What does "I will not now claim that I have verified the cities experiment: there are still a few things left to check." mean other than that the results have been withdrawn until an investigation is completed? Furthermore, when Gans later announced that he had finished his investigation, changed the data a little, but refused to show it to us, isn't that reason enough for us to say that he "announced a new edition which we have not yet seen"? In fact, at the time of writing (June 2001), Gans is still refusing to show us his revised data. We are supposed to accept the success of his experiment purely on his say-so! Remember this when reading the next paragraph.
[Gans:] This sordid affair can thus be succinctly summarized as follows. The critics charged that the cities list was not honestly produced by Inbal. Gans responded by announcing that he would investigate their charges, and the critics then claimed that Gans had withdrawn his experiment - in spite of Gans' public declaration to the contrary. Compare MBBK's version of Gans' statement with their statement that "Nothing we have chosen to omit tells a story contrary to the story here" (MBBK, page 152). Furthermore, on May 11, 1999, a full month before the release of the MBBK paper on the Internet, Gans announced at the International Torah Codes Conference in Jerusalem, in the presence of one of the authors (Bar Hillel) that he had completed his investigation and found that the protocol and list were accurate (except for a handful of errors that were corrected), and the results very significant - 6/1,000,000. There was no challenge from Bar Hillel. Thus, Gans confirmed the Inbal protocol and list (except for a handful of errors); he did not announce a "new edition" that MBBK did not see. The protocol and list have been available to the public for years. MBBK claim that "many choices were available", but do not list any! The most telling flaw in MBBK's case against the cities experiment is that they claim many choices were available, presumably to allow the experiment to be tuned to "succeed". Yet MBBK have never succeeded in using these "choices" to tune a counterfeit cities experiment in "War and Peace". MBBK produce no challenge to the protocol or the list, but base their rejection of the experiment on the claim that the author himself has withdrawn the experiment, along with some vague and unsubstantiated notion of "many choices" being available. MBBK have been challenged publicly on several occasions to prove that the cities experiment could have been produced by taking advantage of wiggle room, by using the purported wiggle room to produce a counterfeit cities experiment in "War and Peace". They have so far refused the challenge.
The experiment "A Replication of the Second Sample of Famous Rabbinical Personalities" by Witztum described in the earlier section "Additional experiments" provides a second line of independent proof that the WRR experiment is not a hoax. In this case, the appellations were replaced by the simplest appellations possible: "ben name" ("son of name") where "name" is the name of the personalities' father as obtained from the Margalioth encyclopedia. No other appellations were used. The spelling rules used were exactly the same as specified in writing by Havlin and used for WRR. (In fact, the spellings are identical to the Margalioth entries with only two exceptions.) Every other component of the experiment is exactly the same as in WRR. Thus, there is no wiggle room in any of the components of the experiment. The probability obtained was 1/23,800.
The experiment "Personalities of Genesis and their Dates of Birth" by Witztum (and the version "Tribes of Israel" by Rips) provides a third line of independent proof that names and dates of birth are encoded in Genesis. The names are spelled exactly as found in the book of Genesis, and all other components are exactly the same as in WRR (except for one component taken from the "Nations" experiment of Witztum and Rips. As described earlier, this introduces at most a factor of 2 in the final results). The probability obtained here is at least 1/10,870 (including the factor of 2). It follows that none of the challenges raised applies to this experiment. The date forms, the proximity formula (with the factor of 2), the calculation of the probability, and the text of Genesis are all fixed from WRR and there are no appellations.
How does MBBK deal with these two experiments and their direct implication to the validity of WRR? They just ignore them! There is no mention of either of these experiments anywhere in their paper, even though both experiments were made public months before the MBBK paper appeared. In fact, MBBK, after discussing several older experiments, state, "The only other significant claim for a positive result is the preprint of Gans (1995)..." (MBBK, page 163). Given three independent experiments with no wiggle room that prove that the WRR experiment could not possibly be a hoax, MBBK deal with them by claiming that one was withdrawn by its author, and ignoring the existence of the other two! Nothing could be a stronger testimonial to their inability to find a flaw in any of these experiments.
[Comment:] We won't repeat what we already said about these things above.
[Gans:] The absence of a "counterfeit" to the cities experiment, reveals that the critics have not demonstrated the possibility of selecting city names and spellings to insure an experimental "success" while strictly adhering to a protocol. We conclude that Rav Deutch and Rav Fisher were absolutely correct when they declared that the rules and lists of Havlin "do not contain an iota of fraud or deception".
Let us briefly summarize what we have determined. Since the text of Genesis, the date forms and the proximity formula are "list 1 - fixed", none could have been manipulated in any way to affect the outcome of the experiment on list 2 or any of the additional experiments described (cities, personalities in Genesis, replication of the list 2 sample, the nations prefix experiment). Since Professor Diaconis specified the method of calculating the probability, it too was not manipulated to affect the result. The method of calculating the probability is also "list 2 - fixed" and could not have been manipulated to affect the results of the additional experiments. As for the appellations, the following points are relevant:
1. The appellations were "list 1 and list 2 - fixed" and so could not have been manipulated for the cities experiment.
2. The cities names/spellings were produced by a strict application of the Inbal protocol without exception. The protocol and list are not challenged by MBBK. The critics have never demonstrated the feasibility of producing a counterfeit cities experiment.
3. The "replication" experiment uses no appellations at all - just the true names of the fathers of the personalities with the prefix "ben". This experiment directly confirms the results on list 2. This experiment is not challenged by MBBK.
4. The "personalities in Genesis" experiment uses no appellations - just the spellings of the names as found in the text of Genesis itself. This experiment provides a success similar in form to WRR. It is not challenged by MBBK.
5. Several Gedolim have verified both Havlin's rules as well as the appellations used in list 2, derived through the application of those rules.
6. The claim that the critics have "done the same" in "War and Peace" is false. They have admitted explicitly that their list is not consistent with Havlin's rules. Hence, their experiment bears only a superficial resemblance to the WRR experiment. Furthermore, the critics claim that the rules were retrofitted to the list (which is why they will not agree to an independent trusted expert forming a new list using Havlin's rules). Since they did not retrofit any rules to the list used in their "War and Peace" experiment, their experiment bears only a superficial resemblance to their view of the WRR experiment and proves nothing.
7. In the nations prefix experiment, the appellations and dates are replaced by a list provided by McKay et al. and hence were not subject to any manipulation by WRR.
[Comment:] Our reply to the above 7 points in summary:
1. Agreed.
2. We do challenge the protocol (as being suspiciously artificial and complicated) and we do challenge the assertion that it specifies the list exactly. We have no obligation to produce a counterfeit cities experiment as it is enough to have demonstrated (repeatedly) that fake experiments are not hard to construct. (However, the main reason we never tried to do this is that it would be a lot of work and we are sick of debunking the codes over and over and over.) The cities experiment is dead anyway because it has proved impossible to replicate.
3. Gans thinks that because we can't be bothered replying to some Witztum nonsense it means we can't see anything wrong with it! More than enough reply is given above.
4. See 3.
5. When we asked an expert on rabbinical history to compile appellations for WRR's second list on the basis of the style of the first list, the result failed to show any codes. For all his bluster about us failing to mention things, you would think that Gans might have mentioned this.
6. In fact, our War and Peace list obeys even the retrofitted rules of Havlin just as well as WRR's list does. Gans may have a different opinion of that, but he should not claim that we agree with him.
7. This is disproved above.
[Gans:] The McKay et al. paper
The paper "Solving the Bible Code Puzzle" by Brendan McKay, Dror Bar-Natan, Maya Bar-Hillel, and Gil Kalai (MBBK) was posted on the Internet in June 1999 and was accepted for publication by Statistical Science for the May 1999 issue. This paper addresses two main questions concerning the WRR experiment. In their words, "In precise terms, we ask two questions: Was there enough freedom available in the conduct of the experiment that a small significance level could have been obtained merely by exploiting it? Is there any evidence for that exploitation?" (MBBK, page 151). It is clear that the answer to these questions is "no", based on the analysis presented in this paper up to this point. We have shown clearly that (1) for the most part, there is no wiggle room, and (2) even in those cases where there theoretically is wiggle room (as for list 1, or for the appellations in lists 1 and 2) it was not exploited. Therefore any "evidence" for such exploitation must necessarily be inconclusive. One cannot have conclusive evidence for something that is not true. Yet, MBBK do present evidence, and it is worthwhile understanding why that evidence is not conclusive. In addition, the MBBK paper does raise a few issues that we have not yet discussed.
The tool used by MBBK to provide evidence that "wiggle room" was exploited to produce apparently significant statistical results in WRR is called "the study of variations". On page 158 they say, "...there is significant circumstantial evidence that WRR's data is indeed selectively biased toward a positive result. We will present this evidence without speculating here about the nature of the process which lead to this biasing. Since we have to call this unknown process something, we will call it tuning. Our method is to study variations on WRR's experiment. We consider many choices made by WRR when they did their experiment, most of them seemingly arbitrary (by which we mean that there was no clear reason under WRR's research hypothesis that they should be made in the particular way they chose to) and see how often these decisions turned out to be favourable to WRR". In other words, since these choices are "arbitrary", by chance alone one would expect about half to be favorable to WRR and the other half to be unfavorable. If most of the choices turn out to be unfavorable to WRR, this is evidence that the WRR parameters were chosen because they were known to be favorable, i.e., better than many other choices of parameters. Thus, there must have been many "hidden" experiments used to effect the "tuning" of these apparently arbitrary choices.
The first flaw in MBBK's case is in their very words: "...there was no clear reason under the research hypothesis that they should be made in the particular way they chose to...". As an example of the problem, let us apply this argument to the proximity formula (which MBBK do on page 169). The idea that all the choices in the proximity formula (and there are many) were made blindly is absurd. We fully expect that Witztum and Rips made several observations of the phenomenon before constructing a hypothesis to test in an experiment. Thus, we would expect that any decent mathematician - and Rips is world class - would construct a proximity formula that truly measures the Torah codes phenomena that they observed, providing the phenomena is real. If the phenomenon is not real, then the proximity formula is meaningless and it is not expected to work better than other formulas on new data in an experiment. It follows that the formula is expected to look "tuned" if the Torah codes phenomenon is real because it was tuned - on earlier observations of different data. This is perfectly honest, and is, in fact, the normative scientific procedure.
[Comment:] A big problem with this speculative explanation of Gans is that the proximity measure is far too complex for all its details to be matched by just observing a few examples. It would require an extensive investigation, of which there is no record. Furthermore, WRR have never claimed that they tuned the measure in detail (or even at all). This is despite many things they have written on this subject. What they in fact wrote on this subject in Statistical Science was:
We stress that our definition of distance is not unique. Although there are certain general principles (like minimizing the skip d) some of the details can be carried out in other ways. We feel that varying these details is unlikely to affect the results substantially. Be that as it may, we chose one particular definition, and have, throughout, used only it ...
In other places, they even suggested specific variations without eliminating them on the grounds Gans speculates. To investigate this issue further, we asked Yoav Rosenberg (who wrote the computer programs) if the details of the success measure had been fine-tuned before the Rabbis data had been collected. He replied (in writing, twice) that he did not know of any such tuning.
Beyond this, it has been necessary for Witztum to adopt many different methods of analysis in order to give his "experiments" the success he desires. The idea that one method is "the right one" is therefore refuted by Witztum himself.
It is of course possible that WRR simply hit upon the "right" measure to use for famous rabbis and their dates, by accident. That would explain the results of our analysis of variations. This lucky accident seems to be at least as unlikely as the significance level WRR achieved, but if they wish to ascribe their result to luck that is fine with us.
[Gans:] In scientific research, one makes observations, forms a hypothesis based on those observations, and then tests the hypothesis on data that is disjoint from the original observations. Hence, if one finds evidence of tuning, there are two possibilities. (1) Torah codes exist and Witztum and Rips correctly constructed the mathematical parameters of the WRR experiment based on prior observations of other anecdotal examples. (2) Torah codes do not exist and therefore the mathematical parameters of the WRR experiment should be totally arbitrary.
[Comment:] So far Gans is correct, but then he loses track of the subject. Our argument considers possibility (1); it does not rely on assuming (2). The issue is how WRR could have gotten almost all the parameters exactly right based only on anecdotal examples. Even they admit that some of these parameters are arbitrary.
[Gans:] If (2) is true and evidence for tuning exists, this implies that these parameters were tuned to produce an apparently significant result on the WRR experiment. Consequently, if one assumes that there are no Torah codes (case 2, above), then evidence of tuning is evidence of a fraudulent experiment. We see then that the use of evidence of tuning to discredit WRR works only if one assumes first that there are no Torah codes (case 2 as opposed to case 1). This is known as "circular reasoning". What we have shown here is that even if there is evidence for tuning, this is not evidence that the WRR experiment was not done honestly - unless one first assumes that there are no codes in the Torah. The apparent tuning may simply be the result of Rips and Witztum constructing an appropriate measuring tool to detect the phenomenon based on previously observed examples of codes. In fact, we can go further. Given the evidence that we have presented that tuning could not have taken place, evidence for tuning is actually evidence for the existence of codes. As we pointed out, if no tuning took place and Torah codes do not exist, then there can be no evidence of tuning. This logical argument effectively destroys MBBK's entire "study of variations" on WRR. We shall see shortly, that in any case the evidence for tuning is non-existent.
At this point, the reader may wonder why MBBK discuss tuning of the proximity measure and the date forms when we have already shown that these components could not possibly have been tuned. They were fixed for the experiment on list 1 and thus could not be changed for list 2. MBBK ask this very question and provide two answers. On page 158 they say, "This naturally raises the question of what insight we could possibly gain by testing the effect of variations which WRR did not actually try. There are two answers. First, if these variations turn out to be overwhelmingly unfavourable to WRR, in the sense that they make WRR's result weaker, the robustness of WRR's conclusions is put into question whether or not we are able to discover the mechanism by which this imbalance arose. Second, and more interestingly, the apparent tuning of one experimental parameter may in fact be a side-effect of the active tuning of another parameter or parameters." The first answer given by MBBK is of no relevance to the question of whether the experiment on list 2 was honestly done or not. If anything, a lack of robustness implies that only a narrow range of parameters uncovers the Torah code phenomenon. This is expected if the phenomenon is real - if one tries to detect the code the wrong way, it is not detectable. Concerning the second answer, MBBK admit that for list 2 almost nothing could have been tuned except the appellations (MBBK, page 159). However, they have no way to prove that the appellations were tuned. Instead, they attempt to bring evidence that the proximity formula and the date forms look tuned and then rely on the possibility that this "may" be a side effect of active tuning elsewhere. Nowhere do MBBK prove or provide empirical evidence that this statement is true. Their entire thesis concerning active tuning of the experiment on list 2 is based on a "may be"! Until this presumed link between active tuning of one parameter and apparent tuning of another parameter is proven, the strength of their argument reduces to "may be"; a logical argument is no stronger than its weakest link.
[Comment:] We had to write "may" rather than "will" because the effect is a stochastic rather than a deterministic one. "If you fall from a 7-story building, you MAY die". It is wrong to say "WILL die" because some people survive such events. It is normal to use English in this precise (if not pedantic) way in mathematical publications.
Gans then goes on to claim that we give no evidence for this effect. In fact, the very next paragraph in our paper is devoted to explaining why this effect should be expected. Perhaps he cannot understand our explanation, but it is simply dishonest to claim that we gave none. Our explanation is in fact precise enough that any competent statistician could turn it into a mathematical theorem.
This statistical process has also been verified by extensive computer simulations.
[Gans:] It is essential to note that even if their "may be" turns out to be correct, the implication goes the wrong way. They claim that active tuning of a parameter (i.e., the appellations) may cause an apparent tuning in another parameter (e.g., the proximity formula). That is to say, statement A: "Active tuning of parameter Ã1 (may) imply apparent tuning of parameter Ã2". MBBK then claim to demonstrate that parameter Ã2 appears tuned. From this they wish to conclude that Ã1 must have been actively tuned. This conclusion, however, is only valid if statement B: "Apparent tuning of parameter Ã2 implies active tuning of parameter Ã1" is true. But statement B is the converse of statement A and logically the truth of a statement does not imply the truth of its converse. This is a well-known fallacy in formal logic.
[Comment:] Our first draft of the present document stated "Incredibly, Gans does not understand the difference between logical deduction and the scientific method.", but then we found that he understands it perfectly well:
Louis Pasteur observed that the blood of animals afflicted with anthrax all seemed to have microscopic squigilies in it. He formed the hypothesis that these squigilies (bacteria) cause anthrax (as opposed, for example, to the possibility that the squigilies were caused by the anthrax). This was a posteriori reasoning. But it was next necessary to verify the hypothesis by testing it. Pasteur made a prediction: if the blood of sick animals is injected into healthy animals then the healthy animals will get anthrax, but if the bacteria is first removed from the blood and then injected, the animals will not get sick even though it has the blood of a sick animal in it. When this prediction is confirmed, it lends strong support to Pasteur's hypothesis. This is a priori reasoning and helps verify the truth of Pasteur's hypothesis.
[previous article of Gans]
So how can it be that that Gans has forgotten his own words? It is not necessary that the hypothesis logically follows the predictions. On the contrary, it is necessary that the predictions logically follow from the hypothesis. (This is something known to every scientist, and is not in the slightest bit contentious.)
[Gans:] We have refuted the MBBK thesis of tuning thrice over on purely logical grounds. First, we have shown that even if there were evidence for tuning, this would not imply that the experiment on list 2 was a hoax unless one first assumes that the Torah codes do not exist. This is circular reasoning. Second, their evidence rests on an assumption for which they provide no proof or evidence. Third, their argument rests on the false logical assumption that the truth of a statement implies the truth of its converse. Note, too, that the additional experiments of Witztum and Gans could not be tuned at all; MBBK deal with those experiments by ignoring them. There is, however, still another problem with MBBK's evidence. How can we be certain that the choices of parameters of WRR made by MBBK for documentation of tuning were not themselves tuned? Since tuning is an "unknown process" (MBBK, page 158), perhaps it infects the choices made by MBBK? Clearly, a bias in selecting which parameters to examine for "tuning" will result in a biased conclusion. As in MBBK, we will not speculate on the causes of such tuning except to note that they do say on page 152, "Nothing we have chosen to omit tells a story contrary to the story here" and on page 161, "Our selection of variations was in all cases as objective as we could manage; we did not select variations according to how they behaved". Nevertheless, their study of variation seems to manifest tuning. For example, we showed in the section above on "hidden failures" that changing the number of personalities used on list 1 (to 20) and changing "bunching threshold" in the calculation of P1, whose value can be arbitrarily specified, from 0.2 to 0.5 makes the measure 16 times stronger. For the entire list 1, this single change makes the P1 measure about 7,000 times stronger! Let us examine how MBBK present this fact. On page 171 they state, "Table 10 shows.... The same table shows the effect of changing the cut-off 0.2 used to compute P1 and P3. Values greater than 0.2 have a dramatic effect on P1, reducing it by a large factor (especially for the first list). However, the result of the permutation test on P1 does not improve so much, and for the second list it is never better than for P4". Now here we have definite evidence that WRR did not tune list 1, since at the time that the WRR paper was first submitted for peer review, Diaconis had not yet suggested the permutation test. Nor had list 2 been requested. Hence, the only measure of success at the time was P1. If we next examine MBBK's table 10 numbers for cut-off values greater than 0.2 we find the following entries:
Cut-off defining P1: | ||
0.25 | [1, 0.8; 1, 1.1] | |
0.33 | [1, 1.0; 1, 1.0] | |
0.4 | [1, 1.0; 1, 1.0] | |
0.5 | [1, 0.4; 1, 1.0] |
Values smaller than 1 represent improvement and are underlined, e.g., 0.5 would be twice as strong. The boldface "1" means that the variation does not apply so that the value cannot change. Four score changes are presented for each variation. The greatest change is for a P1 cut-off of 0.5, and it is only 0.4 - a bit more than twice as strong. This certainly does not seem to be very "dramatic" and it is a far cry from 7,000 times better! How do we understand this? The reason is quite simple. If we go back to MBBK's page 160 we find that the four values listed in the table are the P2 value for list 1, the least permutation rank for list 1, the value of P4 for list 2, and the least permutation rank for list 2 (each divided by a constant). P1 is not shown in Table 10 even though it purports to examine "Cut-off defining P1"! In this way, the "dramatic" improvement that MBBK mention in the text of their paper is nicely hidden from view when examining their table.
[Comment:] Our paper presents four success measures consistently in the tables, and the choice is justified at length. We had to maintain that consistency or our analysis would have been broken, but still we made sure to raise the issue in our text despite that not being strictly necessary. The expression we used: "has a dramatic effect on P1, reducing it by a large factor" is perfectly adequate. Also note that P1 is not suitable for the analysis we were performing, for reasons given in our paper but ignored by both Witztum and Gans.
Incidentally, it is interesting to note that cutoff 0.2 apparently does better than 0.5 for the early rabbis experiment described by Rips in his lecture, although we can't be sure of this without knowing what appellations were used. Perhaps the cut-off was adjusted for that prior experiment and not tested again later. In any case, if we are on trial for robbing 10 banks, we can't get off by proving there was an 11th bank we didn't rob.
[Gans:] Effectively MBBK's notation, which controls what is revealed and what is hidden, has been tuned! Specifically, for list 1 it hides what is essential, P1. We shall see more examples of this phenomenon shortly. Note, too, that in studying different date forms (MBBK, page 156), the date forms used by WRR were not optimized (for list 1!). Furthermore, all of the non-standard date forms did poorly - a result that might be expected if the Torah codes are real.
[Comment:] There are no non-standard date forms mentioned in our paper, so this statement is vacuous. Also, Gans ignores the fact that P1 is not suitable for this variational analysis for the reasons explained in our paper (as we said above).
[Gans:] Let us now turn to the presentation of McKay's evidence of tuning. On page 169 they provide a table of different proximity measure variations and how they affect the strength of the WRR proximity measures and probabilities. There are 67 variations considered of which only 20 show at least one of the measures improving. Bear in mind that as a result of their choice of notation, one can only see the improvements that MBBK want revealed. For example, it is conceivable that P1 improves for all 67 variations but their table would not show any trace of this improvement. Consider the following example. Suppose we wish to study 4 variations. We calculate the results of variations 1 through 4, and then add in the results of variation 4 combined with the first 3 variations. We thus produce 7 values for only 4 variations. If variation 4 and each of variations 4 combined with the first three variations produces weaker scores does this imply that 4 variations are weaker, or does it mean that variation 4 is weaker and it does not matter what other score it is combined with? MBBK actually do precisely this! They list 34 variations (33 in the first column and 1 at the top of the second column) and the 34th variation combined with the other 33. This 34th variation produces very weak scores - on the average it makes the four proximity measures 95 times weaker. When it is combined with the other 33 variations, not a single one makes the WRR measures stronger. On the other hand, if we look at the 34 variations by themselves, 20 of them have at least 1 improved measure. It is also interesting to note that MBBK do not present a table of values for multiple changes which include changing the P1 bunching threshold from 0.2 to 0.5; presumably there would be too many improvements to support their hypothesis! We see, then, that the presentation of the "evidence" for tuning has itself been tuned!
[Comment:] Gans is correct that the variations we present are not independent of each other. This is clearly stated in our paper and is one reason we did not attempt to compute a significance level for our observations. Gans' criticism is essentially that we gave too much information to the reader. Table 5 can be interpretted as showing one parameter with 34 values and another with 2 values (giving 68 combinations), or as showing one parameter with 68 values. It doesn't make any difference. In fact, Table 5 can be taken as an illustration of the total modularity property (see page 159 of our paper).
[Gans:] For the WRR experiment, we have shown that tuning is either impossible (the date forms and the proximity formula), or we can present strong evidence that there was no tuning (the appellations and the testimonies of Rav Deutch and Rav Fisher, and the additional experiments of Witztum and Gans). This is not possible for MBBK's presentation of evidence since we cannot know how many variations not presented would show evidence against their thesis. In fact, their thesis must be taken on faith. Perhaps it is for all these reasons that MBBK state, "...we are not going to attempt a quantitative assessment of our evidence. We merely state our case that the evidence is strong and leave it to the reader to judge" (MBBK, page 159). In other words, all this "evidence" rests on a "may be", has two fatal logical flaws, shows evidence of itself being tuned, and cannot be quantified. The "evidence" is left to the individual reader's judgement!
On page 161, MBBK raise another objection to the WRR results. They compare the P2 proximity measures obtained for lists 1 and 2, viz.: 1.29E-9 and 1.15E-9, and note that they are unexpectedly close. The claim is that they are closer than expected even if the WRR hypothesis were true. They conclude that this is evidence that list 2 was tuned to give a proximity measure close to list 1 because of "naive statistical expectation" that the two lists should give close proximity measures. MBBK estimate the probability of the two measures being so close as less significant than 1/100 and therefore admit that one cannot "conclude too much from" this example alone (MBBK, page 161). They next proceed to bring another example to bolster their theory - and herein lies the downfall of their thesis. They note that if one partitions the Gans city list into its list 1 and list 2 components, the probabilities obtained on the two parts are again very close. They estimate the probability of this closeness is less than 1/500 - clearly too small to be ascribed to chance alone. Before we deal with this "closeness" phenomenon, note a subtle change in the way MBBK report their results. For the WRR experiment, they measure the closeness of proximity measures, while for the Gans experiment they measure the closeness of probabilities. This is like mixing apples and oranges. In looking at two samples that supposedly manifest the same phenomenon, the measures used on both samples should be the same. The fact that MBBK chose to switch measures suggests that several different things were measured, and whatever happened to support their hypothesis better was quoted in each case. Thus, the presentation of the MBBK data once again shows evidence of itself being tuned.
[Comment:] In each case, the measure compared was the measure used by the experimenters. The only p-value given explicitly by WRR in their preprints was P2, so we used that. The only p-value used by Gans was the permutation test applied to (a variant of) P4, so we used that. In both cases, we used the only measure which could have tested our hypothesis. (Gans knows this; his article has deteriorated to near-Witztum quality.)
[Gans:] Even if we assume that their observations are valid, recall that every single city name in lists 1 and 2 were generated mechanically using the Inbal protocol. There is not one exception. Therefore, it is absolutely impossible that the list 2 component of the cities list could have been tuned to match the list 1component! We have thus proven that this "closeness phenomenon" between lists 1 and 2 does not imply that list 2 was tuned to match list 1. The "closeness phenomenon", if it is not simply a creation of MBBK's tuning, shows that there is something present in these lists which is not random - and not a result of tuning. But then, this observation is consistent with the WRR hypothesis that there is something very non-random in these lists. The specific cause of this "closeness", if it really exists, remains to be explained.
We have disposed of all the arguments and evidence presented in the MBBK paper save one. We have left what they consider to be their strongest argument for last. In their conclusions, on page 167, they say, "Be that as it may, our most telling evidence against codes is that we cannot find them. All of our many earnest experiments produced results in line with random chance." Note the use of the word "earnest". This is important because if their experiments were not earnest, they cannot be expected to succeed, even if there are codes in the Torah. In fact, a necessary (but not necessarily sufficient) condition for an experimental success is that the experiment be carefully designed.
MBBK claim to have performed "many real experiments" (MBBK, pages 163 and 165) of which they document a handful. Let us examine some of the experiments that they do document. On page 164 they report on a "cities" experiment of Barry Simon in which he "uses the names of all cities mentioned in each rabbi's entry in Margalioth's encyclopedia as places of birth, death, living, working or studying, without any modification of spelling or addition of prefixes." First, note what has been added to the experiment. Gans only used cities of birth and death, paralleling WRR's use of dates of birth and death. The fact that 3 new categories of cities were added to the experiment means that the experiment has wiggle room, that is, it could possibly be tuned to fail.
[Comment:] The very same page of Simon shows the results with only the cities of birth and death, and that those results are also abject failures. Let us be generous and suppose that Gans merely failed to notice.
[Gans:] Even more wiggle room is introduced by the ambiguity of terms such as "living". Thus, for example, Simon lists "Jerusalem" and "Hebron" as placing of living for the Rambam even though the Rambam only visited these locations. Does "visit" qualify as "living"?
Let us now examine a more fundamental problem with this experiment. Consider the conclusions to be drawn from a hypothetical experiment in which we use WRR's list 1 or 2, but use English dates rather than Hebrew dates. The failure of such an experiment would prove nothing. In fact, the success of WRR's experiment with Hebrew dates provides a strong expectation that an experiment with English dates would fail; if the dates are known to be encoded in Hebrew, why should they also be encoded in English? The same is true for the cities. There are 129 distinct city name/spellings on the list, of which 64 have Jewish names in addition to their secular names. (For the remainder, we assume that if there was no specifically Jewish name for a city, then the Jews must have referred to that city by its secular name. Thus, the Jewish name and the secular name are the same). The Margalioth encyclopedia often uses the secular names of the cities; a comprehensive list of the Jewish names is obtained from the articles on the cities (not the article on the rabbis) and the index in the Encyclopedia Hebraica. By using the entries from Margalioth without any modification of spelling, they have effectively replaced many of the Hebrew names with secular names, or left out many Jewish names. Thus, for example, the birthplace of Rabbi Yehuda HaChasid is given in Margalioth as "$PYRA", which is a Jewish name. However, the additional Jewish names/spellings for this city, "$PYYRA", "A$PYRA", "$PYY@R, and $PYR are obtained from the Encyclopedia Hebraica. This experiment is designed in such a way that much of the data is expected to fail and is expected to dilute any remaining statistical significance to within the range of random expectation. It is quite clear that this experiment, besides having been constructed so as to include wiggle room, was either very poorly designed, or tuned to fail.
[Comment:] Suppose it is true that half the cities have special Jewish names, and that half have equal Jewish and non-Jewish names. Then the data of Simon will still have about half of the cities that Gans claims to be encoded by God in Genesis. This would mean that Simon should have still obtained a positive result, albeit a much weaker one. Instead of 6/1,000,000, one could hope for a few in a thousand, or at worst a few in a hundred. What happens? Gans' success measure (P4) gets 1/3. Not even the minutest sign of non-random behaviour. This shows that Gans' explanation of why Simon's experiment failed is wrong.
From the examples Gans gives, it is clear that the question of Jewish versus non-Jewish name is only one of the points where Gans and Simon differ. Looking at the data itself shows that a much larger source of difference is spelling variation. In this respect, it is worth noting that Gans often uses spellings that violate the rules allegedly used for spelling appellations. Moreover, sometimes his spellings depend on the accident of whether a city has its own entry in the encyclopedia. In other words, the appellations follow one set of spelling rules and the cities follow another. One wonders if this aspect of the rules was certified as the correct way by all those notable people he consulted.
Note that the new replications of the experiment also show no trace of codes. Even if only a part of their data is correct (despite the expertise of their compilers) at least a weak positive result should have been obtained. The fact that this didn't happen is proof that the success of Gans' experiment is entirely due to the oddities of its protocol and the choices made in its implementation.
[Gans:] MBBK also mention experiments which pair the rabbis of list 2 with their years of birth and with the names of the books that they wrote. The details of these experiments were first made public in May 1997. Here are a few observations on their data, taken from a letter to "Jewish Action", vol. 59, No. 2 (page 90) by Doron Witztum: (a) the books of the Vilna Gaon were represented by a single book on geometry. (b) Many books were listed incorrectly. For example, they listed "YD @ZQH" instead of "YD H@ZQH" and "HHRQBH" instead of "HBPRHH". (c) Of the 66 dates given, at least 11 are incorrect. For example, R' Avraham the son of the Rambam was assigned a year of death 48 years before he was born!
[Comment:]
Wiztum knew long before his Jewish Action article that:
(a) "The books of the Vilna Gaon were represented by a single book
on geometry" only because the other famous books were eliminated by
a rule that Witztum himself invented. That rule eliminates names
more than 8 letters long. Putting them back in despite Witztum's
rule would not help either, as none of the longer titles has an ELS.
(b) Correcting all the erroneous book names did not make any difference.
The experiment still fails miserably.
(c) Correcting all the erroneous dates also leaves the experiment an
abject failure.
We expect Witztum to suppress such facts. What is Gans' excuse?
See here for more information on these experiments. Notice how Rips attempted to manipulate the rules of our experiment after seeing the results, openly and with conviction, yet Gans and others stubbornly deny the possibility of the same thing having happened even in secret with any of WRR's experiments.
[Gans:] Let us examine still another example of one of MBBK's "earnest" experiments. It will be recalled that in WRR's experiment on list 2, 4 overall proximity measures were calculated. We indicated that the 3rd and 4th measures are minor modifications of the 1st and 2nd measures respectively. We shall explain what this modification consists of. For technical reasons, the process used to calculate the proximity measures can only do so for words that are between 5 and 8 letters long, inclusive. Many of the appellations provided by Havlin have the prefix "rabbi" (in Hebrew, RBY, a 3-letter word). As a result, there is only 5 letters left for the name. In such a case, only the personal name of the personality was used, not the surname. An appellation without the prefix "RBY" has much more chance of including the surname, and thus being distinct. A simple example will illustrate this principle. The first 3 personalities on list 1 and the first 4 personalities on list 2 all have the same appellation, "Rabbi Avraham" (RBY ABRHM). Thus, there are 7 personalities with the same appellation. On the other hand, there are 13 remaining appellations for 6 of these 7 personalities, and each is uniquely associated with a single personality. It follows that if one wishes to maximize use of appellations that correspond to a unique personality, one must exclude appellations with the prefix "rabbi".
[Comment:] This a very good example of why we should not take Gans too seriously. Here he is criticising our use of appellations of the form "Rabbi X" because they don't uniquely identify the rabbis. A little while ago (see above) he reported how Witztum used appellations of the form "ben Y" (where Y is the father's name), and Gans thought that was just fine. But "ben Y" no more identifies the rabbi than "Rabbi X" does! There are four "ben Avraham"s, four "ben Moshe"s, three "ben Yakov"s, two "ben Yitzhak"s and two "ben Shmuel"s in Witztum's list. We see that Gans is just using whatever argument he fancies in each instance without a care about consistency.
[Gans:] The 3rd and 4th overall proximity measures are exactly the same as the 1st and 2nd measures respectively, except they are applied to this subset of appellations from the original list. We thus have 3 sets of appellations associated with each list. (1) The entire list. (2) The subset scored by the 3rd and 4th overall proximity measures, in which appellations tend to be uniquely associated with personalities, and (3) the difference between (1) and (2). The Gans city experiment was scored with the 3rd and 4th overall proximity measures only because it is clearly advantageous to have a unique association between personalities and appellations, given the hypothesis of codes. (In fact, there is a down side to scoring set (2) only: the data size is smaller. Perhaps this is why for list 2, the 4th probability is best, but the 2nd is close behind). No one has ever suggested using set (3) above for an experiment because it makes no sense. Why score only non-unique data that is desirable to discard? Yet, this is exactly what MBBK describe on page 165. Not only is the use of this data questionable, the data size has been reduced from a total of 188 appellations to 55. Cutting the data size down to less than a third its original size, and retaining only the "undesirable" data is a sure way of insuring that any statistical significance in the original data will be destroyed. We can appreciate just how serious this reduction of data size is by computing what effect it has on the two proximity measures, even if we ignore the questionable quality of the data that has been retained. Just reducing the data size as indicated is expected to reduce the first proximity measure from a sigma value of 6.42 to 3.47, a dramatic weakening of the result. As for the second proximity measure, it is expected to change from 5.2E-9 to 5.7E-3, or 1,100,000 times weaker. We again have an experiment that is either poorly designed or expertly designed to fail.
[Comment:] Here Gans proves himself wrong. He writes that the first measure should change from 6.42 to 3.47 sigma (corresponding to about 0.00026) but it actually is 0.079 (300 times worse). Similarly the second measure is predicted to give 0.0057 but it is actually 0.434 (76 times worse). It is completely obvious that a value like 0.434 is incompatible with Gans' claim that we are just seeing the effect of reducing the data size.
[Gans:] It is important to note that the objections raised to this experiment are specific to this data set and may not apply to other data sets. Thus, for example, in Witztum's "replication" experiment, both "ben name" and "ben rabbi name" were used. However, in that case the number of unique names of personalities is the same for both forms, and correct forms are not excluded. There are also legitimate experiments done on small data sizes, e.g., "personalities in Genesis". The problem is not in a data size being small a priori, (although from a statistical point of view, bigger is usually better) but rather in the a posteriori reduction of the size of a set, whose score is known, to a point where the expected value of that score on the reduced data is within random expectation.
If these experiments are typical of MBBK's "many earnest" experiments, then their "most telling" evidence falls, along with all the other "evidence" reported in their paper. Finally, note that if MBBK feel that their inability to perform a successful Torah codes experiment is the strongest argument against codes, then surely the success of Witztum and his associate, Rips, Gans, Rottenberg, Bombach, and Schwartzman in performing new, mathematically verifiable, and wiggle-free codes experiments is the strongest argument for codes.
[Comment:] We have not seen any experiments matching that description.
Incidentally, note how Gans fails to mention our detailed replication of the great rabbis experiment using an independent expert to prepare the data. It isn't that he didn't see it, it is that he knows it is too damaging to be mentioned.
[Gans:] At this point, the reader may wonder why Witztum and Rips did not submit a rebuttal of the MBBK paper to Statistical Science. The reason is straightforward. On June 23, 1998, Professor Rips sent an email to the editor of Statistical Science stating, "I would appreciate very much if you [could] possibly let me know whether the journal 'Statistical Science' has received papers or comments or remarks or other material related to our paper 'Equidistant Letter Sequences in the Book of Genesis' (by Witztum, Rips and Rosenberg) which appear in the issue of Aug 94. I would be most grateful if you would send such material to us, in order that we will be able to respond to it and explain our point of view." The response from the editor of Statistical Science came the same day: "We have received papers related to your August 1994 Statistical Science article. These papers are currently undergoing review, and until a decision is made whether or not to publish them, we are bound by confidentiality not to disclose their contents or their authors. If a decision is made to publish any of these papers, you will be sent copies and invited to respond."
In May 1999, an Internet posting indicated that Statistical Science had decided to publish the paper by MBBK in that month's issue. Neither Rips nor Witztum had been informed of the publication, nor were they "invited to respond". On May 7, 1999, Witztum sent the following email to the editor of Statistical Science. "It was brought to our attention that you have decided to publish a paper by MBBK related to our Statistical Science article. If I may, I would like to remind you of your letter dated 30 June 1998 to Prof. Rips. If there is any misunderstanding on my behalf, please let me know." The response came the same day: "There is no misunderstanding on your part concerning my letter of June, 1998, but there has been a change of policy based on advice that I have received.... I have decided to publish the MBBK article without commentary of any sort... I regret having to retract my invitation to Professor Rips to respond to the MBBK article. I remind you, and him, that your original article also was not discussed..."
No commentary is needed; the above letters speak for themselves.
[Comment:] Gans omitted to mention that Rips was invited to submit a reply for processing in the normal fashion. What he was not permitted was the right to bypass the scrutiny of the journal's referees. Whether this was correct or not is a question that can be debated, but it is necessary to note that the decision made by the editors was common practice and completely within their accepted discretionary rights. Note that WRR's original paper was also published without comment, even though a major critic (Diaconis) offered to write one.
Incidentally, MBBK were not opposed to allowing WRR to reply and said so to the editors. It would have allowed us to demonstrate that WRR had no real defence against our findings.
[Gans:] Conclusions
There is a popular saying, "Where there is smoke, there is fire". One must wonder how Torah codes could be real given that so many arguments and pieces of evidence have been brought against them. Nevertheless, we have systematically refuted all of the critics' arguments. We have detailed new experiments that are "wiggle-proof" and that even MBBK do not attempt to refute. One of these experiments uses a word list provided by McKay and his associates! (This was the Nations prefix sample). We have even seen evidence that the very "tuning" that they accuse WRR of infects the critics' evidence and experiments. We have seen numerous instances where obvious opportunities for tuning by WRR were passed up. We have documented a plethora of logical flaws in the critics' reasoning.
Still, the MBBK paper passed peer review for a highly respected journal. This is not easily dismissed. It is, perhaps apropos to quote from the directions for submissions to another highly respected journal: ECONOMETRICA. "If you plan to submit a comment on an article that has appeared in ECONOMETRICA, please follow these guidelines: First send a copy to the author and explain that you are considering submitting the comment to ECONOMETRICA. Second, when you submit your comment, please include any response that you have received from the author. ...Authors will be invited to submit for consideration a reply to any accepted comment of this kind." It is abundantly clear that if a party refuses to hear one side of a dispute, that refusal is itself evidence of bias. Furthermore, without input from both sides, the peer reviewers cannot be unbiased even if they wish to be. They simply do not have all the information needed to reach a fair conclusion. This same bias also manifested itself when Witztum submitted his papers "Personalities of Genesis and their Dates of Birth" and "A Replication of the Second Sample of Famous Rabbinical Personalities" to Statistical Science. These two papers report the results of the "additional experiments" described earlier in this paper. In a May 12, 1999 response to this submission, the editor states: "Neither paper offers anything new or interesting in terms of statistical theory or methodology. Given Statistical Science's declared policy of publishing papers with high statistical content, these papers are inappropriate for Statistical Science". This response was given in the same month that the MBBK paper was to be published by Statistical Science. Apparently, this "declared policy" of Statistical Science applies to papers that establish the veracity of the Torah codes, but not to those that challenge it!
[Comment:] On the contrary, it implies that MBBK's paper met the "declared policy" and Witztum's papers did not. If the submitted papers were similar to what Witztum has published on the internet, we can also say that they had inferior presentation quality and hardly looked like scientific papers. It is obvious to anyone familiar with the preprints that WRR's published paper only had a chance because it had been completely rewritten by Robert Aumann in a (superficially) scientific format.
[Gans:] Incidentally, this letter shows that the editors of Statistical Science were well aware of these two experiments performed by Witztum, yet condoned their omission from the MBBK paper. They also appear not to have had any objection to MBBK's statement that "Nothing we have chosen to omit tells a story contrary to the story here".
[Comment:] The timing of our paper meant that those "experiments" could only have been mentioned at a stage after the "final copy" was in the hands of the journal. Nevertheless, we examined them, saw more of the same nonsense, and reported as much to the editors of the journal. That's how the extra experiments failed to get mentioned. In fact the decision was in WRR's favour, as we would have ripped the experiments apart if we had decided to include them in our paper.
[Gans:] In response to this unabashed bias of Statistical Science, four independent mathematicians wrote a letter to the editor of Statistical Science in which they say, "It seems elementary that Witztum et al. should have been asked for their response BEFORE McKay et al. was sent to referees". They go on to say, "The secrecy under which the whole process took place up to now is not worthy of a top flight journal like Statistical Science".
[Comment:] There was no secrecy. Everyone knew for years that we were writing a paper and WRR were aware of most of our arguments. We had even published them before in summary in Chance magazine.
[Gans:] These mathematicians summarize their position as follows: "Allow us to emphasize very strongly that we are not taking any position on the substance of the accusations. We also agree that accusations of fraud may in principle be published, and indeed should be published - if correct. All we say is that if you do publish them, you should be certain that they are indeed correct AND that you have followed equitable and reasonable procedures". This letter was dated July 5, 1999, about two months before the May issue of Statistical Science containing the MBBK paper appeared in print. The letter made no difference at all!
[Comment:] Obviously Gans does not know anything about the process of printing an academic journal and how long it takes. Two months before distribution is too late for anything but an emergency stopping of the presses. Nevertheless, our information is that the letter (and the accompanying treats of personal legal action against dozens of people associated with the journal) were seriously considered and dismissed.
[Gans:] The critics have raised many objections to the Torah codes, yet, one irrefutable argument would have sufficed. It is reminiscent of Hamlet's remark that "The lady doth protest too much". Hopefully, this paper will aid the truth seeker in seeing through the smoke screen and obfuscation to the truth.
Some useful ideas were contributed by Avi Norowitz.
Back to Bible Codes Refuted
Creator: Brendan McKay, bdm@cs.anu.edu.au.