ABSTRACT. It has been noted that when the Book of Genesis is written as two-dimensional arrays, equidistant letter sequences spelling words often appear in close proximity with portions of the text which have related meaning. Quantitative tools for measuring this phenomenon are developed. Randomization analysis is done for three samples. For one of them the effect is significant at the level of .000000004.
Key words and phrases. Genesis, Equidistant letter sequences, Strings of letters, Cylindrical representations, Statistical analysis.
In a previous paper (Witztum et al, 1994), we developed a methodology for systematic and rigorous studies of the same nature; namely, for attempts to show objectively the existence of the"hidden text" in the Hebrew Pentaceuch. This methodology was applied to study the "hidden text" of the Book of Genesis.
The approach we have taken in our research can be illustrated by the following example. Suppose we have a text written in a foreign language that we do not understand. We are asked whether the text is meaningful (in that foreign language) or meaningless. Of course, it is very difficult to decide between these possibilities, since we do not understand the language. Suppose now that we are equipped with a very partial dictionary, which enables us to recognise a small portion of the words in the text: "hammer" here and "chair" there, and maybe even "umbrella" elsewhere. Can we now decide between the two possibilitities?
Not yet. But suppose now that, aided with the partial dictionary, we can recognise in the text a pair of conceptually related words, like "hammer" and "anvil". We check if there is a tendency of their appearances in the text to be in "close proximity". If the text is meaningless, we do not expect to see such a tendency, since there is no reason for it to occur. Next, we widen our check; we may identify some other pairs of conceptually related words: like "chair" and "table", or "rain" and "umbrella". Thus we have a sample of such pairs, and we check the tendency of each pair to appear in close proximity in the text. If the text is meaningless, there is no reason to expect such a tendency. However, a strong tendency of such pairs to appear in close proximity indicates that the text might be meaningful.
Note that even in an absolutely meaningful text we do not expect that, deterministically, every such pair will show such tendency. Note also, that we did not decode the foreign language of the text yet: we do not recognise its syntax and we cannot read the text.
In our research we consider the set of all ELS's spelling out words or phrases in the language of the text. The approach described in this example suggests the two following lines of investigation:
A) A study of the mutual location of ELS's spelling out conceptually related words or expressions.
B) A study of the mutual location of ELS's spelling out words or expressions with conceptually related portions of the text.
Suppose we are given a text, such as Genesis (G). Define an ELS (equidistant letter sequence) as a sequence of letters in the text whose positions, not counting spaces, form an arithmetic progression; that is, the letters are found at the positions
We call d the skip, n the start, and k the length of the ELS. These three parameters uniquely identify the ELS, which is denoted (n, d, k).
Let us write the text as a two-dimensional array -- i.e., on a single large page -- with rows of equal length, except perhaps for the last row. Usually, then, an ELS appears as a set of points on a straight line. The exceptional cases are those where the ELS "crosses" one of the vertical edges of the array and reappears on the opposite edge. To include these cases in our framework, we may think of the two vertical edges of the array as pasted together, with the end of the first line pasted to the beginning of the second, the end of the second to the beginning of the third, and so on. We thus get a cylinder on which the text spirals down in one long line.
It has been noted that when Genesis is written in this way, and the distance between ELS's is defined according to the ordinary two-dimensional Euclidean metric -- ELS's spelling words with related meaning often appear in close proximity. It has also been noted that ELS's spelling words often appear in close proximity with portions of the text which have related meaning.
Thus, our research focuses on two phenomena:
Our paper (Witztum et al., 1994) deals with Phenomenon A. There we developed a method for testing the significance of the phenomenon according to accepted statistical principles. After making certain choices of words to compare and ways to measure proximity, we performed a randomization test and obtained a very small p-value, i.e. we found the results highly statisically significant.
Each ELS determines a series of tables with row lengths h= h1, h2,..., where hi is the integer nearest to |d|/i (1/2 is rounded up).
The rows in our table has 428 letters (only 21 of them are shown in Figure 1). The number 428 is the nearest integer to 855/2, so the word (private) appears every second row.
The table is determined by this ELS of the word (private), which is minimal in a section of the text comprising 76% of G. The word (names) appears in the text of G seven times as SL's. Note that G contains 78,064 letters. More examples are given in Appendix A.3.1.
The measuring scheme for Phenomenon A (see Witztum et al., 1994) is applicable, with minor changes, to study Phenomenon B.
In this paper we make certain choices of words to compare and perform similar randomization tests. We obtain very small p-values; that is, we find that the results are highly statistically significant.
We test the significance of the phenomenon on samples of pairs of related words. To do this we must do the following:
Task (i) has several components. First, we must define the notion of "distance" between an ELS and an SL of the text in a given array; for this we use a convenient variant of the ordinary Euclidean distance. Second, there are many ways of writing a text as a two-dimenional array, depending on the row length; we must select one or more of these arrays, and somehow amalgamate the results (of course, the selection and/or amalgamation must be carried out according to clearly stated, systematic rules). Third, a given word may occur many times as an ELS in a text; here again, a selection and amalgamation process is called for. Fourth, we must correct for factors such as word length and composition. All this is done in detail in Sections A.1 and A.2 of the Appendix.
Next, we have task (ii), measuring the overall proximity of pairs of words in the sample as a whole. For this, we used two different statistics, P1 and P2, which are defined and motivated in the Appendix (Section A.6). Intuitively, each measures overall proximity in a different way. In each case, a small value of Pi indicates that the words in the sample pairs are, on the whole, close to each other.
To accomplish task (iii) we composed three samples (Sample B1, Sample B2 and Sample B3) of pairs of expressions (w, w'), where w's are words appearing as ELS's, and w' 's are words appearing as SL's (i.e. with d' =1).
Preliminary test was done for each sample, in order to check how the subject of the sample appears as ELS's and SL's in Genesis, and consequently to decide whether to test the sample itself. For details see Appendix, Section A.3.1.
Sample B1 is built on the basis of the Hebrew alphabet. For every letter 'x' of the Hebrew alphabet we consider pairs (w,w'), where w' 's are found as SL's and have the meaning "a name beginning with 'x' " or "names beginning with 'x' ", while w's are names beginning with 'x' taken from "A Treasury of Men`s Names" (which is included as an appendix in Even-Shoshan's famous Hebrew dictionary (Even-Shoshan, 1989). For a detailed definition of the Sample B1 see Appendix (Section A.3.2).
Sample B2 is built exactly in the same way as Sample B1, except for the fact that the names are taken from "A Treasury of Women's Names" from the same dictionary.
Sample B3 is built on the basis of the list of the seventy descendents of Noah's sons: the Semites, the Hamites, and the Japhetites, found in Genesis Chapter 10. Jewish tradition teaches, that these seventy descendents became the Seventy Nations which constitute Humanity. This concept is well known, and is usually found in biblical Encyclopaedias under the title "The Table of Nations" (see for instance Encyclopedia Biblica, 1962). Sample B3 consists of pairs (w, w') where w' 's are names from this list, and w`s are expressions from a fixed set of expressions describing basic aspects of nationality (such as name, country, language etc.) For details see Appendix, Section A.3.3 and Table 8.
Finally, we come to Task (iv), the significant test itself. We apply the same procedure for all three samples. For Sample B3 we describe it here: for the other two samples the (similar) details are given in Appendix A.4.
The list of Seventy Nations consists of 68 different names (in two cases nations have the same name). For each of the 68! permutations of these names, we define the statistic obtained by permuting the names in accordance with , so that Name i is matched with the set of expressions defined for Name (i). The 68! numbers are ordered, with possible ties, according to the usual order of the real numbers. If the phenomenon under study were due to chance, it would be just as likely that P1 occupies any one of the 68! places in this order as any other. Similarly for P2. This is our null hypothesis.
To calculate significance levels, we chose 999,999 random permutations of the 68 names; the precise way in which this was done is explained in the Appendix (Section A.7). Each of these permutations determines a statistic together with P1, we have thus 1,000,000 numbers. Define the rank order of P1, among these 1,000,000 numbers as the number of not exceeding P1; if P1 is tied with other , half of these others are considered to "exceed" P1. Let be the rank order of P1, divided by 1,000,000; under the null hypothesis, is the probability that P1 would rank as low as it does. Define similarly (using the same 999,999 permutations in each case).
For Sample B3 we performed an additional test with 999,999,999 random permutations. In this case only statistic was calculated. The time needed for the computation of for 999,999,999 random permutations is at present not within the reach of our possibilties.
After calculating the probabilities and , we must make an overall decision, for each sample, to accept or reject the null hypothesis. Thus the overall significance level (or p-value) for each sample, using the two statistics. is := 2 min .
We conclude that for Sample B1 the null hypothesis is rejected with significance level (p-value) .000884.
The same calculations, using the same 999,999 random permutations, were performed for a control text V (see Witztum et al., 1994). The text V was obtained from G by permuting the verses of G randomly. (For details, see Appendix; Section A.7).
Table 1 gives the results of these calculations too. In the case of V, min is approximately .169, being non-significant.
Table 2 shows the results for Sample B2 for G. The results are non-significant. We saw no reason to perform any further tests for this sample.
Table 3 shows the results for Sample B3. In part A of it, the results for G, as well for the control text V for 999,999 random permutations are summarized. In the case of V, min is approximately .559.
In part B, the results for G for 999,999,999 random permutations are given. Notice that only the rank order of P2 was calculated. It turned out to be 4.
We conclude that for Sample B3 the null hypothesis is rejected with significance level (p-value) .000000004.
In Section A.1 of this Appendix, a "raw" measure of distance between words is defined. Section A.2 explains how we normalize this raw measure to correct for factors like the length of a word and its composition (the relative frequency of the letters occurring in it). Section A.3 explains how the three samples B1, B2 and B3 are constructed. Section A.5 identifies the precise text, of Genesis that we used. In Section A.6, we define and motivate the statistics P1 and P2. The details of the task (iv) are described in Section A.4. Finally, Section A.7 provides the details of the randomization.
As indicated in Section 1, we think of an array as one long line that spirals down on a cylinder; its row length h is the number of vertical columns. To define the distance between two letters x and x', cut the cylinder along a vertical line between two columns. In the resulting plane each of x and x' have two integer coordinates. and we compute the distance between them as usual, using these coordinates. In general, there are two possible values for this distance, depending on the vertical line that was chosen for cutting the cylinder; if the two values are different, we use the smaller one.
Next, we define the distance between fixed ELS e and SL e' in a fixed cylindrical array. Set
f ' := the distance between consecutive letters of e' = 1.
l := the minimal distance between a letter of e and one of e',
Now there are many ways of writing Genesis as a cylindrical array, depending on the row length h. Denote by h(e, e') the distance (e, e') in the array determined by h, and set h(e, e') := 1/h(e, e'); the larger h(e, e') is, the more compact is the configuration consisting of e and e' in the array with row length h. Set e = (n, d, k) (recall that d is the skip). Of particular interest are the row lengths h = h1, h2,.... where hi is the integer nearest to |d| /i (1/2 is rounded up). Thus when h = h1 = |d|, then e appears as a column of adjacent letters and when h = h2, then e appears either as a column that skips alternate rows or as a straight line of knight's moves. In general, the arrays in which e appears relatively compactly are those with row length hi with i "not too large."
The above discussion indicates that if there is an array in which the configuration (e,e') is unusually compact, it is likely to be among those whose row length is one of the first ten hi. (Here and in the sequel 10 is an arbitrarily selected "moderate" number). So setting
we conclude that (e, e') is a reasonable measure of the maximal "compactness" of the configuration (e, e') in any array. Equivalently, it is an inverse measure of the minimum distance between e and e'.
Next, given a word w, we look for the most "noteworthy" occurrence or occurrences of w as an ELS in G. For this, we chose ELS's e = (n,d,k) with |d| 2 that spell out w for which |d| is minimal over all of G, or at least over large portions of it. Specifically, define the domain of minimality of e as the maximal segment Te of G that includes e and does not include any other ELS for w with ||< |d|. The length of Te, relative to the whole of G, is the "weight" we assign to e. Thus we define (e) := (Te)/(G), where (Te) is the length of Te, and (G) is the length of G. For any two words w and w', we set
where the sum is over all ELS's e spelling out w and over all SL's e' spelling out w'. Roughly, (w, w') measures the maximum closeness of the more noteworthy appearances of w as ELS's and w' as SL's in Genesis--the closer they are, the larger is (w, w').
When actually computing (w, w'), the size of the list of ELS's for w may be impractically large (especially for short words). It is clear from the definition of the domain of minimality that ELS's for w with relatively large skips will contribute very little to the value of (w,w') due to their small weight. Hence, in order to cut the amount of computation we restrict beforehand the range of the skip |d|D(w) for w so that the expected number of ELS's for w will be 10. This expected number equals the product of the relative frequencies (within Genesis) of the letters constituting w multiplied by the total number of all equidistant letter sequences with 2 |d| D. (The latter is given by the formula (D - 1)(2L - (k - 1)(D + 2)), where L is the length of the text and k is the number of letters in w). Abusing our notation somewhat, we continue to denote this modified function by (w,w').
The idea is to use perturbations of the arithmetic progressions that define the notion of an ELS. Specifically, start by fixing a triple (x,y,z) of integers in the range {-r,...,0,...,r}; there are (2r + 1)3 such triples. In Witztum et al. (1994) and also here we put r = 2. which gives us 125 triples. Next, rather than looking for ordinary ELS's (n,d,k), look for "(x,y,z)-perturbed ELS's" (n,d,k)(x,y,z) obtained by taking the positions
instead of the positions n, n + d, n +2d,...,n +(k - 1)d. Note that in a word of length k, k-2 intervals could be perturbed. However, we preferred to perturb only the 3 last ones, for technical programming reasons.
The distance between the (x,y,z)-peturbed ELS (n, d, k)(x,y,z) and the SL (n', 1,k') is defined by the same formulae as in the non-perturbed case, where f is taken to be the distance between the first two letters of (x,y,z)-perturbed e.
We may now calculate the "(x,y,z)-proximity" of two words w and w' in a manner exactly analogous to that used for calculating the "ordinary" proximity (w, w'). This yields 125 numbers (x,y,z)(w, w'), of which (w,w')=(0,0,0)(w,w') is one. We are interested in only some of these 125 numbers; namely, those corresponding to triples (x,y,z) for which there actually exist some (x,y,z)-perturbed ELS's in Genesis for w (the other (x,y,z)(w,w') vanish). Denote by M(w, w') the set of all such triples, and by m(w, w') the number of its elements.
Suppose (0,0,0) is in M(w, w'), i.e., w actually appears as ordinary ELS (i.e., with x = y = z = 0) in the text. Denote by v(w,w') the number of triples (x,y,z) in M(w,w') for which (x,y,z)(w,w')(w,w'). If m(w,w') 10 (again, 10 is an arbitrarily selected "moderate" number),
If (0, 0,0) is not in M(w, w'), or if m(w, w') < 10 (in which case we consider the accuracy of the method as insufficient), we do not define c(w,w').
In words, the corrected distance c(w,w') is simply the rank order of the proximity (w,w') among all the "perturbed proximities" (x,y,z)(w,w'); if (w,w') is tied with other (x,y,z)(w,w'), half of these others are considered to "exceed" (w,w'). We normalize it so that the maximum distance is 1. A large corrected distance means that ELS's representing w are far away from the SL's representing w', on a scale determined by how far the perturbed ELS's for w are from the SL's for w'.
Our method of rank ordering of ELS's based on (x,y,z)-perturbations requires that words have at least 5 letters to apply the perturbations. In addition, we found that for words with more than 8 letters, the number of (x,y,z)-perturbed ELS's which actually exist for such words was too small to satisfy our criteria for applying the corrected distance. Thus the words in our list are restricted in length to the range 5-8, exactly as in Witztum et al. (1994). However. there is no restriction on the words or expressions appearing as SL's.
A.3.1 Preliminary Tests.
Sample B1 deals with men's names. Appendix B in Even-Shoshan's Hebrew
dictionary (Even-Shoshan. 1989) is "A Treasury of Private Names". It is
divided into two parts: "A Treasury of Men's Names" and "A Treasury of
Women's Names". Originally, we intented to check a sample based on the
(much bigger) Treasury of men's names. The Treasury contains names from
the Hebrew Bible and from various periods of the Hebrew language. We decided
to include in our sample only names taken from the Hebrew Bible (they are
indicated as such in this Treasury). These are original Hebrew names, or
names which are etymologically Semitic or Hebrew. (See the foreword to
the Treasury).
We describe the subject of the sample by the following set of pairs of expressions:
Tables 4A and 4B give the values of c(w,w') for each pair from the above lists (where w' is (names), and w is the corresponding expression). Recall that c(w, w') is defined only when w appears as an ELS, and that w is restricted in length to the range 5-8.
The total for the 11 pairs is: P1 = 0.0000042, P2 = 0.000397 (for the definition of P1 and P2 see section A.6).
A randomization test that fits for this type of samples, where the same word is "paired" with a list of expressions, is the subject of our next paper.
Sample B2 deals with women's names. We describe this fact exactly as we had done with men's names; i.e. by the following pairs:
Table 5 gives the values of c(w,w') for each pair from the above list (where w' is (names), and w, is the corresponding expression).
To do a preliminary test for Sample B3, we proceed in a similar way and check the following expressions with the same appearance of (names) as above:
Seeing these results as encouraging, we proceed to check even more closely the subject of Sample B3.
The expression (the descendants of Noah) exists in G as SL's five times. So we can look directly for meetings between this expression and the word (seventy) as ELS's:
In Fig. 4 we see the pairs 1), 2) and 5) appearing together. This table is determined by an ELS for (The Seventy Descendants of Noah) with skip -100, each row containing 25 = 100/4 letters. The ELS for (seventy) is the same as shown in Fig. 3. The ELS's for (The Seventy Descendants of Noah) and for (the nations) are minimal over the whole text of G.
Table 7
gives the va1ues of c(w,w')
for each pair from the above list (where w' is
(the descendants of Noah), and w is the corresponding expression).
The total for the 15 pairs is: P1 = 0.0000779, P2
= 0.0000377.
"A Treasury of Men's Names" is included as an appendix in Even-Shoshan's Hebrew dictionary (Even-Shoshan, 1989). From this appendix we chose all the names which are mentioned in the Hebrew Bible (Pentateuch, Prophets and Writings) as proper names - as indicated in the Treasury itself - and contain 5 to 8 letters.
This collection of names is designated as S. Let HA := Hebrew Alphabet (which consists of 22 letters).
For every x HA we define
We define 22 sets A(x), x HA, where the elements of each set are the following three expressions:
Then we define
Now we define
and, for every x,y HA'
For instance, ('' , ) Pair B1(,).
The sample is defined by:
For Genesis it contains 457 pairs of expressions.
Sample B2 is defined precisely in the same way as Sample B1, except that we used 'A Treasury of Women's Names" from the same dictionary. For Genesis it contains 38 pairs of expressions.
A.3.3 The Sample B3.
Chapter 10 of Genesis contains the names of the seventy descendents
of Noah's sons: the Semites, the Hamites and the Japhetites
(see Table
8). Jewish tradition teaches, that these seventy descendents became
the Seventy Nations, which constitute Humanity. The source for this tradition
is found in Targum Yonathan (Jonathan's Translation) to Deuteronomy
32:8 (see Kasher, 1983, p.294). This concept is well known, and is
found in biblical Encyclopedias under the title "the Table of Nations"
(see for instance, Encyclopedia Biblica, 1962).
Jewish tradition further tells us (Hagra, 1905), that a nation has the following four characteristics:
For every x TN we define below a set List B3(x) of expressions.
For every x,y TN we define
The resulting sample is:
Now let us explain how List B3(x) is constructed. Take, for example, one of the names in the list-- (Canaan - item 18 in Table 8):
The same procedure is applied to each name from the Table of Nations.
For every
x
TN List B3 (x) consists of the following expressions:
Remark 1. The expressions 'x ', 'x ' and 'x ' were chosen for categories (1), (2) and (3) because at least for some x TN they appear as Biblical expressions (see the above example of Canaan). For category (4) we used 'x '; although it does not appear for any x TN, it is the only Biblical formation suitable for this category.
Remark 2. Concerning the expressions '-x ', we are aware of the fact that only valid lingustic formations should be used; therefore we use these expressions if, at least, their singular form '-x' appears in the Hebrew Bible. For instance, for (item 1) we do not take , since the expression does not appear in the Hebrew Bible; so we cannot know whether it has any meaning at all. The check was done using Even-Shoshan's New Concordance of the Bible (Even-Shoshan, 1981).
Remark 3. In Talmudic Literature we find various new names for nations and countries, and we cannot decide in favor of one against the other. We preferred to use a single source. The Aramaic translation of Genesis that gives the largest number of such new names is Targum Yonathan. We used Targum Yonathan as printed in Torah Shelemah (Kasher, 1929) (with Torah Shelemah's corrections according to the Ginzburger manuscript and others).
Targum Yonathan identifies the new name for the country for items 1 to 15, 17, and 19 to 23 in Table 8; and the new name for the nation for items 24 to 33, and 41 to 44.
Targum Yonathan is a translation into Aramaic. We need the identification of the names, but we do not intend to check whether the Aramaic appears non-randomly as ELS's. Thus, where the Hebrew formation exists, we used it alone.
Remark 4. Here, as in Witztum et al. (1994), we used the grammatical orthogrophy - ktiv dikduki.
A.4.1 Samples B1 and B2.
Let be a permutation on the set HA'. Then we define the permuted sample
The set HA' consists of 12 elements, thus there are 12! different permutations. To calculate significance levels, we chose 999,999 random permutations as described in Section A.7.
For Sample B2 we proceed similarly.
A.4.2 Sample B3.
For a permutation on the set TN we define the permuted sample
The set TN consists of 68 different elements (the names and appear twice), thus there are 68! different permutations. To calculate significance levels, we first chose 999,999 random permutations as described in Section A.6. Then we did a new measurement which 999,999,999 random permutations, only for statistic P2.
To understand this definition, note that if the c(w, w') were independent random variables that are uniformly distributed over [0,1], then P1 would be the probability that at least k out of N of them are 0.2. However, we do not make or use any such assumptions about uniformity and independence. Thus P1, though calibrated in probability terms, is simply an ordinal index that measures the number of word pairs in a given sample whose words are "pretty close" to each other (i.e. c(w, w') 1/5), taking into account the size of the whole sample. It enables us to compare the overall proximity of the word pairs in different samples; specifically, in the samples arising from the different permutations .
The statistic P1 ignores all distances c(w,w') greater than 0.2, and gives equal weight to all distances less than 0.2. For a measure that is sensitive to the actual size of the distances, we calculate the product c(w,w') over all word pairs (w,w') in the sample. We then define
with N as above, and
To understand this definition, note first that if x1, x2, ..., xN are independent random variables that are uniformly distributed over [O, 1], then the distribution of their product X :=x1, x2...xN is given by Prob(X X0) = FN(X0); this follows from (3.5) in Feller (1966), since the -ln xi are distributed exponentially, and -ln X = (-ln xi). The intuition for P2 is then analogous to that for P1: If the c(w, w') were independent random variables that are uniformly distributed over [0,1], then P2 would be the probability that the product c(w,w') is as small as it is, or smaller. But as before, we do not use any such uniformity, or independence assumptions. Like P1, the statistic P2 is calibrated in probability terms; but rather than thinking of it as a probability, one should think of it simply as an ordinal index that enables us to compare the proximity of the words in word pairs arising from different permutations.
The control text V was constructed by permuting the verses of G with a single random permutation, generated like in the previous paragraph. In this case, the seed was picked arbitrarily to be the decimal integer 10 (i.e., the binary integer 1010).
We thank Mr. Bernard Goldstein of London and Mr. Yaakov Aharonov of Bnei Brak for their help with computer facilities.
We express our sincere gratitude to Mishmereth Stam (Bnei Brak) for the text of the Book of Genesis on a disc.
We wish to express our thanks to Dr. Shalom Srebrenik for helpful discussions and valuable suggestions.
We thank Yaakov Orbach for help on linguistic matters.