Archive for April, 2010

On the Beale Cipher, Part II (and Other Book Ciphers)

April 28th, 2010 No comments

Last time I talked about extending the usual frequency analysis on the (first) Beale cipher to augment our understanding of the composition of the individual letters the numbers may represent.  I have said before that Markov chains are immensely applicable everywhere; the Beale cipher seems no exception.  The idea came to me last month, a couple years after reading Singh's book, and while I was happily lazy watching a couple seagulls fish out of Manzanillo's ocean.  I had also read Snell's book and a particular description of how Markov himself thought about Markov chains intrigued me: apparently he had counted the transitions of vowels to consonants and consonants to vowels in a book, thought out his theory, and showed that the long-term fractions of vowels and consonants in the book (using a chain) stabilized to the actual ratio.

This same idea can be applied, I think, on the Beale cipher.  Suppose we know a priori how transitions between the encoding letters of the key, which we surmise is the first letter.  Say for example the key is the Declaration of Independence (which in fact is the key for Beale cipher 2), as

"When in the Course of human events it becomes necessary..." and each first letter encodes for a particular number.   We can see that W transitions to I, I to T, T to C, and so:

W->I->T->C->O->H->E->  etc.  By finding the proportion of time any letter, say, W, transitions to any of the other letters, we've got ourselves a transition probability matrix.  In this abbreviated example we see that W transitions to I one hundred percent of the time; its transition probability vector would be represented by a 1 in the position of the letter I and 0 otherwise.  Thus, in effect, we are assuming that a random variable can take states represented by the letters of the alphabet, and it can transition to any other letter or stay where it is with a given probability.

Of course, the longer the encoding key is the better; most of the 26x26 transition probability matrix can be filled without a row of zeros, which is in effect what happens with the letter A above, since we count no letter toward which it transitions.  In counting the whole of the transitions in the Declaration of Independence, the only letters that don't transition are X, Y, Z; to bypass the difficulty of a transition probability matrix whose rows do not sum to 1, I have made it so that X, Y, and Z's rows do sum to 1 by assuming that the states X, Y, and Z transition to any other letter of the alphabet in equal proportion.

Writing the Declaration of Independence transition probability matrix thusly, it becomes a regular transition probability matrix with stable nth power.  In other words, we can take powers of this matrix up to an arbitrary number, such as 3000, 5000, without the fear of it diverging or giving weird numbers.  In fact, in particular, the Declaration of Independence transition probability matrix (DOITPM for short) stabilizes to the fourth decimal digit after about 15 powers.  The reason we care about the powers of the TPM is that such new matrix represents the probability of transitioning to a particular letter after n transitions!  Thus, where the DOITPM alone represents the probability of, say, W transitioning to I at the first step (the next cipher number down), DOITPM to the 3rd power represents the probability of (in particular) W jumping to another letter after three steps (the next three cipher numbers down).

Of course, calculating the DOITPM to any power is taxing or impossible if done by hand: I downloaded Octave (which I earnestly recommend if, like me, you don't have the 5000+ dollars for a Matlab license) and built a few script files for the purpose of doing all this.

The very cool thing about this approach is that virtually all points 1-6 that I commented on in my last post are taken care of.  But in particular, if one has a prior belief about the probability that a given number in the cipher is a particular letter (a probability vector), such can be propagated forward by using the TPM and its powers.  As an example, we may have the prior belief that the cipher's 1 is a first letter with probability vector equal to the frequencies of first letters in the English language.  Such a vector (P) times the DOITPM gives us a probability vector for the cipher's 2 (one-step forward), P*DOITPM^2 power gives the probability vector of the cipher's 3 (two steps forward), etc.

On the other hand, we may surmise that, say, the cipher's 64 has the standard frequencies of letters of the English language.  We can propagate such belief forward to 65, 66, 67, etc., until about 15 letters after our original (since the DOITPM stabilizes and all steps after about the fifteenth are the same).

We need not use the above frequencies, although they do seem reasonable "first-guess" beliefs.  But if we suspect for any reason that a particular letter, say T, is the cipher's 1, we can propagate such belief forward and get good results for cipher's 2, 3, and even 4.  After 4 things begin to stabilize toward the long-term proportions, and this is only natural as our certainty of the next letters increases.  If we are good at crosswords and a little bit lucky, we can determine that 2 is a particular letter.  We can then modify the probability vector of such and propagate such belief forward... and so on until the whole text is deciphered.

We can also propagate partial beliefs: if we suspect the cipher's 1 is either a T or a W with equal probability, but there is a slight chance it can be any other letter, our probability vector for that number can be something like 0.004 for all letters except T and W which are 0.45.  As always this belief can be propagated forward and with luck we can determine more letters based on frequency guesses.

Computationally, in the very inefficient script files I have built (I freely admit I am no programmer, and I wrote the script files somewhat quickly in order that the ideas not die before they could see the light of day, so there are a lot of unnecessary steps and redundancies) and my 2 core, it takes something like 20 minutes to process all the cipher and propagate all number's initial probability vectors by proceeding in layers.  An example of an inefficiency is that, despite the fact the DOITPM stabilizes at about the 15th power, I got ambitious to squeeze even the slightest changes and am calculating it sometimes up to the 2000+ power, rather than caching the 15th power and using that.  Thus, every time I want to modify a probability vector for a particular number on the cipher, I have to wait 20+ minutes for it to finish recalculating. It would be nice to have a gazillion computers working in a grid though.

Nevertheless, as I have mentioned, one will have to proceed in layers; by this I mean the following.  Say you have an initial belief about the cipher's 1.  We can propagate this belief all the way up to the cipher's 2000.  But say you also have an initial probability vector of the ciphers 6.  At 6 the probability "waves" begin to collide.  It would be ideal if there were a manner to combine the probability (discrete) waves to obtain a more certain probability vector (kind of like a Kalman filter for discrete distributions): I could not think of any (although I want this to be the subject of a next post), or at least not any that would give me a more certain wave, so I opted to choose the wave that was more entropically close to 0.  Since we have 26 "boxes" with different proportions of "balls" in them (the respective probability of being a particular letter), we can use information entropy to opt for the one vector that carries the most "certain" belief, or the one with the least entropy.  Thus, if 6's prior contains less entropy than 1's propagated vector, I would opt to stick with 6's vector, and vice versa.  I'd do this with all vectors at every single position (thus why it takes about 20 minutes to process, too).

Of course, if Beale number 1 was encoded with a key with similar transition probabilities as the DOI, and this would be a reasonable assumption if we think Beale encoded the cipher using a text from his time (similar stylistically, etc.), then we can use the DOITPM to attempt a decoding of it.  If instead we believe that he used a text that he himself wrote, an analysis of Beale number 2 could yield a TPM that could crack it.  Lastly, if we believe that Beale number 1 was encoded with a key very much like any other book in the world, we could again examine such a TPM and attempt a decipherment.

In my next post or possibly upon revision of this one, I will post an .xls file containing the DOITPM and the probability vectors of each number in the cipher of Beale 1, having assumed that the cipher's 1 and 71 have the frequencies associated with the first letter of the English language, and all the rest have the standard frequencies of the letters of the English language as prior beliefs.  I also intend to post an .xls file except containing a TPM of a typical book.

The new probability vectors obtained in this fashion for each number of the cipher differ markedly from the typical letter frequencies: this makes sense, because the frequencies now depend on the transitions of the letters in the key itself.  This may be the reason why some cryptographers have disputed that the cipher is written in English (based on statistical arguments); I think it is written in English, only the proportions of the letters in the cipher changes because of the particular proportions of the first letters of words in the key.

If anyone is interested in the script files, I may post them too: they are .m files, so you can run them on Octave or Matlab both.  Leave a message below.

Otherwise if anyone is interested in hooking up computers for the processing of this (or donating money so I can buy several :)), also leave a message below or contact me using the contact form.  Thanks!

On the Beale Cipher, Part I

April 27th, 2010 3 comments

Ugh, so right now there is construction going on behind my house; an abandoned house got put down and I can only imagine Telecable extending it's dominions over.  The street is zoned to be residential, not commercial, but the owner seems to be politically well-positioned: he gets to do whatever he wants.  In Mexico, political power is concentrated amongst a few, and not necessarily the smartest (quite the contrary in fact, the majority seems pretty dumb); I have often wondered why it is not requisite to have those whose ambition is a political post (in the legislature, in the judiciary and executive branches) take an IQ or aptitude test (... and have the smartest be selected): I'd much rather have smart crooks than stupids.

At any rate, with all that pounding in the background, I thought I'd talk a little bit about my thoughts on the Beale cipher.  I started studying it a bit enthusiastic about the prospect of deciphering it completely (in the beginning of April actually), but I have become convinced this is not a task I can accomplish too easily and without any support -- in other words, it's difficult to do on my own.  Besides, I am terrible at crossword puzzles, a skill that seems necessary to be able to guess certain words.  I can't but remember my friend Seth K who could complete the NY Times Sunday crossword in like five minutes.  Anyway: my motivator was the challenge, or the mathematical aspect of it, not the 40 million in US dollars that it is purported to be about (the first cipher).  For some background and history, I earnestly recommend Simon Singh's book "The Code Book." It is extremely entertaining -- and it quickly has became one of my favorite books.  Of course there's also the Wikipedia article which is much more succinct.

When I first looked at it under the chapter of Singh's "Le Chiffre Indechiffrable," I suspected then as I do now that it is not entirely undecipherable.  The reasons for this:

*1. The Beale cipher is not a pad cipher, so it may be "breakable" without the specific requirement of the key.

*2.  The fact that certain numbers repeat themselves (some 8 times ("18"), others only once) would suggest that the Beale cipher is susceptible to a form of frequency analysis.

*3.  Certain sequences, for example, 64 following 18 in two different parts (first Beale cipher letters 124-125 and 187-188) may be important.

*4.  The distance between numbers may be significant.  It would seem the closer the difference the stronger the relationship between the letters; thus, the closer they are together the more information we could potentially squeeze out of them.  Sequence repetitions may also help a lot (64 following 18 twice, for example), and adds credence to number 1 and 2.

5.  The letter encoded by 1 has a different probability distribution of being a particular letter than do the others (see Wikipedia article).  The same goes for 71, which is the first letter of the first Beale cipher.  This may be true of other letters but it's not exactly clear which, because we don't know where a word ends and another begins.  We can surely use this to augment our frequency analysis.

6.  It cannot escape us then that if the Beale cipher is a book cipher, then each number in the code likely represents the first letter of a word from the key.  However, it's not that knowing the first-letter frequencies of the key is likely to be any help.  Although there does seem to be another bit of information of the key that would be extremely useful (and we can somewhat extract), which I will explain in my next post.

Later, I will describe how I have incorporated all these pieces of information into an analysis of frequencies, or an augmented form of the usual frequency analysis.  For now, suffice it to say that starred statements I realized fairly early; unstarred statements I noticed only after developing the analysis a bit.

On revolutionizing the whole of Linear Algebra, Markov Chain Theory, Group Theory... and all of Mathematics

April 22nd, 2010 No comments

I have been so remiss about writing here lately!  I'm so sorry!  There are several good reasons for this, believe me.  Among them: (1) I have been enthralled with deciphering a two hundred-year old code, the Beale cipher part I, with no substantial results except several good ideas that I may yet pursue and expound on soon here.  But this post is not intended to be about that.  (2) My computer died around December and I got a new one and I hadn't downloaded TEX; I used this as an excuse not to write proofs from Munkres's Topology chapter 1, and so, I have added none.  I slap myself for this (some of the problems are really boring, although they are enlightening in some ways, I have to admit, part of the reason why I began doing them in the first place). (3) The drudgery of day to day work, which is soooo utterly boring that it leaves me little time for "fun," or math stuff, and my attention being constantly hogged by every possible distraction, at home, etc.  Anyway.

For a few months now I have been reading a lot on Markov chains because they have captured my fancy recently (they are so cool), and in fact they tie in to a couple projects I've been having or been thinking about.  I even wrote J. Laurie Snell because a chapter on his book was excellent (the one on Markov chains) with plenty of amazing exercises that I really enjoyed.  In looking over that book and a Schaum's outline, a couple questions came to my head and I just couldn't let go of these thoughts; I even sort of had to invent a concept that I want to describe here.

So in my interpretation of what a Markov chain is, and really with zero rigor, consider you have  n < \infty states, position yourself at  i .  In the next time period, you are allowed to change state if you want, and you will jump to another state  j (possibly  i ) with probability  p_ij (starting from  i ).  These probabilities can be neatly summarized in a finite  n \times n matrix, with each row being a discrete distribution of your jumping probabilities, and therefore each row sums to 1 in totality.  I think it was Kolmogorov who extended the idea to an infinite matrix, but we must be careful with the word "infinite,"  as the number of states are still countable, and so they are summarized by an  \infty \times \infty countably infinite matrix.  Being keen that you are, dear reader, you know I'm setting this question up:  What would an uncountably infinite transition probability matrix look like?  No one seems to be thinking about this, or at least I couldn't find any literature on the subject.  So here are my thoughts:

The easiest answer is to consider a state  i to be any of the real numbers in an interval, say  [0,1] , and to imagine such a state can change to any other state on such a real interval (that is isomorphic to any other connected closed interval of the same type, as we may know from analysis).  This is summarized by a continuous probability distribution on  [0,1] , whose sum is again 1; a good candidate is a beta function, such as  6 x (1-x) , with parameters (2,2).  I think we can "collect" such probability distributions continuously on  [0,1] \times [0,1] : a transition probability patch, as I've been calling it.   It turns out that it becomes important, if patches are going to be of any use in the theory, to be able to raise the patch to powers (akin to raising matrixes to powers), to multiply patches by (function) vectors and other tensors, and to extend the common matrix algebra to conform to patches; but this is merely a mechanical problem, as I describe in the following pdf.  (Comments are very welcome, preferably here on the site!).


As you may be able to tell, I've managed to go quite a long ways with this, so that patches conform reasonably to a number discrete Markov chain concepts, including a patch version of the Chapman-Kolmogorov equations; but having created patches, there is no reason why we cannot extend the idea to "patchixes" or continuous matrixes on  [0,1] \times [0,1] without the restriction that each row cross-section sum to 1; in fact it seems possible to define identity patchixes (patches), and, in further work (hopefully I'll be involved in it), kernels, images, eigenvalues and eigenvectors of patchixes, commuting patchixes, commutator patchixes, and a slew of group theoretical concepts.

Having defined a patchix, if we think of the values of the patchix as the coefficients in front of, say, a polynomial, can we not imagine a new "polynomial" object that runs through exponents of say  x continuously between  [0,1] with each term being "added" to another? (Consider for example something like  \sum_i g(i)x^i, i \in [0,1] ?)  I think these are questions worth asking, even if they are a little bit crazy, and I do intend to explore them some, even if it later turns out it's a waste of time.