Saturday, June 28, 2008

Information Entropy

A while back I brought up the infinite monkey theorem.

we seemed to hit a paradox at some point.


Suppose that you have a message of N characters which we will call M.

how much information does such a message contain? assume spaces or punctuation don't count.


one way to measure the amount of information that a message contains is to measure the amount of information that it would take you to store such message.

you could have N number of boxes with C number of subchambers one for each possible subunit
like a four dial lock with the digits 0-9 in each dial as a possible choice.
this message could be stored in four boxes with ten subcompartments by naming the boxes ABC & D and the sub compartments 0-10. therefore with four pebbles you can mark down the message.
to universalize this have a strip of paper with 40 circles in a row and a hole punched in the first 10 then another in the second ten and so on. it takes 40 bits to store the password of this lock on this mechanism.

however this information is compressible note the fact that 9999 the highest possible password is the binary number 10011100001111 that's 14 bins (digits are for decimals) with 2 possibilities thats only 14 circles on a strip of paper that are either punched or unpunched.
and we just compressed our information from 40 bits to 14 without compromising any part of the information.

If you have a gigantic keyboard with all the alphabets in the multiverse (i.e. hebrew, latin, cyrillic, Babylonian Cuneiform, Vogon, Martian and Alphacentaurian) then you would have a lot of characters to pick from and it would not likely be very efficient to store messages.

One pretty cool danish ( if Kenedy is "ein Berliner" then this guy can be a pastry too.) Piet Hein wrote

Problems worthy of attack prove their worth by hitting back
is 26^50= 5*10^70 bits of information in characters
but there are no Zs Xs Qs Js or Ds.
so 21^50= 1*10^66.
but noting that those letters are missing also takes information.

if we use words instead of characters it would make sense to think that this would reduce the amount of information needed since we can't ever make a combination like stakfl ssd kkjkjjjkjkjjksdjsdsdsdsda fammvodfdfd. because at least all the words would make sense, it would take less storage space.
86000^10=2*10^49
for long enough messages. if synonyms are eliminated imagine the possibilities.
its a tradeoff of universality and for efficiency and a more complex coding mechanism.


http://www.edict.com.hk/textanalyser/wordlists.htm

the link directs you to a place where you can see what the frequency of usage of each word
apply it to this formula.

therefore in a N words message the amount of disorder or entropy contained in a message is equal to:

where p(xi) is the probability of word number i in a message

lets find the entropy of a message whose language uses all words with equal frequency and has M words so the probability of each word is 1/M and n is the number of words then the entropy of this message would be n/M*log2(M)
for Piet Hein's message

Word Probability
Problems-----------0.000243
worthy-------------0.000028
of-------------------0.035839
attack--------------0.000103
prove--------------0.000052
their---------------0.002628
worth--------------0.000093
by------------------0.005224
fighting-------------0.000071
back----------------0.000952

will have to be calculated by Derrida (you) as homework.


0 comments: