David Brown
8/5/2015 9:08:00 AM
On 31/07/15 13:24, Richard Heathfield wrote:
> On 31/07/15 12:07, Paul N wrote:
>
> <snip>
>
>> If the data were truly random, you'd expect about
>> the same amount of each byte, so if there is a big
>> difference between the most and least fequent values
>> that would sugest that there's *something* going on.
>
> That's right. Thing is, "about the same amount of each byte" is a bit
> nebulous, isn't it? I'm not quite sure how to express this, but I want
> to be able to decide, in a binary (yes/no) fashion, whether there's
> *something* going on, in a programmatic or at least numerical way. How
> much something does there need to be (how much departure from a flat
> distribution) before it's *something* rather than just something?
>
Of course, even with different frequencies for different bytes, it could
be a random process (think of rolling two dice and adding them, for a
simple example).
The first step, I think, would be to draw two graphs - one ordered by
frequency, and one ordered by byte. That might give you some inspiration.
But without any more information, or a clear outstanding pattern, it's a
lost cause. There is no way from what you have that you could
distinguish line noise from a gzip'ed Shakespear play.
>> Beyond that, you could compare with the frequencies
>> of letters, to see if it might be a simple substitution
>> cypher (each letter replaced with a particular different one).
>
> Er, yeah, that's relevant too, I guess, but the day I have trouble
> cracking a simple substitution cipher is the day I hang up my keyboard. :-)
>