24

If we have to determine a file is encrypted or not, can we use Shannon's entropy algorithm on the file?

As discussed here, the entropy (in bits per byte) being closer to 0 is considered as more orderly and being closer to 8 is considered more random.

Intuitively this might be generating false positives a lot. Can any one explain if this approach is appropriate to determine given a random file is encrypted or not?

otus
  • 32,132
  • 5
  • 70
  • 165
sashank
  • 6,174
  • 4
  • 32
  • 67
  • 1
    pure random files will have the most, followed by encrypted, then by compressed. audio and image files have less, text even less, ht/xml way less... see https://www.youtube.com/watch?v=T0MVe4aud30 the real smoking gun for encrypted files is a uniform distribution of bit values, see around 15m on the video specifically. – dandavis Jun 29 '16 at 06:32
  • 1
    Shannon entropy is defined for a process that generates a random output not for strings of data. Kolmogorov complexity is uncomputable (and would only be minimally bigger for an encrypted file than for the equivalent plaintext). So you're pretty much left with trying the compress the file (e.g. using gzip/deflate) and checking if it gets smaller. – CodesInChaos Jun 29 '16 at 15:05
  • I'm pretty sure /etc/entropy.bin is not encrypted and it has no meaningful headers. – Joshua Jun 30 '16 at 02:02
  • Good encryption does NOT necessarily lead to high entropy! Imagine you encrypt cipher stream of a stream cipher with the same cipher. This leads to a lot of zeroes, but its encryption is strong. Similar things can be achieved with block ciphers. I can't verify now, but I guess this video contains examples: https://media.ccc.de/v/31c3_-5930-en-saal_6-201412291400-funky_file_formats-_ange_albertini – bot47 Jun 30 '16 at 09:29

2 Answers2

34

You are likely going to have both false positives and false negatives if you try to use Shannon entropy for this.

  • Many compressed files would have close to 8 bits of entropy per byte, resulting in false positives.
  • Any encrypted file that has some non-binary encoding (like a file containing an ASCII-armored PGP message, or just a low entropy header) could have a lower entropy, resulting in false negatives.
  • If Format Preserving Encryption is used , this might result in false negatives too

It may work as a heuristic, but you should not rely on the results being correct.

sashank
  • 6,174
  • 4
  • 32
  • 67
otus
  • 32,132
  • 5
  • 70
  • 165
  • 2
    You could, however, check the first 4-8 bytes of the file and see if the magic number is or isn't part of a list of known compressed files or binary files. Having PNG, PK, MZ, GIFa, PDF or similars as the first bytes, will tell you that the file is not encrypted. This requires a massive lookup list for detection. – Ismael Miguel Jun 29 '16 at 14:49
  • 2
    That is a good point. But, your point is slightly mute when you know the structure of these files. Yes, the content could be encrypted, but, it would be a corrupted file (I would be surprised if i saw a valid encrypted PNG) or it has somewhere in it's structure saying that "this is encrypted". I agree with the premise that you can't do this reliably, but you can rule out some "points" by knowing the file structure. And answering to your question, you can't know for sure that a file composed of scrambled bytes is just garbage (E.g.: random mojibake) or really encrypted. – Ismael Miguel Jun 29 '16 at 15:04
  • 1
    @IsmaelMiguel Perhaps not PNG, but borrowing from my comment to Will's answer: I can use a perfectly valid TIFF container to hold a binary PGP-encrypted message, which file will almost certainly quite happily call a TIFF image and move on with life. (I'm even quite sure I could make one that holds both a PGP-encrypted message and a real image, which opens and displays fine!) – user Jun 29 '16 at 15:11
  • @MichaelKjörling Won't that have somewhere in the file saying it has a key? – Ismael Miguel Jun 29 '16 at 15:18
  • @IsmaelMiguel Surprise! http://www.cis.upenn.edu/~cis110/12fa/hw/hw05/TOYXpipe.png – OrangeDog Jun 29 '16 at 15:25
  • @OrangeDog You're saying that, opening the file with Notepad, anything in the chunk IDAT is the result of encrypting the content of an image? And that it happens to be a valid PNG? Or it is the output of the output represented as pixels? The later one is easy to do. Here I have the jQuery library, represented as pixels: http://i.imgur.com/PGlFaT1.png – Ismael Miguel Jun 29 '16 at 17:04
  • 2
    @IsmaelMiguel that is a PNG that is both valid and encrypted, as you said you would be surprised to see. It's not difficult to find more examples, or indeed to generate them. – OrangeDog Jun 29 '16 at 17:19
  • @OrangeDog If the IDAT chunk has nothing but encrypted data (that is, it isn't pixel data), then I am surprised. – Ismael Miguel Jun 29 '16 at 17:52
  • 1
    A compressed file is encrypted. I know you don't think of it that way, but it's an encoded mapping of the original data. The fact that you have the decoder handy (7-zip or whatever) doesn't change this. – Carl Witthoft Jun 29 '16 at 19:40
  • @IsmaelMiguel when you don't have the key, what's the difference? – OrangeDog Jun 29 '16 at 20:21
  • @CarlWitthoft I can't see how. – Ismael Miguel Jun 30 '16 at 00:02
  • @OrangeDog The difference is that one is just the data, but drawn as pixels in the image. The others, is the data in the IDAT chunk being encrypted and places back, and still showing a valid PNG image. That's the difference. – Ismael Miguel Jun 30 '16 at 00:04
  • 3
    Conversely I can run an encrypted file through a uuencode/base64/ascii85 and produce a file with respectively less entropy. – Aron Jun 30 '16 at 01:05
  • 1
    @IsmaelMiguel you seem to think "the data" must be something in particular. – OrangeDog Jun 30 '16 at 08:07
  • @OrangeDog Sorta. When I said "the data", I meant "the given result after encrypting some plaintext". An example of "data" could be a tiny text file that was encrypted. – Ismael Miguel Jun 30 '16 at 09:12
  • @IsmaelMiguel Images can be plaintext too. – OrangeDog Jun 30 '16 at 09:33
6

Yes its a good indicator and no there won't be many false positives.

A high-entropy file indicates that a file is either well-encrypted, well-compressed or just contains truly random bytes.

Most compression formats have recognizable headers etc so these can be easily distinguished.

Most people do not have files of random bytes lying around - why would they?

Strong cryptography strives for ciphertext indistinguishability. Which is a necessary property for security, but which also makes it stand out.

So imagine you are the police or border agent interrogating a suspect. There is a file on their computer that is seemingly random. You will conclude that it must be encrypted, and demand the suspect hand over the keys.

People also strive to find "distinguishers" for standard ciphers and encryption formats. For example, trucrypt.

Hiding a message, beyond just encrypting it, is called steganography.

Will
  • 412
  • 3
  • 11
  • 4
    Not "well" encrypted. Even very broken encryption algorithms can be random enough that Shannon entropy is close to maximum. (E.g. nonce reuse with a stream cipher.) – otus Jun 29 '16 at 06:49
  • Entropy calculation won't really help identify and classify. Just one of many situations where you approach will fail: Take a base64 encoded version of a file and you won't be able to differ if it is encrypted, compressed, simply base64 encoded, or a file containing data that just happens to include chars within the base64 range. If you think that's uncommon, look at armored PGP messages and ask yourself if an entropy scan of Snowdens laptop would have helped any border agent to pinpoint encrypted files among regular text and binary files. TL;DR: entropy calculation won't really help classify. – e-sushi Jun 29 '16 at 09:14
  • @e-sushi all the low-entropy files can also be classified, which is not incompatible with my stance. Imagine you are a Russian border agent scanning a laptop. You can run the 'file' command over everything. Any files that are high entropy that 'file' cannot recognise are likely to be encrypted. – Will Jun 29 '16 at 09:21
  • 3
    ... or compressed, or binary ( just do an entropy calculation of your MP3 or video collection). Entropy calculation by itself won't cut it. You said it yourself by pointing to compressed files and the need to check their headers - which goes beyond entropy calculation, fixing the false-positive issue @otus pointed at in his answer. Note that the question didn't ask how to differ files, but if entropy calculation suffices to identify. It obviously doesn't. – e-sushi Jun 29 '16 at 09:38
  • @e-sushi headers and structure decrease entropy. Either a file is low entropy, i.e. it can be recognised, or it is not trivially distinguishable from noise. If the latter, then the question is is it random, compressed or encrypted. I address this in my answer. – Will Jun 29 '16 at 09:45
  • 1
    ... right. Now think about that for a minute. – e-sushi Jun 29 '16 at 09:46
  • 2
    I'm pretty sure that, with for example a reasonable-length MP3 or a reasonable-size JPEG photo, the well-known headers are going to be completely dwarfed by the payload data. Hence, the headers won't significantly affect the overall "entropy" of that file. And I can use a perfectly valid TIFF container to hold a binary PGP-encrypted message, which file will almost certainly quite happily call a TIFF image and move on with life. – user Jun 29 '16 at 14:18
  • What if it's encrypted, and then compressed? – OrangeDog Jun 29 '16 at 15:26
  • 2
    I think commenters may be overly fixated on the word "determine" in the question. This answer does not take that literally and considers this a more practical than academic question. I give +1 here because I expect this technique should work well in practice, quickly and easily paring down a large hard drive into relatively few candidates which can then be examined more carefully. As an extra step, any compressed archives found should be exploded and the contents scanned in the same way. – wberry Jun 29 '16 at 15:44
  • 2
    RE base64 encoding, if you take XML plaintext and base64 encode that, you can visually see the patterns in the encoded text. With practice one could learn to recognize certain plaintexts in the encoded form. base64 is not a scramble and will not greatly affect the entropy measure. – wberry Jun 29 '16 at 15:45
  • "Most people do not have files of random bytes lying around - why would they?" To provide plausible deniability when someone claims that some file is encrypted. Why do so many people and web sites use SSL for data that's not sensitive? – David Schwartz Jun 30 '16 at 02:04
  • @DavidSchwartz because it's easier to config, and to prevent mixed content warnings? :-D – John Dvorak Jun 30 '16 at 04:55
  • Plausible deniability for large files of "random" data: an app that needs large amounts of truly random data that can't be reused, but has no access to a TRNG dongle and only intermittent access to the Internet. The solution would be to hold a large buffer that is refilled from a trusted web service once the machine connects. You could say you're a big fan of VLT poker and you strive for utmost accuracy. Why, yes, here is the poker app I've been talking about, if you don't mind going one directory level up. Let me run it and give me a Wifi connection and you can watch it connect to Random.org. – John Dvorak Jun 30 '16 at 05:23
  • @JanDvorak and now when the goon googles this excuse they see, right at the top of the listings, this comment. Cover blown! Goon says "that is a classic excuse, Sir! Right, pills and a $5 wrench it is then!" – Will Jun 30 '16 at 06:05
  • I thought I was defending against cops, not mafia. For mafia, I guess the strategy would be either to destroy the disk (if the data absolutely can't go to enemy hands) or to defect my allies (if I prefer keeping my hands over keeping my dignity). Cyanide pill time in 3... – John Dvorak Jun 30 '16 at 06:09