9

Is there a tool available that will generate a grammar from a corpus of sample inputs, similar to what HotFuzz does for network protocols?

For example, given a collection of MP3 files, I'm looking for a tool that would generate a BNF grammar to describe the format of MP3.

nneonneo
  • 1,283
  • 10
  • 15
mrduclaw
  • 4,066
  • 8
  • 27
  • 40

1 Answers1

3

I haven't used either, but Peach Fuzzer, on which HotFuzz is based, has a "Peach Fuzz Bang" GUI for fuzzing files.

Keep in mind, though, that fuzzers try to generate invalid inputs that crash a program, not determine the exact grammar that describes all valid input.

Furthermore, strictly speaking, it isn't mathematically possible to do what you're asking. If a computer could comprehensively learn a language merely by reading text in that language, then machine translation would be a solved problem. (This is a slightly poor analogy since not all human languages are context-free, but the idea is clear.)

  • I'm probably showing my ignorance, but it doesn't seem like it should be that impossible of a problem, right? Given a bunch of sample data, I should be able to diff and determine the general structure (e.g. 4-byte header that stays the same, followed by a 4-byte int, maybe a NULL-terminated string follows, etc). Why would doing that be impossible? – mrduclaw Apr 02 '13 at 03:14
  • @mrduclaw: It's just the title of your question that's the impossibly general part, and is the subject of research. If you just want to figure out the common header structure of a set of binary files, see this SO question, which has some good answers. – Daniel W. Steinbrook Apr 02 '13 at 03:29
  • Great paper suggestion, seems far from impossible, thanks! – mrduclaw Apr 02 '13 at 03:36
  • Angluin's paper on L* gives a proof that grammar inference can be achieved in polynomial time if an oracle is available to answer membership queries. This paper includes descriptions and discussion of performance of other algorithms for grammar inference. An approach as used by Polyglot might scale better, though. – Mathew Hall Apr 02 '13 at 09:22
  • Here are some papers about a technique to derive a context free grammar based on the binary analysis of the polymorphic engine of a virus [1,2,3]. I am not sure it is really relevant here... But, maybe you can find some use of it. – perror Apr 02 '13 at 17:40