I want to parse a big list of variable and method names in code and recognise if those are (or are made up) out of real words - essentially, to recognise if they are obfuscated or not.
For example, the idea is to build something that understands that the following are non-obfuscated names:
- VeryLongVarName
- counter
- my_iterator
but the following are obfuscated:
- aaabbcc
- zAxX
- FDSLFUnmfs
- a
This also needs going to be language independent - I am not focusing on recognising English (names could be French or anything)
My idea is to extract some kind of metric out of each word and if within some ranges then its obfuscated. What do you think?
What sort of metric could estimate this - for example randomness of characters?
The whole thing doesn't need to be terribly accurate. If for a given code base it tells that out of 100 names tested, 20% seem to be obfuscated, then its good enough for me.
(completely newbie to linguistics!)