0

I want to parse a big list of variable and method names in code and recognise if those are (or are made up) out of real words - essentially, to recognise if they are obfuscated or not.

For example, the idea is to build something that understands that the following are non-obfuscated names:

  • VeryLongVarName
  • counter
  • my_iterator

but the following are obfuscated:

  • aaabbcc
  • zAxX
  • FDSLFUnmfs
  • a

This also needs going to be language independent - I am not focusing on recognising English (names could be French or anything)

My idea is to extract some kind of metric out of each word and if within some ranges then its obfuscated. What do you think?

What sort of metric could estimate this - for example randomness of characters?

The whole thing doesn't need to be terribly accurate. If for a given code base it tells that out of 100 names tested, 20% seem to be obfuscated, then its good enough for me.

(completely newbie to linguistics!)

John
  • 13
  • 1
  • There are language and encoding recognizers that you could modify, one is part of the Mozilla codebase. I would expect there are ther free ones available. You probably don't need the encoding part. They are build mostly from studying ngram frequencies plus a few heuristics. They might need some tweaking to account for things such as underscores and camelCaps. You will never get 100% accuracy - you will just be able to tweak thresholds until you get something you find satisfactory for your use case. – hippietrail Jan 14 '14 at 04:49

1 Answers1

0

Seems more a programming problem than a linguistis problem to me.

I'd use a fuzzy parser (probably done in Perl) that attempts to divide such variable names into smaller chunks using either underscores or changes in capitalization, then checks all resulting chunks (including original) vs. a dictionary of words in the target language - taking into account either the Levenshtein distance or the longest common substring.

You'd get a number of potential matches, assign a score to each (based on less changes required) and, voilá! If a score is high enough, you could be certain it's a match, and thus a word in such language.

I don't think there's such a thing as a universal discriminator to check if a word exists in a language or not - some languages (both natural and created) have very weird capitalization patterns, even within a word (e.g. Irish or Klingon) or can have long sequences of just consonants which nonetheless are valid words (Serbian and Bereber). Worse yet, someone could be typing Chinese characters using the Tsang-chieh method: a seemingly random "JWJ" actually encodes the character 車 "car, vehicle"! Ditto for the "4 corners" method: an initial letter plus a series of numbers could be a valid word.

Joe Pineda
  • 1,063
  • 9
  • 12