143

GMail has this feature where it will warn you if you try to send an email that it thinks might have an attachment.

Did you mean to attach files?

Because GMail detected the string see the attached in the email, but no actual attachment, it warns me with an OK / Cancel dialog when I click the Send button.

We have a related problem on Stack Overflow. That is, when a user enters a post like this one:

my problem is I need to change the database but I don't won't to create 
a new connection. example:

DataSet dsMasterInfo = new DataSet();
Database db = DatabaseFactory.CreateDatabase("ConnectionString");
DbCommand dbCommand = db.GetStoredProcCommand("uspGetMasterName");

This user did not format their code as code!

That is, they didn't indent by 4 spaces per Markdown, or use the code button (or the keyboard shortcut ctrl+k) which does that for them.

Thus, our system is accepting a lot of edits where people have to go in and manually format code for people that are somehow unable to figure this out. This leads to a lot of bellyaching. We've improved the editor help several times, but short of driving over to the user's house and pressing the correct buttons on their keyboard for them, we're at a loss to see what to do next.

That's why we are considering a Google GMail style warning:

Did you mean to post code?

You wrote stuff that we think looks like code, but you didn't format it as code by indenting 4 spaces, using the toolbar code button or the ctrl+k code formatting command.

However, presenting this warning requires us to detect the presence of what we think is unformatted code in a question. What is a simple, semi-reliable way of doing this?

  • Per Markdown, code is always indented by 4 spaces or within backticks, so anything correctly formatted can be discarded from the check immediately.
  • This is only a warning and it will only apply to low-reputation users asking their first questions (or providing their first answers), so some false positives are OK, so long as they are about 5% or less.
  • Questions on Stack Overflow can be in any language, though we can realistically limit our check to, say, the "big ten" languages. Per the tags page that would be C#, Java, PHP, JavaScript, Objective-C, C, C++, Python, Ruby.
  • Use the Stack Overflow creative commons data dump to audit your potential solution (or just pick a few questions in the top 10 tags on Stack Overflow) and see how it does.
  • Pseudocode is fine, but we use c# if you want to be extra friendly.
  • The simpler the better (so long as it works). KISS! If your solution requires us to attempt to compile posts in 10 different compilers, or an army of people to manually train a bayesian inference engine, that's ... not exactly what we had in mind.
Jeff Atwood
  • 6,747
  • I would simply read each line and detect if it contains a (common languages) reserved words or common words. If it happens too many time, it would increase the weight. You decide what weight will trigger the message. There are many tools that can analyze source code and provide you with a list of most used words. –  Jun 28 '11 at 08:17
  • 34
    I think if you just always display the warning if there is no indentation present, you will be way below the 5% error limit. This is only half meant as a joke. – Konrad Rudolph Jun 28 '11 at 08:47
  • 59
    @Konrad This would work even better if the message would be: 'Either your question is missing code samples that would help others to understand it or you forgot to indent them properly'. This should cover 99% of all cases. – thorsten müller Jun 28 '11 at 09:09
  • Did you data mind the database to find out all entries that were "just add block syntax", that is 4 spaces before lines and try to take the rule out of them? – Display Name Jun 28 '11 at 09:37
  • 3
    This is a GOOD question but I feel it doesn't have an answer. You show me an idiot-proof system and I will show you a better idiot. Even if this problem could be addressed by CODE, perhaps it shouldn't? It is these ignorant people who can't be bothered to ask a PROPER QUESTION that are RUINING this site for people like me who ask proper questions AND contribute proper answers IMHO. – maple_shaft Jun 28 '11 at 11:26
  • 2
    A common pattern I've seen is a block of code that was properly indented in itself, but where the first and last lines (usually only those two, sometimes more when showing multiple functions, for example) aren't labeled as code. This probably should be detected too. – 3Doubloons Jun 28 '11 at 15:09
  • 1
    Surely, if they're posting to StackOverflow, 99% of the time they should be posting code. I've lost count of the times I've posted a comment asking for the actual code before I can answer the question. Seems like you should show that message whenever you don't find a four-character-indented block. – Daniel Roseman Jun 28 '11 at 15:34
  • @maple_shaft: so, you're claiming you know all markdown syntax the first day you used them? – Lie Ryan Jun 28 '11 at 15:37
  • @Lie, Of course not, but before I post on any site I figure out how to markup code. If I can't figure this out then I state that I couldn't figure out how to do this and hopefully a moderator will come around and help me and then it won't happen again. Any other excuse by a poster is a lack of thought, lack of respect, or complete laziness. – maple_shaft Jun 28 '11 at 15:54
  • 3
    On a side note, the GMail confirmation text is rather confusing. If your answer on the first question is 'yes' then the answer on the second question is 'no'... – pimvdb Jun 28 '11 at 15:55
  • I suggest you change the text in the code button from '{}' to the word 'code'. It would require widening it quite a bit, but it would me much clearer for new users. I actually went through a stage on SO in which I knew I should format code, but couldn't figure out how. – Emilio M Bumachar Jun 28 '11 at 20:09
  • 2
    You mention that you don't want to spend ages training an inference engine, but surely you already have the training material: correctly marked-up code from the SO database dump. – Scott Jun 29 '11 at 12:21
  • You should probably also take a look at which languages are most commonly formatted incorrectly -- I bring this up because of the lisp comments. I bet lisp question askers are far more sophisticated than the javascript ones (in the aggregate, not a slight against javascript) – Jiaaro Jul 08 '11 at 17:55
  • 1
    Funny, I was expecting something entirely different after "short of driving over to the user's house and". It involved baseball bats. – Benjol Jul 15 '11 at 05:41
  • @benjol http://i.stack.imgur.com/8SAGb.png – Jeff Atwood Jul 15 '11 at 06:29
  • GitHub just published their library they use to detect programming languages. Maybe that might be of help: https://github.com/blog/881-linguist Unfortunately, it seems to be in Ruby, no idea how well that integrates with .NET. – Henrik Paul Jun 28 '11 at 09:00
  • The SO highlighter automatically does this already. The problem is that English is not among the languages that Github's Linguist library can detect. – Kevin Vermeer Jun 29 '11 at 03:43
  • isn't this just a syntax highlighter? Not really what was asked in the question. And it can't distinguish between English and code. – Jeff Atwood Jun 29 '11 at 11:49
  • A Bloom Filter may work for this. You can train it with all the existing code blocks. – gnibbler Jun 28 '11 at 09:59
  • How about reminding the new user to apply tags to their questions and then narrowing down the possible languages to detect based on the tags? – Kal Jun 28 '11 at 08:36
  • The research in this paper is pertinent: http://www.cs.mcgill.ca/~martin/papers/icse2013.pdf – james.garriss Oct 03 '14 at 12:39
  • Is it soo bad if code is not formatted as code? – Florian F Oct 03 '14 at 12:47

14 Answers14

147

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

  1. Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
  2. Parentheses directly following text with no space to separate it: myFunc()
  3. A dot or arrow between two words: foo.bar = ptr->val
  4. Presence of curly braces, brackets: while (true) { bar[i]; }
  5. Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
  6. Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __
  7. Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
  8. camelCase text in the post.
  9. nested parentheses, braces, and/or brackets.

One could keep track of the number of times each of these appears, and these could be used as features in a machine-learning algorithm like perceptron, the way SpamAssassin does.

Ken Bloom
  • 2,394
  • 16
  • 20
  • I thought of the syntax highlighter too. This would work even better, if it would be limited to consecutive key words. If there are more than 5 key words in row and all most other words followed by () (=functions) or = (= variable assignments) or come after type identifiers (declarations) – thorsten müller Jun 28 '11 at 09:13
  • 25
    Tips: 3 has a very low weight, because a dot between words can be the result of a typo. 5 should not match URLs. For 6 the ampersand is also frequently used outside the code context this you might also weight that character less. Double check if the highlighter works, because it can highlight non-code text as I sometimes see in Notepad++. – Tamara Wijsman Jun 28 '11 at 10:27
  • 8
    re the . as a typo - there would be no harm in flagging that as the author ought to edit anyway. – mmmmmm Jun 28 '11 at 10:29
  • 1
    You could also try to detect camel- or pascal- case words, since those do not appear in regular language, except in typos. Also words such as a_variable_name for instance. – Nobody Jun 28 '11 at 10:44
  • 4
    additionally, specific keywords that many languages have could help: WHILE, ELSE, IF, LOOP, BREAK, etc. – JoséNunoFerreira Jun 28 '11 at 10:45
  • 1
    and now i remembered this: you could add up the number of "rules" it triggers, to get it more accurate. basically, if you spot "()", a "WHILE" and a "!=", that's probably not three typos, it's code. – JoséNunoFerreira Jun 28 '11 at 10:48
  • 6
    Add "Usage of $ before non numeric words: $var is common in Perl and PHP (and Ruby?)." – PhiLho Jun 28 '11 at 11:37
  • 2
    Also think about catching multiple-words-with-more-than-one-hyphen; that would help identify Lispy languages that might not otherwise be caught by these rules. – JasonFruit Jun 28 '11 at 11:43
  • Indeed, even if some languages are proud to allow omitting them (JavaScript, Lua, Scala...) or use a different symbol (or none at all, like Python). 2. Some people like to write func (x) (or func( x ) and other variants). But sure the majority omits the space. 3. As pointed out, dot doesn't work well in some cases (URLs, IP addresses, typo). 5. Perhaps more reliable if detected at the start of a (typed) line. Also lines starting with # or two dashes. 5. + and & aren't so uncommon as abbreviations, I think. -- Overall, a good set of suggestions, I just wanted to help to refine them.
  • – PhiLho Jun 28 '11 at 11:45
  • 1
    Also, you could implement HINTS in the textbox's margin whenever code is detected. Many code editors (Visual Studio w/ ReSharper, DreamWeaver, etc) do something similar when they find errors/warnings/suggestions in code. – David Murdoch Jun 28 '11 at 12:35
  • I'd start off by looking at tags on what languages the question is about to limit what processing is needed. Then use clues like this to form an approximation of where code is situated. Say, if a line has a clue that says it's a piece of code then backtrack to attempt find the first line of code. Then step after line 3 like debugger and repeat the process this time looking forwards until there's some indication that we've rearched the end of the code snippet. This way you could handle cases where lines of code are separated by plain text. – James P. Jun 28 '11 at 14:24
  • @TomWIJ: You can keep track of the frequency of each of these features, and consider some tests a potential false positive if they only happen once. A SpamAssassin-like approach might work. – Ken Bloom Jun 28 '11 at 15:46
  • None of these heuristics will identify LISP code, but I think we should be editing this list to add other features to the list. – Ken Bloom Jun 28 '11 at 16:13
  • @Ken Bloom that is why there is a requirement to only parse the "Big Ten" by tags, LISP is on page 16 of tags – Scott Chamberlain Jun 28 '11 at 17:36
  • I just for the sake of completion, in addition to camelCase you should check for underscored_lowercase_names. – Ziv Jun 28 '11 at 18:05
  • 4
    You won't detect my SELECT DISTINCT name FROM people WHERE id IS NOT NULL. – Benoit Jun 29 '11 at 04:57
  • 2
    for what it's worth, you can generally refine your algorithm on a language-by-language basis -- detecting bash code may be very different than detecting erlang code, but the author will almost always tell you what language he's using by his choice of tags. – tylerl Jun 29 '11 at 05:50
  • 1
    I think you should add another item for Line length. Code tends to have short (<80 chars) lines separated by line breaks, which is a departure from most written language. – Jiaaro Jul 08 '11 at 17:53
  • Another pattern: () – Loren Pechtel Oct 03 '14 at 21:39
  • Nested parenthesis – jkd Jun 08 '15 at 14:48