Simple method for reliably detecting code in text?

Question

GMail has this feature where it will warn you if you try to send an email that it thinks might have an attachment.

Did you mean to attach files?

Because GMail detected the string see the attached in the email, but no actual attachment, it warns me with an OK / Cancel dialog when I click the Send button.

We have a related problem on Stack Overflow. That is, when a user enters a post like this one:

my problem is I need to change the database but I don't won't to create 
a new connection. example:

DataSet dsMasterInfo = new DataSet();
Database db = DatabaseFactory.CreateDatabase("ConnectionString");
DbCommand dbCommand = db.GetStoredProcCommand("uspGetMasterName");

This user did not format their code as code!

That is, they didn't indent by 4 spaces per Markdown, or use the code button (or the keyboard shortcut ctrl+k) which does that for them.

Thus, our system is accepting a lot of edits where people have to go in and manually format code for people that are somehow unable to figure this out. This leads to a lot of bellyaching. We've improved the editor help several times, but short of driving over to the user's house and pressing the correct buttons on their keyboard for them, we're at a loss to see what to do next.

That's why we are considering a Google GMail style warning:

Did you mean to post code?

You wrote stuff that we think looks like code, but you didn't format it as code by indenting 4 spaces, using the toolbar code button or the ctrl+k code formatting command.

However, presenting this warning requires us to detect the presence of what we think is unformatted code in a question. What is a simple, semi-reliable way of doing this?

Per Markdown, code is always indented by 4 spaces or within backticks, so anything correctly formatted can be discarded from the check immediately.
This is only a warning and it will only apply to low-reputation users asking their first questions (or providing their first answers), so some false positives are OK, so long as they are about 5% or less.
Questions on Stack Overflow can be in any language, though we can realistically limit our check to, say, the "big ten" languages. Per the tags page that would be C#, Java, PHP, JavaScript, Objective-C, C, C++, Python, Ruby.
Use the Stack Overflow creative commons data dump to audit your potential solution (or just pick a few questions in the top 10 tags on Stack Overflow) and see how it does.
Pseudocode is fine, but we use c# if you want to be extra friendly.
The simpler the better (so long as it works). KISS! If your solution requires us to attempt to compile posts in 10 different compilers, or an army of people to manually train a bayesian inference engine, that's ... not exactly what we had in mind.

I would simply read each line and detect if it contains a (common languages) reserved words or common words. If it happens too many time, it would increase the weight. You decide what weight will trigger the message. There are many tools that can analyze source code and provide you with a list of most used words. — , Jun 28 '11 at 08:17
I think if you just always display the warning if there is no indentation present, you will be way below the 5% error limit. This is only half meant as a joke. — Konrad Rudolph, Jun 28 '11 at 08:47
@Konrad This would work even better if the message would be: 'Either your question is missing code samples that would help others to understand it or you forgot to indent them properly'. This should cover 99% of all cases. — thorsten müller, Jun 28 '11 at 09:09
Did you data mind the database to find out all entries that were "just add block syntax", that is 4 spaces before lines and try to take the rule out of them? — Display Name, Jun 28 '11 at 09:37
This is a GOOD question but I feel it doesn't have an answer. You show me an idiot-proof system and I will show you a better idiot. Even if this problem could be addressed by CODE, perhaps it shouldn't? It is these ignorant people who can't be bothered to ask a PROPER QUESTION that are RUINING this site for people like me who ask proper questions AND contribute proper answers IMHO. — maple_shaft, Jun 28 '11 at 11:26
A common pattern I've seen is a block of code that was properly indented in itself, but where the first and last lines (usually only those two, sometimes more when showing multiple functions, for example) aren't labeled as code. This probably should be detected too. — 3Doubloons, Jun 28 '11 at 15:09
Surely, if they're posting to StackOverflow, 99% of the time they should be posting code. I've lost count of the times I've posted a comment asking for the actual code before I can answer the question. Seems like you should show that message whenever you don't find a four-character-indented block. — Daniel Roseman, Jun 28 '11 at 15:34
@maple_shaft: so, you're claiming you know all markdown syntax the first day you used them? — Lie Ryan, Jun 28 '11 at 15:37
@Lie, Of course not, but before I post on any site I figure out how to markup code. If I can't figure this out then I state that I couldn't figure out how to do this and hopefully a moderator will come around and help me and then it won't happen again. Any other excuse by a poster is a lack of thought, lack of respect, or complete laziness. — maple_shaft, Jun 28 '11 at 15:54
On a side note, the GMail confirmation text is rather confusing. If your answer on the first question is 'yes' then the answer on the second question is 'no'... — pimvdb, Jun 28 '11 at 15:55
I suggest you change the text in the code button from '{}' to the word 'code'. It would require widening it quite a bit, but it would me much clearer for new users. I actually went through a stage on SO in which I knew I should format code, but couldn't figure out how. — Emilio M Bumachar, Jun 28 '11 at 20:09
You mention that you don't want to spend ages training an inference engine, but surely you already have the training material: correctly marked-up code from the SO database dump. — Scott, Jun 29 '11 at 12:21
You should probably also take a look at which languages are most commonly formatted incorrectly -- I bring this up because of the lisp comments. I bet lisp question askers are far more sophisticated than the javascript ones (in the aggregate, not a slight against javascript) — Jiaaro, Jul 08 '11 at 17:55
Funny, I was expecting something entirely different after "short of driving over to the user's house and". It involved baseball bats. — Benjol, Jul 15 '11 at 05:41
GitHub just published their library they use to detect programming languages. Maybe that might be of help: https://github.com/blog/881-linguist Unfortunately, it seems to be in Ruby, no idea how well that integrates with .NET. — Henrik Paul, Jun 28 '11 at 09:00
The SO highlighter automatically does this already. The problem is that English is not among the languages that Github's Linguist library can detect. — Kevin Vermeer, Jun 29 '11 at 03:43
isn't this just a syntax highlighter? Not really what was asked in the question. And it can't distinguish between English and code. — Jeff Atwood, Jun 29 '11 at 11:49
A Bloom Filter may work for this. You can train it with all the existing code blocks. — gnibbler, Jun 28 '11 at 09:59
How about reminding the new user to apply tags to their questions and then narrowing down the possible languages to detect based on the tags? — Kal, Jun 28 '11 at 08:36
The research in this paper is pertinent: http://www.cs.mcgill.ca/~martin/papers/icse2013.pdf — james.garriss, Oct 03 '14 at 12:39

score 147 · Accepted Answer · edited Jun 28 '11 at 16:20

147

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
Parentheses directly following text with no space to separate it: myFunc()
A dot or arrow between two words: foo.bar = ptr->val
Presence of curly braces, brackets: while (true) { bar[i]; }
Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __
Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
camelCase text in the post.
nested parentheses, braces, and/or brackets.

One could keep track of the number of times each of these appears, and these could be used as features in a machine-learning algorithm like perceptron, the way SpamAssassin does.

edited Jun 28 '11 at 16:20

Ken Bloom

2,394
16
20

answered Jun 28 '11 at 08:31

Yevgeniy Brikman

2,575

I thought of the syntax highlighter too. This would work even better, if it would be limited to consecutive key words. If there are more than 5 key words in row and all most other words followed by () (=functions) or = (= variable assignments) or come after type identifiers (declarations) – thorsten müller Jun 28 '11 at 09:13
25

Tips: 3 has a very low weight, because a dot between words can be the result of a typo. 5 should not match URLs. For 6 the ampersand is also frequently used outside the code context this you might also weight that character less. Double check if the highlighter works, because it can highlight non-code text as I sometimes see in Notepad++. – Tamara Wijsman Jun 28 '11 at 10:27
8

re the . as a typo - there would be no harm in flagging that as the author ought to edit anyway. – mmmmmm Jun 28 '11 at 10:29
1

You could also try to detect camel- or pascal- case words, since those do not appear in regular language, except in typos. Also words such as a_variable_name for instance. – Nobody Jun 28 '11 at 10:44
4

additionally, specific keywords that many languages have could help: WHILE, ELSE, IF, LOOP, BREAK, etc. – JoséNunoFerreira Jun 28 '11 at 10:45
1

and now i remembered this: you could add up the number of "rules" it triggers, to get it more accurate. basically, if you spot "()", a "WHILE" and a "!=", that's probably not three typos, it's code. – JoséNunoFerreira Jun 28 '11 at 10:48
6

Add "Usage of $ before non numeric words: $var is common in Perl and PHP (and Ruby?)." – PhiLho Jun 28 '11 at 11:37
2

Also think about catching multiple-words-with-more-than-one-hyphen; that would help identify Lispy languages that might not otherwise be caught by these rules. – JasonFruit Jun 28 '11 at 11:43
Indeed, even if some languages are proud to allow omitting them (JavaScript, Lua, Scala...) or use a different symbol (or none at all, like Python). 2. Some people like to write func (x) (or func( x ) and other variants). But sure the majority omits the space. 3. As pointed out, dot doesn't work well in some cases (URLs, IP addresses, typo). 5. Perhaps more reliable if detected at the start of a (typed) line. Also lines starting with # or two dashes. 5. + and & aren't so uncommon as abbreviations, I think. -- Overall, a good set of suggestions, I just wanted to help to refine them.

PhiLho

Jun 28 '11 at 11:45

1

Also, you could implement HINTS in the textbox's margin whenever code is detected. Many code editors (Visual Studio w/ ReSharper, DreamWeaver, etc) do something similar when they find errors/warnings/suggestions in code. – David Murdoch Jun 28 '11 at 12:35

I'd start off by looking at tags on what languages the question is about to limit what processing is needed. Then use clues like this to form an approximation of where code is situated. Say, if a line has a clue that says it's a piece of code then backtrack to attempt find the first line of code. Then step after line 3 like debugger and repeat the process this time looking forwards until there's some indication that we've rearched the end of the code snippet. This way you could handle cases where lines of code are separated by plain text. – James P. Jun 28 '11 at 14:24

@TomWIJ: You can keep track of the frequency of each of these features, and consider some tests a potential false positive if they only happen once. A SpamAssassin-like approach might work. – Ken Bloom Jun 28 '11 at 15:46

None of these heuristics will identify LISP code, but I think we should be editing this list to add other features to the list. – Ken Bloom Jun 28 '11 at 16:13

@Ken Bloom that is why there is a requirement to only parse the "Big Ten" by tags, LISP is on page 16 of tags – Scott Chamberlain Jun 28 '11 at 17:36

I just for the sake of completion, in addition to camelCase you should check for underscored_lowercase_names. – Ziv Jun 28 '11 at 18:05

4

You won't detect my SELECT DISTINCT name FROM people WHERE id IS NOT NULL. – Benoit Jun 29 '11 at 04:57

2

for what it's worth, you can generally refine your algorithm on a language-by-language basis -- detecting bash code may be very different than detecting erlang code, but the author will almost always tell you what language he's using by his choice of tags. – tylerl Jun 29 '11 at 05:50

1

I think you should add another item for Line length. Code tends to have short (<80 chars) lines separated by line breaks, which is a departure from most written language. – Jiaaro Jul 08 '11 at 17:53

Another pattern: () – Loren Pechtel Oct 03 '14 at 21:39

Nested parenthesis – jkd Jun 08 '15 at 14:48

score 54 · Answer 2 · edited May 23 '17 at 12:40

I would be curious to see what are the average metrics of written English on one side, and code on the other side.

length of paragraphs
length of lines
size of words
chars used
ratio between alphabetic, numeric and other symbol characters
number of symbols per word
etc.

Maybe that alone could discriminate already between code and the rest. At least I believe code, regardless of language, would show some noticeably different metrics in many cases.

The good news is: you already have plenty of data to build your statistics upon.

Ok I'm back with some data to back my assumptions up. :-)

I did a quick and dirty test on your own post and on the first post I found on StackOverflow, with a pretty advanced tool: wc.

Here is what I had after running wc on the text part and on the code part of those two examples:

First lets look at the English part:

The English part of your post (2635 chars, 468 words, 32 lines)
- 5 chars/word, 82 chars/line, 14 words/line
The English part of the other post (1499 chars, 237 words, 12 lines)
- 6 chars/word, 124 chars/line, 19 words/line

Pretty similar don't you think?

Now lets take a look at the code part!

The code part of your post (174 chars, 13 words, 3 lines)
- 13 chars/word, 58 chars/line, 4 words/line
The code part of the other post (4181 chars, 287 words, 151 lines)
- 14 chars/word, 27 chars/line, 2 words/line

See how not so different those metrics are, but more importantly, how different they are from the English metrics? And this is just using a limited tool. I am now sure you can get something really accurate by measuring more metrics (I'm thinking in particular of chars statistics).

I can haz cookie?

Line length, particularly if you exclude bullet points and look for clustered lines of less than a particular length containing specific punctuation would seem to be a good measure. — Jon Hopkins, Jun 28 '11 at 09:30
This would work for blocks of code, but it'd seem a whole lot harder to look for inline cdde. Not sure how much that matters, though -- the bigger problem is big blocks of unformatted code anyway. — cHao, Jul 02 '11 at 17:42
@james.garriss: the Internet stole my cookie jar. :( Thank you for the notice though. — Julien Guertault, Oct 08 '14 at 14:27

Matthew Rodatus · Answer 3 · 2011-06-28T11:28:14.977

Typically, Markov chains are used to generate text, but they can also be used to predict the similarity of text (per C.E. Shannon 1950) to a trained model. I recommend multiple Markov chains.

For each prevalent language, train a Markov chain on a large, representative sample of code in the language. Then, for a Stack Overflow post for which you want to detect code, do the following for each of the chains:

Loop through the lines in the post.
- Declare two variables: ACTUAL=1.0 and HIGHEST=1.0
- Loop through each character in the line.
  - For each character, find the probability in the Markov chain that the current character is the one following the previous N characters. Set ACTUAL = ACTUAL * PROB₁. If the current character is not present in the chain, then use a tiny value for PROB₁, like 0.000001.
  - Now, find the character most likely (i.e. the highest probability) to follow the previous N characters. Set HIGHEST = HIGHEST * PROB₂.
  - Obviously, PROB₂ >= PROB₁

For each line, you should have an ACTUAL and a HIGHEST value. Divide ACTUAL by HIGHEST. That will give you the fitness score as to whether a particular line is source code. That would associate a number with each of the lines in the example you gave:

my problem is I need to change the database but I don't won't to create // 0.0032
a new connection. example: // 0.0023

DataSet dsMasterInfo = new DataSet(); // 0.04
Database db = DatabaseFactory.CreateDatabase("ConnectionString");   // 0.05
DbCommand dbCommand = db.GetStoredProcCommand("uspGetMasterName");  // 0.04

Finally, you'll need to select a threshold to determine when there is code in the post. This could simply be a number selected by observation that yields high performance. It could also take into account the number of lines with a high score.

Training

To train, procure a large, representative sample of code in the language. Write a program to loop over the code text and associate each N-gram in the file (the range for N should be parameterized) with the statistical frequency of the subsequent character. This will yield multiple possible states of characters that follow the bigram, each associated with a probability. For example, the bigram "()" could have some following character probabilities of:

"()" 0.5-> ";"
"()" 0.2-> "."
"()" 0.3-> "{"

The first should be read, for example as "The probability that a semicolon follows an empty parenthetical is 0.5."

For training, I recommend N-grams of size two through five. Back when I did some research on this, we found that N-grams size two through five worked well for English. Since much of source code is English like, I'd suggest starting with that range and then adjusting to find the optimal parameter values as you find what works.

A caveat: The model is going to be affected by identifiers, method names, whitespace, and etc. However, you can tune the training to omit certain features of the training sample. For example, you could collapse all unnecessary whitespace. The presence of whitespace in the input (the Stack Overflow post) can be ignored as well. You could also ignore alphabetical case, which would be more resilient in the face of varying identifier naming conventions.

During my research, we found that our methods worked well for Spanish as well as English. I don't see why this wouldn't also work well for source code. Source code is even more structured and predictable than human language.

The only problem I foresee is that the probabilities will be vastly smaller than in your toy example. Given numerical instability, this means that soon all probabilities are 0. Using log odds solves this though. Furthermore, I’d use larger tokens (i.e. not characters but words/punctuation). — Konrad Rudolph, Jun 28 '11 at 15:20
@Konrad: the idea here isn't to test absolute probabilities: it's to test relative probabilities. For each line, is the text of that line more likely to have been generated by an English language model, or by a code language model. — Ken Bloom, Jun 28 '11 at 15:48
You can train this model on existing SO posts (particularly because you may need to account for Markdown syntax). If you assume that most posts are formatted correctly (or you pick through a large number of posts, on the order of tens of thousands, to remove posts that are not formatted correctly), then you assume that stuff that's not code formatted is English text, and stuff that is code formatted is code, you can train from actual SO answers. — Ken Bloom, Jun 28 '11 at 15:51
It might be sufficient to train a single Markov model from the "golden" SO answer code blocks, rather than a model for each language. — Matthew Rodatus, Jun 28 '11 at 16:04
A tutorial about how to do this (using LingPipe in Java) is available from LingPipe's website. At the end of the tutorial, there are a number of papers on techniques to tackle this problem. I suggest reading them. — Ken Bloom, Jun 28 '11 at 16:04
@Ken I know, but that doesn’t change my caveat. Especially when using single character tokens, you will get ridiculously small probabilities/odds at the end of each line. — Konrad Rudolph, Jun 28 '11 at 16:15
It’s interesting to see that the state of the art solution has only a very low vote count, and rates vastly less than all those ad-hoc solutions which, admittedly, might just be good enough but rely a lot on special-casing and are inherently prone to overfitting. — Konrad Rudolph, Jul 03 '11 at 10:56
@Konrad Jeff did specify "simple, semi-reliable" and "The simpler the better (so long as it works)." But, this isn't hard, and I do think it would perform better without much more effort. @Ken pointed out that they could train it on existing SO posts, which is a great idea. — Matthew Rodatus, Jul 05 '11 at 11:22
@Matthew My point exactly. Nothing in your answer goes beyond simple statistics and CS 102 algorithms, and what’s more, most (all?) of this probably already exists in library form. — Konrad Rudolph, Jul 05 '11 at 11:39

mac · Answer 4 · 2011-06-28T19:09:29.527

13

May I suggest a radically different approach? On SO the only human-language allowed is English, so anything that is non-English has 99.9% of chances to be a code snippet.

So my solution would be: use one of the many English language-checkers out there (just make sure they also signal - beside misspellings - syntax mistakes like double dots, or non-language symbols like # or ~). Then any line/paragraph that throws a large amount of errors and warnings should trigger the "is this code?" question.

This approach can also be adapted for those StackExchange sites using other languages than English, of course.

Just my 2¢...

edited Jun 28 '11 at 19:09

answered Jun 28 '11 at 13:02

mac

671

16

The problem is that a lot of the incoming questions aren't English either (although they resemble it). – Brendan Long Jun 28 '11 at 19:07
3

@Brendan - Added advantage of this proposal then: underline (or highlight) the mistakes in the probably-intended-to-be-English parts of the post and help the writer to write... in English! ;) – mac Jun 28 '11 at 19:12
1

I'm Dutch and everything I code is in English, by comments are not (depending on the project). So Non-English must be code would not suffice. That or you mean that broken English must be code. – Ivo Limmen Jun 28 '11 at 20:14
@Ivo - My comment was jokingly addressed to the broken English issue! ;) However I would say that with my proposal comments in another language would just work fine... OTOH block comments in English won't trigger the "is this code?" question, but that's just fine because the code for which the comment has been written would already have triggered it... – mac Jun 28 '11 at 20:22

Ivo Limmen · Answer 5 · 2012-02-08T05:53:32.823

11

Pseudo code would pose a real challenge because all programming language depend on special characters like '[]', ';', '()', etc. Simply count the occurrence of these special characters. Just like you would detect a binary file (more than 5% of a sample contains byte value 0).

edited Feb 08 '12 at 05:53

answered Jun 28 '11 at 08:17

Ivo Limmen

279
2
5

I would improve this as much as having groups of these special chars like [] () ; {} =. Each line that has more than 2-3 of these groups contained is a line of code. – Honza Jun 28 '11 at 08:19
...and also look for common strings in the most common languages, e.g. " = someword();" for most curly bracket languages, XML-like syntax like "" and "ab:cde", and other common strings in other languages. I do believe some sort of lookup table of common syntax would be a good solution, as you can expand it when you find new languages to implement. – Arve Systad Jun 28 '11 at 08:27
You should probably drop pseudo code. Some people like to write it as a C-style language but other people will use plain english with something that looks closer to VB6 – James P. Jun 28 '11 at 14:25

score 11 · Answer 6 · answered Jun 28 '11 at 11:37

11

I'm probably going to get a few down votes for this but I think you are approaching this from the wrong angle.

This line got me:

people have to go in and manually format code for people that are somehow unable to figure this out

IMO that standpoint is kind of arrogant. I find this a lot in software design where programmers and designers get annoyed with users who can't figure out how to use the software properly, when the problem isn't the user but the software itself - or the UI at least.

The root cause of this problem isn't the user but the fact that it isn't obvious to them that they can do this.

How about a change in UI to make this more obvious? Surely this will be:

more obvious to new users exactly what they need to do
easier for you to build rather than writing complex algorithms to detect code logic of a multitude of languages

Example:

enter image description here

answered Jun 28 '11 at 11:37

matt_asbury

141
3

27

Actually this IMO enforces poor questions like "I have a problem please help me, the code is below" - quite rarely code needs to be separated from the question. Best questions go like this "I want to achieve this and wrote these two lines of code, but the effect is the following, what's the problem" - there's very little code heavily interleaved with plain language. – sharptooth Jun 28 '11 at 12:50
I agree that the UI might not be the best at showing the user how to post code, but I also agree with sharptooth that having a separate area for code is not the way to go. Some other UI change might be good, however. – NickAldwin Jun 28 '11 at 14:10
4

Your root observation is correct but your diagnosis is still wrong: in fact, Jeff is trying to improve the user interface via this approach. Furthermore, the current UI has already gone through several cycles and while I don’t doubt that it could be improved (drastically), I doubt that this would help against lazy idiots. Neither would your proposed solution. @sharptooth has this covered. – Konrad Rudolph Jun 28 '11 at 15:24
2

I would +1 for thinking out the box but I disagree with the specific suggestion, since posting "supporting code" forces a question flow that may be unnatural. I have never just dumped in code at the bottom of my question. I almost always post an intro, the sample code, then the actual question. If you accept this premise that inline code is essential, then some type of formatting is required -- formatting which must be entered by the user or recommended by the system. And that's the exact thing that Jeff is asking about doing. – Nicole Jun 28 '11 at 16:23
While I disagree with the posted example (picture), this is an excellent suggestion. Never underestimate the power of carefully designed UI. You have 300000+ pixels at your disposal; the sky is the limit! – Roy Tinker Jun 28 '11 at 16:42
Thanks for your additions to my answer. The example I gave was merely just that, mocked up in all of 2 mins. It was just a way of illustrating that whilst we can go through these very technical resolutions for a variety of different problems, the easiest one may be staring us in the face. – matt_asbury Jun 28 '11 at 20:26
1

@Konrad: In addition to my above comment and in response to yours, I don't believe the Jeff is improving the UI by taking this path, but merely treating the symptoms of an underlying problem. If the UI was improved so that the mistake couldn't be made, then the solution of alerting the user would not be necessary. I am under no illusion that my example is the final solution but some thought needs to go into the question "are we presenting this in the best way possible?". – matt_asbury Jun 28 '11 at 20:29
1

The simple sentence please mark code using the {} button around the text box could be enough. – Paŭlo Ebermann Jun 28 '11 at 23:55
1

"The root cause of this problem isn't the user but the fact that it isn't obvious to them that they can do this." The root problem is that they don't care. – Jeff Atwood Jun 29 '11 at 11:54

jk. · Answer 7 · 2011-06-28T08:19:37.560

4

I think you may need to target this against only specific languages, in general this problem is likely intractable as you can get languages which are pretty similar to English (e.g. inform7). but luckily the most used ones could be covered fairly easily.

My first cut would be to look for the sequence ";\n" which would get you a good match for C,C++,Java, C# and any other language that uses similar syntax and is really simple. It is also less likely to be used in English than a ; without a newline

edited Jun 28 '11 at 08:19

answered Jun 28 '11 at 08:14

jk.

10,236

plus maybe an abundance of curly braces ;p – Marc Gravell Jun 28 '11 at 08:15
1

As Jeff says in his post, they would probably only target the main languages. And in any case, I suspect that new users (for whom this functionality is intended) will be more likely to post C# or Javascript than, say, INTERCAL ;-) – Ben Jun 28 '11 at 08:28
Yes but this would not work with the programming language BRAINFUCK or BLANK. ;-) – Ivo Limmen Sep 03 '12 at 18:01

score 4 · Answer 8 · answered Jun 28 '11 at 09:43

4

Someone mentioned looking at the tags and then looking for syntax for that but that was shot down because this is aimed at new users.

A possible better solution would be to look for language names in the body of the question, then apply the same strategy. If I mention "Javascript", "Java" or "C#" then chances are that is what the question is about, and code in the question is likely to be in that language.

answered Jun 28 '11 at 09:43

Omar Kooheji

387

Especially if the title is something like "vb c# .net dot net help me help me!!!" – NickAldwin Jun 28 '11 at 14:12

vartec · Answer 9 · 2011-06-28T08:26:04.743

1

First, run it through spell check, it will find very few proper English words, however there should be lot of words that spellchecker will suggest to split.

Then there are punctuation/special characters not typical for plain English, typical for code:

something(); just cannot be plain English;
$something where something is not all numeric;
-> between words w/o spaces;
. between words w/o space;

Of course to have it working well, you might want to have Bayesian classifier built on top of these characteristics.

edited Jun 28 '11 at 08:26

answered Jun 28 '11 at 08:19

vartec

20,806

1

Detecting a non indented line containing (); would be a good reason to suggest the message. – Jun 28 '11 at 08:22
What spell checker won't choke prior to the code being pasted? – Jun 28 '11 at 09:09
With some messages written by non-native English writers, the spellchecker will choke on every other word... – PhiLho Jun 28 '11 at 11:33
@Ph: these question/answers are not accepted on SO anyway. – vartec Jun 28 '11 at 11:40

score 1 · Answer 10 · answered Jun 28 '11 at 09:13

there are several sets of languages which share similar syntax. the most languages got influenced by a few languages, so the languages [AMPL, AWK, csh, C++, C-- , C#, Objective-C, BitC, D, Go, Java, JavaScript, Limbo, LPC, Perl, PHP, Pike, Processing[ were all influenced by C, so if you detect C you will probably detect all these languages. so you have only to write a simple pattern for detecting this language-sets.

i would also split the text into blocks because the most code will be split by two newlines or similar from the other text blocks in the post.

this can be easy done with javascript (a supersimple incomplete sample for the c family) :

var txt = "my problem is I need to change the database but I don't won't to create a new connection. example:\n\nDataSet dsMasterInfo = new DataSet();Database db = DatabaseFactory.CreateDatabase(&quot;ConnectionString&quot;);DbCommand dbCommand = db.GetStoredProcCommand(&quot;uspGetMasterName&quot;);";
var blocks = txt.split(/\n\n/gi); console.dir(blocks);
var i = blocks.length;
var cReg = /if\s*\(.+?\)|.*(?:int|char|string|short|long).*?=.+|while\s*\(.+?\)/gi;

while ( i-- ){
   var current = blocks[i];
   if ( cReg.test( current ) ){
      console.log("found code in block[" +  i + "]");
   }
}

score 0 · Answer 11 · answered Jun 28 '11 at 10:36

Simply count words / punctuation character for each line. English will tend to have 4 or more, code less than 2.

The paragraph above has 18 words, and 4 punctuation characters, for example. This paragraph has 19 words and 4 punctuation, so within expectations.

Of course, this would need to be tested against newbie poor-english speakers questions, and it may be that in those cases, the statistics are skewed.

I expect that [non-whitespace].[whitespace or newline] is very rare in code, but common in English, so this could be counted as words, not punctuation.

I think the biggest problem will be inline code, where someone asks a question like:

If I say for (i=0; i>100; i++) {} what does that mean?

That's code and English, and should be marked up as with back-ticks:

If I say for (i=0; i>100; i++) {} what does that mean?

score 0 · Answer 12 · answered Jun 28 '11 at 10:43

I think you should first make a distinction between (sufficiently) formatted code that only needs to be actually designated as such, and (too) poorly formatted code, which needs manual formatting anyway.

Formatted code has breaklines and indentation. That is: if a line is preceded by a single breakline, you have a good candidate. If it has leading whitespaces on top of that, you have a very good candidate.

Normal text uses two breaklines or two spaces and a breakline for formatting, so there's a clear criterion for distinction.

In LISP code you will not find semicolons, in Ruby code you may not find parenthesis, in pseudo code you might not find much at all. But in any (non-esoteric) language you will find decent code to be formatted with breaklines and indentation. There's nothing as universal as that. Because in the end code is, written to be read by humans.

So first, search for potential lines of code. Also, lines of code usually come in groups. If you have one, there's a good chance that the one above or below is a line of code as well.

Once you have singled out potential lines of code, you can check them against quantifiable criteria and pick some threshold:

frequence of non-word characters
frequence of identifiers: very short words or very long words with CamelCase or under_score style
repetition of uncommon words

Also, now that there is programmers and cs, stackoverflow's scope is clearly narrowed down. One might consider denoting all language tags as languages. And when posting, you'd be asked to either pick at least one language tag, pick the language-agnostic tag or to explicitly omit it.

In the first case you know which languages to look for, in the second case, you might want to look for pseudo-code and in the last case, there probably won't be any code, because it's a question related to some technology or framework or such.

score 0 · Answer 13 · answered Jun 28 '11 at 11:52

You could create a parser for each language you want to detect (language definitions for ANTLR are usually easy to find), then run each line of the question through each parser. If any line parses correctly, you probably have code.

The problem with this is that some english (natural language) sentences may parse as code, so you may want to include some of the other ideas as well, or you could limit positive results only if more than one or two consecutive lines parse correctly with the same language parser.

The other potential issue is that this will probably not pick up pseudocode, but that may be OK.

Often people have syntax errors in their code (and are asking about this). — Paŭlo Ebermann, Jun 29 '11 at 00:02

Abbafei · Answer 14 · 2011-06-28T21:49:53.287

What may be the most future-proof and require the least manual adjustment in the long run ,as other languages (which look somewhat different than the programming languages used most now) become more popular and the currently used languages become less popular, is to do something like what Google Translate does (see the paragraph titled "How does it work?"), instead of looking for certain things like a.b and a(), etc.

In other words, instead of manually thinking of patterns found in code to look for, the computer can figure it out by itself. This can be done by having

lots of code in many different programming languages
- Suggestion: automatically take code samples from web-based source code repositories like Google Code or Github, or even from things on Stackoverflow already marked as code
- Note: it may be a good idea to parse out code comments
lots of english text taken from articles on the web
- although not from articles about programming (otherwise they may have code in them and mix the system up :-) )

and having some kind of algorithm automatically find patterns in the code that are not in the english, and vice versa, and using those patterns to detect what is code and what is not code by running the algorithm on posts.

(However, I am not sure how such an algorithm would work. Other answers to the current question may have useful information for that.)

Then the system can re-scan the code every once in a while to account for changes in the way code looks at that point in time.

Simple method for reliably detecting code in text?

14 Answers14

Linked