16

It is very similar to this:

How to tell if a string contains valid Python code

The only difference being instead of the entire program being given altogether, I am interested in a single line of code at a time.

Formally, we say a line of python is "syntactically valid" if there exists any syntactically valid python program that uses that particular line.

For instance, I would like to identify these as syntactically valid lines:

for i in range(10):

x = 1

Because one can use these lines in some syntactically valid python programs.

I would like to identify these lines as syntactically invalid lines:

for j in range(10 in range(10(

x =++-+ 1+-

Because no syntactically correct python programs could ever use these lines

The check does not need to be too strict, it just need to be good enough to filter out obviously bogus statements (like the ones shown above). The line is given as a string, of course.

Community
  • 1
  • 1
Evan Pu
  • 1,979
  • 5
  • 20
  • 33
  • 2
    FYI, `x =+ 1` is syntactically valid. It assigns `+1` to `x`. – John Kugelman May 03 '16 at 19:39
  • 2
    What about implicit line concatenation (which would make `for j in range(10` also possibly syntatically valid) – mgilson May 03 '16 at 19:39
  • 3
    `for j in range(10` is also valid if the next line continues with something like `):`, and `if x < 3` could be part of a multi-line expression as well. Almost anything could be part of a multi-line string, too. – user2357112 May 03 '16 at 19:40
  • 3
    I think the question that you need to answer is *why* you need/want to do this – OneCricketeer May 03 '16 at 19:41
  • i see. I'll make the incorrect statements more outrageous. @cricket_007 I'm training a neural network to generate statements – Evan Pu May 03 '16 at 19:43
  • 2
    The `for` is still syntactically valid. The assignment isn't quite valid any more unless, say, it's part of a triple-quoted string or a line-continued comment. I don't think you quite understand what you're trying to do. – user2357112 May 03 '16 at 19:47
  • The accepted answer in the SO question you linked should work, no? – RobertR May 03 '16 at 19:49
  • @RobertR it won't quite work because something like "for x in range(10):" should be valid, but on the linked question it returns False, because that particular statement alone does not parse to an AST as it's still missing some pieces – Evan Pu May 03 '16 at 19:50
  • @user2357112 Those syntactically invalid lines he provided, whether they truly are syntactically invalid or not, have nothing to do with the actual question. – RobertR May 03 '16 at 19:56
  • @RobertR: On the contrary, it's important to resolve this, because appropriate solutions to the problem will vary depending on whether the questioner *wants* to exclude these lines, or whether he only wants to exclude lines that are actually syntactically invalid under his definition, or whether he wants to take an entirely different approach to training his neural network, or whatever else the resolution ends up being. – user2357112 May 03 '16 at 20:01
  • 1
    @user2357112 just choose either interpretatio, I don't really care. I don't care about completion (the way you put it) in particular – Evan Pu May 03 '16 at 20:08
  • 1
    A formal description of the full grammar is [available](https://docs.python.org/3/reference/grammar.html) on the official Python site. – Jongware May 03 '16 at 20:17
  • I think if your goal is to write an ML model, you should limit your input to basic, well-formed Python. Or use entire programs. – erip May 03 '16 at 20:20
  • The question could stand some improvement in precise wording, since even the example invalid lines which "correct python programs could never use" are valid if the lines before and after, for example, were """ (triple quotes). Given the nature of the question, I don't think this distinction is mere pedantry either. – Peter Hansen May 06 '16 at 21:00
  • @PeterHansen I would love to, can you word it for me? I tried to edit it but I don't quite know how to properly phrase it. In my world, triple quotes don't exist, and continued expressions such as x + y +\n z don't exist – Evan Pu May 11 '16 at 05:13
  • @EvanPu I'm not sure how to improve it without changing the whole approach. Few languages are really "line-based", but your model seems to be built on generating single line statements. I tried an approach like that years ago and decided, were I to continue with it, that I'd switch to generating parts of an AST directly. Alternatively, you might just clarify that your definition of "line" isn't strict and includes certain types of multi-line statement, such as triple-quoted strings or those where all but the last end with a trailing backslash. – Peter Hansen May 11 '16 at 12:54

2 Answers2

16

This uses codeop.compile_command to attempt to compile the code. This is the same logic that the code module does to determine whether to ask for another line or immediately fail with a syntax error.

import codeop
def is_valid_code(line):
    try:
        codeop.compile_command(line)
    except SyntaxError:
        return False
    else:
        return True

It can be used as follows:

>>> is_valid_code('for i in range(10):')
True
>>> is_valid_code('')
True
>>> is_valid_code('x = 1')
True
>>> is_valid_code('for j in range(10 in range(10(')
True
>>> is_valid_code('x = ++-+ 1+-')
False

I'm sure at this point, you're saying "what gives? for j in range(10 in range(10( was supposed to be invalid!" The problem with this line is that 10() is technically syntactically valid, at least according to the Python interpreter. In the REPL, you get this:

>>> 10()
Traceback (most recent call last):
  File "<pyshell#22>", line 1, in <module>
    10()
TypeError: 'int' object is not callable

Notice how this is a TypeError, not a SyntaxError. ast.parse says it is valid as well, and just treats it as a call with the function being an ast.Num.

These kinds of things can't easily be caught until they actually run. If some kind of monster managed to modify the value of the cached 10 value (which would technically be possible), you might be able to do 10(). It's still allowed by the syntax.

What about the unbalanced parentheses? This fits the same bill as for i in range(10):. This line is invalid on its own, but may be the first line in a multi-line expression. For example, see the following:

>>> is_valid_code('if x ==')
False
>>> is_valid_code('if (x ==')
True

The second line is True because the expression could continue like this:

if (x ==
    3):
    print('x is 3!')

and the expression would be complete. In fact, codeop.compile_command distinguishes between these different situations by returning a code object if it's a valid self-contained line, None if the line is expected to continue for a full expression, and throwing a SyntaxError on an invalid line.

However, you can also get into a much more complicated problem than initially stated. For example, consider the line ). If it's the start of the module, or the previous line is {, then it's invalid. However, if the previous line is (1,2,, it's completely valid.

The solution given here will work if you only work forward, and append previous lines as context, which is what the code module does for an interactive session. Creating something that can always accurately identify whether a single line could possibly exist in a Python file without considering surrounding lines is going to be extremely difficult, as the Python grammar interacts with newlines in non-trivial ways. This answer responds with whether a given line could be at the beginning of a module and continue on to the next line without failing.

It would be better to identify what the purpose of recognizing single lines is and solve that problem in a different way than trying to solve this for every case.

Alyssa Haroldsen
  • 3,525
  • 1
  • 18
  • 35
-1

I am just suggesting, not sure if going to work... But maybe something with exec and try-except?

code_line += "\n" + ("\t" if code_line[-1] == ":" else "") + "pass"
try:
    exec code_line
except SyntaxError:
    print "Oops! Wrong syntax..."
except:
    print "Syntax all right"
else:
    print "Syntax all right"

Simple lines should cause an appropriate answer

Neo
  • 3,164
  • 2
  • 17
  • 32
  • I was just about to suggest the `+= "pass"` approach. You might want to `.rstrip` the line though. Also, you don't need the new line and the indentation. – Jared Goguen May 03 '16 at 20:24
  • 3
    Executing lines is opening Pandora's box. Let's see if `while True:` is syntactically valid. How about `import os; os.system('rm -rf /')`? – John Kugelman May 03 '16 at 20:27
  • @JohnKugelman Right, but this is going to need to be sand-boxed anyways. If OP is randomly generating programs, some of them may not halt, and some of them affect the environment. – Jared Goguen May 03 '16 at 20:29
  • @JohnKugelman You're right... But I can't think of any way to do it without simply making python interpreter... Does someone know of a way to execute python code without really executing it? Stupid question I know but can help much with this – Neo May 03 '16 at 20:32
  • this is similar to what my labmate and I discussed just now. I will try this approach and report back in an hour. thanks! – Evan Pu May 03 '16 at 20:36
  • @Neo This is what [`compile`](https://docs.python.org/3/library/functions.html#compile) is designed to do. – Alyssa Haroldsen May 03 '16 at 21:12