7

This is a sister question to: Is it bad to use Unicode characters in variable names?

As is my wont, I'm working on a language project. The thought came to me that allowing multi-token identifiers might improve both readability and writability:

primary controller = new Data Interaction Controller();

# vs.

primary_controller = new DataInteractionController();

And whether or not you think that's a good idea*, it got me musing about how permissive a language ought to be about identifiers, and how much value there is in being so.

It's obvious that allowing characters outside the usual [0-9A-Za-z_] has some advantages in terms of writability, readability, and proximity to the domain, but also that it can create maintenance nightmares. There seems to be a consensus (or at least a trend) that English is the language of programming. Does a Chinese programmer really need to be writing 电子邮件地址 when email_address is the international preference?

I hate to be Anglocentric when it comes to Unicode, or a stickler when it comes to other identifier restrictions, but is it really worth it to allow crazy variable names?

tl;dr: Is the cost of laxity higher than the potential benefit?

Why or why not? What experiences and evidence can you share in favour of or opposed to relaxed restrictions? Where do you think is the ideal on the continuum?

* My argument in favour of allowing multi-token identifiers is that it introduces more sane points to break long lines of code, while still allowing names to be descriptive, and avoiding ExcessiveCamelCase and a_whole_lot_of_underscores, both of which are detrimental to readability.

Jon Purdy
  • 20,547
  • 1
    Dear close-voter: I've done my best to improve the question; in the future please explain your close-vote and offer some pointers for improvement. – Jon Purdy May 01 '11 at 16:28
  • One language that is very permissive is Agda. It allows anything except Unicode whitespace and a handful of reserved operators. – Mechanical snail Dec 01 '12 at 01:24
  • I wish programming languages would require that every use of an identifier match the declaration precisely, but also forbid the use of identifiers which differ only in case, accents, etc. Such a rule would IMHO combine the advantages of both case-sensitive and non-case-sensitive languages, but I don't know any language that works that way. – supercat Jan 28 '14 at 22:11

6 Answers6

10

I once worked with USL which allowed a space as part of a name. The combinatorial possibilities became a nightmare. Is "LAST LEFT TURN" one identifier? Or two ("LAST LEFT" and "TURN") or two identifiers ("LAST" and "LEFT TURN") or three? And is "RIGHT TURN" (one blank) the same as "RIGHT TURN" (two blanks) even though a text editor won't match them? No, don't ever accept blanks in names.

For similar reasons never accept special characters that mean something in the language. Is "ALPHA-BETA" a variable name or a subtraction?

Normally identifiers must start with a letter. Are you going to extend that to other Unicode languages? How will you know what a letter is in Arabic?

I fear that you are opening a gigantic can of worms.

Jon Purdy
  • 20,547
  • 2
  • I think we've been over this before: it's not combinatorial, it's exponential. 2) That problem doesn't come up in my project because two randomly adjacent identifiers don't mean anything; LAST LEFT TURN is always a single identifier. 3) Interpretation of otherwise-significant characters depends on your approach to lexing: many Lisps have very few identifier restrictions at the cost of a certain amount of required whitespace and parens. 4) Arabic is not a good example, but that's a good point; luckily Unicode is organised well, and it's possible to allow/exclude vast ranges quite sanely.
  • – Jon Purdy May 01 '11 at 00:52
  • Actually, come to re-read, I think your concepts of what constitutes an identifier, and of the state of lexing and parsing in general, are actually rather outdated. I'd retract my upvote, but you'd have to make a (possibly null) edit. – Jon Purdy May 01 '11 at 01:48
  • 1
    I took a class that was done entirely in lisp, which allows alpha-beta as an identifier, and actually found it very reasonable. In fact, getting back to other languages was annoying afterwards. – Tikhon Jelvis May 01 '11 at 09:02
  • 2
    @Tikhon Jelvis: Yeah, the uniformity is very logical: - is an identifier just as alpha-beta is an identifier, and the lexer just relies on whitespace and parens to separate tokens. It's really quite elegant in its (almost naive) simplicity. – Jon Purdy May 01 '11 at 16:35
  • @Jon Purdy: I'm still not sure how do you solve the ambiguity. Say your parser already has encountered that an identifier alpha, beta and alpha-beta exist. How do you separate alpha-beta from alpha - beta in say i = alpha-beta;? – n1ckp May 01 '11 at 19:12
  • @n1ck: In Lisp, at least, symbols are separated by spaces and surrounded by parentheses. Also, everything is in prefix order. So alpha - beta would be (- alpha beta) and i = alpha-beta would either by (set! i alpha-beta) or (set! i (- alpha beta)) depending on which one you meant. (When I say I used Lisp, I really mean Scheme, a type of Lisp, but the point still stands.) – Tikhon Jelvis May 01 '11 at 19:58
  • @Tikhon: Yeah I forgot that we were talking about lisp. But how about whitespace (like in the question)? Something like in your exemple (set! i alpha-beta) if you allow whitespace the two successive variable passed can lead to ambiguity (i.e. (set! alpha beta beta) if there is a certain combination of value). I guess there is possibly way to solve it (with a certain order?) but I was a little curious since it does seems like it complicate things a lot I think. – n1ckp May 01 '11 at 21:03
  • @n1ck: Lisp isn't a good example here. In my language, if my flag return my var is unambiguous because if and return are keywords, so my flag and my var are ordinary identifiers, whereas in Lisp, no name is special, so there would be ambiguity. My language does naturally have some separation requirements for the otherwise rather lax control structures: if foo bar() defaults to if (foo bar()) ..., and you would have to explicate—with if (foo) bar(), if foo { bar() }, if (foo) { bar() }, or even if foo do bar()—to have foo and bar treated separately. – Jon Purdy May 01 '11 at 22:20
  • @Jon Purdy: nice, my main question was in fact how you handle the ambiguity in your language as your second comment on this answer seemed to me to imply that you used some secret parsing technique to resolve it and I was curious what it was. It seems a little clearer now. Another possibility, I guess, could be to do an error on ambiguity, similar to when two overload conflict with possible conversions in c++, for example (this is not my area of expertise and I didn't think much about it so don't take this too seriously). – n1ckp May 01 '11 at 22:56
  • @n1ck: It's nothing special. I just have a consistent set of rules, and if applying those rules results in a parse error, at least it fails in the immediate vicinity of the problem, and with a reasonably helpful error message. – Jon Purdy May 01 '11 at 23:27
  • @Jon Purdy: well to talk more about your actual question, I do think it looks like it would not be that great an idea to allow whitespace in identifier, at least in it's free'er form (vs say something like sparkie's answer). How about having a character to split the identifier, like C's \ for macro endline or a similar rule when wanting to break it to a newline (which seems to be your rational for whitespace in identifier)? Again it's not my area but I feel like more restriction would put some limit in the freedom and, as such, possible ambiguities in the grammar. – n1ckp May 02 '11 at 01:03
  • @n1ck: My system is more restricted: any intervening whitespace characters between identifier parts are treated as a single space. You could do something similar with a quoting system, of course, but I'm fairly certain the F# spec is more liberal, treating a b and a b (enclosed in backticks) as distinct, even though newlines and tabs are disallowed. And getting around the issue with a line-joining character tarnishes the goal of readability. – Jon Purdy May 02 '11 at 01:40
  • Even if you can reliably parse the situation the human shouldn't have to be trying to. Anything that makes the code harder to read is bad. Use_underscores or CamelCaps. Also, Unicode is a bad idea--when someone not used to generating those characters (likely in another locale) needs to do something with it. – Loren Pechtel Dec 03 '12 at 04:57