How permissive should a language be about identifiers?

Question

This is a sister question to: Is it bad to use Unicode characters in variable names?

As is my wont, I'm working on a language project. The thought came to me that allowing multi-token identifiers might improve both readability and writability:

primary controller = new Data Interaction Controller();

# vs.

primary_controller = new DataInteractionController();

And whether or not you think that's a good idea*, it got me musing about how permissive a language ought to be about identifiers, and how much value there is in being so.

It's obvious that allowing characters outside the usual [0-9A-Za-z_] has some advantages in terms of writability, readability, and proximity to the domain, but also that it can create maintenance nightmares. There seems to be a consensus (or at least a trend) that English is the language of programming. Does a Chinese programmer really need to be writing 电子邮件地址 when email_address is the international preference?

I hate to be Anglocentric when it comes to Unicode, or a stickler when it comes to other identifier restrictions, but is it really worth it to allow crazy variable names?

tl;dr: Is the cost of laxity higher than the potential benefit?

Why or why not? What experiences and evidence can you share in favour of or opposed to relaxed restrictions? Where do you think is the ideal on the continuum?

_{* My argument in favour of allowing multi-token identifiers is that it introduces more sane points to break long lines of code, while still allowing names to be descriptive, and avoiding ExcessiveCamelCase and a_whole_lot_of_underscores, both of which are detrimental to readability.}

Dear close-voter: I've done my best to improve the question; in the future please explain your close-vote and offer some pointers for improvement. — Jon Purdy, May 01 '11 at 16:28
One language that is very permissive is Agda. It allows anything except Unicode whitespace and a handful of reserved operators. — Mechanical snail, Dec 01 '12 at 01:24
I wish programming languages would require that every use of an identifier match the declaration precisely, but also forbid the use of identifiers which differ only in case, accents, etc. Such a rule would IMHO combine the advantages of both case-sensitive and non-case-sensitive languages, but I don't know any language that works that way. — supercat, Jan 28 '14 at 22:11

score 10 · Answer 1 · edited Dec 02 '12 at 22:57

10

I once worked with USL which allowed a space as part of a name. The combinatorial possibilities became a nightmare. Is "LAST LEFT TURN" one identifier? Or two ("LAST LEFT" and "TURN") or two identifiers ("LAST" and "LEFT TURN") or three? And is "RIGHT TURN" (one blank) the same as "RIGHT TURN" (two blanks) even though a text editor won't match them? No, don't ever accept blanks in names.

For similar reasons never accept special characters that mean something in the language. Is "ALPHA-BETA" a variable name or a subtraction?

Normally identifiers must start with a letter. Are you going to extend that to other Unicode languages? How will you know what a letter is in Arabic?

I fear that you are opening a gigantic can of worms.

edited Dec 02 '12 at 22:57

Jon Purdy

20,547

answered May 01 '11 at 00:39

Andy Canfield

2,093

2
I think we've been over this before: it's not combinatorial, it's exponential. 2) That problem doesn't come up in my project because two randomly adjacent identifiers don't mean anything; LAST LEFT TURN is always a single identifier. 3) Interpretation of otherwise-significant characters depends on your approach to lexing: many Lisps have very few identifier restrictions at the cost of a certain amount of required whitespace and parens. 4) Arabic is not a good example, but that's a good point; luckily Unicode is organised well, and it's possible to allow/exclude vast ranges quite sanely.

Jon Purdy

May 01 '11 at 00:52

Actually, come to re-read, I think your concepts of what constitutes an identifier, and of the state of lexing and parsing in general, are actually rather outdated. I'd retract my upvote, but you'd have to make a (possibly null) edit. – Jon Purdy May 01 '11 at 01:48

1

I took a class that was done entirely in lisp, which allows alpha-beta as an identifier, and actually found it very reasonable. In fact, getting back to other languages was annoying afterwards. – Tikhon Jelvis May 01 '11 at 09:02

2

@Tikhon Jelvis: Yeah, the uniformity is very logical: - is an identifier just as alpha-beta is an identifier, and the lexer just relies on whitespace and parens to separate tokens. It's really quite elegant in its (almost naive) simplicity. – Jon Purdy May 01 '11 at 16:35

@Jon Purdy: I'm still not sure how do you solve the ambiguity. Say your parser already has encountered that an identifier alpha, beta and alpha-beta exist. How do you separate alpha-beta from alpha - beta in say i = alpha-beta;? – n1ckp May 01 '11 at 19:12

@n1ck: In Lisp, at least, symbols are separated by spaces and surrounded by parentheses. Also, everything is in prefix order. So alpha - beta would be (- alpha beta) and i = alpha-beta would either by (set! i alpha-beta) or (set! i (- alpha beta)) depending on which one you meant. (When I say I used Lisp, I really mean Scheme, a type of Lisp, but the point still stands.) – Tikhon Jelvis May 01 '11 at 19:58

@Tikhon: Yeah I forgot that we were talking about lisp. But how about whitespace (like in the question)? Something like in your exemple (set! i alpha-beta) if you allow whitespace the two successive variable passed can lead to ambiguity (i.e. (set! alpha beta beta) if there is a certain combination of value). I guess there is possibly way to solve it (with a certain order?) but I was a little curious since it does seems like it complicate things a lot I think. – n1ckp May 01 '11 at 21:03

@n1ck: Lisp isn't a good example here. In my language, if my flag return my var is unambiguous because if and return are keywords, so my flag and my var are ordinary identifiers, whereas in Lisp, no name is special, so there would be ambiguity. My language does naturally have some separation requirements for the otherwise rather lax control structures: if foo bar() defaults to if (foo bar()) ..., and you would have to explicate—with if (foo) bar(), if foo { bar() }, if (foo) { bar() }, or even if foo do bar()—to have foo and bar treated separately. – Jon Purdy May 01 '11 at 22:20

@Jon Purdy: nice, my main question was in fact how you handle the ambiguity in your language as your second comment on this answer seemed to me to imply that you used some secret parsing technique to resolve it and I was curious what it was. It seems a little clearer now. Another possibility, I guess, could be to do an error on ambiguity, similar to when two overload conflict with possible conversions in c++, for example (this is not my area of expertise and I didn't think much about it so don't take this too seriously). – n1ckp May 01 '11 at 22:56

@n1ck: It's nothing special. I just have a consistent set of rules, and if applying those rules results in a parse error, at least it fails in the immediate vicinity of the problem, and with a reasonably helpful error message. – Jon Purdy May 01 '11 at 23:27

@Jon Purdy: well to talk more about your actual question, I do think it looks like it would not be that great an idea to allow whitespace in identifier, at least in it's free'er form (vs say something like sparkie's answer). How about having a character to split the identifier, like C's \ for macro endline or a similar rule when wanting to break it to a newline (which seems to be your rational for whitespace in identifier)? Again it's not my area but I feel like more restriction would put some limit in the freedom and, as such, possible ambiguities in the grammar. – n1ckp May 02 '11 at 01:03

@n1ck: My system is more restricted: any intervening whitespace characters between identifier parts are treated as a single space. You could do something similar with a quoting system, of course, but I'm fairly certain the F# spec is more liberal, treating a b and a b (enclosed in backticks) as distinct, even though newlines and tabs are disallowed. And getting around the issue with a line-joining character tarnishes the goal of readability. – Jon Purdy May 02 '11 at 01:40

Even if you can reliably parse the situation the human shouldn't have to be trying to. Anything that makes the code harder to read is bad. Use_underscores or CamelCaps. Also, Unicode is a bad idea--when someone not used to generating those characters (likely in another locale) needs to do something with it. – Loren Pechtel Dec 03 '12 at 04:57

score 4 · Answer 2 · edited May 23 '17 at 12:40

4

Some early programming languages did allowed spaces in identifiers; e.g. early FORTRAN (pre F77), some dialects / implementations of Algol, AppleScript, etc.

From a programming language perspective it is a bad idea because it introduces lots of ambiguity. Resolving that ambiguity is hard work for the compiler, and ultimately it makes the language more complicated and harder to read. For instance:

String str = "";

Is that declaring a variable called str, or assigning a value to a previously declared variable called "String str"?

Eliminating this kind of ambiguity (while allowing spaces in names) some other changes to the language syntax; e.g. requiring keywords to be quoted, requiring keywords to be all caps, eliminating all keywords from the language (!).

See also: https://stackoverflow.com/questions/1805030/why-programming-languages-do-not-include-spaces-in-the-method-identifiers

edited May 23 '17 at 12:40

Community

1

answered May 01 '11 at 00:54

Stephen C

25,178

2

I think it has great potential to introduce ambiguity, yes. The only reason I'm considering it in my case is that the syntax of the language is such that there is neither any ambiguity nor any added complexity. – Jon Purdy May 01 '11 at 00:57
1

@Jon Purdy - if you are designing your own language, and you can really eliminate the possibility of ambiguity by some means that is acceptable to users of your language ... go ahead. – Stephen C May 01 '11 at 01:13
Hah, well, yes, that is the case. There was the choice of whether to allow reserved words within multi-token identifiers, but I think that's an easy no. – Jon Purdy May 01 '11 at 01:24

score 3 · Answer 3 · answered May 01 '11 at 01:18

3

From a parsing point of view, it looks like it could easily become hell to implement and maintain. Though it'd become quite reasonable if you introduce some notation to let the parser know when you're dealing with multi token idents, and parsing is then a non-issue.

As an example, F# allows almost anything to be used in an identifier, but you must surround the whole thing with pairs of grave accents, so it's valid to write

let ``primary controller`` = new ``Data Interaction Controller``();

Although the feature is there, it's rarely used by the programmer manually. There are various tools which dynamically generate code where these identifiers are useful, and that will be more apparent in F#3.0, which can literally take data from the web or elsewhere and allow it to be used as strongly typed identifiers, without first normalizing the data to fit into ASCII.

answered May 01 '11 at 01:18

Mark H

2,510

Isn't F# a .Net language? They all need such a feature, because they have to deal with names from other languages. They all have to accept class, Class, CLASS , ectera. – MSalters May 02 '11 at 12:15
@MSalters: Not all .NET languages have this feature. C# doesn't for sure, and it cannot call types written in F# which have such identifiers. The CLR supports nearly any string as an identifier name, but only F# of the main languages supports it. – Mark H May 02 '11 at 12:59
@sharpie: VB.Net, another main language, escapes identifiers with []. That seems to work the same. – MSalters May 02 '11 at 13:24

score 3 · Answer 4 · answered May 01 '11 at 19:27

I'm not sure about spaces but it must be possible. I do think it means you have to give up on spaces in other places, and that's a decision you should weigh. In curly-style languages spaces are usually only necessary for separating keywords from identifier and separating types from identifiers (int x, new Y).

Hmm. When I think about it, it might not even be as unfeasible as I thought.

For the UTF8-chars I'm feeling the same kind of ambivalence. Clearly the horrors of trying to type localized chars on a non-localized keyboard layout or having two chars that look alike but are not the same are a nightmare.

At the same time, the adagium 'program in English' simply can't be applied everywhere. It's great for abstract concepts and libs and such, but when you are programming business logic you may need to represent local concepts (I could think of legal terms), and it's only confusing to translate them to an English approximation and then back again. If you have a latin-scriptbased language you might get away with ditching the non-ascii-chars but the more you get away from that, the harder it gets. And the more the concepts you need might not have a good English translation too.

So I guess I have to leave this undecided. I do not need either spaces or utf8-chars at the moment though.

score 2 · Accepted Answer · answered May 01 '11 at 01:34

2

That's rather hard to say. More permissive grammars are more difficult to parse. Ruby's optional parentheses are a good example of this. The lack of extant languages with this feature may not prove that it's a bad idea, but it doesn't help validate it either. There isn't much else to go on.

If you think it's a good idea and it's relatively easy to execute, why not go ahead and do it? That's the only real way to get a definitive answer to questions like this.

answered May 01 '11 at 01:34

Rein Henrichs

13,172

1

I like this stance. "If you don't know if it works, test it and see." – Jon Purdy May 01 '11 at 01:43
2

Otherwise there would never be any innovation. – Rein Henrichs May 01 '11 at 01:46
1

@Jon Purdy: On the other hand, this approach did occur a few times in the distant past and was quickly dumped. I think it would be foreign to most developers and confuse them, if not the parser. – Orbling May 01 '11 at 05:32
1

@Orbling: Sometimes confusing developers in the short run is actually worth it in the long run. And if it fits well with the rest of his language, it might not be very confusing at all. – Tikhon Jelvis May 01 '11 at 09:05
Taking a lesson from history means you are less likely to be doomed to repeat it. There were valid reasons for quickly dumping some esoteric language features, an looking at some languages, I doubt very much programmer confusion was a part of it. – mattnz Dec 03 '12 at 00:08

score 1 · Answer 6 · answered May 01 '11 at 11:08

I've been wondering about this myself.

We've seen the "relax" attitude in the design of HTML, and having to work with it daily, I can only say it led to a mess. As a result I heartily support a well-specified approach... and one that rejects outright anything outside of the specs.

Once this is said and done, I prefer being pragmatic. You want me to read/work on/use your programs ? Then:

don't use characters that are not immediately accessible on my keyboard (basically, stick to what C uses itself), for the record, I alternate between Azerty (home) and Qwerty (work)
program in English

I suppose that if you use it only for yourself or share it only with like-minded fellows, then anything goes. But working in a multi-cultural environment (a bit of everything from Europe, as well as some from North Africa, India and China), uniformity is required, and English is well-suited. And before you think that I am lazy for wanting to impose my mother-tongue, I am French, so I had to learn English, and I am still.

Then comes the issue of blanks. I don't know about this. I can see a few issues with the use of Turn then Turn Left which would make it awkward to grep/sed if required, but nothing outstanding.

How permissive should a language be about identifiers?

6 Answers6