Hacker News new | past | comments | ask | show | jobs | submit login
In Python, `[0xfor x in (1, 2, 3)]` returns `[15]` (twitter.com/nedbat)
121 points by obi1kenobi on April 13, 2021 | hide | past | favorite | 95 comments



Trying to clarify it a bit, it gets interpreted as: "[0xf or (x in (1,2,3))]" and the "or" short circuits, never evaluating the second part (which would have given an uninitialized variable error). This therefore evaluates to [0xf] or just [15] once you convert the hex notation to an integer.

If you neglect the outer list that's only there to fool you into thinking this has something to do with list comprehensions, same as the x, and if you don't use hex to make it seem like there is a "for" in the middle, it boils down to something like: "15or whatever()" which doesn't seem all that confusing, even if "whatever" uses an uninitialized variable like x, or is undefined, because we're in Python and it's only evaluated when it runs. Then we are left with the 'confusion' boiling down to a Wat about why it's legal to do "15or 3" without space before or, and especially why "x0for 3" works. This is documented as other comments mention and is due to the way the parsing works.

I don't know if the parser could be changed to require a leading space before the or operator, but it's pretty clear to me that this is only confusing if you intentionally do you very best to try to add confusing structure around it in an attempt to fool the reader into thinking something very different and weird is going on.


> it's pretty clear to me that this is only confusing if you intentionally do you very best to try to add confusing structure around it in an attempt to fool the reader into thinking something very different and weird is going on.

This has come up from a Core Developer before [1] so it's not just code golfers having a laugh.

Yes, the docs note the tokenization behavior [2], but Guido's response today in the above mail thread is also pretty unambiguous:

> I would totally make that a SyntaxError, and backwards compatibility be damned.

1: https://mail.python.org/archives/list/python-dev@python.org/...

2: https://docs.python.org/3/reference/lexical_analysis.html#wh...


> it's pretty clear to me that this is only confusing if you intentionally do you very best to try to add confusing structure around it in an attempt to fool the reader into thinking something very different and weird is going on.

Sure, but that's half the fun :-) I frequently see Ned's name pop up with interesting things like this.


Yeah, this also evaluates to [15] and is easier to understand what's going on:

    [15or x in (1, 2, 3)]
As another person pointed out, Python's lexer sees 0xfor (or 15or) and splits it into "0xf", "or" ("15", "or" for the other case) then parser processes it as usual.


Reminds me of the "goes to"[0] operator in c (and c++)

    int x = 10;
    while ( x --> 0)

[0] https://stackoverflow.com/questions/1642028/what-is-the-oper...


> Whitespace is needed between two tokens only if their concatenation could otherwise be interpreted as a different token (e.g., ab is one token, but a b is two tokens).

https://docs.python.org/3/reference/lexical_analysis.html#wh...


I find that to be a really odd choice when OTOH tabs/indents are required for block structure.


Python: Semantic whitespace is the one true way

Also Python: Meh, you can leave out the whitespace, I'll figure it out


Similarly, Python is supposed to be a language that's easy to learn and easy to read code in. Pythonistas take pride in that, as IMO they should. With that in mind, preserving a counterintuitive behavior that exists solely because of how the lexer was implemented seems inconsistent at best. A language with a more relaxed attitude toward inscrutable brevity - e.g. Perl - can use that excuse. I don't think Python should.


I first learned Python in the 2.6 days, and have been developing in it fairly regularly over the 15 or so years since.

I've never encountered this issue before.

The reason I've never encountered this issue, or even needed to know it was an issue is because I use good development tools. 1) Pycharm highlighted the or in orange, making it clear that it was being interpreted as a keyword, 2) I use a linter (as we all should) which explicitly highlights the lack of whitespace around the token as an issue (PEP 8: E225), and 3) I use a code formatter (which we all should) which, again, highlights this statement as requests that I fix it.

Python has a lot of flaws. This one is down near the bottom of ones that are even really worth talking about.


Started with 1.5 myself. I've never encountered this issue, nor would I expect to. It might be down near the bottom of the list, but it's still a wart and it showed up here for us to talk about. You seemed to think it was worth talking about at even greater length than I did. So do you just believe it's not worth other people expressing opinions?


This showed up on the front page because it's a novelty that even people who have been using the language for over a decade would never encounter.

You're taking my statement way too personally and out of context. You and everyone else can express your opinions all you want.

As a topic in the list of "flaws with Python," I don't think it's a very important or insightful topic because it's not something people really run into. This just doesn't fit there.

This is a novelty, pure and simple. Filed under "weird programming hacks you won't expect" This would be fine. It's like, `([]+![])[+!+[]+!+[]+!+[]][([]+{})]` in javascript. Weird, kind of interesting, but it's not really a "flaw," at least not the sense of something that will burn you unexpectedly.


This is amazing to me and I would never have suspected such a thing.

Reminds me of Fortran.

https://arstechnica.com/civis/viewtopic.php?t=862715


Yeah, FORTRAN made some bad choices. In its defense, it first appeared in 1957. There wasn't much language design or compiler writing knowledge back in those days.

When we were in college learning FORTRAN, students would ask for help from the teaching assistants at the computer center. One big problem was FORTRAN allowed horrible spaghetti code because GOTO statements could be used anywhere. It was easy to jump into and out from loops. The TAs had a tough job.

It didn't help when people would deliberately mess with the TAs by asking about code with language features similar to those in the article you linked. Something like IIRC after 47 years:

      DO 15 I = (1, 100)
         purported loop stuff goes here
   15    final statement of loop
That is also not a loop control statement. Instead it is an assignment to a complex number.


But why "0xf or" rather than "0 xfor" (where xfor is a valid variable name) as long as we only care about splitting between valid tokens?


Because the hexadecimal parsing rule [0] happens before [1] the rule that parses names [2]:

    % cat 0xfor.py
    0xfor 1
    % python -m tokenize -e '0xfor.py'
    0,0-0,0:            ENCODING       'utf-8'
    1,0-1,3:            NUMBER         '0xf'
    1,3-1,5:            NAME           'or'
    1,6-1,7:            NUMBER         '1'
    1,7-1,8:            NEWLINE        '\n'
    2,0-2,0:            ENDMARKER      ''
Also, literals have to be evaluated before names, otherwise you could overwrite them:

    >>> 0xzzz = 1
    File "<stdin>", line 1
        0xzzz = 1
        ^
    SyntaxError: invalid hexadecimal literal
    >>> 0xf = 1
    File "<stdin>", line 1
        0xf = 1
        ^
    SyntaxError: cannot assign to literal
[0]: https://docs.python.org/3/library/stdtypes.html#float.fromhe...

[1]: https://github.com/python/cpython/blob/5ce227f3a767e6e44e7c4...

[2]: https://github.com/python/cpython/blob/5ce227f3a767e6e44e7c4...


You seem to suggest some sort of differences in rule priorities. I don't think they are prioritized. It's just that Python reads left to right, and the first thing it sees looks like the start of a number, so it starts parsing a number. It doesn't reason like "I have to parse like this, otherwise you could overwrite literals".


I’m suggesting that it has to happen in a specific order, which it does. Regardless of the direction the tokenization process reads, the rules can’t be evaluated concurrently.

> the first thing it sees looks like the start of a number

Yes, because it checks if the token is a number before checking if it is a name.


Names cannot start with a decimal digit, so 0xfor can never be a name. It doesn't matter in which order the rules are checked.


> 0xfor can never be a name

But to know if a token is a name, it has to check, and that check never happens because the tokenizer yields when the number case hits. You can attach a debugger and see for yourself.

> It doesn't matter in which order the rules are checked.

Nowhere did I claim otherwise. The rules _do_ happen in a deterministic order, which I posted the source code for.


>> It doesn't matter in which order the rules are checked.

> Nowhere did I claim otherwise.

You claimed otherwise when you wrote "Because the hexadecimal parsing rule [0] happens before [1] the rule that parses names [2]", and when you wrote "literals have to be evaluated before names", and when you wrote "the rules can’t be evaluated concurrently". That is, you claimed otherwise in every single one of your posts in this subthread.


The check whether a token is a number happens before the check whether a token is a name, as evinced by both the source code and a trivial debugger session.

You are conflating the fact that there is an order to the rules with a strawman that the rules _must_ be evaluated in a specific order. All of the quotes you pulled are facts (Python checks if a token is a number before it checks if it is a name, Python checks if a token is a literal before it checks if it is a name, and the rules are not evaluated concurrently), but your strawman is making a different claim.


> a strawman that the rules _must_ be evaluated in a specific order.

I'm not the one who wrote "literals have to be evaluated before names".

> rules are not evaluated concurrently

Strawman yourself. I didn't say they were evaluated concurrently, I said that your claim that they can't be evaluated concurrently was wrong. They could be, because they are disjoint, and regardless of order only one of them can match at input starting with a '0' character.

Anyway, I'm done here.


Because `0 xfor` is a syntax error regardless -- nowhere in the grammar are you allowed to separate two non-keywords with a space. 0xf is a number, 0xfo is not, so the lexer breaks the token before the o.

This isn't a bug. It's a slightly loose grammar spec that can surprise python users who don't play golf.


The lexer reads the longest tokens it can (the "maximum munch" rule mentioned in another answer). Once it has read the 0, it has decided to read a number. It will then proceed to read the longest possible number at this position, which is 0xf. This is simpler to specify and implement than the alternative that would require looking ahead to see where to split.


Because of maximal munch.


Any way to lint this pattern away? Say what you will about the language and lexer complexity, backwards compatibility. This is a code smell if I have ever seen one.


Enforce consistent formatting, e.g. with black (https://github.com/psf/black) this gets formatted to [0xF or x in (1, 2, 3)].


This issue is caught by a linter and reports, "PEP 8: E225 missing whitespace around operator".


Some cases are, but there are still plenty of patterns that (currently) are not. e.g. `1if 1else 2`

See also: https://github.com/PyCQA/pycodestyle/issues/371


Interpreted as: [0xf or x in (1,2,3)]

0xf = Hex for 15

Expression is evaluated as a boolean


Here's the AST:

    >>> ast.dump(ast.parse('[0xfor x in (1,2,3)]'))
    "Module(body=[Expr(value=List(elts=[BoolOp(op=Or(), values=[Num(n=15), Compare(left=Name(id='x', ctx=Load()), ops=[In()], comparators=[Tuple(elts=[Num(n=1), Num(n=2), Num(n=3)], ctx=Load())])])], ctx=Load()))])"


Shouldn't this be considered a bug in Python? Why does it even try to evaluate 0xfor without the space? Trying a few other things..

* 0xfor1 evaluates.

* 1or 2 evaluates.

* 1or2 doesn't.

* ''or'foo' evaluates.

This is gross.


That’s the normal way lexers work, given “tight” token definitions. They continue adding to the current token until an invalid (for the current token type) character is reached, and then begin parsing a new token starting with the “invalid” (but now valid for the next token) character (or the next non-whitespace character).

“1or2” is lexed into “1” (integer) followed by “or2” (identifier), which is valid on the lexer level but then fails on the grammar level.


The lexer unfortunately is a greedy token matcher. As soon as the 0xf "made sense" to it, and 0xfo did not - it did the same thing it would do in the case of something like 0xf+3. Except the + was an `or` in this case which is kosher. There is an idempotent step you can take where extra spaces are added before the AST is formed to make this sort of thing easier. The good news is, with a decent lint / format flow - these sorts are easy to catch.


Probably a lexer bug. "foo"or should never be processed as "foo" and token OR


why not? "or" is an operator, like "+", "foo"+"bar" should be valid. Why have special inconsistent case for "or".


You know what? You’re right.

I guess it’s an operator token after all.


Not by design, fortunately: https://bugs.python.org/issue43833


That hasn't been confirmed. Are we sure that it's not an inherent ambiguity in the grammar?


That's not totally clear. A bug being filed doesn't mean it's accepted. And this has been (ab)used for quite some time in various python codegolf. See https://codegolf.stackexchange.com/a/56 from 2011.


It is 100% by design. It's even documented.

https://docs.python.org/3/reference/lexical_analysis.html#wh...

This is not a bug.


The cited documentation says:

> Whitespace is needed between two tokens only if their concatenation could otherwise be interpreted as a different token (e.g., ab is one token, but a b is two tokens).

The two tokens in this case are "0xf" and "or". Their concatenation cannot be interpreted as a different token, because "0xfor" is not a valid token. Therefore, if I'm reading the rule correctly, whitespace is needed in this case.

"0xffor" is another interesting case. It's also not a valid token, but it could be interpreted as two tokens in two different ways: "0xf" "for" or "0xff" "or". (Python does the latter. I presume it uses something like C's "maximal munch" rule.)


Your conclusion is the exact opposite of what the documentation explicitly states.

> Whitespace is needed between two tokens only if their concatenation could otherwise be interpreted as a different token (e.g., ab is one token, but a b is two tokens).

Because the concatenation of "0xf" and "or" can't be interpreted as a different token, the whitespace is not needed.


You're right, and I was wrong.

I dislike the rule, and I strongly think that "0xfor" should require whitespace between "0xf" and "or" (I'm sure that influenced my reading), but you're right about what the rule says.

(Apparently I can't edit my previous comment.)


It is not strictly speaking a bug, since it works as intended. But it is clearly a counter-intuitive behavior and could be improved. Making 0xfor a syntax error would definitely be an improvement.

But requiring whitespace between all tokens is not an acceptable solution, since "2+2" should work. Always equiring whitespace between alphanumerical characters in different tokens would make sense.


Why is it parsed like that and not as invalid integer literal "0xfor"?

edit: 'cause it's a bug! https://bugs.python.org/issue43833


For pretty much the same reason that "1+2" is not parsed as an invalid integer.


"1+2" is a valid expression consisting of the three tokens "1", "+", and "2". It's permitted to omit whitespace between those tokens.

The question is whether "0xfor" is a valid way to represent the two tokens "0xf", "or". My reading of the rules is that it isn't, and that whitespace is required.


I read the rules incorrectly. It appears that "0xfor" is a valid way to represent the two tokens "0xf" and "or".

The rule, which I initially misinterpreted, is:

> Whitespace is needed between two tokens only if their concatenation could otherwise be interpreted as a different token (e.g., ab is one token, but a b is two tokens).

Since "0xfor" cannot be interpreted as a different (single) token, whitespace is not needed.

(I'm not a fan of this particular consequence of that rule, and I suspect it was not intended, but it says what it says.)


This illustrates why lexer tokens should preferably not be defined as exactly what the language allows. Instead the internal lexer definition should include invalid tokens (like “0xfor”) that are only sorted out in a later step (in this case, when actually converting it to an integer value).


Yes, that is basically how C works, though the C standard describes it in a weird way. There is the notion of a “preprocessor number” which starts with a digit and allows any characters that can appear in a number or identifier. At a later step in parsing it is converted to a strict number, which would fail in a case like 0xfor.


Just to elaborate, it short circuits, so x need not be defined.

    [sielicki@dogfruit ~]$ python3.10 -c 'print(0xf or some_undefined_variable in some_undefined_set)'
    15


This is the kinda thing that helps illustrate just how powerfully dynamic Python is. Almost nothing is statically bound. Super cool if you want to write code that inspects some dynamic state and has special behavior. But the cost is that there's very little static checking capability.

This code "should" result in a NameError or something similar. But resolving that identifier to a name won't happen until we decide to evaluate the right-hand side of the `or`.

Thankfully, most python static checkers make some mostly-sane assumptions and will flag this code as an error, despite the subtle possibility of legally creating this name using some other Python/CPython features.

Type annotations are a great idea but in practice I don't see them used much. Should be great for dev teams working on bigger Python projects.


A bit of a miss for a language that puts whitespace central in the syntax.


"2+2" and "2 + 2" is the same in Python, so whitespace is not significant between otherwise unambigous tokens in Python.

Whitespace in Python is significant to indicate statements and block.


I initially assumed the missing space was a typo in the headline, so I tried [0x for x in (1,2,3)] and got "invalid hexadecimal literal" as I would have expected. Took me a few minutes to realize the typo was intentional.

I still can't figure out what the author was trying to do when he stumbled onto that, though...


This is the kind of thing you could end up with from an automated test case reducer. Start with a typo that causes odd behavior (which might be interpreted as a CPython 'bug') and use something like creduce to whittle it down to the minimal case.


My guess is that they were trying to do soemthing like [x for x in some_list], but had some unfortunate typos.


Yeah maybe he was converting a tuple to a list, but that still seems like a weird way to do it.


Think of it as a simplified version to demonstrate the issue.


Anyone know why it's legal to use logic operators like that in Python?

That is, 0or, 0xor, 0and,etc. in my view, this should clearly be a syntax error.

edit: as noted by the parent comment, 0xor != 0 xor, but 0x or. But this seems to hold true for all data types.


Other languages don't enforce whitespace around logical operators. Like `0xf||1+2` is perfectly valid javascript.

I can see how this could slip by, it's possible that original python used a non-alphabetical token as the logical operator, then it was swapped out at some later time for or and and.


There's no ambiguity though, right? [0x for x in (1,2,3)] is nonsense (one typo off from something legitimate, but nonsense regardless). I can't think of another way someone might expect this to parse.


If you want confusion with a valid expression:

    [0x1for x in (1,2,3)]
looks like it should parse as

    [0x1 for x in (1,2,3)]
but instead parses as

    [0x1f or x in (1,2,3)]


This is better than the original post.


I would have expected the english boolean operators to require whitespace to parse correctly, personally. "0xf||x" makes sense. "0xforx" makes no sense.


"0xforx" would be parsed as "0xf orx" which would be a syntax error.


Yeah, I realized after my edit window had expired. Should have been "0xf|| x" and "0xfor x".


Simpler example of the issue:

    >>> 1or 2
    1


Looks like the parser knows numbers end when the digits end so splits them into two tokens. (Ignoring the optional [lL] long suffix before Python 3)

One of those things that looks helpful at first glance, but is problematic in the long run. Throwing a syntax error immediately would lead to more robust/maintainable code.


You have to recognize the specific case to throw the syntax error. Sounds like they're going to fix that.

I guess people don't remember PL/I very much: PL/I keywords are not reserved words, so it is possible to use them in a program in other than their keyword context. DO DO=1 TO TO BY BY;END=END+1;END; is a valid program.


For those wondering, the ideas behind that choice is that you can’t expect anybody to know all keywords. PL/I had many built-in functions that all were keywords, so it had many keywords (https://www.cs.vu.nl/grammarware/browsable/os-pli-v2r3/#Keyw... lists hundreds)

and even if you do, future versions of the language may include more keywords.


Though you don't want to throw a syntax error on "1+2"


Yes, the difference between a letter character and punctuation/operator character.

There's an argument to be made that disallowing it could help in the long run as well, but Python has never been so strict. So I'd stop short of recommending it here. Perhaps in a bondage and discipline language. :-D


1+2 is a valid expression consisting of 3 tokens, "1", "+", and "2".

In most languages, numeric literals, identifiers, and keywords cannot be adjacent, but any of them can be adjacent to an operator. The odd thing here is "0xfor" being tokenized as "0xf" and "or". In C, for example, "0xfor" is an invalid token.


> In most languages, numeric literals, identifiers, and keywords cannot be adjacent, but any of them can be adjacent to an operator. The odd thing here is "0xfor" being tokenized as "0xf" and "or"

0xf is a numeric literal, and or is an operator, so as you said, they can be adjacent. or isn't an operator in C.

of course, if it were 0xfand ... you have a fun problem, because it could be 0xfa nd, but nd isn't an operator (that I know of), or 0xf and. Ideally, syntax shouldn't be ambiguous.


I should have stated that more clearly. In Python, "or" is both an operator and a keyword.

In most languages, numeric literals, identifiers, and keywords cannot be adjacent, but any of then can be adjacent to a token that consists of punctuation characters. (In some languages, most or all operators spelled using punctuation characters, which is what I had in mind when I wrote my previous comment.)


or is an operator in C++ and if you #include <iso646.h> in C


> or is an operator in C++

Yes.

> and if you #include <iso646.h> in C

Sort of. As a preprocessor token, it's an identifier that happens to be the name of a macro. As a token (after preprocessing), it becomes the "||" operator.


My preference would be that the syntax requires spaces in that case. Oh well, I've always got the option of writing my own programming language.


There's the additional thing with x not being defined, not not being evaluated, which I found surprising at least. So perhaps:

    >>> 1or x  # without x existing
    1


Clearly a lexer error.


Why is it "clearly" an error? This syntax has been around forever and is widely used by code golfers, so if it were unintentional, I'm sure the Python devs would've fixed it by now.


It no longer matters if it was originally unintentional. If it's in active use now, so even if was originally by accident it cannot be fixed without breaking backwards compatibility.


Eh, python breaks backwards compatibility between versions all the time. And I don't know if "it's used by codegolfers" is a legitimate reason to keep a language feature.

That said, I have a really hard time imagining this issue coming up in regular usage of the language so I don't really have much of an opinion should they choose to keep or drop it.


Corresponding bug report: https://bugs.python.org/issue43833


This isn't a bug. In fact, it's documented behavior:

https://docs.python.org/3/reference/lexical_analysis.html#wh...


For Christ's sake this isn't a bug.


could be much worse

  # launch debugger with the program provided inline, consisting of the statement `1`.
  # this program is ignored and we just mess with some variables.
  % perl -de1  

  Loading DB routines from perl5db.pl version 1.57
  Editor support available.
  
  Enter h or 'h h' for help, or 'man perldebug' for more help.
  
  main::(-e:1): 1
    DB<1> @a = (1..4)
    DB<2> x \@a
  0  ARRAY(0x7faf890418f8)
     0  1
     1  2
     2  3
     3  4
    DB<3> x scalar @a
  0  4
    DB<4> sub foo { return (1..4) }
    DB<5> @b = foo()
    DB<6> x scalar @b
  0  4
    DB<7> x scalar foo()
  0  ''


yes, that's the empty string, aka false, because the .. operator returns entirely different things in scalar context so you can write code like

  while (<STDIN>) {
    next if /^BEGIN$/ .. /^END$/;
    # ...
  }


`0xbor(11)` is 11

`11-0xbor-11` is -11

and `11-0xbor-0xbor-0xbor-11` is -11, too


Reverse order and you get: >>> [x in (1,2,3) or 0xf] [True]


no that would return an error unless x is defined earlier


  >>> [0xfor d or cambridge]
  [15]


>>> 0xfor whatever

15




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: