Undefined behavior and other delights of (bad) C programming
I started a couple of weeks ago, when I received an email from Sandeep Vasant from Ahmedabad University in India. For reasons that he has yet to reveal, he was having trouble with some code like this:
int a=10, b=20, c=0; c = a++ + a++ + b++ + b++ + ++a + ++b;
He tried this with one compiler and the resulting values of a, b and c were 13, 23 and 96 respectively. He was satisfied with this result. Then he tried a different compiler, which yielded a final value for c of 98, which he found confusing.
I started looking into this, certain that the explanation was simple …
As I work for a company that sell embedded software development tools, I have a natural interest in programming languages and their quirks – even if the code is not specifically embedded. I believe that embedded developers are often more interested in what is going on “behind the scenes” than their desktop computer software development counterparts.
My first thought was that, although the precedence of the + and ++ operators is clear and the functionality of the pre-increment and post-increment versions of ++ is unambiguous, the order of evaluation of the the operators is not defined. Maybe they are evaluated left to right or perhaps right to left. So, instead of using a compiler, I decided to work it out by hand. Going left to right, I got 10 + 11 + 20 + 21 + 13 + 23 = 98; going the other way, I got 21 + 11 + 21 + 22 + 11 + 12 = 98. So it made no difference. In both cases I got the result that Sandeep had been confused by.
Now it was time to try a compiler [I used CodePad]. I wrote this:
int a, b, c;
a=10, b=20, c=0; // original code c = a++ + a++ + b++ + b++ + ++a + ++b; printf("%d %d %d\n", a, b, c);
a=10, b=20, c=0; // sequence of sub-expressions reversed c = ++b + ++a + b++ + b++ + a++ + a++; printf("%d %d %d\n", a, b, c);
The results were:
13 23 92
13 23 96
Now I was confused, as neither result seemed correct. I rewrote the code:
a=10, b=20, c=0; // explicit left to right evaluation c = a++; c += a++; c += b++; c += b++; c += ++a; c += ++b; printf("%d %d %d\n", a, b, c);
a=10, b=20, c=0; // explicit right to left evaluation c = ++b; c += ++a; c += b++; c += b++; c += a++; c += a++; printf("%d %d %d\n", a, b, c);
I was much happier with the results this time:
13 23 98
13 23 98
So what is happening here? I consulted my colleague Jon Roelofs, who provided a straightforward explanation: The order of evaluation is unspecified – it could be right to left or left to right, but it could also be any other order that the compiler felt was appropriate. When side effects of the evaluations of sub-expressions occur [like the increment operators], there are undefined results. Needless, coding an algorithm which has an undefined result is rather pointless. Some compilers would give a warning/error in this situation.
Undefined behavior only happens when there is reading and writing to variables on the right hand side on an assignment more than once. For example:
a = b++ + c++ + d++;
does not exhibit undefined behavior.
Out of interest Jon suggested that I try this code:
int foo() { printf("foo\n"); return 0; }
int bar() { printf("bar\n"); return 1; }
int baz() { printf("baz\n"); return 2; }
void main() { printf("%d %d %d\n", foo(), bar(), baz()); }
The result I got was unsurprising:
baz
bar
foo
0 1 2
He pointed out that the three lines of text could have come out in any order; the numeric data will always be displayed last and in the correct order.
There are numerous examples of this kind of challenge. Again, Jon suggested that I try this:
c = a+++b;
Is this treated as c = a++ + b; or c = a + ++b; ? It must be the former. Even though it looks as if it could have gone either way, the C language standard nails it. So this is not undefined behavior.
Without wishing to sound superior, I think that it would be very unlikely that I would encounter this problem in “real” code that I had written. This is because, as I started out writing assembly language, I am naturally inclined to keep my statements in C very simple and, hence, do not introduce such complexity.
Comments
Leave a Reply
You must be logged in to post a comment.
There are three things to consider here:
(1) “Order of evaluation”, which applies to sub-expressions and is undefined in C (and C++), except in a few specific circumstances.
(2) “Precedence”, which applies to operators and is rigorously defined in the language specifications.
(3) “Associativity”, also rigorously defined, which also applies to operators and acts as a tie-breaker when precedence alone is not enough.
The difference between (1) and the combination of (2) and (3) takes some getting to grips with, but is essential for a proper understanding of either language.
I agree with you, Colin, about sticking to the KISS principle. The most prevalent C “sin” in the examples in your article is the (ab)use of embedded assignments in expressions. I *never* permit myself the dubious luxury of *any* embedded assignments in my code, even though the ++ operators were specifically invented for that kind of use.
Interesting way to put it Peter – thinking of a ++ as being a bit like a =. If I saw a = on the RHS of an assignment, even if valid, I would be wary and want to simplify the code.
Some high-level languages define evaluation order, e.g. for Lisp it is always left-to-right and such ambiguities cannot happen. C cannot do this since the goal is to squeeze out the last CPU cycle. Whether this still makes sense or not today is questionable, considering some mass-targeted operating systems are becoming bloater and bloater on purpose, in order to sell more powerful hardware. And they are still coded in C.
You make a good point Antonio.
Yes! 🙂 ‘Clever’ code isn’t clever. Keep it simple, clear and easy to understand. Nothing else will do. 😉