Tuesday, August 15, 2017

A Subtle Difference between C and C++: String Literals

For the most part, C++ is a superset of C.  You can write C code, rename the file with a .cpp extension, and the compiler will compile it in C++ mode, generating essentially the same code.  In fact, the very first C++ compilers were actually C compilers with extra pre-processing.  But there are some subtle differences, and I recently ran into one that has some important implications.

In C, string literals are not constant, but in C++ they are.


For most practical purposes, string literals are constants in C.  The compiler puts them into a data segment that the code can't modify.  All the libc string functions take (const char *) parameters.  But if you assign a (char *) pointer to the start of a string literal, the compiler won't warn you about dropping the 'const' from the type, and you can write code that modifies the string (only to have it fault at run time).  In fact, the warning about dropping the 'const' is probably the reason that newer versions of C haven't changed string literals to be constant.

Why does anyone care?

I've found two specific instances where code works when compiled as C++, but not as C because of this.

You can't use string literals as the targets of case statements.


This may seem obvious, as a case statement takes an integer type, not a string.  But suppose you want to do a switch based on the first four characters of a string (after ensuring that it's at least four characters long).  Imagine the following macro:

#define STR2INT(s) ( ((s[0]) << 24) | ((s[1]) << 16) | ((s[2]) << 8) | (s[3]) )

Now you could write code like:

   switch(STR2INT(option)) {
      case STR2INT("help"):
         ...
      case STR2INT("read"):
         ...
   }

That works in C++.  But in C, the compiler complains that the case statements aren't constant expressions.  To make it work in C, you have to have a much uglier version of the macro for the case statements:

#define CHARS2INT(w, x, y, z) (((w) << 24) | ((x) << 16) | ((y) << 8) | (z))

Then the code looks like:

   switch(STR2INT(option)) {
      case CHARS2INT('h','e','l','p'):
         ...
      case CHARS2INT('r','e','a','d'):
         ...
   }

That works in both C and C++, but is a pain to write.  At least the STR2INT macro works fine in other situations where the compiler insists on constant values.

You can't write asserts based on macro names.


In large software projects, it's not unusual to have sets of macros for specific purposes.  These macros are by convention supposed to follow some project-specific format.  There even may be a defined correlation between the name of the macro and the value.  It would be nice to be able to write asserts based on the macro name to enforce those conventions.

A quick aside on asserts:

Both C and C++ now support compile-time asserts.  It used to be that you would write code that would generate a negative shift if the expression wasn't true or something like that.  When the assert failed, you would get a compile-time error that was rather confusing until you looked at the offending line.  With the new mechanism, the compiler displays whatever message you tell it.  You use static_assert(expression,"message");  In C, you have to include <assert.h> or use _Static_assert.  This was added in C11 and C++11.

So for a trivial example, suppose we have macros like:

#define VAL_7B 0x7b

Now somewhere we use those macros:

   process_value(VAL_7B);

Obviously real code would have other parameters, but this is enough for our purposes.

To have asserts based on the macro name, what appears to be a function call must also be a macro; presumably a macro wrapper around the real function call.  Consider this definition:

#define process_value(v) \
   do { \
      _process_value(v); \
   } while(0)

That's a basic wrapper, forcing a semicolon at the end of the do-while loop.  This lets us add in asserts using the '#' preprocessor operator to stringify the input parameter:

#define CHAR2HEX(c) ( (c) <= '9' ? (c) - '0' : (c) - 'A' + 10 ) // Assumes uppercase
#define process_value(v) \
   do { \
      static_assert( (#v)[0]=='V' && (#v)[1]=='A' && (#v)[2]=='L' && (#v)[3]=='_', "Must use a 'VAL_xx' macro here" ); \
      static_assert( CHAR2HEX((#v)[4]) == ((v)>>4)  , "'VAL_xx' macro doesn't match defined value" ); \
      static_assert( CHAR2HEX((#v)[5]) == ((v)&0x0f), "'VAL_xx' macro doesn't match defined value" ); \
      static_assert( (#v)[06]==0, "'VAL_xx' macro format wrong" ); \
      _process_value(v); \
   } while(0)

In C++, that works great.  In C, you just can't do that.

And here's something interesting:  Why not change the above example to look like:

#define CHAR2HEX(c) ( (c) <= '9' ? (c) - '0' : (c) - 'A' + 10 ) // Assumes uppercase
#define process_value(v) \
   do { \
      static_assert( (#v)[0]=='V' && (#v)[1]=='A' && (#v)[2]=='L' && (#v)[3]=='_', "Must use a 'VAL_xx' macro here" ); \
      static_assert( CHAR2HEX((#v)[4]) <= 0xf  , "'VAL_xx' macro with bad hex value" ); \
      static_assert( CHAR2HEX((#v)[5]) <= 0xf  , "'VAL_xx' macro with bad hex value" ); \
      static_assert( (#v)[06]==0, "'VAL_xx' macro format wrong" ); \
      _process_value( CHAR2HEX((#v)[4])<<4 | CHAR2HEX((#v)[5]) ); \
   } while(0)

This uses the value directly out of the macro name, so you can leave the value off entirely when defining the macro, right?  Yes.  But it goes further than that.  Since the above code only uses the stringification of the parameter, it never expands it.  That means it's perfectly happy if you never define the VAL_XX macros at all, which is probably not what you want.  Be sure that the wrapper macro expands the macro somewhere if you want to be sure it's actually a defined macro.

Conclusion


So if you've followed my other writing up to this point, you're probably expecting some clever hack to make this work in C.  Sorry, but not this time.  It would probably be relatively simple to add a compiler option or #pragma directive to make string constants literal in C, but gcc doesn't have this, and I'm not aware of any other compiler that does.  (Please comment if you know otherwise.)  There are plenty of tricks you could do if you're willing to use additional tools in your build process, like an additional step between the regular preprocessor and the compiler to look for extracting characters from string literals and convert them into character constants (and you could tell the compiler to use a wrapper script to do that as the preprocessor), but that's not likely to be an acceptable option.

You just can't do that in C.

Resources

This is the test file I used to be sure my above examples were correct: c_vs_cpp_example.c