Tag Archives: c++

String literals with embedded nulls in Boost

Boost’s string library makes global string replace in C++ easy:

std::string str = "$greeting, world!";
boost::replace_all(str, "$greeting", "Hello");
std::cout << str << "\n"; // print "Hello, world!"

Suppose however that you want the search string to contain a null character. Why would you want such a thing? Consider escaping strings for safe inclusion in some contexts:

std::string str2 = get_string_from_remote_source();
boost::replace_all(str2, "\0", "(nul)");
// do something with str2 that depends on it not to contain null chars

(Remember that, unlike C strings, it’s perfectly valid for C++ standard library strings to contain NUL characters.)

Alas, the code above doesn’t work; the call to replace_all() doesn’t do anything. It turns out that when you given Boost a string literal, it uses strlen() to get the string’s length. Since strlen() works on C style null-terminated strings, it stops on the first null character it sees.

Why did I expect Boost to behave differently? In C++, string literals are array of chars. With the help of some template magic, the Boost library can know the string’s length at compile-time. It doesn’t need to rely on functions like strlen() to compute the string’s length, so it can handle arbitrary string literals, including ones with embedded nulls.

After some thinking and googling about it, it becomes clear why Boost doesn’t work this way, or at least why it isn’t the only reasonable way. The reason is that Boost cannot tell the difference between string literals and other character arrays. Consider this case:

char search[80];
strcpy(search, "foo");
boost::replace_all(str, search, "bar");

We probably wouldn’t want replace_all() to look for the whole 80 character long string which the input array happens to contain, but only for the part initialized with a null-terminated string. Actually, this seems to be some sort of gray area. When a zero character appears inside a string literal, it certainly means that the programmer intended the character to be a part of the string. But when it appears inside another character array, it may or may not mark the end of the string.

We need a way to tell Boost you want to treat a char array as an array instead of a null-terminated string literal. To do this, wrap the array in a call to boost::as_array. For example:

char nullchar[] = {'\0'};
boost::replace_all(str2, boost::as_array(nullchar), "(nul)");

In fact, you can also pass a string literal to as_array, but remember that the corresponding array contains an (additional) terminating null character. So, returning to the original problem, for a string containing solely one null character, use boost::as_array(“”). Don’t use boost::as_array(“\0”), as the latter will contain two characters.

CA2W found

Here’s a problem that was driving me crazy for a while. I got the following error trying to use ATL’s CA2W class in a C++ program:

error C3861: ‘CA2W’: identifier not found

Usually problems like this occur when you don’t include the correct header file. But in this case, I did include it (atlconv.h), exactly like the documentation says.

I made sure that CA2W is really defined in the header. I checked other things, but the result was always that the compiler should have seen the definition. Still the nefarious error message appeared. So why did the compiler pretend not to know this symbol? Finally I saw this line at the beginning of atlconv.h:

namespace ATL
{

D’oh!

So CA2W did exist, but only in the ATL namespace. I never noticed this before; it turns out that by default projects created with Visual C++ include atlbase.h which does “using namespace ATL”. Including this file solved the problem.

The documentation says nothing about this, of course. I couldn’t find any mention of the word “namespace” in there.

Incidentally, looks like we’ll be seeing this problem a lot in the future since the default has changed in Visual Studio 2010.

std::string is contiguous

You can safely assume that the memory buffer used by std::string is contiguous. Specifically, the address of the string’s first character can be used as the address for the whole string, just like a C-style char array:

std::string str = "foo";
strncpy(&str[0], "bar", 3); // str now contains "bar".

Why is this safe? The current C++ standard apparently doesn’t guarantee that the string is stored contiguously, but it is in all known implementations. Additionally, the next C++ standard (C++0x) will make this guarantee. So the above usage is valid on all present and future C++ implementations.

Why is this important? It’s common for functions, especially in the Windows API, to “return” strings by copying them into a buffer passed to the function. Since the memory buffer used in std::string is contiguous you can safely pass it to the function, after resizing the string to the correct size.

A typical usage for Windows API functions:

// get required buffer size
DWORD bufSize = 0;
GetComputerNameA(NULL, &bufSize);
if (!bufSize && GetLastError() != ERROR_BUFFER_OVERFLOW) {
  throw std::runtime_error("GetComputerNameA failed");
}
// bufSize now contains required size of buffer, including null terminator
std::string buf(bufSize, '\0');
if (!GetComputerNameA(&buf[0], &bufSize)) {
  throw std::runtime_error("GetComputerNameA failed");
}
// bufSize now contains actual size of data
buf.resize(bufSize);
// now use buf as a regular std::string

This is cumbersome but actually easier than plain C code, since you don’t have to manage the memory yourself.

Note that the expression &str[0] is valid only if str isn’t empty. Also, everything I’ve said also applies to std::wstring, the wide-character version of std::string.

References: