CString is a class that we use to guarantee having a null character at the end, and to simplify memory management of C strings.
In the past, we had multiple bugs caused by its confusing handling of embedded null characters. CString doesn't enforce the no nulls policy, but some of its methods, and also some of its callers don't work with embedded nulls.
One example is bug 144257, another is bug 55412, and I remember something very similar to the latter occurring on Windows.
I think that conceptually, CString should not allow null bytes, and when we need actual data arrays, we should use Vector<uint8_t>. It does not make sense to use a class that ensures a trailing null with data that has embedded nulls.
However, I'm not sure how we would use CString in conjunction with String, as Strings can have null bytes. But then again, we probably have a serious bug in every place where we convert a String to a CString using e.g. latin1() or utf8(), so revisiting all these places would make sense.
Darin noted this in bug 144257:
CString functions that work properly with embedded null characters:
== (when both sides are CString)
Functions that do not:
== (when one side is a CString and the other is a const char*)
It’s clear that we should either forbid the embedded null characters entirely, or make the functions all support them.
One clarification: In bug 144257, the issue is only that there is an assertion that there are not null characters. The code both before and after that bug fix properly handles strings with null characters. The bug was simply that the assertion fired.
Thank you for opening a new bug and great analysis!
So what direction do you prefer? In both directions, I think the assertion in the original issue should be removed because
1. If CString should not contain a null character, this should be asserted in CString side instead of SHA1::addBytes.
2. If CString can contain a null character, this assertion becomes incorrect.
Yes, I agree with you about the assertion, which is why I pushed you to remove the it.
I think Alexey would prefer that we move away from CString for any case where we want to allow null characters and use CString only for cases where we do not want to allow them. I think I agree; it’s messy to null-terminate a string if we know that there might be a null character inside it!
I’m not sure how we are going to find all those call sites that need to handle null characters before we remove support from CString. I’m also not sure that I prefer we use Vector<char>, Vector<uint8_t> or something else that doesn’t waste extra memory on capacity, since we won’t be resizing these once they are created. Also note that CString implements a shared handle to a string. Good because you can share without copying. Bad because we always pay the overhead of a pointer, extra memory block, and reference counting. I could even imagine creating a class that shares CStringBuffer with CString, but doesn’t do the null termination thing rather than using Vector. Really not sure.
I think WTF::String::utf8 should fail when there is a null character, returning a null CString as it does when it fails due to other kinds of conversion errors.
The WTF::String::latin1 and WTF::String::ascii functions are already OK in this respect. They may have other problems, but they never produce a string with a null character in it.
I believe TextCodec::encode may produce null characters, so it is probably one of the call sites that needs to change.