Bug 92872
| Summary: | A smart util class that grabs latin1 string from a WTF::String | ||
|---|---|---|---|
| Product: | WebKit | Reporter: | Yong Li <yong.li.webkit> |
| Component: | Web Template Framework | Assignee: | Nobody <webkit-unassigned> |
| Status: | RESOLVED LATER | ||
| Severity: | Normal | CC: | ap, jkjiang, msaboff |
| Priority: | P3 | ||
| Version: | 528+ (Nightly build) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
Yong Li
In many cases, we know a WTF::String is 99% a 8bit string internally, but to avoid crash in the 1%, we have to do this for best performance:
str.is8Bit() ? str.characters8() : static_cast<LChar>(str.latin1.data())
Not convenient at all. And most developer will just use str.latin1.data() when they need to get a const char*.
So how about adding a util class that either has a valid CString object or uses String's 8bit data?
class Latin1StringHolder {
public:
explicit Latin1StringHolder (String string)
{
if (string.is8Bit())
m_string = string;
else
m_latin1 = string.latin1();
}
const LChar* data() const { return m_string.isNull() ? static_cast<LChar>(m_latin1.data()) : m_string.characters8; }
size_t length() const { return m_string.isNull() ? m_latin1.length() : m_string.lenght(); }
private:
CString m_latin1;
String m_string;
};
| Attachments | ||
|---|---|---|
| Add attachment proposed patch, testcase, etc. |
Alexey Proskuryakov
Can CString be changed to use String's internal 8-bit buffer when possible without copying? Introducing a new class seems a little heavy-handed.
Michael Saboff
It seems to me that the only case you always want a const char* that is Latin 1 is for debugging or logging. If the string is a 16 bit string, then there is loss of data calling String::latin1().
For the performant cases of handling strings, you need to do the is8Bit() check and then handle the 8 and 16 bit cases separately.
Therefore use str.latin1.data() for those debug / logging cases.
Yong Li
(In reply to comment #2)
> It seems to me that the only case you always want a const char* that is Latin 1 is for debugging or logging. If the string is a 16 bit string, then there is loss of data calling String::latin1().
>
> For the performant cases of handling strings, you need to do the is8Bit() check and then handle the 8 and 16 bit cases separately.
>
> Therefore use str.latin1.data() for those debug / logging cases.
We also want to convert WTF::String to const char* when communicating with 3rd party libraries or port-specific API. For example, when forwarding a network request to libcurl, we have to convert a url from KURL to const char*. In most cases KURL has a 8-bit string internally, but it is not safe to assume that's always true.
(In reply to comment #1)
> Can CString be changed to use String's internal 8-bit buffer when possible without copying? Introducing a new class seems a little heavy-handed.
I think it is possible: let CString have a RefPtr<StringImpl> as a member, and it also has to make sure the 8-bit buffer has a null terminator...
Another idea: Add a non-const method ensure8Bit() to String, so developer can get a copy of the string, call this method, and then directly use characters8() without worrying about crashes.
Yong Li
>
> Another idea: Add a non-const method ensure8Bit() to String, so developer can get a copy of the string, call this method, and then directly use characters8() without worrying about crashes.
Actually it can be just one call: characters8MustBeThere(). the same way as we use String::charactersWithNullTermination().
Alexey Proskuryakov
URLs are one of the very few cases where you have strings that are guaranteed to fit in Latin-1 without data loss.
I agree with Michael that there doesn't seem to be a problem to address here. It's just one line of code in URL loading system, no need to over-engineer it.
Yong Li
OK. closing it for now until there is more reason to do so.
Yong Li
(In reply to comment #5)
> URLs are one of the very few cases where you have strings that are guaranteed to fit in Latin-1 without data loss.
>
> I agree with Michael that there doesn't seem to be a problem to address here. It's just one line of code in URL loading system, no need to over-engineer it.
How about making KURL always use CString internally?
Alexey Proskuryakov
A CString cannot hold invalid URLs, because those are not necessarily ASCII or even valid UTF-16. Not sure exactly how much fidelity we need for those, but it's tricky.