92872 – A smart util class that grabs latin1 string from a WTF::String

RESOLVED LATER 92872

A smart util class that grabs latin1 string from a WTF::String

https://bugs.webkit.org/show_bug.cgi?id=92872

Summary A smart util class that grabs latin1 string from a WTF::String

Yong Li

Reported 2012-08-01 07:30:16 PDT

In many cases, we know a WTF::String is 99% a 8bit string internally, but to avoid crash in the 1%, we have to do this for best performance: str.is8Bit() ? str.characters8() : static_cast<LChar>(str.latin1.data()) Not convenient at all. And most developer will just use str.latin1.data() when they need to get a const char*. So how about adding a util class that either has a valid CString object or uses String's 8bit data? class Latin1StringHolder { public: explicit Latin1StringHolder (String string) { if (string.is8Bit()) m_string = string; else m_latin1 = string.latin1(); } const LChar* data() const { return m_string.isNull() ? static_cast<LChar>(m_latin1.data()) : m_string.characters8; } size_t length() const { return m_string.isNull() ? m_latin1.length() : m_string.lenght(); } private: CString m_latin1; String m_string; };

Attachments
Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2012-08-01 10:56:09 PDT

Can CString be changed to use String's internal 8-bit buffer when possible without copying? Introducing a new class seems a little heavy-handed.

Michael Saboff

Comment 2 2012-08-01 11:06:39 PDT

It seems to me that the only case you always want a const char* that is Latin 1 is for debugging or logging. If the string is a 16 bit string, then there is loss of data calling String::latin1(). For the performant cases of handling strings, you need to do the is8Bit() check and then handle the 8 and 16 bit cases separately. Therefore use str.latin1.data() for those debug / logging cases.

Yong Li

Comment 3 2012-08-01 11:35:20 PDT

(In reply to comment #2) > It seems to me that the only case you always want a const char* that is Latin 1 is for debugging or logging. If the string is a 16 bit string, then there is loss of data calling String::latin1(). > > For the performant cases of handling strings, you need to do the is8Bit() check and then handle the 8 and 16 bit cases separately. > > Therefore use str.latin1.data() for those debug / logging cases. We also want to convert WTF::String to const char* when communicating with 3rd party libraries or port-specific API. For example, when forwarding a network request to libcurl, we have to convert a url from KURL to const char*. In most cases KURL has a 8-bit string internally, but it is not safe to assume that's always true. (In reply to comment #1) > Can CString be changed to use String's internal 8-bit buffer when possible without copying? Introducing a new class seems a little heavy-handed. I think it is possible: let CString have a RefPtr<StringImpl> as a member, and it also has to make sure the 8-bit buffer has a null terminator... Another idea: Add a non-const method ensure8Bit() to String, so developer can get a copy of the string, call this method, and then directly use characters8() without worrying about crashes.

Yong Li

Comment 4 2012-08-01 12:00:59 PDT

> > Another idea: Add a non-const method ensure8Bit() to String, so developer can get a copy of the string, call this method, and then directly use characters8() without worrying about crashes. Actually it can be just one call: characters8MustBeThere(). the same way as we use String::charactersWithNullTermination().

Alexey Proskuryakov

Comment 5 2012-08-01 12:41:51 PDT

URLs are one of the very few cases where you have strings that are guaranteed to fit in Latin-1 without data loss. I agree with Michael that there doesn't seem to be a problem to address here. It's just one line of code in URL loading system, no need to over-engineer it.

Yong Li

Comment 6 2012-08-01 12:44:24 PDT

OK. closing it for now until there is more reason to do so.

Yong Li

Comment 7 2012-08-03 07:07:35 PDT

(In reply to comment #5) > URLs are one of the very few cases where you have strings that are guaranteed to fit in Latin-1 without data loss. > > I agree with Michael that there doesn't seem to be a problem to address here. It's just one line of code in URL loading system, no need to over-engineer it. How about making KURL always use CString internally?

Alexey Proskuryakov

Comment 8 2012-08-03 10:31:37 PDT

A CString cannot hold invalid URLs, because those are not necessarily ASCII or even valid UTF-16. Not sure exactly how much fidelity we need for those, but it's tricky.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution LATER

Priority P3

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Web Template Framework

Assignee

Nobody

Reported

2012-08-01 07:30 PDT

Modified

2012-08-03 10:31 PDT History

CC List

3 users Show

URL

Keywords

Depends on

Blocks