Bug 15953

Summary:

Add UTF-8 encoding/decoding to WTF

Product:

WebKit

Reporter:

Alexey Proskuryakov <ap>

Component:

Web Template Framework

Assignee:

Alexey Proskuryakov <ap>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

darin

Priority:

Version:

528+ (Nightly build)

Hardware:

Mac

OS:

OS X 10.4

Attachments:

Description	Flags
proposed patch	darin: review+

Alexey Proskuryakov

Reported 2007-11-12 04:59:29 PST

We have to handle UTF-8 in several places in JavaScriptCore, wtf/unicode looks like a good place to keep this code. Patch forthcoming.

Attachments
proposed patch (35.06 KB, patch) 2007-11-12 05:13 PST, Alexey Proskuryakov	darin: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2007-11-12 05:13:13 PST

Created attachment 17201 [details] proposed patch

Darin Adler

Comment 2 2007-11-12 21:25:03 PST

Comment on attachment 17201 [details] proposed patch What exactly does JSStringCreateWithUTF8CString do if passed invalid UTF-8? What should it do? 268 * This method should only be used for *debugging* purposes as it 269 * is not Unicode safe. Perhaps that's overstating the case -- might there be some circumstances where you know it's an all-ASCII UString? Should we add a UTF-8 text decoder to WebCore that uses this? Maybe if we did we could get rid of the simple/complex system for ICU, since we could handle the most common encodings without creating much of a text encoding registry at all. And perhaps we could change functions that convert to UTF-8 in WebCore::String to not use the registry. Looks good, r=me

Alexey Proskuryakov

Comment 3 2007-11-12 23:13:51 PST

Committed revision 27746. (In reply to comment #2) > (From update of attachment 17201 [details] [edit]) > What exactly does JSStringCreateWithUTF8CString do if passed invalid UTF-8? I have now copied a comment describing strict/lenient modes to UTF8.h. For lenient mode as used by JSStringCreateWithUTF8CString(), it's: - both irregular sequences and isolated surrogates are converted; - illegal sequences will cause an error, and the result will be truncated to the first error position (this includes overlong forms); - characters over 0x10FFFF are converted to replacement character. > What should it do? This may or may not be what it should do, I'm not sure. > 268 * This method should only be used for *debugging* purposes as it > 269 * is not Unicode safe. > > Perhaps that's overstating the case -- might there be some circumstances where > you know it's an all-ASCII UString? I have re-worded the comment.

Geoffrey Garen

Comment 4 2007-11-13 00:03:55 PST

Yay!

Darin Adler

Comment 5 2007-11-13 09:10:50 PST

(In reply to comment #3) > - both irregular sequences and isolated surrogates are converted; What are irregular sequences? You mean things like the sequence for U+FFFE? Seems OK for isolated surrogates, but also not necessarily useful. Perhaps they should be treated as illegal sequences. > - illegal sequences will cause an error, and the result will be truncated to the first error position (this includes overlong forms); I think it's a problem that this error condition can't be detected by the caller to JSStringCreateWithUTF8CString; I think it would be better to return 0 when there's an error. For APIs where there's both an error indication and a string, I think returning the truncated string arguably might be useful. For some clients, it would be good to be able to report where the error was (say, on the JavaScript console). > - characters over 0x10FFFF are converted to replacement character. I would prefer that characters over 0x10FFFF be handled the same way as other illegal sequences.

Alexey Proskuryakov

Comment 6 2007-11-13 10:11:10 PST

(In reply to comment #5) > > - both irregular sequences and isolated surrogates are converted; > > What are irregular sequences? You mean things like the sequence for U+FFFE? An irregular UTF-8 code unit sequence is a six-byte sequence where the first three bytes correspond to a high surrogate, and the next three bytes correspond to a low surrogate. We can change the function to use strict decoding, and to return 0 if an error is detected.

Alexey Proskuryakov

Comment 7 2007-11-14 00:54:02 PST

Filed bug 15982.

Note You need to log in before you can comment on or make changes to this bug.