15953 – Add UTF-8 encoding/decoding to WTF

RESOLVED FIXED15953

Add UTF-8 encoding/decoding to WTF

https://bugs.webkit.org/show_bug.cgi?id=15953

Summary Add UTF-8 encoding/decoding to WTF

Alexey Proskuryakov

Reported 2007-11-12 04:59:29 PST

We have to handle UTF-8 in several places in JavaScriptCore, wtf/unicode looks like a good place to keep this code. Patch forthcoming.

Attachments
proposed patch (35.06 KB, patch) 2007-11-12 05:13 PST, Alexey Proskuryakov	darin: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2007-11-12 05:13:13 PST

Created attachment 17201 [details] proposed patch

Darin Adler

Comment 2 2007-11-12 21:25:03 PST

Comment on attachment 17201 [details] proposed patch What exactly does JSStringCreateWithUTF8CString do if passed invalid UTF-8? What should it do? 268 * This method should only be used for *debugging* purposes as it 269 * is not Unicode safe. Perhaps that's overstating the case -- might there be some circumstances where you know it's an all-ASCII UString? Should we add a UTF-8 text decoder to WebCore that uses this? Maybe if we did we could get rid of the simple/complex system for ICU, since we could handle the most common encodings without creating much of a text encoding registry at all. And perhaps we could change functions that convert to UTF-8 in WebCore::String to not use the registry. Looks good, r=me

Alexey Proskuryakov

Comment 3 2007-11-12 23:13:51 PST

Committed revision 27746. (In reply to comment #2) > (From update of attachment 17201 [details] [edit]) > What exactly does JSStringCreateWithUTF8CString do if passed invalid UTF-8? I have now copied a comment describing strict/lenient modes to UTF8.h. For lenient mode as used by JSStringCreateWithUTF8CString(), it's: - both irregular sequences and isolated surrogates are converted; - illegal sequences will cause an error, and the result will be truncated to the first error position (this includes overlong forms); - characters over 0x10FFFF are converted to replacement character. > What should it do? This may or may not be what it should do, I'm not sure. > 268 * This method should only be used for *debugging* purposes as it > 269 * is not Unicode safe. > > Perhaps that's overstating the case -- might there be some circumstances where > you know it's an all-ASCII UString? I have re-worded the comment.

Geoffrey Garen

Comment 4 2007-11-13 00:03:55 PST

Yay!

Darin Adler

Comment 5 2007-11-13 09:10:50 PST

(In reply to comment #3) > - both irregular sequences and isolated surrogates are converted; What are irregular sequences? You mean things like the sequence for U+FFFE? Seems OK for isolated surrogates, but also not necessarily useful. Perhaps they should be treated as illegal sequences. > - illegal sequences will cause an error, and the result will be truncated to the first error position (this includes overlong forms); I think it's a problem that this error condition can't be detected by the caller to JSStringCreateWithUTF8CString; I think it would be better to return 0 when there's an error. For APIs where there's both an error indication and a string, I think returning the truncated string arguably might be useful. For some clients, it would be good to be able to report where the error was (say, on the JavaScript console). > - characters over 0x10FFFF are converted to replacement character. I would prefer that characters over 0x10FFFF be handled the same way as other illegal sequences.

Alexey Proskuryakov

Comment 6 2007-11-13 10:11:10 PST

(In reply to comment #5) > > - both irregular sequences and isolated surrogates are converted; > > What are irregular sequences? You mean things like the sequence for U+FFFE? An irregular UTF-8 code unit sequence is a six-byte sequence where the first three bytes correspond to a high surrogate, and the next three bytes correspond to a low surrogate. We can change the function to use strict decoding, and to return 0 if an error is detected.

Alexey Proskuryakov

Comment 7 2007-11-14 00:54:02 PST

Filed bug 15982.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Mac

OS OS X 10.4

Product WebKit

Component Web Template Framework

Assignee

Alexey Proskuryakov

Reported

2007-11-12 04:59 PST

Modified

2007-11-14 00:54 PST History

CC List

1 user Show

URL

Keywords

Depends on

Blocks