RESOLVED FIXED 15953
Add UTF-8 encoding/decoding to WTF
https://bugs.webkit.org/show_bug.cgi?id=15953
Summary Add UTF-8 encoding/decoding to WTF
Alexey Proskuryakov
Reported 2007-11-12 04:59:29 PST
We have to handle UTF-8 in several places in JavaScriptCore, wtf/unicode looks like a good place to keep this code. Patch forthcoming.
Attachments
proposed patch (35.06 KB, patch)
2007-11-12 05:13 PST, Alexey Proskuryakov
darin: review+
Alexey Proskuryakov
Comment 1 2007-11-12 05:13:13 PST
Created attachment 17201 [details] proposed patch
Darin Adler
Comment 2 2007-11-12 21:25:03 PST
Comment on attachment 17201 [details] proposed patch What exactly does JSStringCreateWithUTF8CString do if passed invalid UTF-8? What should it do? 268 * This method should only be used for *debugging* purposes as it 269 * is not Unicode safe. Perhaps that's overstating the case -- might there be some circumstances where you know it's an all-ASCII UString? Should we add a UTF-8 text decoder to WebCore that uses this? Maybe if we did we could get rid of the simple/complex system for ICU, since we could handle the most common encodings without creating much of a text encoding registry at all. And perhaps we could change functions that convert to UTF-8 in WebCore::String to not use the registry. Looks good, r=me
Alexey Proskuryakov
Comment 3 2007-11-12 23:13:51 PST
Committed revision 27746. (In reply to comment #2) > (From update of attachment 17201 [details] [edit]) > What exactly does JSStringCreateWithUTF8CString do if passed invalid UTF-8? I have now copied a comment describing strict/lenient modes to UTF8.h. For lenient mode as used by JSStringCreateWithUTF8CString(), it's: - both irregular sequences and isolated surrogates are converted; - illegal sequences will cause an error, and the result will be truncated to the first error position (this includes overlong forms); - characters over 0x10FFFF are converted to replacement character. > What should it do? This may or may not be what it should do, I'm not sure. > 268 * This method should only be used for *debugging* purposes as it > 269 * is not Unicode safe. > > Perhaps that's overstating the case -- might there be some circumstances where > you know it's an all-ASCII UString? I have re-worded the comment.
Geoffrey Garen
Comment 4 2007-11-13 00:03:55 PST
Yay!
Darin Adler
Comment 5 2007-11-13 09:10:50 PST
(In reply to comment #3) > - both irregular sequences and isolated surrogates are converted; What are irregular sequences? You mean things like the sequence for U+FFFE? Seems OK for isolated surrogates, but also not necessarily useful. Perhaps they should be treated as illegal sequences. > - illegal sequences will cause an error, and the result will be truncated to the first error position (this includes overlong forms); I think it's a problem that this error condition can't be detected by the caller to JSStringCreateWithUTF8CString; I think it would be better to return 0 when there's an error. For APIs where there's both an error indication and a string, I think returning the truncated string arguably might be useful. For some clients, it would be good to be able to report where the error was (say, on the JavaScript console). > - characters over 0x10FFFF are converted to replacement character. I would prefer that characters over 0x10FFFF be handled the same way as other illegal sequences.
Alexey Proskuryakov
Comment 6 2007-11-13 10:11:10 PST
(In reply to comment #5) > > - both irregular sequences and isolated surrogates are converted; > > What are irregular sequences? You mean things like the sequence for U+FFFE? An irregular UTF-8 code unit sequence is a six-byte sequence where the first three bytes correspond to a high surrogate, and the next three bytes correspond to a low surrogate. We can change the function to use strict decoding, and to return 0 if an error is detected.
Alexey Proskuryakov
Comment 7 2007-11-14 00:54:02 PST
Filed bug 15982.
Note You need to log in before you can comment on or make changes to this bug.