Summary: | Add UTF-8 encoding/decoding to WTF | ||||||
---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Alexey Proskuryakov <ap> | ||||
Component: | Web Template Framework | Assignee: | Alexey Proskuryakov <ap> | ||||
Status: | RESOLVED FIXED | ||||||
Severity: | Normal | CC: | darin | ||||
Priority: | P2 | ||||||
Version: | 528+ (Nightly build) | ||||||
Hardware: | Mac | ||||||
OS: | OS X 10.4 | ||||||
Attachments: |
|
Description
Alexey Proskuryakov
2007-11-12 04:59:29 PST
Created attachment 17201 [details]
proposed patch
Comment on attachment 17201 [details]
proposed patch
What exactly does JSStringCreateWithUTF8CString do if passed invalid UTF-8? What should it do?
268 * This method should only be used for *debugging* purposes as it
269 * is not Unicode safe.
Perhaps that's overstating the case -- might there be some circumstances where you know it's an all-ASCII UString?
Should we add a UTF-8 text decoder to WebCore that uses this? Maybe if we did we could get rid of the simple/complex system for ICU, since we could handle the most common encodings without creating much of a text encoding registry at all. And perhaps we could change functions that convert to UTF-8 in WebCore::String to not use the registry.
Looks good, r=me
Committed revision 27746. (In reply to comment #2) > (From update of attachment 17201 [details] [edit]) > What exactly does JSStringCreateWithUTF8CString do if passed invalid UTF-8? I have now copied a comment describing strict/lenient modes to UTF8.h. For lenient mode as used by JSStringCreateWithUTF8CString(), it's: - both irregular sequences and isolated surrogates are converted; - illegal sequences will cause an error, and the result will be truncated to the first error position (this includes overlong forms); - characters over 0x10FFFF are converted to replacement character. > What should it do? This may or may not be what it should do, I'm not sure. > 268 * This method should only be used for *debugging* purposes as it > 269 * is not Unicode safe. > > Perhaps that's overstating the case -- might there be some circumstances where > you know it's an all-ASCII UString? I have re-worded the comment. Yay! (In reply to comment #3) > - both irregular sequences and isolated surrogates are converted; What are irregular sequences? You mean things like the sequence for U+FFFE? Seems OK for isolated surrogates, but also not necessarily useful. Perhaps they should be treated as illegal sequences. > - illegal sequences will cause an error, and the result will be truncated to the first error position (this includes overlong forms); I think it's a problem that this error condition can't be detected by the caller to JSStringCreateWithUTF8CString; I think it would be better to return 0 when there's an error. For APIs where there's both an error indication and a string, I think returning the truncated string arguably might be useful. For some clients, it would be good to be able to report where the error was (say, on the JavaScript console). > - characters over 0x10FFFF are converted to replacement character. I would prefer that characters over 0x10FFFF be handled the same way as other illegal sequences. (In reply to comment #5) > > - both irregular sequences and isolated surrogates are converted; > > What are irregular sequences? You mean things like the sequence for U+FFFE? An irregular UTF-8 code unit sequence is a six-byte sequence where the first three bytes correspond to a high surrogate, and the next three bytes correspond to a low surrogate. We can change the function to use strict decoding, and to return 0 if an error is detected. |