WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
15953
Add UTF-8 encoding/decoding to WTF
https://bugs.webkit.org/show_bug.cgi?id=15953
Summary
Add UTF-8 encoding/decoding to WTF
Alexey Proskuryakov
Reported
2007-11-12 04:59:29 PST
We have to handle UTF-8 in several places in JavaScriptCore, wtf/unicode looks like a good place to keep this code. Patch forthcoming.
Attachments
proposed patch
(35.06 KB, patch)
2007-11-12 05:13 PST
,
Alexey Proskuryakov
darin
: review+
Details
Formatted Diff
Diff
View All
Add attachment
proposed patch, testcase, etc.
Alexey Proskuryakov
Comment 1
2007-11-12 05:13:13 PST
Created
attachment 17201
[details]
proposed patch
Darin Adler
Comment 2
2007-11-12 21:25:03 PST
Comment on
attachment 17201
[details]
proposed patch What exactly does JSStringCreateWithUTF8CString do if passed invalid UTF-8? What should it do? 268 * This method should only be used for *debugging* purposes as it 269 * is not Unicode safe. Perhaps that's overstating the case -- might there be some circumstances where you know it's an all-ASCII UString? Should we add a UTF-8 text decoder to WebCore that uses this? Maybe if we did we could get rid of the simple/complex system for ICU, since we could handle the most common encodings without creating much of a text encoding registry at all. And perhaps we could change functions that convert to UTF-8 in WebCore::String to not use the registry. Looks good, r=me
Alexey Proskuryakov
Comment 3
2007-11-12 23:13:51 PST
Committed revision 27746. (In reply to
comment #2
)
> (From update of
attachment 17201
[details]
[edit]) > What exactly does JSStringCreateWithUTF8CString do if passed invalid UTF-8?
I have now copied a comment describing strict/lenient modes to UTF8.h. For lenient mode as used by JSStringCreateWithUTF8CString(), it's: - both irregular sequences and isolated surrogates are converted; - illegal sequences will cause an error, and the result will be truncated to the first error position (this includes overlong forms); - characters over 0x10FFFF are converted to replacement character.
> What should it do?
This may or may not be what it should do, I'm not sure.
> 268 * This method should only be used for *debugging* purposes as it > 269 * is not Unicode safe. > > Perhaps that's overstating the case -- might there be some circumstances where > you know it's an all-ASCII UString?
I have re-worded the comment.
Geoffrey Garen
Comment 4
2007-11-13 00:03:55 PST
Yay!
Darin Adler
Comment 5
2007-11-13 09:10:50 PST
(In reply to
comment #3
)
> - both irregular sequences and isolated surrogates are converted;
What are irregular sequences? You mean things like the sequence for U+FFFE? Seems OK for isolated surrogates, but also not necessarily useful. Perhaps they should be treated as illegal sequences.
> - illegal sequences will cause an error, and the result will be truncated to the first error position (this includes overlong forms);
I think it's a problem that this error condition can't be detected by the caller to JSStringCreateWithUTF8CString; I think it would be better to return 0 when there's an error. For APIs where there's both an error indication and a string, I think returning the truncated string arguably might be useful. For some clients, it would be good to be able to report where the error was (say, on the JavaScript console).
> - characters over 0x10FFFF are converted to replacement character.
I would prefer that characters over 0x10FFFF be handled the same way as other illegal sequences.
Alexey Proskuryakov
Comment 6
2007-11-13 10:11:10 PST
(In reply to
comment #5
)
> > - both irregular sequences and isolated surrogates are converted; > > What are irregular sequences? You mean things like the sequence for U+FFFE?
An irregular UTF-8 code unit sequence is a six-byte sequence where the first three bytes correspond to a high surrogate, and the next three bytes correspond to a low surrogate. We can change the function to use strict decoding, and to return 0 if an error is detected.
Alexey Proskuryakov
Comment 7
2007-11-14 00:54:02 PST
Filed
bug 15982
.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug