Bug 15953 - Add UTF-8 encoding/decoding to WTF
Summary: Add UTF-8 encoding/decoding to WTF
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Template Framework (show other bugs)
Version: 528+ (Nightly build)
Hardware: Macintosh OS X 10.4
: P2 Normal
Assignee: Alexey Proskuryakov
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-11-12 04:59 PST by Alexey Proskuryakov
Modified: 2007-11-14 00:54 PST (History)
1 user (show)

See Also:


Attachments
proposed patch (35.06 KB, patch)
2007-11-12 05:13 PST, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2007-11-12 04:59:29 PST
We have to handle UTF-8 in several places in JavaScriptCore, wtf/unicode looks like a good place to keep this code.

Patch forthcoming.
Comment 1 Alexey Proskuryakov 2007-11-12 05:13:13 PST
Created attachment 17201 [details]
proposed patch
Comment 2 Darin Adler 2007-11-12 21:25:03 PST
Comment on attachment 17201 [details]
proposed patch

What exactly does JSStringCreateWithUTF8CString do if passed invalid UTF-8? What should it do?

 268      * This method should only be used for *debugging* purposes as it
 269      * is not Unicode safe.

Perhaps that's overstating the case -- might there be some circumstances where you know it's an all-ASCII UString?

Should we add a UTF-8 text decoder to WebCore that uses this? Maybe if we did we could get rid of the simple/complex system for ICU, since we could handle the most common encodings without creating much of a text encoding registry at all. And perhaps we could change functions that convert to UTF-8 in WebCore::String to not use the registry.

Looks good, r=me
Comment 3 Alexey Proskuryakov 2007-11-12 23:13:51 PST
Committed revision 27746.

(In reply to comment #2)
> (From update of attachment 17201 [details] [edit])
> What exactly does JSStringCreateWithUTF8CString do if passed invalid UTF-8?

  I have now copied a comment describing strict/lenient modes to UTF8.h. For lenient mode as used by JSStringCreateWithUTF8CString(), it's:
- both irregular sequences and isolated surrogates are converted;
- illegal sequences will cause an error, and the result will be truncated to the first error position (this includes overlong forms);
- characters over 0x10FFFF are converted to replacement character.

> What should it do?

This may or may not be what it should do, I'm not sure.

>  268      * This method should only be used for *debugging* purposes as it
>  269      * is not Unicode safe.
> 
> Perhaps that's overstating the case -- might there be some circumstances where
> you know it's an all-ASCII UString?

I have re-worded the comment.
Comment 4 Geoffrey Garen 2007-11-13 00:03:55 PST
Yay!
Comment 5 Darin Adler 2007-11-13 09:10:50 PST
(In reply to comment #3)
> - both irregular sequences and isolated surrogates are converted;

What are irregular sequences? You mean things like the sequence for U+FFFE?

Seems OK for isolated surrogates, but also not necessarily useful. Perhaps they should be treated as illegal sequences.

> - illegal sequences will cause an error, and the result will be truncated to the first error position (this includes overlong forms);

I think it's a problem that this error condition can't be detected by the caller to JSStringCreateWithUTF8CString; I think it would be better to return 0 when there's an error. For APIs where there's both an error indication and a string, I think returning the truncated string arguably might be useful. For some clients, it would be good to be able to report where the error was (say, on the JavaScript console).

> - characters over 0x10FFFF are converted to replacement character.

I would prefer that characters over 0x10FFFF be handled the same way as other illegal sequences.
Comment 6 Alexey Proskuryakov 2007-11-13 10:11:10 PST
(In reply to comment #5)
> > - both irregular sequences and isolated surrogates are converted;
> 
> What are irregular sequences? You mean things like the sequence for U+FFFE?

An irregular UTF-8 code unit sequence is a six-byte sequence where the first three bytes correspond to a high surrogate, and the next three bytes correspond to a low surrogate.

We can change the function to use strict decoding, and to return 0 if an error is detected.
Comment 7 Alexey Proskuryakov 2007-11-14 00:54:02 PST
Filed bug 15982.