Bug 38890 - Multi-byte issue in JavaScriptCore formatLocaleDate
Summary: Multi-byte issue in JavaScriptCore formatLocaleDate
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC Linux
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-11 00:26 PDT by Leo Yang
Modified: 2010-05-15 11:39 PDT (History)
5 users (show)

See Also:


Attachments
patch (2.40 KB, patch)
2010-05-11 00:41 PDT, Leo Yang
no flags Details | Formatted Diff | Diff
patch (2.42 KB, patch)
2010-05-11 01:01 PDT, Leo Yang
darin: review-
Details | Formatted Diff | Diff
patch (1.97 KB, patch)
2010-05-12 23:47 PDT, Leo Yang
no flags Details | Formatted Diff | Diff
patch (2.47 KB, patch)
2010-05-13 19:25 PDT, Leo Yang
no flags Details | Formatted Diff | Diff
patch (2.48 KB, patch)
2010-05-13 19:40 PDT, Leo Yang
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Leo Yang 2010-05-11 00:26:19 PDT
formatLocaleDate (non-Mac platform implement) in JavaScriptCore/runtime/DatePrototype.cpp doesn't take multi-byte into account.
Considering the follow html code for LC_TIME="zh_CN.UTF-8", formatLocaleDate behaves incorrect (junk result)

<html>
<head>
<meta http-equiv="Content-Type" Content="text/html; Charset=utf-8">
</head>
<body>
    <script>
       var now = new Date();
       document.write(now.toLocaleString());
    </script>
</body>
</html>
Comment 1 Leo Yang 2010-05-11 00:41:34 PDT
Created attachment 55672 [details]
patch
Comment 2 Leo Yang 2010-05-11 01:01:37 PDT
Created attachment 55674 [details]
patch
Comment 3 Darin Adler 2010-05-11 10:51:43 PDT
Comment on attachment 55674 [details]
patch

What's the guarantee that wide characters are UTF-16? This whole patch seems to be based on that assumption. There needs to be at least a comment indicating why that's so.

> +    // Convert multibyte result to wide char

Sentence comments should use whole words and periods. So please call this "wide characters" and use a period.

> +    UChar buffer[128];

Would be better to the value bufsize that is already here instead of repeating the 128 here.

> +    size_t len;

We don't abbreviate variable names like this.

> +    // Because sizeof can't be used during preprocessing, we do runtime check.
> +    if (sizeof(UChar) == sizeof(wchar_t))
> +        len = mbstowcs((wchar_t*)buffer, timebuffer, sizeof(buffer) / sizeof(buffer[0]) - 1);

We don't use C-style casts, so this should be reinterpret_cast.

Do we really need the optimized case? It seems fine to always copy from a wchar_t buffer into a UChar buffer, so I suggest removing the special case loop.

> +        wchar_t tempbuffer[128];

Would be better to use bufsize instead of repeating the 128 here. We normally don't ram words together. The existing name "timebuffer" is not our WebKit style. I'd call this "wideCharacterBuffer".

> +        len = mbstowcs(tempbuffer, timebuffer, sizeof(tempbuffer) / sizeof(tempbuffer[0]) -1);

There should be a space after the "-" here.

> +        for (size_t i = 0; i < len && len != (size_t)(-1); ++i)
> +            buffer[i] = (UChar)tempbuffer[i];

Checking that length is not -1 every time through the loop doesn't make sense. That should be a separate if statement.

We don't use C-style casts, so this should be no cast at all, or a cast to UChar if the cast is needed for some reason.

> +    if (UNLIKELY(len == (size_t)(-1)))

This should be static_cast.

There's no need to use UNLIKELY for this, because it's not so performance-critical that we need to optimize that much. The work here dwarfs that single patch.

review- because of these minor issues, but there also is the issue that wchar_t is not always UTF-16, which seems like a major one.
Comment 4 Yong Li 2010-05-11 11:33:30 PDT
mbstowcs: it's a standard c function, but I think it would be nice if we use WTF::Unicode layer for this, as other Unicode related functions are also gathered there.
Comment 5 Leo Yang 2010-05-11 19:42:58 PDT
(In reply to comment #4)
> mbstowcs: it's a standard c function, but I think it would be nice if we use WTF::Unicode layer for this, as other Unicode related functions are also gathered there.

Yes. It's better to do so. If we do it, we need to add interface and ask each platform to implement. We need also to propagate this bug to every platform. And most platform implementation will just call mbstowcs. Is it necessary to make things complex?

If a platform has no mbstowcs, just implement it, we can treat mbstowcs as an interface :)
Comment 6 Leo Yang 2010-05-11 20:38:09 PDT
(In reply to comment #3)
> (From update of attachment 55674 [details])
> What's the guarantee that wide characters are UTF-16? This whole patch seems to be based on that assumption. There needs to be at least a comment indicating why that's so.
> 
> > +    // Convert multibyte result to wide char
> 
> Sentence comments should use whole words and periods. So please call this "wide characters" and use a period.
> 
> > +    UChar buffer[128];
> 
> Would be better to the value bufsize that is already here instead of repeating the 128 here.
> 
> > +    size_t len;
> 
> We don't abbreviate variable names like this.
> 
> > +    // Because sizeof can't be used during preprocessing, we do runtime check.
> > +    if (sizeof(UChar) == sizeof(wchar_t))
> > +        len = mbstowcs((wchar_t*)buffer, timebuffer, sizeof(buffer) / sizeof(buffer[0]) - 1);
> 
> We don't use C-style casts, so this should be reinterpret_cast.
> 
> Do we really need the optimized case? It seems fine to always copy from a wchar_t buffer into a UChar buffer, so I suggest removing the special case loop.
> 
> > +        wchar_t tempbuffer[128];
> 
> Would be better to use bufsize instead of repeating the 128 here. We normally don't ram words together. The existing name "timebuffer" is not our WebKit style. I'd call this "wideCharacterBuffer".
> 
> > +        len = mbstowcs(tempbuffer, timebuffer, sizeof(tempbuffer) / sizeof(tempbuffer[0]) -1);
> 
> There should be a space after the "-" here.
> 
> > +        for (size_t i = 0; i < len && len != (size_t)(-1); ++i)
> > +            buffer[i] = (UChar)tempbuffer[i];
> 
> Checking that length is not -1 every time through the loop doesn't make sense. That should be a separate if statement.
> 
> We don't use C-style casts, so this should be no cast at all, or a cast to UChar if the cast is needed for some reason.
> 
> > +    if (UNLIKELY(len == (size_t)(-1)))
> 
> This should be static_cast.
> 
> There's no need to use UNLIKELY for this, because it's not so performance-critical that we need to optimize that much. The work here dwarfs that single patch.
> 
> review- because of these minor issues, but there also is the issue that wchar_t is not always UTF-16, which seems like a major one.
Will follow all above minor issues. 
Yes, wchar_t is not always UTF-16, so I copy the whart_t result to UTF-16 buffer in case of that. In theory, there might be data loss, but datetime related words in different languages should be aligned in to UTF-16; if not, can we handle it by JSString?
Comment 7 Leo Yang 2010-05-11 20:40:05 PDT
(In reply to comment #6)
> (In reply to comment #3)
> > (From update of attachment 55674 [details] [details])
> > What's the guarantee that wide characters are UTF-16? This whole patch seems to be based on that assumption. There needs to be at least a comment indicating why that's so.
> > 
> > > +    // Convert multibyte result to wide char
> > 
> > Sentence comments should use whole words and periods. So please call this "wide characters" and use a period.
> > 
> > > +    UChar buffer[128];
> > 
> > Would be better to the value bufsize that is already here instead of repeating the 128 here.
> > 
> > > +    size_t len;
> > 
> > We don't abbreviate variable names like this.
> > 
> > > +    // Because sizeof can't be used during preprocessing, we do runtime check.
> > > +    if (sizeof(UChar) == sizeof(wchar_t))
> > > +        len = mbstowcs((wchar_t*)buffer, timebuffer, sizeof(buffer) / sizeof(buffer[0]) - 1);
> > 
> > We don't use C-style casts, so this should be reinterpret_cast.
> > 
> > Do we really need the optimized case? It seems fine to always copy from a wchar_t buffer into a UChar buffer, so I suggest removing the special case loop.
> > 
> > > +        wchar_t tempbuffer[128];
> > 
> > Would be better to use bufsize instead of repeating the 128 here. We normally don't ram words together. The existing name "timebuffer" is not our WebKit style. I'd call this "wideCharacterBuffer".
> > 
> > > +        len = mbstowcs(tempbuffer, timebuffer, sizeof(tempbuffer) / sizeof(tempbuffer[0]) -1);
> > 
> > There should be a space after the "-" here.
> > 
> > > +        for (size_t i = 0; i < len && len != (size_t)(-1); ++i)
> > > +            buffer[i] = (UChar)tempbuffer[i];
> > 
> > Checking that length is not -1 every time through the loop doesn't make sense. That should be a separate if statement.
> > 
> > We don't use C-style casts, so this should be no cast at all, or a cast to UChar if the cast is needed for some reason.
> > 
> > > +    if (UNLIKELY(len == (size_t)(-1)))
> > 
> > This should be static_cast.
> > 
> > There's no need to use UNLIKELY for this, because it's not so performance-critical that we need to optimize that much. The work here dwarfs that single patch.
> > 
> > review- because of these minor issues, but there also is the issue that wchar_t is not always UTF-16, which seems like a major one.
Will follow comments about all above minor issues. 
Yes, wchar_t is not always UTF-16, so I copy the whart_t result to UTF-16 buffer in case of that. In theory, there might be data loss, but datetime related words in different languages should be aligned in to UTF-16; if not, can we handle it by JSString?
Comment 8 Alexey Proskuryakov 2010-05-12 14:41:13 PDT
> Yes, wchar_t is not always UTF-16, so I copy the whart_t result to UTF-16 buffer in case of that.

It may be UTF-32, for example. Or it may be something entirely unrelated to Unicode, even if its size is 16 bit.
Comment 9 Leo Yang 2010-05-12 18:48:40 PDT
(In reply to comment #8)
> > Yes, wchar_t is not always UTF-16, so I copy the whart_t result to UTF-16 buffer in case of that.
> 
> It may be UTF-32, for example. Or it may be something entirely unrelated to Unicode, even if its size is 16 bit.
If it's UTF-32, we could only static_cast to UTF-16, and may loss data, can we handle UTF-32 using JSString?

For wchar_t itself, it can be everything, but as result of mbstowcs, can it be everything instead of UTF-16/UTF-32?
Comment 10 Leo Yang 2010-05-12 23:47:19 PDT
Created attachment 55949 [details]
patch

Follows review comments to address the following issues.

1. Minor style issue
2. wchar_t to unicode issue
Comment 11 Darin Adler 2010-05-13 13:21:09 PDT
Comment on attachment 55949 [details]
patch

Looking better!

> +#ifdef __STDC_ISO_10646__

This seems like an unusual way to detect the presence of the mbstowcs function along with the fact that wide characters are Unicode. Is this really defined consistently everywhere we need to use mbstowcs and nowhere we don't?

> +    // Convert multi-byte result to unicode.

Unicode should be capitalized, since it's a proper noun.

> +        for (size_t i = 0; i < length; ++i)
> +            buffer[i] = static_cast<UChar>(tempbuffer[i]);

This needs a comment saying this assumes the characters are all representable by a single UTF-16 code point. That is not a safe assumption in general but we are probably OK for all the cases we will encounter in practice. If any character is a non-BMP one, we'll give the wrong result.

I suggest you also add a comment explaining that if mbstowcs fails we just fall back on using the 8-bit characters as-is. And explain why and when that is ever the right thing to do. I think it's never helpful, myself, and it seems a little strange to leave that code path in there.

I'll say r=me but I'd appreciate follow-up on the issues I mentioned here.
Comment 12 Leo Yang 2010-05-13 19:25:33 PDT
Created attachment 56043 [details]
patch

Follow review comments to add more code comments.
Comment 13 Leo Yang 2010-05-13 19:40:56 PDT
Created attachment 56046 [details]
patch

Follow review comments
Comment 14 Adam Barth 2010-05-15 00:05:23 PDT
Comment on attachment 56046 [details]
patch

In the future, it would be helpful if you included more information in your ChangeLog.
Comment 15 WebKit Commit Bot 2010-05-15 11:38:56 PDT
Comment on attachment 56046 [details]
patch

Clearing flags on attachment: 56046

Committed r59545: <http://trac.webkit.org/changeset/59545>
Comment 16 WebKit Commit Bot 2010-05-15 11:39:02 PDT
All reviewed patches have been landed.  Closing bug.