38890 – Multi-byte issue in JavaScriptCore formatLocaleDate

RESOLVED FIXED 38890

Multi-byte issue in JavaScriptCore formatLocaleDate

https://bugs.webkit.org/show_bug.cgi?id=38890

Summary Multi-byte issue in JavaScriptCore formatLocaleDate

Leo Yang

Reported 2010-05-11 00:26:19 PDT

formatLocaleDate (non-Mac platform implement) in JavaScriptCore/runtime/DatePrototype.cpp doesn't take multi-byte into account. Considering the follow html code for LC_TIME="zh_CN.UTF-8", formatLocaleDate behaves incorrect (junk result) <html> <head> <meta http-equiv="Content-Type" Content="text/html; Charset=utf-8"> </head> <body> <script> var now = new Date(); document.write(now.toLocaleString()); </script> </body> </html>

Attachments
patch (2.40 KB, patch) 2010-05-11 00:41 PDT, Leo Yang	no flags	Details Formatted Diff Diff
patch (2.42 KB, patch) 2010-05-11 01:01 PDT, Leo Yang	darin: review-	Details Formatted Diff Diff
patch (1.97 KB, patch) 2010-05-12 23:47 PDT, Leo Yang	no flags	Details Formatted Diff Diff
patch (2.47 KB, patch) 2010-05-13 19:25 PDT, Leo Yang	no flags	Details Formatted Diff Diff
patch (2.48 KB, patch) 2010-05-13 19:40 PDT, Leo Yang	no flags	Details Formatted Diff Diff
Show Obsolete (4) View All Add attachment proposed patch, testcase, etc.

Leo Yang

Comment 1 2010-05-11 00:41:34 PDT

Created attachment 55672 [details] patch

Leo Yang

Comment 2 2010-05-11 01:01:37 PDT

Created attachment 55674 [details] patch

Darin Adler

Comment 3 2010-05-11 10:51:43 PDT

Comment on attachment 55674 [details] patch What's the guarantee that wide characters are UTF-16? This whole patch seems to be based on that assumption. There needs to be at least a comment indicating why that's so. > + // Convert multibyte result to wide char Sentence comments should use whole words and periods. So please call this "wide characters" and use a period. > + UChar buffer[128]; Would be better to the value bufsize that is already here instead of repeating the 128 here. > + size_t len; We don't abbreviate variable names like this. > + // Because sizeof can't be used during preprocessing, we do runtime check. > + if (sizeof(UChar) == sizeof(wchar_t)) > + len = mbstowcs((wchar_t*)buffer, timebuffer, sizeof(buffer) / sizeof(buffer[0]) - 1); We don't use C-style casts, so this should be reinterpret_cast. Do we really need the optimized case? It seems fine to always copy from a wchar_t buffer into a UChar buffer, so I suggest removing the special case loop. > + wchar_t tempbuffer[128]; Would be better to use bufsize instead of repeating the 128 here. We normally don't ram words together. The existing name "timebuffer" is not our WebKit style. I'd call this "wideCharacterBuffer". > + len = mbstowcs(tempbuffer, timebuffer, sizeof(tempbuffer) / sizeof(tempbuffer[0]) -1); There should be a space after the "-" here. > + for (size_t i = 0; i < len && len != (size_t)(-1); ++i) > + buffer[i] = (UChar)tempbuffer[i]; Checking that length is not -1 every time through the loop doesn't make sense. That should be a separate if statement. We don't use C-style casts, so this should be no cast at all, or a cast to UChar if the cast is needed for some reason. > + if (UNLIKELY(len == (size_t)(-1))) This should be static_cast. There's no need to use UNLIKELY for this, because it's not so performance-critical that we need to optimize that much. The work here dwarfs that single patch. review- because of these minor issues, but there also is the issue that wchar_t is not always UTF-16, which seems like a major one.

Yong Li

Comment 4 2010-05-11 11:33:30 PDT

mbstowcs: it's a standard c function, but I think it would be nice if we use WTF::Unicode layer for this, as other Unicode related functions are also gathered there.

Leo Yang

Comment 5 2010-05-11 19:42:58 PDT

(In reply to comment #4) > mbstowcs: it's a standard c function, but I think it would be nice if we use WTF::Unicode layer for this, as other Unicode related functions are also gathered there. Yes. It's better to do so. If we do it, we need to add interface and ask each platform to implement. We need also to propagate this bug to every platform. And most platform implementation will just call mbstowcs. Is it necessary to make things complex? If a platform has no mbstowcs, just implement it, we can treat mbstowcs as an interface :)

Leo Yang

Comment 6 2010-05-11 20:38:09 PDT

(In reply to comment #3) > (From update of attachment 55674 [details]) > What's the guarantee that wide characters are UTF-16? This whole patch seems to be based on that assumption. There needs to be at least a comment indicating why that's so. > > > + // Convert multibyte result to wide char > > Sentence comments should use whole words and periods. So please call this "wide characters" and use a period. > > > + UChar buffer[128]; > > Would be better to the value bufsize that is already here instead of repeating the 128 here. > > > + size_t len; > > We don't abbreviate variable names like this. > > > + // Because sizeof can't be used during preprocessing, we do runtime check. > > + if (sizeof(UChar) == sizeof(wchar_t)) > > + len = mbstowcs((wchar_t*)buffer, timebuffer, sizeof(buffer) / sizeof(buffer[0]) - 1); > > We don't use C-style casts, so this should be reinterpret_cast. > > Do we really need the optimized case? It seems fine to always copy from a wchar_t buffer into a UChar buffer, so I suggest removing the special case loop. > > > + wchar_t tempbuffer[128]; > > Would be better to use bufsize instead of repeating the 128 here. We normally don't ram words together. The existing name "timebuffer" is not our WebKit style. I'd call this "wideCharacterBuffer". > > > + len = mbstowcs(tempbuffer, timebuffer, sizeof(tempbuffer) / sizeof(tempbuffer[0]) -1); > > There should be a space after the "-" here. > > > + for (size_t i = 0; i < len && len != (size_t)(-1); ++i) > > + buffer[i] = (UChar)tempbuffer[i]; > > Checking that length is not -1 every time through the loop doesn't make sense. That should be a separate if statement. > > We don't use C-style casts, so this should be no cast at all, or a cast to UChar if the cast is needed for some reason. > > > + if (UNLIKELY(len == (size_t)(-1))) > > This should be static_cast. > > There's no need to use UNLIKELY for this, because it's not so performance-critical that we need to optimize that much. The work here dwarfs that single patch. > > review- because of these minor issues, but there also is the issue that wchar_t is not always UTF-16, which seems like a major one. Will follow all above minor issues. Yes, wchar_t is not always UTF-16, so I copy the whart_t result to UTF-16 buffer in case of that. In theory, there might be data loss, but datetime related words in different languages should be aligned in to UTF-16; if not, can we handle it by JSString?

Leo Yang

Comment 7 2010-05-11 20:40:05 PDT

(In reply to comment #6) > (In reply to comment #3) > > (From update of attachment 55674 [details] [details]) > > What's the guarantee that wide characters are UTF-16? This whole patch seems to be based on that assumption. There needs to be at least a comment indicating why that's so. > > > > > + // Convert multibyte result to wide char > > > > Sentence comments should use whole words and periods. So please call this "wide characters" and use a period. > > > > > + UChar buffer[128]; > > > > Would be better to the value bufsize that is already here instead of repeating the 128 here. > > > > > + size_t len; > > > > We don't abbreviate variable names like this. > > > > > + // Because sizeof can't be used during preprocessing, we do runtime check. > > > + if (sizeof(UChar) == sizeof(wchar_t)) > > > + len = mbstowcs((wchar_t*)buffer, timebuffer, sizeof(buffer) / sizeof(buffer[0]) - 1); > > > > We don't use C-style casts, so this should be reinterpret_cast. > > > > Do we really need the optimized case? It seems fine to always copy from a wchar_t buffer into a UChar buffer, so I suggest removing the special case loop. > > > > > + wchar_t tempbuffer[128]; > > > > Would be better to use bufsize instead of repeating the 128 here. We normally don't ram words together. The existing name "timebuffer" is not our WebKit style. I'd call this "wideCharacterBuffer". > > > > > + len = mbstowcs(tempbuffer, timebuffer, sizeof(tempbuffer) / sizeof(tempbuffer[0]) -1); > > > > There should be a space after the "-" here. > > > > > + for (size_t i = 0; i < len && len != (size_t)(-1); ++i) > > > + buffer[i] = (UChar)tempbuffer[i]; > > > > Checking that length is not -1 every time through the loop doesn't make sense. That should be a separate if statement. > > > > We don't use C-style casts, so this should be no cast at all, or a cast to UChar if the cast is needed for some reason. > > > > > + if (UNLIKELY(len == (size_t)(-1))) > > > > This should be static_cast. > > > > There's no need to use UNLIKELY for this, because it's not so performance-critical that we need to optimize that much. The work here dwarfs that single patch. > > > > review- because of these minor issues, but there also is the issue that wchar_t is not always UTF-16, which seems like a major one. Will follow comments about all above minor issues. Yes, wchar_t is not always UTF-16, so I copy the whart_t result to UTF-16 buffer in case of that. In theory, there might be data loss, but datetime related words in different languages should be aligned in to UTF-16; if not, can we handle it by JSString?

Alexey Proskuryakov

Comment 8 2010-05-12 14:41:13 PDT

> Yes, wchar_t is not always UTF-16, so I copy the whart_t result to UTF-16 buffer in case of that. It may be UTF-32, for example. Or it may be something entirely unrelated to Unicode, even if its size is 16 bit.

Leo Yang

Comment 9 2010-05-12 18:48:40 PDT

(In reply to comment #8) > > Yes, wchar_t is not always UTF-16, so I copy the whart_t result to UTF-16 buffer in case of that. > > It may be UTF-32, for example. Or it may be something entirely unrelated to Unicode, even if its size is 16 bit. If it's UTF-32, we could only static_cast to UTF-16, and may loss data, can we handle UTF-32 using JSString? For wchar_t itself, it can be everything, but as result of mbstowcs, can it be everything instead of UTF-16/UTF-32?

Leo Yang

Comment 10 2010-05-12 23:47:19 PDT

Created attachment 55949 [details] patch Follows review comments to address the following issues. 1. Minor style issue 2. wchar_t to unicode issue

Darin Adler

Comment 11 2010-05-13 13:21:09 PDT

Comment on attachment 55949 [details] patch Looking better! > +#ifdef __STDC_ISO_10646__ This seems like an unusual way to detect the presence of the mbstowcs function along with the fact that wide characters are Unicode. Is this really defined consistently everywhere we need to use mbstowcs and nowhere we don't? > + // Convert multi-byte result to unicode. Unicode should be capitalized, since it's a proper noun. > + for (size_t i = 0; i < length; ++i) > + buffer[i] = static_cast<UChar>(tempbuffer[i]); This needs a comment saying this assumes the characters are all representable by a single UTF-16 code point. That is not a safe assumption in general but we are probably OK for all the cases we will encounter in practice. If any character is a non-BMP one, we'll give the wrong result. I suggest you also add a comment explaining that if mbstowcs fails we just fall back on using the 8-bit characters as-is. And explain why and when that is ever the right thing to do. I think it's never helpful, myself, and it seems a little strange to leave that code path in there. I'll say r=me but I'd appreciate follow-up on the issues I mentioned here.

Leo Yang

Comment 12 2010-05-13 19:25:33 PDT

Created attachment 56043 [details] patch Follow review comments to add more code comments.

Leo Yang

Comment 13 2010-05-13 19:40:56 PDT

Created attachment 56046 [details] patch Follow review comments

Adam Barth

Comment 14 2010-05-15 00:05:23 PDT

Comment on attachment 56046 [details] patch In the future, it would be helpful if you included more information in your ChangeLog.

WebKit Commit Bot

Comment 15 2010-05-15 11:38:56 PDT

Comment on attachment 56046 [details] patch Clearing flags on attachment: 56046 Committed r59545: <http://trac.webkit.org/changeset/59545>

WebKit Commit Bot

Comment 16 2010-05-15 11:39:02 PDT

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware PC

OS Linux

Product WebKit

Component JavaScriptCore

Assignee

Nobody

Reported

2010-05-11 00:26 PDT

Modified

2010-05-15 11:39 PDT History

CC List

5 users Show

URL

Keywords

Depends on

Blocks