Bug 38890

Summary: Multi-byte issue in JavaScriptCore formatLocaleDate
Product: WebKit Reporter: Leo Yang <leo.yang>
Component: JavaScriptCoreAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, charles.wei, commit-queue, staikos, yong.li.webkit
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: Linux   
Attachments:
Description Flags
patch
none
patch
darin: review-
patch
none
patch
none
patch none

Leo Yang
Reported 2010-05-11 00:26:19 PDT
formatLocaleDate (non-Mac platform implement) in JavaScriptCore/runtime/DatePrototype.cpp doesn't take multi-byte into account. Considering the follow html code for LC_TIME="zh_CN.UTF-8", formatLocaleDate behaves incorrect (junk result) <html> <head> <meta http-equiv="Content-Type" Content="text/html; Charset=utf-8"> </head> <body> <script> var now = new Date(); document.write(now.toLocaleString()); </script> </body> </html>
Attachments
patch (2.40 KB, patch)
2010-05-11 00:41 PDT, Leo Yang
no flags
patch (2.42 KB, patch)
2010-05-11 01:01 PDT, Leo Yang
darin: review-
patch (1.97 KB, patch)
2010-05-12 23:47 PDT, Leo Yang
no flags
patch (2.47 KB, patch)
2010-05-13 19:25 PDT, Leo Yang
no flags
patch (2.48 KB, patch)
2010-05-13 19:40 PDT, Leo Yang
no flags
Leo Yang
Comment 1 2010-05-11 00:41:34 PDT
Leo Yang
Comment 2 2010-05-11 01:01:37 PDT
Darin Adler
Comment 3 2010-05-11 10:51:43 PDT
Comment on attachment 55674 [details] patch What's the guarantee that wide characters are UTF-16? This whole patch seems to be based on that assumption. There needs to be at least a comment indicating why that's so. > + // Convert multibyte result to wide char Sentence comments should use whole words and periods. So please call this "wide characters" and use a period. > + UChar buffer[128]; Would be better to the value bufsize that is already here instead of repeating the 128 here. > + size_t len; We don't abbreviate variable names like this. > + // Because sizeof can't be used during preprocessing, we do runtime check. > + if (sizeof(UChar) == sizeof(wchar_t)) > + len = mbstowcs((wchar_t*)buffer, timebuffer, sizeof(buffer) / sizeof(buffer[0]) - 1); We don't use C-style casts, so this should be reinterpret_cast. Do we really need the optimized case? It seems fine to always copy from a wchar_t buffer into a UChar buffer, so I suggest removing the special case loop. > + wchar_t tempbuffer[128]; Would be better to use bufsize instead of repeating the 128 here. We normally don't ram words together. The existing name "timebuffer" is not our WebKit style. I'd call this "wideCharacterBuffer". > + len = mbstowcs(tempbuffer, timebuffer, sizeof(tempbuffer) / sizeof(tempbuffer[0]) -1); There should be a space after the "-" here. > + for (size_t i = 0; i < len && len != (size_t)(-1); ++i) > + buffer[i] = (UChar)tempbuffer[i]; Checking that length is not -1 every time through the loop doesn't make sense. That should be a separate if statement. We don't use C-style casts, so this should be no cast at all, or a cast to UChar if the cast is needed for some reason. > + if (UNLIKELY(len == (size_t)(-1))) This should be static_cast. There's no need to use UNLIKELY for this, because it's not so performance-critical that we need to optimize that much. The work here dwarfs that single patch. review- because of these minor issues, but there also is the issue that wchar_t is not always UTF-16, which seems like a major one.
Yong Li
Comment 4 2010-05-11 11:33:30 PDT
mbstowcs: it's a standard c function, but I think it would be nice if we use WTF::Unicode layer for this, as other Unicode related functions are also gathered there.
Leo Yang
Comment 5 2010-05-11 19:42:58 PDT
(In reply to comment #4) > mbstowcs: it's a standard c function, but I think it would be nice if we use WTF::Unicode layer for this, as other Unicode related functions are also gathered there. Yes. It's better to do so. If we do it, we need to add interface and ask each platform to implement. We need also to propagate this bug to every platform. And most platform implementation will just call mbstowcs. Is it necessary to make things complex? If a platform has no mbstowcs, just implement it, we can treat mbstowcs as an interface :)
Leo Yang
Comment 6 2010-05-11 20:38:09 PDT
(In reply to comment #3) > (From update of attachment 55674 [details]) > What's the guarantee that wide characters are UTF-16? This whole patch seems to be based on that assumption. There needs to be at least a comment indicating why that's so. > > > + // Convert multibyte result to wide char > > Sentence comments should use whole words and periods. So please call this "wide characters" and use a period. > > > + UChar buffer[128]; > > Would be better to the value bufsize that is already here instead of repeating the 128 here. > > > + size_t len; > > We don't abbreviate variable names like this. > > > + // Because sizeof can't be used during preprocessing, we do runtime check. > > + if (sizeof(UChar) == sizeof(wchar_t)) > > + len = mbstowcs((wchar_t*)buffer, timebuffer, sizeof(buffer) / sizeof(buffer[0]) - 1); > > We don't use C-style casts, so this should be reinterpret_cast. > > Do we really need the optimized case? It seems fine to always copy from a wchar_t buffer into a UChar buffer, so I suggest removing the special case loop. > > > + wchar_t tempbuffer[128]; > > Would be better to use bufsize instead of repeating the 128 here. We normally don't ram words together. The existing name "timebuffer" is not our WebKit style. I'd call this "wideCharacterBuffer". > > > + len = mbstowcs(tempbuffer, timebuffer, sizeof(tempbuffer) / sizeof(tempbuffer[0]) -1); > > There should be a space after the "-" here. > > > + for (size_t i = 0; i < len && len != (size_t)(-1); ++i) > > + buffer[i] = (UChar)tempbuffer[i]; > > Checking that length is not -1 every time through the loop doesn't make sense. That should be a separate if statement. > > We don't use C-style casts, so this should be no cast at all, or a cast to UChar if the cast is needed for some reason. > > > + if (UNLIKELY(len == (size_t)(-1))) > > This should be static_cast. > > There's no need to use UNLIKELY for this, because it's not so performance-critical that we need to optimize that much. The work here dwarfs that single patch. > > review- because of these minor issues, but there also is the issue that wchar_t is not always UTF-16, which seems like a major one. Will follow all above minor issues. Yes, wchar_t is not always UTF-16, so I copy the whart_t result to UTF-16 buffer in case of that. In theory, there might be data loss, but datetime related words in different languages should be aligned in to UTF-16; if not, can we handle it by JSString?
Leo Yang
Comment 7 2010-05-11 20:40:05 PDT
(In reply to comment #6) > (In reply to comment #3) > > (From update of attachment 55674 [details] [details]) > > What's the guarantee that wide characters are UTF-16? This whole patch seems to be based on that assumption. There needs to be at least a comment indicating why that's so. > > > > > + // Convert multibyte result to wide char > > > > Sentence comments should use whole words and periods. So please call this "wide characters" and use a period. > > > > > + UChar buffer[128]; > > > > Would be better to the value bufsize that is already here instead of repeating the 128 here. > > > > > + size_t len; > > > > We don't abbreviate variable names like this. > > > > > + // Because sizeof can't be used during preprocessing, we do runtime check. > > > + if (sizeof(UChar) == sizeof(wchar_t)) > > > + len = mbstowcs((wchar_t*)buffer, timebuffer, sizeof(buffer) / sizeof(buffer[0]) - 1); > > > > We don't use C-style casts, so this should be reinterpret_cast. > > > > Do we really need the optimized case? It seems fine to always copy from a wchar_t buffer into a UChar buffer, so I suggest removing the special case loop. > > > > > + wchar_t tempbuffer[128]; > > > > Would be better to use bufsize instead of repeating the 128 here. We normally don't ram words together. The existing name "timebuffer" is not our WebKit style. I'd call this "wideCharacterBuffer". > > > > > + len = mbstowcs(tempbuffer, timebuffer, sizeof(tempbuffer) / sizeof(tempbuffer[0]) -1); > > > > There should be a space after the "-" here. > > > > > + for (size_t i = 0; i < len && len != (size_t)(-1); ++i) > > > + buffer[i] = (UChar)tempbuffer[i]; > > > > Checking that length is not -1 every time through the loop doesn't make sense. That should be a separate if statement. > > > > We don't use C-style casts, so this should be no cast at all, or a cast to UChar if the cast is needed for some reason. > > > > > + if (UNLIKELY(len == (size_t)(-1))) > > > > This should be static_cast. > > > > There's no need to use UNLIKELY for this, because it's not so performance-critical that we need to optimize that much. The work here dwarfs that single patch. > > > > review- because of these minor issues, but there also is the issue that wchar_t is not always UTF-16, which seems like a major one. Will follow comments about all above minor issues. Yes, wchar_t is not always UTF-16, so I copy the whart_t result to UTF-16 buffer in case of that. In theory, there might be data loss, but datetime related words in different languages should be aligned in to UTF-16; if not, can we handle it by JSString?
Alexey Proskuryakov
Comment 8 2010-05-12 14:41:13 PDT
> Yes, wchar_t is not always UTF-16, so I copy the whart_t result to UTF-16 buffer in case of that. It may be UTF-32, for example. Or it may be something entirely unrelated to Unicode, even if its size is 16 bit.
Leo Yang
Comment 9 2010-05-12 18:48:40 PDT
(In reply to comment #8) > > Yes, wchar_t is not always UTF-16, so I copy the whart_t result to UTF-16 buffer in case of that. > > It may be UTF-32, for example. Or it may be something entirely unrelated to Unicode, even if its size is 16 bit. If it's UTF-32, we could only static_cast to UTF-16, and may loss data, can we handle UTF-32 using JSString? For wchar_t itself, it can be everything, but as result of mbstowcs, can it be everything instead of UTF-16/UTF-32?
Leo Yang
Comment 10 2010-05-12 23:47:19 PDT
Created attachment 55949 [details] patch Follows review comments to address the following issues. 1. Minor style issue 2. wchar_t to unicode issue
Darin Adler
Comment 11 2010-05-13 13:21:09 PDT
Comment on attachment 55949 [details] patch Looking better! > +#ifdef __STDC_ISO_10646__ This seems like an unusual way to detect the presence of the mbstowcs function along with the fact that wide characters are Unicode. Is this really defined consistently everywhere we need to use mbstowcs and nowhere we don't? > + // Convert multi-byte result to unicode. Unicode should be capitalized, since it's a proper noun. > + for (size_t i = 0; i < length; ++i) > + buffer[i] = static_cast<UChar>(tempbuffer[i]); This needs a comment saying this assumes the characters are all representable by a single UTF-16 code point. That is not a safe assumption in general but we are probably OK for all the cases we will encounter in practice. If any character is a non-BMP one, we'll give the wrong result. I suggest you also add a comment explaining that if mbstowcs fails we just fall back on using the 8-bit characters as-is. And explain why and when that is ever the right thing to do. I think it's never helpful, myself, and it seems a little strange to leave that code path in there. I'll say r=me but I'd appreciate follow-up on the issues I mentioned here.
Leo Yang
Comment 12 2010-05-13 19:25:33 PDT
Created attachment 56043 [details] patch Follow review comments to add more code comments.
Leo Yang
Comment 13 2010-05-13 19:40:56 PDT
Created attachment 56046 [details] patch Follow review comments
Adam Barth
Comment 14 2010-05-15 00:05:23 PDT
Comment on attachment 56046 [details] patch In the future, it would be helpful if you included more information in your ChangeLog.
WebKit Commit Bot
Comment 15 2010-05-15 11:38:56 PDT
Comment on attachment 56046 [details] patch Clearing flags on attachment: 56046 Committed r59545: <http://trac.webkit.org/changeset/59545>
WebKit Commit Bot
Comment 16 2010-05-15 11:39:02 PDT
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.