Bug 26375

Summary: VC++ 2005 Express failed to build WebKit due to raw UTF-8 string in WebKit/win/WebCoreLocalizedStrings.cpp
Product: WebKit Reporter: Takeshi Yoshino <tyoshino>
Component: New BugsAssignee: David Levin <levin>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, hbono, levin
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: OS X 10.5   
Bug Depends on:    
Bug Blocks: 27416    
Attachments:
Description Flags
Proposed fix for 26375
abarth: review-
Proposed fix for 26375 (rev 2)
none
Proposed fix for 26375 (rev 3)
eric: review-
Proposed fix for 26375 (rev 4) darin: review+

Description Takeshi Yoshino 2009-06-13 12:09:18 PDT
VC++ 2005 Express failed to build WebKit due to raw UTF-8 string in WebKit/win/WebCoreLocalizedStrings.cpp . It shows warning that the file contains some non ASCII characters (in my Japanese environment, it said it found some characters not in Windows-932 codepage). By default VC++ treats warnings as errors and stops building.

Actually, WebKit/win/WebCoreLocalizedStrings.cpp contains a multiplication sign (U+00D7) encoded in UTF-8 (0xC3 0x97).

Checkout is ToT r44649
Comment 1 Takeshi Yoshino 2009-06-13 12:10:54 PDT
Created attachment 31242 [details]
Proposed fix for 26375
Comment 2 Adam Barth 2009-06-13 23:08:42 PDT
Comment on attachment 31242 [details]
Proposed fix for 26375

LGTM.  Thanks.
Comment 3 Darin Adler 2009-06-14 11:21:44 PDT
Comment on attachment 31242 [details]
Proposed fix for 26375

This change is incompatible with the scripts that we uses to maintain the localized string files, so while it fixes compilation for the people using these development tools, it breaks WebKit development work flow for anyone using the localization scripts, unless we make some changes to the script used to extract localizable strings.
Comment 4 Adam Barth 2009-06-14 11:27:49 PDT
Comment on attachment 31242 [details]
Proposed fix for 26375

Marking as R- based on comment #3.  I'm surprised we've been able to get away with non-ASCII source files thus far.  Maybe we should change the scripts?
Comment 5 Darin Adler 2009-06-14 11:48:53 PDT
(In reply to comment #4)
> Marking as R- based on comment #3. I'm surprised we've been able to get away
> with non-ASCII source files thus far. Maybe we should change the scripts?

Until this point in the project, all the compilers we support were consistent in that in C strings (non-wide) they allowed non-ASCII characters, including the same raw bytes in the executable that were in the source file. This meant we were free to use UTF-8 for all strings, and not do any special escaping.

I'm disappointed to find we need to support compilers incompatible with this technique.

If we decide that this is no longer allowed, then we need to teach the scripts to deal with some sort of all-ASCII syntax. I think actual UTF-8 sequences with \x for each byte are a really ugly choice; it would be nice to have something that was aesthetically pleasing. I don't have any immediate good ideas at the moment about how to resolve this.
Comment 6 Darin Adler 2009-06-14 11:49:47 PDT
Perhaps the issue is entirely with warnings. Maybe we can fix this by turning off the warning. Or is the compiler actually trying to transcode the string?
Comment 7 Adam Barth 2009-06-14 13:26:08 PDT
Other projects I've been involved with have used XML as the source representations of localized strings (which, of course, is natively UTF-8).
Comment 8 Takeshi Yoshino 2009-06-14 23:30:07 PDT
I see. So, for now, may I put some comments into the code to let the Windows-based developer know the workaround, e.g. URL to this bug entry? I'll send a new patch to do so, later.
Comment 9 Takeshi Yoshino 2009-06-17 01:27:01 PDT
Created attachment 31405 [details]
Proposed fix for 26375 (rev 2)
Comment 10 David Levin 2009-06-18 16:00:12 PDT
*** Bug 21849 has been marked as a duplicate of this bug. ***
Comment 11 Eric Seidel (no email) 2009-06-18 18:23:55 PDT
Could you please post the warning number, and the description (even if it's in Japanese).
Comment 12 Takeshi Yoshino 2009-06-18 19:32:07 PDT
(In reply to comment #11)
> Could you please post the warning number, and the description (even if it's in
> Japanese).
> 

Here's the warning.

5>WebCoreLocalizedStrings.cpp
5>..\WebCoreLocalizedStrings.cpp : error C2220: warning treated as error - no 'object' file generated
5>..\WebCoreLocalizedStrings.cpp : warning C4819: The file contains a character that cannot be represented in the current code page (932). S
ave the file in Unicode format to prevent data loss
5>Project : warning PRJ0018 : The following environment variables were not found:
5>$(PRODUCTION)
5>Build log was saved at "file://E:\wk1_svn_cygwin\WebKitBuild\obj\WebKit\Debug\BuildLog.htm"
5>WebKit - 1 error(s), 1 warning(s)
========== Build: 4 succeeded, 1 failed, 13 up-to-date, 0 skipped ==========
Comment 13 Takeshi Yoshino 2009-06-18 19:36:15 PDT
Created attachment 31531 [details]
Proposed fix for 26375 (rev 3)

Included the warning messages to the patch.
Comment 14 Eric Seidel (no email) 2009-06-21 23:51:55 PDT
What happens if we "Save the file in Unicode format"?  Does it just add a UTF8 BOM to the top of the file?
Comment 15 Takeshi Yoshino 2009-06-22 21:12:51 PDT
(In reply to comment #14)
> What happens if we "Save the file in Unicode format"?  Does it just add a UTF8
> BOM to the top of the file?
> 

I couldn't find what the message "Save the file in Unicode format" in C4819 (http://msdn.microsoft.com/en-us/library/ms173715.aspx) exactly means. However, Visual Studio 2005 Professional has a "Advanced Save Options" for its editor. I saw five encoding options in the dialog with "Unicode" in them.

So, I've tested building WebKit by saving WebCoreLocalizedString.cpp with each of them. Whether a BOM is added or not is also described.

Build mode used: Release (because run-safari in debug mode on Windows fails too frequently)
Check: Navigate to some image (http://www.google.co.jp/intl/ja_jp/images/logo.gif) and check the title

Save mode: Unicode (UTF-8 with signature) - Codepage 65001
The source was encoded by UTF-8 w/ BOM
Build: Successful
Result: localized string not found

Save mode: Unicode - Codepage 1200
The source was encoded by UTF-16 Little Endian w/o BOM
Build: Successful
Result: localized string not found

Save mode: Unicode (Big-Endian) - Codepage 1201
The source was encoded by UTF-16 Big-Endian w/o BOM
Build: Successful
Result: localized string not found

Save mode: Unicode (UTF-7) - Codepage 65000
The source was encoded by UTF-7
Build: Failure
4>WebCoreLocalizedStrings.cpp
4>..\WebCoreLocalizedStrings.cpp(1) : error C2059: syntax error : '/'
4>..\WebCoreLocalizedStrings.cpp(7) : error C2059: syntax error : '+'
4>..\WebCoreLocalizedStrings.cpp(13) : error C2137: empty character constant
4>..\WebCoreLocalizedStrings.cpp(35) : error C2653: 'WebCore' : is not a class or namespace name

Save mode: Unicode (UTF-8 without signature) - Codepage 65001
The source was encoded by UTF-8 w/o BOM
Build: Failure. We need to remove /Wx option (take warnings as errors)
Build w/o /Wx: Successful
Result: localized string not found

None of them fixed the problem. I saw "logo.gif 276×110 pixels image" on the title bar only when I replace the character with escape sequences (\xC3\x97).

Thanks,
Comment 16 Eric Seidel (no email) 2009-06-23 22:57:06 PDT
Comment on attachment 31531 [details]
Proposed fix for 26375 (rev 3)

Wow!  Thank you for the awesome investigation.

Unfortunately the current patch doesn't seem to actually have the code change, just a comment?

This is the only log message you need in the ChangeLog:
WebCoreLocalizedStrings.cpp : warning C4819: The file contains a character that cannot be represented in the current code page (932). Save the file in Unicode format to prevent data loss

The rest of the build output you can remove from the ChangeLog.

r- for lack of code change.  Otherwise I'd r+ this.  Thanks again for the investigation!
Comment 17 Eric Seidel (no email) 2009-06-23 22:57:35 PDT
Comment on attachment 31531 [details]
Proposed fix for 26375 (rev 3)

oops, meant to r- it until a new patch is posted with the code change (and adjusted ChangeLog)
Comment 18 Takeshi Yoshino 2009-06-23 23:17:35 PDT
(In reply to comment #16)
> (From update of attachment 31531 [details] [review])
> Wow!  Thank you for the awesome investigation.
> 
> Unfortunately the current patch doesn't seem to actually have the code change,
> just a comment?
> 
> This is the only log message you need in the ChangeLog:
> WebCoreLocalizedStrings.cpp : warning C4819: The file contains a character that
> cannot be represented in the current code page (932). Save the file in Unicode
> format to prevent data loss
> 
> The rest of the build output you can remove from the ChangeLog.
> 

OK. Thanks.

> r- for lack of code change.  Otherwise I'd r+ this.  Thanks again for the
> investigation!
> 

As Darin said, if I make code change on WebCoreLocalizedStrings.cpp without making change on extract-localized-strings script, I'll break Localized.strings generation process. Do you mean I should make change on extract-localized-strings instead of putting a comment?

Thanks,
Comment 19 Takeshi Yoshino 2009-07-20 21:53:21 PDT
Created attachment 33143 [details]
Proposed fix for 26375 (rev 4)

Wrote a change on WebKitTools/Scripts/extract-localizable-strings so
that replacing a raw string in UI_STRING with hexadecimal escape
sequences won't break localized string generation process.
Comment 20 David Levin 2009-07-22 14:23:19 PDT
Assign to levin for landing.
Comment 21 David Levin 2009-07-22 15:11:53 PDT
Committed as http://trac.webkit.org/changeset/46245
Comment 22 Takeshi Yoshino 2009-08-06 10:55:06 PDT
(In reply to comment #21)
> Committed as http://trac.webkit.org/changeset/46245

Thank you!