Bug 113001

Summary: Some Hebrew diacritics get messed up on form submission
Product: WebKit Reporter: Konstantin <ikn>
Component: FormsAssignee: Nobody <webkit-unassigned>
Status: RESOLVED INVALID    
Severity: Normal CC: ap, rniwa
Priority: P3    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
URL: http://zapad.org/~ignatiev/temp/w4.php
Attachments:
Description Flags
Source of PHP script to reproduce the problem none

Description Konstantin 2013-03-21 21:40:48 PDT
Created attachment 194439 [details]
Source of PHP script to reproduce the problem

When I submit any form which has a text field which contains Hebrew diacritics U+05BC ("dagesh") and U+05B6 ("segol"), in this order, they get submitted to the server in the *opposite* order: U+05B6, U+05BC . While Hebrew word seems "same" visually, this "fixed" order is invalid (or at least non-standard), and regardless, browser obviously shouldn't change data entered into the form on its own, under any circumstances.

To demonstrate this issue, I wrote a simple PHP script (attached, and available online at http://zapad.org/~ignatiev/temp/w4.php), which allows user to fill a text field and then upon form submission to compare user input with what was actually submitted (via simple hash sum JavaScript implementation). You can play with it and see that it works fine for almost any text in any language you can enter.

If, however, you use button "initialize", script will initialize the text field to the string '\u05d1\u05bc\u05b5' (bet-dagesh-segol), and upon form submission the comparison test will FAIL; value submitted will be '\u05d1\u05b5\u05bc' bet-segol-dagesh.

This problem is reproducible in any WebKit-based browser I tried (Chrome Windows/Mac, Safari Mac/Windows/iPhone, Debian 6 "Web browser", also on the latest "nightly build", compiled from source on Linux/GTK), while it works fine in IE, Firefox, and (Presto-based) Opera.
Comment 1 Alexey Proskuryakov 2013-03-26 11:58:08 PDT
> this "fixed" order is invalid (or at least non-standard) 

In fact, '\u05d1\u05bc\u05b5' is not properly normalized - both NFC and NFD forms for this string are '\u05d1\u05b5\u05bc'. Please see <http://unicode.org/reports/tr15/> for discussion of Unicode normalization forms.

Overall, this is expected behavior.

The reason why we normalize to NFC when sending for text is compatibility - since Windows uses NFC everywhere, there can be subtle errors when the text sent from WebKit gets processed by systems that don't work with decomposed text well.

I can see how in this specific case WebKit becomes an outlier, but this is the cost of being like other browsers in more common cases.
Comment 2 Alexey Proskuryakov 2013-07-31 09:46:15 PDT
*** Bug 119320 has been marked as a duplicate of this bug. ***