Bug 12506

Summary: REGRESSION: Safari doesn't display hebrew text on a web page, displayed correctly on Tiger
Product: WebKit Reporter: Darin Adler <darin>
Component: TextAssignee: Alexey Proskuryakov <ap>
Status: RESOLVED FIXED    
Severity: Normal CC: ap
Priority: P1 Keywords: InRadar, Regression
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
Attachments:
Description Flags
proposed fix
none
fix the fix darin: review+

Darin Adler
Reported 2007-01-31 09:48:47 PST
<GMT14-Nov-2006 21:30:26GMT> Dary Mihova: * SUMMARY: The Hebrew text on the web page http://www.icellcom.co.il/javagames/ is displayed unreadable from Leopard 9a302 Safari. The same web page is displayed without any problems on Tiger Safari 2.0.4 (419.3). * STEPS TO REPRODUCE: Install Tiger 10.48. Open Safari and go to the above URL => the Hebrew text is displayed correctly (the left side of the attached screenshot) Install Leopard 9a302 Open Safari and go to the above URL => the Hebrew text is broken (the right side of the attached screenshot) <GMT16-Nov-2006 17:34:00GMT> Dave Wiley: FYI <GMT16-Nov-2006 20:14:27GMT> Ping Huey: Active on 9A305. <GMT27-Nov-2006 21:49:19GMT> Stephanie Lewis: Safari BRB Reviewed <GMT29-Nov-2006 22:44:18GMT> Oliver Hunt: This looks like it could be us incorrectly assuming ASCII over UTF-8, which seems odd <GMT01-Dec-2006 21:24:35GMT> Jill Surdzial: Intl BRB: Setting Intl Blocker/P1. <GMT15-Jan-2007 22:18:27GMT> Alice Liu: Safari beta blocker reviewed <GMT26-Jan-2007 01:56:51GMT> Darin Adler: I don't think it's right to say that this is blocked by bug 4892428. This specific bug is happening because we're confused by text in the a <meta> tag that looks like a tag to us. We could easily fix this specific problem and I'd like to see why this is a regression from the old code. Bug 4892428 is a far broader one and not a regression. <GMT26-Jan-2007 01:57:18GMT> Darin Adler: The key question here is: why did the Tiger version of the charset sniffer work better on this page? <rdar://problem/4836738>
Attachments
proposed fix (4.57 KB, patch)
2007-01-31 11:32 PST, Alexey Proskuryakov
no flags
fix the fix (2.43 KB, patch)
2007-02-01 11:16 PST, Alexey Proskuryakov
darin: review+
Alexey Proskuryakov
Comment 1 2007-01-31 10:15:48 PST
(In reply to comment #0) > The key question here is: why did the Tiger version of the charset sniffer work > better on this page? In Tiger, all unknown tags were allowed in HEAD: int id = khtml::getTagID(tmp, len); if(end) id += ID_CLOSE_TAG; ... case (ID_META+ID_CLOSE_TAG): case ID_SCRIPT: case (ID_SCRIPT+ID_CLOSE_TAG): ...other tags explicitly allowed in HEAD... case ID_HTML: case ID_HEAD: case 0: case (0 + ID_CLOSE_TAG ):     break; default:     body = true; I don't think we want to re-add this behavior.
Alexey Proskuryakov
Comment 2 2007-01-31 10:24:46 PST
(In reply to comment #1) > I don't think we want to re-add this behavior. At the same time, I suspect the change was non-intentional (likely made at the time when we switched from tag IDs to AtomicStrings).
Alexey Proskuryakov
Comment 3 2007-01-31 11:32:33 PST
Created attachment 12830 [details] proposed fix This patch fixes more than the regression - known tags hidden in attributes don't fool the sniffer anymore, too.
Darin Adler
Comment 4 2007-01-31 12:12:31 PST
Comment on attachment 12830 [details] proposed fix I believe that the HTML parser is tolerant of spurious quote marks in tags attribute lists -- see line 927 of HTMLTokenizer.cpp, the part that says "ignore any quotes we encounter and treat them like spaces". This code is strict about that. I'd like to see some more test cases that cover these strange quoting cases to see if there are any repercussions of that. r=me
Sam Weinig
Comment 5 2007-01-31 20:36:06 PST
Landed in r19324.
Alexey Proskuryakov
Comment 6 2007-01-31 21:21:43 PST
I'm going to look into how the real parser handles quote marks; will re-open if the fix needs to be corrected.
Darin Adler
Comment 7 2007-02-01 09:37:14 PST
Turns out the only reason the test case was succeeding was a bug in the code! The patch did not fix the bug, as demonstrated by the layout test.
Darin Adler
Comment 8 2007-02-01 09:38:44 PST
Comment on attachment 12830 [details] proposed fix Clearing review flag on this, since it was already landed.
Alexey Proskuryakov
Comment 9 2007-02-01 10:03:16 PST
(In reply to comment #7) > Turns out the only reason the test case was succeeding was a bug in the code! > The patch did not fix the bug, as demonstrated by the layout test. I have a fix for this; looking into whether the issue with spurious quote marks can be fixed without copying most of HTMLTokenizer.
Alexey Proskuryakov
Comment 10 2007-02-01 11:16:12 PST
Created attachment 12857 [details] fix the fix The spurious quotes quirk is very context-dependent, I don't see a way to support it without stateful code similar to that found in HTMLTokenizer. Maybe we should find a way to re-use that.
Darin Adler
Comment 11 2007-02-01 18:14:16 PST
Comment on attachment 12857 [details] fix the fix r=me
Alexey Proskuryakov
Comment 12 2007-02-01 22:03:23 PST
Committed revision 19348.
Note You need to log in before you can comment on or make changes to this bug.