Bug 12506

Summary:

REGRESSION: Safari doesn't display hebrew text on a web page, displayed correctly on Tiger

Product:

WebKit

Reporter:

Darin Adler <darin>

Component:

Text

Assignee:

Alexey Proskuryakov <ap>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

Priority:

Keywords:

InRadar, Regression

Version:

420+

Hardware:

Mac

OS:

OS X 10.4

Attachments:

Description	Flags
proposed fix	none
fix the fix	darin: review+

Darin Adler

Reported 2007-01-31 09:48:47 PST

<GMT14-Nov-2006 21:30:26GMT> Dary Mihova: * SUMMARY: The Hebrew text on the web page http://www.icellcom.co.il/javagames/ is displayed unreadable from Leopard 9a302 Safari. The same web page is displayed without any problems on Tiger Safari 2.0.4 (419.3). * STEPS TO REPRODUCE: Install Tiger 10.48. Open Safari and go to the above URL => the Hebrew text is displayed correctly (the left side of the attached screenshot) Install Leopard 9a302 Open Safari and go to the above URL => the Hebrew text is broken (the right side of the attached screenshot) <GMT16-Nov-2006 17:34:00GMT> Dave Wiley: FYI <GMT16-Nov-2006 20:14:27GMT> Ping Huey: Active on 9A305. <GMT27-Nov-2006 21:49:19GMT> Stephanie Lewis: Safari BRB Reviewed <GMT29-Nov-2006 22:44:18GMT> Oliver Hunt: This looks like it could be us incorrectly assuming ASCII over UTF-8, which seems odd <GMT01-Dec-2006 21:24:35GMT> Jill Surdzial: Intl BRB: Setting Intl Blocker/P1. <GMT15-Jan-2007 22:18:27GMT> Alice Liu: Safari beta blocker reviewed <GMT26-Jan-2007 01:56:51GMT> Darin Adler: I don't think it's right to say that this is blocked by bug 4892428. This specific bug is happening because we're confused by text in the a <meta> tag that looks like a tag to us. We could easily fix this specific problem and I'd like to see why this is a regression from the old code. Bug 4892428 is a far broader one and not a regression. <GMT26-Jan-2007 01:57:18GMT> Darin Adler: The key question here is: why did the Tiger version of the charset sniffer work better on this page? <rdar://problem/4836738>

Attachments
proposed fix (4.57 KB, patch) 2007-01-31 11:32 PST, Alexey Proskuryakov	no flags	Details Formatted Diff Diff
fix the fix (2.43 KB, patch) 2007-02-01 11:16 PST, Alexey Proskuryakov	darin: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2007-01-31 10:15:48 PST

(In reply to comment #0) > The key question here is: why did the Tiger version of the charset sniffer work > better on this page? In Tiger, all unknown tags were allowed in HEAD: int id = khtml::getTagID(tmp, len); if(end) id += ID_CLOSE_TAG; ... case (ID_META+ID_CLOSE_TAG): case ID_SCRIPT: case (ID_SCRIPT+ID_CLOSE_TAG): ...other tags explicitly allowed in HEAD... case ID_HTML: case ID_HEAD: case 0: case (0 + ID_CLOSE_TAG ): break; default: body = true; I don't think we want to re-add this behavior.

Alexey Proskuryakov

Comment 2 2007-01-31 10:24:46 PST

(In reply to comment #1) > I don't think we want to re-add this behavior. At the same time, I suspect the change was non-intentional (likely made at the time when we switched from tag IDs to AtomicStrings).

Alexey Proskuryakov

Comment 3 2007-01-31 11:32:33 PST

Created attachment 12830 [details] proposed fix This patch fixes more than the regression - known tags hidden in attributes don't fool the sniffer anymore, too.

Darin Adler

Comment 4 2007-01-31 12:12:31 PST

Comment on attachment 12830 [details] proposed fix I believe that the HTML parser is tolerant of spurious quote marks in tags attribute lists -- see line 927 of HTMLTokenizer.cpp, the part that says "ignore any quotes we encounter and treat them like spaces". This code is strict about that. I'd like to see some more test cases that cover these strange quoting cases to see if there are any repercussions of that. r=me

Sam Weinig

Comment 5 2007-01-31 20:36:06 PST

Landed in r19324.

Alexey Proskuryakov

Comment 6 2007-01-31 21:21:43 PST

I'm going to look into how the real parser handles quote marks; will re-open if the fix needs to be corrected.

Darin Adler

Comment 7 2007-02-01 09:37:14 PST

Turns out the only reason the test case was succeeding was a bug in the code! The patch did not fix the bug, as demonstrated by the layout test.

Darin Adler

Comment 8 2007-02-01 09:38:44 PST

Comment on attachment 12830 [details] proposed fix Clearing review flag on this, since it was already landed.

Alexey Proskuryakov

Comment 9 2007-02-01 10:03:16 PST

(In reply to comment #7) > Turns out the only reason the test case was succeeding was a bug in the code! > The patch did not fix the bug, as demonstrated by the layout test. I have a fix for this; looking into whether the issue with spurious quote marks can be fixed without copying most of HTMLTokenizer.

Alexey Proskuryakov

Comment 10 2007-02-01 11:16:12 PST

Created attachment 12857 [details] fix the fix The spurious quotes quirk is very context-dependent, I don't see a way to support it without stateful code similar to that found in HTMLTokenizer. Maybe we should find a way to re-use that.

Darin Adler

Comment 11 2007-02-01 18:14:16 PST

Comment on attachment 12857 [details] fix the fix r=me

Alexey Proskuryakov

Comment 12 2007-02-01 22:03:23 PST

Committed revision 19348.

Note You need to log in before you can comment on or make changes to this bug.