Bug 12506 - REGRESSION: Safari doesn't display hebrew text on a web page, displayed correctly on Tiger
Summary: REGRESSION: Safari doesn't display hebrew text on a web page, displayed corre...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P1 Normal
Assignee: Alexey Proskuryakov
URL:
Keywords: InRadar, Regression
Depends on:
Blocks:
 
Reported: 2007-01-31 09:48 PST by Darin Adler
Modified: 2007-02-01 22:03 PST (History)
1 user (show)

See Also:


Attachments
proposed fix (4.57 KB, patch)
2007-01-31 11:32 PST, Alexey Proskuryakov
no flags Details | Formatted Diff | Diff
fix the fix (2.43 KB, patch)
2007-02-01 11:16 PST, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Darin Adler 2007-01-31 09:48:47 PST
<GMT14-Nov-2006 21:30:26GMT> Dary Mihova:
* SUMMARY:
The Hebrew text on the web page 
http://www.icellcom.co.il/javagames/
is displayed unreadable from Leopard 9a302 Safari.
The same web page is displayed without any problems on Tiger Safari 2.0.4 (419.3).

* STEPS TO REPRODUCE:
Install Tiger 10.48.
Open Safari and go to the above URL => the Hebrew text is displayed correctly (the left side of the attached screenshot)

Install Leopard 9a302
Open Safari and go to the above URL => the Hebrew text is broken (the right side of the attached screenshot)

<GMT16-Nov-2006 17:34:00GMT> Dave Wiley:
FYI

<GMT16-Nov-2006 20:14:27GMT> Ping Huey:
Active on 9A305.

<GMT27-Nov-2006 21:49:19GMT> Stephanie Lewis:
Safari BRB Reviewed

<GMT29-Nov-2006 22:44:18GMT> Oliver Hunt:
This looks like it could be us incorrectly assuming ASCII over UTF-8, which seems odd

<GMT01-Dec-2006 21:24:35GMT> Jill Surdzial:
Intl BRB:  Setting Intl Blocker/P1.

<GMT15-Jan-2007 22:18:27GMT> Alice Liu:
Safari beta blocker reviewed

<GMT26-Jan-2007 01:56:51GMT> Darin Adler:
I don't think it's right to say that this is blocked by bug 4892428.

This specific bug is happening because we're confused by text in the a <meta> tag that looks like a tag to us. We could easily fix this specific problem and I'd like to see why this is a regression from the old code. Bug 4892428 is a far broader one and not a regression.

<GMT26-Jan-2007 01:57:18GMT> Darin Adler:
The key question here is: why did the Tiger version of the charset sniffer work better on this page?

<rdar://problem/4836738>
Comment 1 Alexey Proskuryakov 2007-01-31 10:15:48 PST
(In reply to comment #0)
> The key question here is: why did the Tiger version of the charset sniffer work
> better on this page?

In Tiger, all unknown tags were allowed in HEAD:

  int id = khtml::getTagID(tmp, len);
  if(end) id += ID_CLOSE_TAG;
...
  case (ID_META+ID_CLOSE_TAG):
  case ID_SCRIPT:
  case (ID_SCRIPT+ID_CLOSE_TAG):
...other tags explicitly allowed in HEAD...
  case ID_HTML:
  case ID_HEAD:
  case 0:
  case (0 + ID_CLOSE_TAG ):
    break;
  default:
    body = true;

I don't think we want to re-add this behavior.
Comment 2 Alexey Proskuryakov 2007-01-31 10:24:46 PST
(In reply to comment #1)
> I don't think we want to re-add this behavior.

At the same time, I suspect the change was non-intentional (likely made at the time when we switched from tag IDs to AtomicStrings).
Comment 3 Alexey Proskuryakov 2007-01-31 11:32:33 PST
Created attachment 12830 [details]
proposed fix

This patch fixes more than the regression - known tags hidden in attributes don't fool the sniffer anymore, too.
Comment 4 Darin Adler 2007-01-31 12:12:31 PST
Comment on attachment 12830 [details]
proposed fix

I believe that the HTML parser is tolerant of spurious quote marks in tags attribute lists -- see line 927 of HTMLTokenizer.cpp, the part that says "ignore any quotes we encounter and treat them like spaces". This code is strict about that. I'd like to see some more test cases that cover these strange quoting cases to see if there are any repercussions of that.

r=me
Comment 5 Sam Weinig 2007-01-31 20:36:06 PST
Landed in r19324.
Comment 6 Alexey Proskuryakov 2007-01-31 21:21:43 PST
I'm going to look into how the real parser handles quote marks; will re-open if the fix needs to be corrected.
Comment 7 Darin Adler 2007-02-01 09:37:14 PST
Turns out the only reason the test case was succeeding was a bug in the code! The patch did not fix the bug, as demonstrated by the layout test.
Comment 8 Darin Adler 2007-02-01 09:38:44 PST
Comment on attachment 12830 [details]
proposed fix

Clearing review flag on this, since it was already landed.
Comment 9 Alexey Proskuryakov 2007-02-01 10:03:16 PST
(In reply to comment #7)
> Turns out the only reason the test case was succeeding was a bug in the code!
> The patch did not fix the bug, as demonstrated by the layout test.

I have a fix for this; looking into whether the issue with spurious quote marks can be fixed without copying most of HTMLTokenizer.
Comment 10 Alexey Proskuryakov 2007-02-01 11:16:12 PST
Created attachment 12857 [details]
fix the fix

The spurious quotes quirk is very context-dependent, I don't see a way to support it without stateful code similar to that found in HTMLTokenizer. Maybe we should find a way to re-use that.
Comment 11 Darin Adler 2007-02-01 18:14:16 PST
Comment on attachment 12857 [details]
fix the fix

r=me
Comment 12 Alexey Proskuryakov 2007-02-01 22:03:23 PST
Committed revision 19348.