14608 – Please add UTF-8 support to Japanese encoding auto-detection

Bug 14608 - Please add UTF-8 support to Japanese encoding auto-detection

Summary: Please add UTF-8 support to Japanese encoding auto-detection

Status:	RESOLVED WONTFIX

Alias:	None

Product:	WebKit
Classification:	Unclassified
Component:	Page Loading (show other bugs)
Version:	523.x (Safari 3)
Hardware:	All All

Importance:	P3 Enhancement
Assignee:	Nobody

URL:
Keywords:

Depends on:
Blocks:

Reported:	2007-07-13 10:48 PDT by 808caaa4.8ce9.9cd6c799e9f6
Modified:	2022-09-27 07:47 PDT (History)
CC List:	5 users (show)

See Also:	245305

Attachments
reproducing material about Hypothesis#3 (6.87 KB, image/png) 2007-07-13 17:53 PDT, 808caaa4.8ce9.9cd6c799e9f6	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description 808caaa4.8ce9.9cd6c799e9f6 2007-07-13 10:48:58 PDT

hypothesis.

1. Many pages uses UTF8 but not included BOMs.
2. Some pages have not-meta-tags like <div> at the top, checkForHeadCharset() ignored the charset specified in <meta>.
3. KanjiCode::judge is linked, but seems not be called, because encoding().isJapanese() seems almost always return with false.

BTW, KanjiCode::judge cannot detect UTF8.
UTF8-ja texts almost always detected as Shift_JIS by KanjiCode::judge, so we distinguish between SJIS and UTF8 after call KanjiCode::judge.

enum KanjiCode::Type judge_with_utf8_ja(const char* str,int size){
	// UTF8 JA strings is detected as Shift_JIS at this time.
	int r=KanjiCode::judge(str,size);
	// SJIS is really SJIS? 
	if(r==KanjiCode::SJIS && size>3){
		int r80DF=0;
		int rE0FF=0;
		for(int i=0;i<size-3;i++){
			if(str[i]<0 && str[i+1]<0 && str[i+2]<0 && str[i+3]>0){
				if(str[i]<-0x20) r80DF++; else rE0FF++;
			}
		}
		// Almost, SJIS: rE0FF==0 UTF8: r80DF==0
		if(rE0FF>r80DF) r=KanjiCode::UTF8;
	}
	return r;
}

Comment 1 Alexey Proskuryakov 2007-07-13 11:56:08 PDT

Could you please provide examples of pages that are decoded incorrectly because of these issues?

Comment 2 808caaa4.8ce9.9cd6c799e9f6 2007-07-13 17:53:56 PDT

Created attachment 15508 [details]
reproducing material about Hypothesis#3 

Encoding-detection bypassing example attached. Hypothesis#3.
93 FA ... 82 E6 is Shift_JIS, and detected as so successfully by KanjiCode::judge().

Another saying, heuristic functionality is "sleeping."

I think we may do first is fixing/improvements about isJapanese()
(and add support UTF8JA-heuristic).
No need extended support for 'broken markup,' for now, I think.
With hotpatching bypassing encoding().isJapanese(), this page:
 http://www.bandai.co.jp/releases/J2006120401.html
shown in bug#12526 is successfully rendered (not garbled)
WITHOUT altering current checkForHeadCharset().

Comment 3 808caaa4.8ce9.9cd6c799e9f6 2007-07-13 18:46:41 PDT

Current isJapanese () impl in platform/TextEncodeing.cpp
I don't know enough about m_name behaviour...but
I wonder if m_name is stored BEFORE detectJapaneseEncoding() called.

With ntsd, actually, isJapanese() returns with false almost always.

---------------
bool TextEncoding::isJapanese() const
{
    if (noExtendedTextEncodingNameUsed())
        return false;

    static HashSet<const char*> set;
    if (set.isEmpty()) {
        addEncodingName(set, "x-mac-japanese");
        addEncodingName(set, "cp932");
	// ...
        addEncodingName(set, "Shift_JIS");
    }
    return m_name && set.contains(m_name);
}
---------------

Comment 4 808caaa4.8ce9.9cd6c799e9f6 2007-07-13 22:33:39 PDT

Additional info. 
Anonymous reporter (2ch.net, poster ID:94CW6sQg0) said
users should be specify DefaultEncoding to somewhat-Japanese-encodings,
and so will invoke encoding judge.

Surely, previously specify to Shift_JIS or EUC and try loads, isJapanese() returns true.
The URL shown in bug#12526 is with no trouble.

I don't know whether specifying specific encodings means automatic detection intentionally.


Lacking UTF8 autodetection remains.

Comment 5 Alexey Proskuryakov 2007-07-14 00:13:09 PDT

Yes, the design of Japanese encoding guessing is that it only helps to fix an incorrect _Japanese_ encoding specified. I.e. if the page specifies EUC, but uses Shift-JIS, it will help. For pages that rely on default browser encoding (such as <http://www.bandai.co.jp/releases/J2006120401.html>), the user is supposed to have a Japanese encoding as a default for this to work.

This is necessary for performance reasons, and also because we absolutely don't want other languages to be accidentally auto-detected as Japanese. Please also note that we may want to implement encoding auto-detection for more languages in the future, as Japanese isn't the only language with multiple encodings in active use.

I think that adding UTF-8 support to KanjiCode::judge is a good idea - but please give examples of real life sites that would benefit from this (as you mentioned in the bug description, there are many).

I'm retitling this bug to track this single issue to avoid confusion. Please file new bugs if necessary!

Comment 6 808caaa4.8ce9.9cd6c799e9f6 2007-07-16 16:02:39 PDT

Sorry for delay response.

Sites with UTF8/ja and broken tags

Comment 7 808caaa4.8ce9.9cd6c799e9f6 2007-07-16 16:14:51 PDT

// repost.
Sorry for delayed response.

Sites with UTF8/ja and broken tags mostly occur in end user sites,
I want not to bring pillory to them....

The most important reason for auto-detecting UTF8/ja support I think is
casual filter/Greasemonkey, for further maybe implements to WebKit.
It may strip out <meta>s and pads something at the top.
It's their risk at own...but supporting UTF8/ja is gentle, I think.



Additional consults.

While collecting examples, anonymous reporter(2ch.net, poster ID:xmYP4i2q0) said 
this URL in fun:

http://developer.apple.com/jp/

Kidding!

(Currently) this URL has the sort of 'broken tags:'

> <meta http-equiv="Content-Type" content="text/html; charset="utf-8">

With this case, detectJapaneseEncoding() seems to not to be called (in another reason)....
For not-collectly-paired \x22, checkForHeadCharset() lost sync for quote and
runs out whole the content absorbed with returns-false
(at 'if(ptr == pEnd) return false;' line 588).

Tag/content may not contain linefeeds with almost websites.
I think successfully aborting at scanning quote pair when linefeed occuered
is with reality.

Should I post this issue as new thread or wait?

My experimental code.
-----
                        while (ptr != pEnd && *ptr != quoteMark)
						{
							if(*ptr=='\r' || *ptr=='\n'){
								// too long tag content : may lost sync
								// successfully bail out
								m_checkedForHeadCharset = true;
								return true;
							}
                            ++ptr;
						}
-----

Comment 8 Alexey Proskuryakov 2007-07-16 21:28:32 PDT

> http://developer.apple.com/jp/

Wow. Yes, let's track this as a separate bug, please.

Comment 9 David Kilzer (:ddkilzer) 2007-07-17 08:09:34 PDT

(In reply to comment #8)
> > http://developer.apple.com/jp/
> 
> Wow. Yes, let's track this as a separate bug, please.

Bug 14643.

Comment 10 David Kilzer (:ddkilzer) 2007-07-17 08:31:10 PDT

(In reply to comment #9)
> (In reply to comment #8)
> > > http://developer.apple.com/jp/
> > Wow. Yes, let's track this as a separate bug, please.
> Bug 14643.

Oops--Bug 14636 had already been filed!

Comment 11 David Kilzer (:ddkilzer) 2007-07-17 08:38:43 PDT

(In reply to comment #7)
> Sites with UTF8/ja and broken tags mostly occur in end user sites,
> I want not to bring pillory to them....

While I understand that you don't wish to publicly embarrass specific web sites, the web is truly driven by web sites and how they use HTML, JavaScript and CSS.  If there is a significant number of web sites that do something in a way that breaks a web browser, we'd like to fix it, but we need to know which sites are causing the issues, or we need reduced test cases from the web sites so that the issue may be fixed.

If you could provide examples of HTML pages which do not work with Safari 3.0 (stripped of any identifiable content), that would help a great deal!  Please attach them to this bug as HTML files.  Thanks!

Comment 12 808caaa4.8ce9.9cd6c799e9f6 2007-07-17 20:16:52 PDT

Fount one example, simulation of broken-tags/non-standard page.

[This URL is OK] https://www.google.com/accounts/Login?hl=ja

Again, this URL is OK.
This page uses UTF8/ja, without specifying charsets in html.
Charset is specified in http response header, so no problem.

Save page to local, and open, page is garbled.

Comment 13 Alexey Proskuryakov 2007-07-22 02:56:43 PDT

While better supporting locally saved pages is nice, this isn't a very compelling example, as there are lots of non-Japanese pages out there that do not have correct METAs, but are served with a correct Content-Type, and we cannot detect all existing encodings.

We really need many examples of real life pages that would benefit from this to have a chance to fix this for Safari 3 (and even then, it may not be possible due to time constraints).

Comment 14 Sam Sneddon [:gsnedders] 2022-09-16 14:54:48 PDT

See also https://hsivonen.fi/utf-8-detection/ with regards to UTF-8 detection, though Chrome _does_ detect UTF-8. It seems unlikely we'll ever get consensus with regards to introducing UTF-8 sniffing cross-browser, which to me suggests we should be very conservative about introducing it ourselves at this point.

Comment 15 Anne van Kesteren 2022-09-27 07:47:22 PDT

If this is standardized first it's something we could consider, but I'd rather not do this unilaterally.