61561 – add support for "on demand" webfonts

Brian Stell

Reported 2011-05-26 14:24:08 PDT

Add support for "on demand" webfonts that download the data needed for the current page (that is not already downloaded).

Brian Stell

Comment 1 2011-05-26 14:58:14 PDT

I'm developing an experimental prototype to speed up webfonts by only loading the minimally needed parts of a webfont. This is beneficial where most of a font is not needed; eg, short runs of a unique font like blog titles; CJK fonts where on any given page only a small subset is used; etc. When this type of webfont is accessed for the very first time a stripped version (base) of the font is requested. On all page loads we record which characters we do no yet have data for. Once the page is loaded we request any needed data. When this data arrives we merge it with the already downloaded data, cache it for future reuse, and redraw.

Brian Stell

Comment 2 2011-05-27 12:14:40 PDT

Webkit has many layers so I would appreciate some recommendations and guidance on where in the code the pieces would best go. I need to know if this font is a "load on demand" font. Would it be okay to do this by the extension? Should I create a different type font object; eg, a new subclass? Can I get a recommendation where this should go in the code? I will need to download the font 'base' (font with removable parts removed) if nothing is already downloaded. I see what looks like font loading code in CSSFontSelector::addFontFaceRule(). Can I just assume this code is enough? Or do I need to do something like change the font object type? As the page is loaded I need to know this is a "load on demand" font (perhaps by object type). The font object will need to have info on which characters are supported in this font and which characters have already been downloaded. Obviously this info will also need to be in the font file. The font object will need to be able to record the accessed characters. Where would the best go? Perhaps some of the layout code such as RenderBlock::LineBreaker::nextLineBreak()? When the base font has been received I need to not request a reflow. Recommendations on where this logic should go? Perhaps CSSFontSelector::dispatchInvalidationCallbacks() ? When the page has finished loading I need to request the needed data. Where would this go? Perhaps MainResourceLoader::didFinishLoading()?

mitz

Comment 3 2011-06-07 16:47:10 PDT

You should look at the discussion and the patch at bug 42154.

Brian Stell

Comment 4 2011-06-13 17:45:45 PDT

My goal is to speed up CJK webfonts. Yuzo's goal in bug 42154 is to speed up CJK fonts. We are both working toward a similar goal but we are going about it in very different ways. I talked with Yuzo and I believe in bug 42154 the plan is to break up big fonts into subsets of smaller Unicode ranges. These subsets would then be listed out in a CSS font list which the browser would logically reassemble. With his patch only the parts actually used would be downloaded. For a font composed of scripts that are efficiently ordered in Unicode (eg, Latin, Cyrillic, etc.) subsetting by Unicode range makes good sense. Each subset can efficiently cover a script and browser would only need to download the parts actually used. However, the characters in the Unicode Han (CJK) unification range are not efficiently ordered for Chinese, Japanese, or Korean. The Han unification is ordered in radical-stroke order, not by script or popularity. Popular characters for a given language are somewhat randomly scattered thru the whole range. Each subset in a Unicode range includes many low popularity characters along with a few high popularity characters. To load a page of popular characters would effectively download most of the subsets effectively defeating the goal of smaller downloading. For bug 42154 to be effective we would need to: 1. introduce a concept of 'popular' character ranges 2. add these 'popular' character ranges to the CSS spec 3. subset by 'popular' characters 4. list these in the CSS font family (or make the browser smart enough to request the appropriate subsets).

Brian Stell

Comment 5 2011-06-13 18:06:11 PDT

Here's what I'm trying to do: just ship the minimum needed by a page and cache for reuse: * Create a 'base' font by stripping the font of all the character specific data. For Latin fonts the character specific data is about 66% of the font size. For CJK fonts character data is 85-95% of the font size. * The first time the webfont is used download this 'base' (one time only). * During layout record all the characters actually used that are not already downloaded. * Request the data for these new characters. * Merge the data into the base and redraw. The initial download will (almost always) be significantly smaller. Subsequent uses will (almost always) be even smaller. There does not need a concept of 'popular' characters for a given script. The only 'manual' operation is to generate the (page agnostic) base font and character data. Everything else happens automatically. This is a big win for CJK but it also speeds up 'small uses' like fancy blog titles which only use a few characters from a font.

Dave Hyatt

Comment 6 2011-06-14 13:20:52 PDT

So rather than inventing a new way for client/server to communicate, I'd try to latch onto existing concepts. We know a font can be broken up into multiple files server-side and you can serve up pieces based off unicode ranges. If unicode range does a poor job of addressing specific sets of characters that might be very scattered, then maybe what's needed is a set of keywords for unicode range that could represent those sets (and avoid the author having to define some giant list of single characters in the unicode range).

Dimitri Glazkov (Google)

Comment 7 2011-06-14 13:40:33 PDT

(In reply to comment #6) > So rather than inventing a new way for client/server to communicate, I'd try to latch onto existing concepts. We know a font can be broken up into multiple files server-side and you can serve up pieces based off unicode ranges. > > If unicode range does a poor job of addressing specific sets of characters that might be very scattered, then maybe what's needed is a set of keywords for unicode range that could represent those sets (and avoid the author having to define some giant list of single characters in the unicode range). Before jumping to a working prototype, I think we should make a good assessment of how feasible would it be to standardize something like this. It sounds like as-is, the idea is bound to remain a one-off WebKit oddity, which to me smells like death.

Eric Seidel (no email)

Comment 8 2011-06-14 16:28:33 PDT

What happens if you specify: font-family: url(1.ttf), url(2.ttf), url(3.ttf),...; Will that correctly load the fallback fonts when the earlier ones do not contain the necessary characters? Would that be a working solution in today's browsers?

Brian Stell

Comment 9 2011-06-22 17:05:27 PDT

Issue 1: Right now WebKit loads all listed webfonts even if they are not used. Yuzo's bug 42154 is trying to only download a webfont if it is actually used. Issue 2: There is overhead in each subset. In a private conversation I asked Raph Levien of Google Webfonts and he estimates the subset overhead to be around 10%. I plan to look at making that smaller but it won't go away entirely so lots of subsets would negate any advantage. Issue 3: If we use big subsets then the size works as a penalty for when multiple fonts for short runs. If the goal was only one font then this is not much of a problem. However, if web apps are to compete with desktop apps then people should be able to produce rich typography; eg, use of semi-bold, semi-condensed, style variations. Big subsets make these costly.

Brian Stell

Comment 10 2011-08-08 13:43:25 PDT

It has been a while (it has taken me some time to get meaningful mapreduces). > So rather than inventing a new way for client/server to communicate, > I'd try to latch onto existing concepts. We know a font can be broken > up into multiple files server-side and you can serve up pieces based > off unicode ranges. For many scripts using Unicode ranges is likely to give a good win. For example, Droid Sans has ~ 2500 characters but less than 100 are needed for US English docs, less than 256 for most European docs, less than 100 for Hebrew. Splitting by Unicode range would reduce the webfont size significantly and in general most web pages using one of they scripts know which script they are using. Can I get some help prototyping code to have WebKit request this? > If unicode range does a poor job of addressing specific sets of > characters that might be very scattered, then maybe what's needed is > a set of keywords for unicode range that could represent those sets > (and avoid the author having to define some giant list of single > characters in the unicode range). While some script will benefit from Unicode ranges CJK is a very different matter. The Unicode Han unification scatters the interesting 7K Chinese (combined Simplified/Traditional) and 4K Japanese over 20K code positions in a pattern that has nothing to do with popularity (KangXi radical-stroke ordering). We've done mapreduces to figure out the popular characters in CJK pages on the web. It takes around 4K character to cover 75% of Japanese web pages. It takes about 4K to cover 75% of Korean characters. It appears that Chinese may be in the same range but we need to redo the mapreduces separating Simplified and Traditional Chinese (Over the 75% popularity range there is uncertainty since the results include some questionable characters. Hence we are redoing the mapreducs.). A subset of 4K chars is about 25% of a CJK font so splitting the font based on popularity to hit 75% of webpages is a modest win since 25% of all CJK webpages would require additional subsets. The mapreduce results for per-document subsetting show that 90% of CJK docs only need 600 characters. Relative to the 20K characters in a font this is a big win.

Ryosuke Niwa

Comment 11 2011-08-09 14:06:34 PDT

I'm very excited about this bug! However, I strongly believe we should do this discussion on an open mailing list where folks from Mozilla, Opera, etc... can participate. (In reply to comment #10) > > So rather than inventing a new way for client/server to communicate, > > I'd try to latch onto existing concepts. We know a font can be broken > > up into multiple files server-side and you can serve up pieces based > > off unicode ranges. > > For many scripts using Unicode ranges is likely to give a good win. For example, Droid Sans has ~ 2500 characters but less than 100 are needed for US English docs, less than 256 for most European docs, less than 100 for Hebrew. Splitting by Unicode range would reduce the webfont size significantly and in general most web pages using one of they scripts know which script they are using. Hebrew and Arabic are very good examples here since they only use very small subset of Unicode. I'm not sure how much of win we have for CJK since even a very small blog article may contain 500-1,000 unique characters (guess still much better than loading the whole 20,000 character worth of glyphs?) > We've done mapreduces to figure out the popular characters in CJK pages on the web. It takes around 4K character to cover 75% of Japanese web pages. It takes about 4K to cover 75% of Korean characters. It appears that Chinese may be in the same range but we need to redo the mapreduces separating Simplified and Traditional Chinese (Over the 75% popularity range there is uncertainty since the results include some questionable characters. Hence we are redoing the mapreducs.). A subset of 4K chars is about 25% of a CJK font so splitting the font based on popularity to hit 75% of webpages is a modest win since 25% of all CJK webpages would require additional subsets. It appears that what we need is a generic way of pulling multiple Unicode code point ranges.

Eric Seidel (no email)

Comment 12 2012-10-27 01:46:59 PDT

This seems like a discussion for the spec community, more than our bug tracker. Given this has been quiet for over a year, I think it's safe to close for now. It's a cool feature though. :)