Bug 22339

Summary: Support Indian web sites with EOT by on-the-fly transcoding to Unicode
Product: WebKit Reporter: Jungshik Shin <jshin>
Component: Layout and RenderingAssignee: Nobody <webkit-unassigned>
Status: NEW ---    
Severity: Normal CC: ap, darin, eric, hamaji, ian, jdaggett, mike, mjs, nickshanks, prunthaban
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 14637, 24906    
Attachments:
Description Flags
Patch to support on-the-fly transcoding for one font
darin: review-
Patch to support on-the-fly transcoding for one font (new version) darin: review-

Description Jungshik Shin 2008-11-18 11:34:32 PST
In a mail thread http://lists.macosforge.org/pipermail/webkit-dev/2008-October/005493.html

we decided not to support EOT in webkit. Instead, Maciej came up with a brilliant idea of using the facility for 'text-transform' to convert the content of a text-node (font-encoded) to Unicode on the fly. This transcoding will be triggered by a pre-configured list of (web site, font) pairs. Which transcoder to use will be keyed by (web site, font) as well. 

Prunthaban has looked into implementing transcoders to webkit.
Comment 1 Prunthaban 2009-01-02 05:55:01 PST
Created attachment 26367 [details]
Patch to support on-the-fly transcoding for one font

This patch gives the whole framework to support transcoding on-the-fly when a page is rendered. All transcoder related code is protected by ENABLE(TRANSCODER) condition. The different classes used in this patch are to facilitate transcoding. There are 2 abstract classes which needs to be inherited to add support for new fonts/languages.
Encoder - This class is responsible for transforming the ASCII content into an intermediate form.
Converter - This class converts the intermediate form into the target language.

This patch at present supports one type of font encoding (TSCII - http://www.tscii.org/ A standardized Tamil font). The site I have used to test the code is 'www.kalkiweekly.com' a weekly tamil magazine. 
At present the Singleton class Transformer stores the allowed list of sites and fonts. The domain-font list is hardcoded now. I think this should be fine because we do not have any plans to extend this framework to support all dynamic fonts (which will be equal to supporting EOT). Each encoder needs one class (which peroforms look-up and additional processing when required). The size of the look-up table is on average not more than 200-300 entries. 
The entire logic behind this transcoding has been taken from padma.mozdev.org an open source Javascript based transcoder. I have done some automation to convert the tables in Padma to look-up tables in C++. So once we decide what fonts and domains we are going to allow, it should not take much time to develop the corresponding Converters & Encoders.
Comment 2 Darin Adler 2009-01-02 10:16:53 PST
Comment on attachment 26367 [details]
Patch to support on-the-fly transcoding for one font

Thanks for taking this on! I think this feature will be good for WebKit.

I'm a little disappointed at how much code you're putting up for review all at once. This includes both the "transcode based on font" infrastructure, but also some new codecs for Tamil and TSCII.

I don't think it's good to have the transcoding be based on the origin of the web page. We don't want the same content to work differently when it's on different servers. We should use the font family names alone, and not involve the security origin!

I think the concept we want here is that certain font names require re-decoding the characters. So the interface to that would be a single function, something like this:

    TextEncoding encodingForFont(const AtomicString& familyName);

Or perhaps like this:

    TextEncoding encodingForFont(const FontFamily&);

That function can go in platform/text or possibly in the rendering directory.

You'd give that map a font family name and ask it if there's a special encoding. If there is, the encoding would be represented by a TextEncoding object. To transcode, we'd first re-encode by using TextEncoding::encode on the text, using the encoding of the web page the text was found in, document()->decoder()->encoding(), to convert the text back from UTF-16 into the original bytes. Then we'd re-decode those original bytes by using TextEncoding::decode on the bytes, using the text encoding returned by encodingForFont. This would give us new, corrected UTF-16. Something like this:

    CString reencodedBytes = document()->decoder()->encoding().encode(characters, length, QuestionMarksForUnencodables);
    bool sawError;
    String transcodedText = fontBasedEncoding.decode(reencodedBytes.data(), reencodedBytes.length(), false, sawError);

If we do not already have suitable codecs for TSCII and Tamil -- and I'm surprised that ICU does not include these -- then the right way to add those is to add them via the TextCodec machinery, in the platform/text directory along with the other codecs. TextEncodingRegistry.cpp is the starting point where the various codecs register the encoding names they can handle.

We definitely don't need a new directory for transcoding functions in the render tree code!

> +2009-01-02  prunthaban  <prunthaban@google.com>

This should have your full name, not just your login name.

> +        Test: fast/css/font-face-transcoder.html

The ChangeLog entry needs to describe what you've changed. It should include the URL of the bug, title of the bug, and a description of the changes.

Ideally, each file and function should have comments about what changed. Although for new files it's better to omit the list of functions, even though the prepare-ChangeLog script generates it. So, for example, you'd want a comment about RenderText::styleDidChange explaining the change to that function, but you wouldn't want Encoder::maximumLookupLength.

> +#if ENABLE(TRANSCODER)

The global setting is called TRANSCODER, but that seems like far too vague a name. Transcoding could mean many different things, and we need something that's more specific. That's a problem with many of the names in your patch. You're sharing the namespace with the rest of WebCore, so you need specific enough names.

For example, you have Transcoder and Transformer. Not a clear distinction between the two. All these classes need more specific names. But also I don't think they need to exist (see my suggested design above).

In general, this design uses too many separate objects that each have to be allocated separately and too many different function calls. I believe it's going to be unnecessarily inefficient because of that.

> +    if (isTranscoderText()) {
> +        m_text = transcode(String(m_text->characters(), m_text->length()));
> +    }

Single line if statements should not have braces around them in the WebKit coding style.

This code is going to run on every style change and risks slowing down browsing on sites that don't involve the specially-named fonts. We have to design this isTranscoderText function and how it's used in the render tree so that it is very fast. I think that means that we need a design where the font family makes the decision about whether transcoding is involved and that gets stored rather than being recomputed each time styleDidChange on each RenderText object. I think the right place to store this is probably the WebCore::Font object.

> +    String origin = document()->securityOrigin()->domain();
> +    Transformer* trans = transformer();

There's no need to abbreviate the name of the local variable like this, and we try to avoid that. Our preferred style is to either use the transformer() function directly without a local variable at all, or if there is a real efficiency reason to use a local variable, we do this:

    Transformer* transformer = this->transformer();

> +PassRefPtr<StringImpl> RenderText::transcode(String text)

Functions should almost never take a String argument, because that unnecessarily churns the reference count of the string. Instead the argument should be const String&, or in some cases it should be a character pointer and a length.

> +    String family = style()->font().family().family().string();
> +    String transformed = transformer()->convert(text, family);
> +    return StringImpl::create(transformed.characters(),transformed.length());

Need a space after the comma.

This is an unnecessarily inefficient idiom here. We already have the string in a StringImpl, and then you create a new one by calling StringImpl::create; that copies all the characters for no good reason. I think the return value should just be a String, but if you really find it useful to have the return value be a PassRefPtr<StringImpl>, you should return transformed.impl(). WebCore strings are immutable, so there's almost never a need to copy them.

> +#if ENABLE(TRANSCODER)
> +    bool isTranscoderText() const;
> +    PassRefPtr<StringImpl> transcode(String);
> +#endif

These functions should be private, not public.

The function name isTranscoderText is not good. The text isn't "transcoder text", so "is transcoder text" is not the right name for it. Perhaps textNeedsTranscoding() is a good name. 

> +class Converter {
> +public:
> +    virtual String transcode(UChar) = 0;
> +};

This is much too vague a name for a class. A "converter" could convert anything into anything else. It needs a more specific name, one that mentions transcoding I think. But we shouldn't have this at all, so it's not worth commenting on it in too much detail.

The interface to this class is unnecessarily inefficient. Typically we don't want to make a separate function call for each character. Functions normally work on runs of characters for efficiency's sake. Further, returning a String for each character means doing multiple memory allocations per character. You need an API that takes efficiency into account. For example, you could have an API that takes both an input and output UChar buffer with some rules about the size of the buffer that allow it to be allocated on the stack.

> +#include "config.h"
> +#include "DynamicParser.h"
> +#include "Syllable.h"
> +#include "Transcoder.h"
> +#include <wtf/PassRefPtr.h>
> +#include "StringImpl.h"

There should be a blank line after DynamicParser.h, the main header of the file. Then the includes should be sorted with the "sort" tool. Thus, StringImpl.h should be higher up the list.

> +DynamicParser::DynamicParser(String input, Encoder* encoder)

Should be const String&.

> +String DynamicParser::removeRedundantSymbols()
> +{
> +    String output = "";
> +    for (int i = 0; i < m_text.length(); i++) {
> +        UChar c = m_text[i];
> +        if (!m_encoder->isRedundant(c))
> +            output.append(c);
> +    }
> +    return output;
> +}

No function should append characters to a String object one character at a time. That's extremely inefficient. Instead you should use a Vector<UChar> and then convert the vector to a string at the end using the String::adopt function.

> +        }
> +        else {

The close brace goes on the same line as the else in the WebKit coding style.

> \ No newline at end of file

Please put newlines at the end of the file.

I'm going to stop commenting on the details of the specific transcoder you wrote at this point. The more important point is that since this is a translation from one encoding to another, it should use the TextCodec API and be in the platform/text directory, not in the WebCore/rendering directory. There probably doesn't need to be a WebCore/rendering/transcoder directory at all.
> +TSCIIEncoder::TSCIIEncoder()

Is TSCII an encoding that's not supported by ICU? Do we really need a custom codec in WebKit for this? The heavy use of strings in these encoders is going to make it unnecessarily slow.

review- for now; lets try a new patch that either adds the new codecs or adds the new machinery for special decoding, but not both. We can test the decoding machinery using some codec that's already supported and a special font name.

The main tricky part here is adding this machinery without slowing down browsing that doesn't involve the special fonts.
Comment 3 Jungshik Shin 2009-01-05 17:06:57 PST
Great that this is coming along ! Thank you, Prunthaban !

I think TSCII support should be in a separate bug. Actually, I'd rather not support it at all.  Well, I added it to Gecko a long time ago, but in 2009, I'd rather not.  Anyway, that should be separate from this bug and has to be implemented by adding a converter for TSCII (as Darin wrote). 

To Darin: no, TSCII is not a part of the ICU.  

As for converting back to the byte streama(using windows-1252 - the claimed encoding of Indian pages) and then back again to Unicode (using a font-specific encoding), I'm afraid it's too a roundabout way.  Maciej's original idea in the webkit list was to treat the transcoding as a form of 'text-transform'. That means working entirely in the Unicode space.  

IMHO, ICU's Transformer APIs [1] are good for that kind (m-to-n) of transformation (Unicode to Unicode). The downside is that there are webkit ports that do not use ICU.  For that, I suggested (offline?) that we add a helper class/function that will be implemented using ICU (for ports that use ICU) and implemented with something else for ports that do not use ICU. The advantage is that the amount of the code to implement that in ports using ICU would be much smaller than what 's in the patch. There may be an issue with loading/storing the transformation rules for font-based-encodings. (we don't want to build all those rules - a few dozens of them - as a part of webkit binary but rather want to load if/when necessary).



[1] http://www.icu-project.org/userguide/Transform.html
    http://www.icu-project.org/userguide/TransformRule.html



Comment 4 Prunthaban 2009-01-08 03:37:41 PST
Thanks for the review comments.
Let me explain why I have chosen this approach. 
There are few dozen Indic fonts and each of them has their own glyf-to-ascii mapping. Now if we really want to write transformation rules for them from scratch, I am afraid it cannot be done by one person. It requires good understanding of how conjuncts are handled in that particular language. Considering these limitations, we needed an alternate method where we can develop these codecs with least effort. Padma plug-in ( http://padma.mozdev.org ) seems to be the right choice because
1) It has an active developer community which is constantly improving it.
2) It supports almost all major fonts in all major indic languages.
3) Some of the Padma developers showed interest in helping to port the Javascript code to C++.

The logic behind the code (present in this patch) were directly taken from Padma. The main transcoder code base (Syllable and Parser classes) are written to handle all types of fonts. They transform the ASCII characters into an intermediate form before transforming them again into Unicode. They find this approach easy and flexible. There were some re-arranging of characters done to simplify the look-up process.

If we take the ICU path, then we need to write all the font transformation rules and we need to have people having knowledge of each language and font (I presume that is the case - I have not worked with ICU in the past. If I am wrong, let me know).

Regarding simplifying the look-up process and minimize the number of classes, I am not sure to what extent we can simplify it because the code depends on Padma. By taking the Padma-way, it should be easier to maintain and update (whenever Padma code-base is updated).

Let me know your thoughts.
Comment 5 Darin Adler 2009-01-08 09:10:17 PST
(In reply to comment #3)
> As for converting back to the byte streama(using windows-1252 - the claimed
> encoding of Indian pages) and then back again to Unicode (using a font-specific
> encoding), I'm afraid it's too a roundabout way.  Maciej's original idea in the
> webkit list was to treat the transcoding as a form of 'text-transform'. That
> means working entirely in the Unicode space.

I don't agree. I believe the point here is that we have decoded the byte stream incorrectly because we didn't realize the font requires a particular encoding. We need to undo that and then re-decode.

You've stated that this should be done entirely in the Unicode space, but I don't agree. This is working around incorrect decoding, so what it should be doing is new decoding. I agree that it's transcoding, but the cleanest model is to use re-decoding.
Comment 6 Alexey Proskuryakov 2009-01-08 11:29:41 PST
For the record (and as discussed previously offline), I think that we should consider using the existing Padma JavaScript code without adaptation. The cases where transcoding is needed are becoming more and more rare, so the expediency of re-using the whole code base may outweigh performance and even architecture concerns.
Comment 7 Prunthaban 2009-01-14 21:29:54 PST
Hi,

Before submitting a new patch, I would like to get clarification on certain issues.
The current encoding machinery I use is from Padma and the Padma architecture is different from the current TextEncoding machinery of Webkit.
(i) Padma uses a generic code base to transcode any font. These generic routines do bulk of the pre-processing. When you need to add a new font, it is enough to add two look-up tables and pass the corresponding objects to this transcoder.
(ii) Padma converts the ASCII characters into an intermediate form before converting them into unicode characters of a particular language (So there is an additional level in the transcoding).

Now, if we want to use the TextEncoding architecture of Webkit, (if my understanding is correct) we need to have one TextEncoding class for each font we add. But as I stated earlier, Padma transcoders share ~80% code across fonts & languages and the only way we can achieve this is to make those TextEncoding objects call the corresponding routines in the Padma implementation. So effectively we will still have all the Padma architecture hidden behind an additional level of TextEncoding objects.

If we need to get rid of the Padma code-base, then we need to write encoders for each font from scratch which is a big effort.

The main classes used in this implementation are,
1) Syllable
2) Parser
3) DynamicParser
4) Converter
5) Encoder

The first 3 classes are the core of the transcoding API. Additionally each font requires one Encoder and each language requires a Converter. We need to pass the encoder and converter to the API to perform transcoding on any font (Other than this there is one Singleton and another static helper class with helper methods).

So after looking into the TextEncoding API of WebKit, I am not sure how we can simplify the existing Padma architecture by using TextEncoding.
What are your thoughts on this?

Regards,
Prunthaban
Comment 8 Maciej Stachowiak 2009-01-15 22:44:25 PST
I agree with Alexey and Prunthaban that the ability to use a bunch of pre-written custom transcoders is a huge benefit. And since this is a legacy compatibility feature, I would say that outweighs the desire to take the cleaner approach of undecoding and then re-decoding to unicode.

I also think the origin restrictions are ok, based on what we have heard; the idea is to limit further propagation of mis-encoded text using custom fonts. However, having such restrictions may make it hard to create layout tests.
Comment 9 Prunthaban 2009-01-21 03:57:39 PST
Created attachment 26892 [details]
Patch to support on-the-fly transcoding for one font (new version)

Here is a summary of changes made in this new patch:

* The naming of classes has been changed. Earlier I had Transformer and Transcoder. One of the classes were removed (the functionality has been moved to a common class because they are very similar). Now the class is called FontTranscoder. Other classes has been renamed as well as per Darin's review comments.
* The global setting is renamed to TEXT_TRANSCODER. I think this should be fine.
* The efficiency of the transcoding process has been improved in this new patch. I have made the converters to act on a range of UChars in one shot. Also all String building methods now use Vector<UChar> to build the string and finally use String::adopt().
* The classes has been moved under platform/text. But I am still keeping them under a special directory because we need to add a dozen more Encoders and Converters. So I felt it will be good to keep them separate. If it is ok to have all of them directly under platform/text then we can move it there. Let me know your thoughts on this.
* All the review comments by Darin has been taken care of except using the TextCodec machinery of WebKit. As I mentioned in my previous mail, this implementation is based on Padma. The Padma architecture allows us to add new encoders (for each font. Based on our requirement we might add a dozen fonts initially) and converters (one for each language)  easily. They just plug-in to the existing code base. Instead, if we use one TextCodec for each font, then these codecs should either duplicate the logic or rely on external (Padma based) classes to do the transcoding. In the latter case the TextCodec will simply add one new layer to the transcoding process. Also the current implementation does not exactly fit into a code-decode process. The current implementation converts text (in a particular indic font) into an intermediate form and this intermediate form (there is no decoding logic from intermediate form to the initial font) will get converted into the target language. Considering the difference in architecture, I am still keeping the transcoding machinery separate.
* The TSCII Encoder I have added is just one encoder. There will be atleast a dozen more (once we figure out what are the major sites we want to support).
* Site based restriction for transcoding has been removed. As mjs suggested, if we want to bring it back we can do so., Let me know whether we need this.
Comment 10 Prunthaban 2009-01-28 19:20:48 PST
When can I expect a review for this? 
Comment 11 Eric Seidel (no email) 2009-02-16 00:00:37 PST
Comment on attachment 26892 [details]
Patch to support on-the-fly transcoding for one font (new version)

Wow, if the world ever needed a for loop, looks like this might be it!

   m_lookupTable.set(String(in + 0, 1), String(out + 0, 1));
 125     m_lookupTable.set(String(in + 1, 1), String(out + 1, 1));
 126
Comment 12 Prunthaban 2009-02-16 00:47:46 PST
(In reply to comment #11)
> (From update of attachment 26892 [details] [review])
> Wow, if the world ever needed a for loop, looks like this might be it!
> 
>    m_lookupTable.set(String(in + 0, 1), String(out + 0, 1));
>  125     m_lookupTable.set(String(in + 1, 1), String(out + 1, 1));
>  126 
> 
The problem is the Strings are not always of length 1 and they do not follow an order. I should have still had the length in an array and used a for loop. But then we will ultimately end up having two extra arrays with numbers filled in and a loop accumulating the start index. This part of the code was generated from Padma JS file using an automated program I have written (so that it will be easier to do the same for other fonts) and it was easier to keep it this way. If for readability, the loop based approach is preferred, I will change it.
Comment 13 Darin Adler 2009-02-23 16:37:01 PST
Comment on attachment 26892 [details]
Patch to support on-the-fly transcoding for one font (new version)

> + virtual Vector<UChar> transcode(const UChar* characters, int length) = 0;

Returning a Vector<UChar> is inefficient, because it requires copying the entire vector; it's not reference counted or anything like that. Instead the transcoder should take a Vector<UChar>& as an "out" parameter and put the result into the vector.

> +#if ENABLE(TEXT_TRANSCODER)
> +    , m_isTranscodableFont(transcoder()->isSupportedFont(family().family().string()))
> +#endif

How many different Font objects are created when browsing? Will this additional function call and hash table overhead have measurable performance impact? If not, why not?

> +#include "StringImpl.h"

This include should not be needed. Please double check there are no extra includes.

> +namespace WebCore
> +{

The brace should go on the same line.

> +DynamicParser::DynamicParser(const String& input, FontEncoder* encoder)
> +    : Parser(input, encoder)

Ownership is unclear to me. What guarantees the lifetime of the FontEncoder?

> +    if (encoder->isPreprocessRequired())

Name should be isPreprocessingRequired.

> +    m_length = m_text.length();

Why store a separate m_length if it's always the same as m_text.length()?

> +    m_hasSuffixes = m_encoder->hasSuffixes();

Why store m_hasSuffixes rather than calling m_encoder->hasSuffixes() each time?

> +String DynamicParser::removeRedundantSymbols()
> +{
> +    Vector<UChar> response;
> +    for (int i = 0; i < m_text.length(); i++) {
> +        UChar c = m_text[i];
> +        if (!m_encoder->isRedundant(c))
> +            response.append(c);
> +    }
> +    return String::adopt(response);
> +}

For better performance, it would be best to just return m_text as-is if there are no redundant symbols rather than constructing a new String object. You should do a reserveCapacity on the vector since it typically will be about the same length as m_text. The index should be size_t, not int.

> +                String lookupKey = m_text.substring(m_index, i);

The substring operation requires memory allocation so it is expensive. Is there a way to do this without allocating new strings all the time?

> +    String gunintam = syllable.gunintam();
> +    if (gunintam.length() == 2)
> +        syllable.setGunintam(m_encoder->handleTwoPartVowelSigns(gunintam[0], gunintam[1]));

If guintam is always two characters, then I suggest returning two characters rather than allocating a string.

> +#include "FontEncoder.h"
> +#include "Parser.h"
> +#include "PlatformString.h"
> +#include "Syllable.h"

You need to include "Parser.h" here, but all of the other includes seem unnecssary.

> +    DynamicParser(const String& input, FontEncoder* encoder);

In WebKit code we leave out argument names that are clear based on only the type. For example "encoder" would be omitted here.

> +    virtual bool handleConsonantTermination(Syllable& syllable);

And "syllable" here.

> +    virtual bool isCurrentSyllableComplete(SyllableType type);

And "type" here.

> +#include "PlatformString.h"

This file doesn't use String, it uses Vector and UChar, so it should include the headers for those, not "PlatformString.h".

> +namespace WebCore
> +{

Brace goes up on previous line.

> +    virtual String preprocessMessage(const String&);
> +    virtual bool isPreprocessRequired();
> +    virtual bool hasSuffixes();
> +    virtual bool isOverloaded(const String&);
> +    virtual bool isSuffixSymbol(const String&);
> +    virtual bool isPrefixSymbol(const String&);
> +
> +    virtual String lookup(const String& key) = 0;
> +    virtual String handleTwoPartVowelSigns(UChar a, UChar b) = 0;

The names "a" and "b" should be omitted here.

The interface here is heavy on use of String; this will cause the text to have to be put into lots of separate String objects. Could these functions instead work some other way so they won't require so much memory allocation?

> Property changes on: WebCore/platform/text/transcoder/FontEncoder.h
> ___________________________________________________________________
> Name: svn:executable
>    + *

Please do not submit patches with executable source files.

> +const SyllableType FontTranscoder::type[] = {TypeAccuMod, TypeAccuMod, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeHalluMod, TypeUnknown, TypeUnknown, TypeDigit, TypeDigit, 
> +    TypeDigit, TypeDigit, TypeDigit, TypeDigit, TypeDigit, TypeDigit, TypeDigit, 
> +    TypeDigit, TypeGunintam, TypeGunintam, TypeGunintam, TypeGunintam, 
> +    TypeUnknown, TypeUnknown, TypeAccu, TypeAccu, TypeAccu, TypeAccu, TypeAccu, 
> +    TypeAccu, TypeAccu, TypeAccu, TypeAccu, TypeAccu, TypeAccu, TypeAccu, 
> +    TypeAccu, TypeAccu, TypeAccu, TypeAccu, TypeAccu, TypeAccu, TypeAccu, 
> +    TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, 
> +    TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, 
> +    TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, 
> +    TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, 
> +    TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, 
> +    TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, 
> +    TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, 
> +    TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeHallu, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeDigit, TypeDigit, TypeDigit, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeHalluMod, TypeHalluMod, TypeAccuMod, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeGunintam, TypeGunintam, TypeGunintam, TypeGunintam, TypeGunintam, 
> +    TypeGunintam, TypeGunintam, TypeGunintam, TypeGunintam, TypeGunintam, 
> +    TypeGunintam, TypeGunintam, TypeGunintam, TypeGunintam, TypeGunintam, 
> +    TypeGunintam, TypeGunintam, TypeVattu, TypeHallu, TypeVattu, TypeHallu, 
> +    TypeVattu, TypeVattu, TypeHallu, TypeVattu, TypeVattu, TypeVattu, TypeVattu, 
> +    TypeHallu, TypeVattu, TypeVattu, TypeVattu, TypeVattu, TypeVattu, TypeHallu, 
> +    TypeVattu, TypeHallu, TypeVattu, TypeVattu, TypeVattu, TypeVattu, TypeVattu, 
> +    TypeVattu, TypeVattu, TypeVattu, TypeHallu, TypeVattu, TypeVattu, TypeVattu, 
> +    TypeVattu, TypeVattu, TypeHallu, TypeVattu, TypeVattu, TypeVattu, TypeVattu, 
> +    TypeVattu, TypeVattu, TypeVattu, TypeVattu, TypeVattu, TypeVattu, TypeVattu, 
> +    TypeVattu, TypeVattu, TypeVattu, TypeVattu, TypeVattu, TypeVattu, TypeVattu, 
> +    TypeVattu, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, 
> +    TypeUnknown, TypeUnknown, TypeUnknown, TypeUnknown, TypeHalfForm, 
> +    TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, 
> +    TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, 
> +    TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, 
> +    TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, 
> +    TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, 
> +    TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, 
> +    TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, 
> +    TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, 
> +    TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, 
> +    TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeHalfForm, 
> +    TypeHalfForm, TypeHalfForm, TypeHalfForm, TypeUnknown };

Formatting here doesn't put braces where we normally would.

> +    // Assign index to fonts
> +    m_encoderIndex.set("InaiMathiNewTSC",0);
> +
> +    m_converterIndex.set("InaiMathiNewTSC",0);

Why not put the actual pointers into these maps? I don't see a reason to have them go indirectly through array indices. And why not have a single map with a struct that holds both the encoder and the converter?

> +    m_transcodableSites.add("www.kalkiweekly.com");

Requiring a particular host name seems like a bad idea. I think the font name should be enough. Did we already discuss this?

> +bool FontTranscoder::isTranscodableSite(const String& origin) const
> +{
> +    return m_transcodableSites.contains(origin);
> +}

Why no check for empty string here?

> +String FontTranscoder::convert(const String& text, const String& font)
> +{
> +    if (!m_encoderIndex.contains(font) || !m_converterIndex.contains(font))
> +        return text;
> +    int encoderIndex = m_encoderIndex.get(font);
> +    int converterIndex = m_converterIndex.get(font);

This is a poor idiom because it does extra hash table lookups. With a find call you can both check if something is in the map and get at the value that's in there. That's twice as fast as doing a contains followed by a get. By making this a single map you can make this four times faster.

> +    Parser* parser = new DynamicParser(text, m_encoders[encoderIndex]);

You should use OwnPtr here rather than an explicit call to delete.

> +class FontTranscoder {
> +public:
> +    static const int baseIndex = 0xEC00;
> +    static const int baseStart = 0xEC22;
> +    static const int baseEnd   = 0xEC68;
> +    static const int dependentStart = 0xECA2;
> +    static const int dependentEnd   = 0xECE8;
> +    static const int halfStart = 0xED33;
> +    static const int halfEnd   = 0xED68;
> +    static const int vattuStart = 0xECB3;
> +    static const int vattuEnd   = 0xECE8;
> +
> +    // Constants definitions
> +    static const UChar anusvara = 0xEC00;
> +    static const UChar visarga = 0xEC01;
> +    static const UChar pollu = 0xEC02;
> +    static const UChar chillu = 0xEC0D;
> +    static const UChar syllbreak = 0xEC7B;
> +    static const UChar nukta = 0xEC7C;
> +    static const UChar ardhavisarga = 0xEC7D;
> +    static const UChar tippi = 0xEC7E;
> +    static const UChar addak = 0xEC7F;
> +    static const UChar ekonkar = 0xEC80;
> +    static const UChar virama = 0xEC02;
> +    static const UChar chandrakkala = 0xEC02;
> +    static const UChar pulli = 0xEC02;
> +    static const UChar halant = 0xEC02;
> +    static const UChar hasant = 0xEC02;
> +    static const UChar candrabindu = 0xEC03;
> +    static const UChar avagraha = 0xEC04;
> +    static const UChar yati = 0xEC05;
> +    static const UChar udAtta = 0xEC06;
> +    static const UChar anudAtta = 0xEC07;
> +    static const UChar svarita = 0xEC08;
> +    static const UChar guru = 0xEC09;
> +    static const UChar laghu = 0xEC0A;
> +    static const UChar danda = 0xEC0B;
> +    static const UChar ddanda = 0xEC0C;
> +    static const UChar abbrev = 0xEC0E;
> +    static const UChar om = 0xEC0F;
> +    static const UChar digitZERO = 0xEC10;
> +    static const UChar digitONE = 0xEC11;
> +    static const UChar digitTWO = 0xEC12;
> +    static const UChar digitTHREE = 0xEC13;
> +    static const UChar digitFOUR = 0xEC14;
> +    static const UChar digitFIVE = 0xEC15;
> +    static const UChar digitSIX = 0xEC16;
> +    static const UChar digitSEVEN = 0xEC17;
> +    static const UChar digitEIGHT = 0xEC18;
> +    static const UChar digitNINE = 0xEC19;
> +    static const UChar vowelA = 0xEC20;
> +    static const UChar vowelSHTA = 0xEC21;
> +    static const UChar vowelAA = 0xEC22;
> +    static const UChar vowelI = 0xEC23;
> +    static const UChar vowelII = 0xEC24;
> +    static const UChar vowelU = 0xEC25;
> +    static const UChar vowelUU = 0xEC26;
> +    static const UChar vowelR = 0xEC27;
> +    static const UChar vowelRR = 0xEC28;
> +    static const UChar vowelL = 0xEC29;
> +    static const UChar vowelLL = 0xEC2A;
> +    static const UChar vowelCDRE = 0xEC2B;
> +    static const UChar vowelE = 0xEC2C;
> +    static const UChar vowelEE = 0xEC2D;
> +    static const UChar vowelAI = 0xEC2E;
> +    static const UChar vowelCDRO = 0xEC2F;
> +    static const UChar vowelO = 0xEC30;
> +    static const UChar vowelOO = 0xEC31;
> +    static const UChar vowelAU = 0xEC32;
> +    static const UChar consntKA = 0xEC33;
> +    static const UChar consntQA = 0xEC34;
> +    static const UChar consntKHA = 0xEC35;
> +    static const UChar consntKHHA = 0xEC36;
> +    static const UChar consntGA = 0xEC37;
> +    static const UChar consntGHA = 0xEC38;
> +    static const UChar consntGHHA = 0xEC39;
> +    static const UChar consntNGA = 0xEC3A;
> +    static const UChar consntCA = 0xEC3B;
> +    static const UChar consntCHA = 0xEC3C;
> +    static const UChar consntJA = 0xEC3D;
> +    static const UChar consntZA = 0xEC3E;
> +    static const UChar consntJHA = 0xEC3F;
> +    static const UChar consntNYA = 0xEC40;
> +    static const UChar consntTTA = 0xEC41;
> +    static const UChar consntTTHA = 0xEC42;
> +    static const UChar consntDDA = 0xEC43;
> +    static const UChar consntDDDHA = 0xEC44;
> +    static const UChar consntDDHA = 0xEC45;
> +    static const UChar consntRHA = 0xEC46;
> +    static const UChar consntNNA = 0xEC47;
> +    static const UChar consntTA = 0xEC48;
> +    static const UChar consntTHA = 0xEC49;
> +    static const UChar consntDA = 0xEC4A;
> +    static const UChar consntDHA = 0xEC4B;
> +    static const UChar consntNA = 0xEC4C;
> +    static const UChar consntNNNA = 0xEC4D;
> +    static const UChar consntPA = 0xEC4E;
> +    static const UChar consntFA = 0xEC4F;
> +    static const UChar consntPHA = 0xEC50;
> +    static const UChar consntBA = 0xEC51;
> +    static const UChar consntBHA = 0xEC52;
> +    static const UChar consntMA = 0xEC53;
> +    static const UChar consntYA = 0xEC54;
> +    static const UChar consntYYA = 0xEC55;
> +    static const UChar consntRA = 0xEC56;
> +    static const UChar consntRRA = 0xEC57;
> +    static const UChar consntLA = 0xEC58;
> +    static const UChar consntLLA = 0xEC59;
> +    static const UChar consntZHA = 0xEC5A;
> +    static const UChar consntVA = 0xEC5B;
> +    static const UChar consntSHA = 0xEC5C;
> +    static const UChar consntSSA = 0xEC5D;
> +    static const UChar consntSA = 0xEC5E;
> +    static const UChar consntHA = 0xEC5F;
> +    static const UChar consntTSA = 0xEC60;
> +    static const UChar consntDJA = 0xEC61;
> +    static const UChar consntGGA = 0xEC62;
> +    static const UChar consntJJA = 0xEC63;
> +    static const UChar consntDDDA = 0xEC64;
> +    static const UChar consntBBA = 0xEC65;
> +    static const UChar consntRAMD = 0xEC66;
> +    static const UChar consntRALD = 0xEC67;
> +    static const UChar consntKHANDATA = 0xEC68;
> +    static const UChar vowelsnAA = 0xECA2;
> +    static const UChar vowelsnI = 0xECA3;
> +    static const UChar vowelsnII = 0xECA4;
> +    static const UChar vowelsnU = 0xECA5;
> +    static const UChar vowelsnUU = 0xECA6;
> +    static const UChar vowelsnR = 0xECA7;
> +    static const UChar vowelsnRR = 0xECA8;
> +    static const UChar vowelsnL = 0xECA9;
> +    static const UChar vowelsnLL = 0xECAA;
> +    static const UChar vowelsnCDRE = 0xECAB;
> +    static const UChar vowelsnE = 0xECAC;
> +    static const UChar vowelsnEE = 0xECAD;
> +    static const UChar vowelsnAI = 0xECAE;
> +    static const UChar vowelsnCDRO = 0xECAF;
> +    static const UChar vowelsnO = 0xECB0;
> +    static const UChar vowelsnOO = 0xECB1;
> +    static const UChar vowelsnAU = 0xECB2;
> +    static const UChar vowelsnEELEN = 0xEC1A;
> +    static const UChar vowelsnAILEN = 0xEC1B;
> +    static const UChar vowelsnAULEN = 0xEC1C;
> +    static const UChar vowelsnIILEN = 0xEC1D;
> +    static const UChar vattuKA = 0xECB3;
> +    static const UChar vattuQA = 0xECB4;
> +    static const UChar vattuKHA = 0xECB5;
> +    static const UChar vattuKHHA = 0xECB6;
> +    static const UChar vattuGA = 0xECB7;
> +    static const UChar vattuGHA = 0xECB8;
> +    static const UChar vattuGHHA = 0xECB9;
> +    static const UChar vattuNGA = 0xECBA;
> +    static const UChar vattuCA = 0xECBB;
> +    static const UChar vattuCHA = 0xECBC;
> +    static const UChar vattuJA = 0xECBD;
> +    static const UChar vattuZA = 0xECBE;
> +    static const UChar vattuJHA = 0xECBF;
> +    static const UChar vattuNYA = 0xECC0;
> +    static const UChar vattuTTA = 0xECC1;
> +    static const UChar vattuTTHA = 0xECC2;
> +    static const UChar vattuDDA = 0xECC3;
> +    static const UChar vattuDDDHA = 0xECC4;
> +    static const UChar vattuDDHA = 0xECC5;
> +    static const UChar vattuRHA = 0xECC6;
> +    static const UChar vattuNNA = 0xECC7;
> +    static const UChar vattuTA = 0xECC8;
> +    static const UChar vattuTHA = 0xECC9;
> +    static const UChar vattuDA = 0xECCA;
> +    static const UChar vattuDHA = 0xECCB;
> +    static const UChar vattuNA = 0xECCC;
> +    static const UChar vattuNNNA = 0xECCD;
> +    static const UChar vattuPA = 0xECCE;
> +    static const UChar vattuFA = 0xECCF;
> +    static const UChar vattuPHA = 0xECD0;
> +    static const UChar vattuBA = 0xECD1;
> +    static const UChar vattuBHA = 0xECD2;
> +    static const UChar vattuMA = 0xECD3;
> +    static const UChar vattuYA = 0xECD4;
> +    static const UChar vattuYYA = 0xECD5;
> +    static const UChar vattuRA = 0xECD6;
> +    static const UChar vattuRRA = 0xECD7;
> +    static const UChar vattuLA = 0xECD8;
> +    static const UChar vattuLLA = 0xECD9;
> +    static const UChar vattuZHA = 0xECDA;
> +    static const UChar vattuVA = 0xECDB;
> +    static const UChar vattuSHA = 0xECDC;
> +    static const UChar vattuSSA = 0xECDD;
> +    static const UChar vattuSA = 0xECDE;
> +    static const UChar vattuHA = 0xECDF;
> +    static const UChar vattuTSA = 0xECE0;
> +    static const UChar vattuDJA = 0xECE1;
> +    static const UChar vattuGGA = 0xECE2;
> +    static const UChar vattuJJA = 0xECE3;
> +    static const UChar vattuDDDA = 0xECE4;
> +    static const UChar vattuBBA = 0xECE5;
> +    static const UChar vattuRAMD = 0xECE6;
> +    static const UChar vattuRALD = 0xECE7;
> +    static const UChar vattuKHANDATA = 0xECE8;
> +    static const UChar halffmKA = 0xED33;
> +    static const UChar halffmQA = 0xED34;
> +    static const UChar halffmKHA = 0xED35;
> +    static const UChar halffmKHHA = 0xED36;
> +    static const UChar halffmGA = 0xED37;
> +    static const UChar halffmGHA = 0xED38;
> +    static const UChar halffmGHHA = 0xED39;
> +    static const UChar halffmNGA = 0xED3A;
> +    static const UChar halffmCA = 0xED3B;
> +    static const UChar halffmCHA = 0xED3C;
> +    static const UChar halffmJA = 0xED3D;
> +    static const UChar halffmZA = 0xED3E;
> +    static const UChar halffmJHA = 0xED3F;
> +    static const UChar halffmNYA = 0xED40;
> +    static const UChar halffmTTA = 0xED41;
> +    static const UChar halffmTTHA = 0xED42;
> +    static const UChar halffmDDA = 0xED43;
> +    static const UChar halffmDDDHA = 0xED44;
> +    static const UChar halffmDDHA = 0xED45;
> +    static const UChar halffmRHA = 0xED46;
> +    static const UChar halffmNNA = 0xED47;
> +    static const UChar halffmTA = 0xED48;
> +    static const UChar halffmTHA = 0xED49;
> +    static const UChar halffmDA = 0xED4A;
> +    static const UChar halffmDHA = 0xED4B;
> +    static const UChar halffmNA = 0xED4C;
> +    static const UChar halffmNNNA = 0xED4D;
> +    static const UChar halffmPA = 0xED4E;
> +    static const UChar halffmFA = 0xED4F;
> +    static const UChar halffmPHA = 0xED50;
> +    static const UChar halffmBA = 0xED51;
> +    static const UChar halffmBHA = 0xED52;
> +    static const UChar halffmMA = 0xED53;
> +    static const UChar halffmYA = 0xED54;
> +    static const UChar halffmYYA = 0xED55;
> +    static const UChar halffmRA = 0xED56;
> +    static const UChar halffmRRA = 0xED57;
> +    static const UChar halffmLA = 0xED58;
> +    static const UChar halffmLLA = 0xED59;
> +    static const UChar halffmZHA = 0xED5A;
> +    static const UChar halffmVA = 0xED5B;
> +    static const UChar halffmSHA = 0xED5C;
> +    static const UChar halffmSSA = 0xED5D;
> +    static const UChar halffmSA = 0xED5E;
> +    static const UChar halffmHA = 0xED5F;
> +    static const UChar halffmTSA = 0xED60;
> +    static const UChar halffmDJA = 0xED61;
> +    static const UChar halffmGGA = 0xED62;
> +    static const UChar halffmJJA = 0xED63;
> +    static const UChar halffmDDDA = 0xED64;
> +    static const UChar halffmBBA = 0xED65;
> +    static const UChar halffmRAMD = 0xED66;
> +    static const UChar halffmRALD = 0xED67;
> +    static const UChar halffmKHANDATA = 0xED68;
> +    static const UChar digitTEN = 0xEC70;
> +    static const UChar digitHUNDRED = 0xEC71;
> +    static const UChar digitTHOUSAND = 0xEC72;
> +    static const UChar signDAY = 0xEC73;
> +    static const UChar signMONTH = 0xEC74;
> +    static const UChar signYEAR = 0xEC75;
> +    static const UChar signDEBIT = 0xEC76;
> +    static const UChar signCREDIT = 0xEC77;
> +    static const UChar signASABOVE = 0xEC78;
> +    static const UChar signRUPEE = 0xEC79;
> +    static const UChar signNUMBER = 0xEC7A;
> +
> +    static const UChar UniCodeSharedNUKTA    = 0x093C;
> +    static const UChar UniCodeSharedAVAGRAHA = 0x093D;
> +    static const UChar UniCodeSharedOM       = 0x0950;
> +    static const UChar UniCodeSharedUDATTA   = 0x0951;
> +    static const UChar UniCodeSharedANUDATTA = 0x0952;
> +    static const UChar UniCodeSharedSVARITA  = 0x0951;
> +    static const UChar UniCodeSharedDANDA    = 0x0964;
> +    static const UChar UniCodeSharedDDANDA   = 0x0965;
> +    static const UChar UniCodeSharedabbrev   = 0x0970;
> +    static const UChar UniCodeSharedZWNJ     = 0x200C;
> +    static const UChar UniCodeSharedZWJ      = 0x200D;

Unicode doesn't have a capital "C".

I don't understand why these constants are part of the public interface to FontTranscoder. Can we put them somewhere else?

> +    static String fastVattu(UChar code)
> +    {
> +        String response = "";
> +        response.append(static_cast<UChar>(code + 0x80));
> +        return response;
> +    }

To construct a single-character string, you should initialize a character and then:

    return String(&character, 1);

But don't do it. Don't create interfaces that require allocating a string (two calls to the memory allocator, and reference counting overhead) to hold a single character!

> +    static String baseForm(const String& input) 
> +    {
> +        Vector<UChar> response;
> +        for (int i = 0; i < input.length(); i++) {
> +            UChar code = input[i];
> +            if (code >= dependentStart && code <= dependentEnd)
> +                response.append(static_cast<UChar>(code - 0x80));
> +            if (code >= halfStart && code <= halfEnd)
> +                response.append(static_cast<UChar>(code - 0x100));
> +        }
> +        return String::adopt(response);
> +    }

I don't understand why these helper functions are public members of the FontTranscoder class. There's no need for clients using this class to see these. Lets find somewhere else to put these. These functions use strings way too much -- it's going to make the code slow!

> +                if(!value.isNull())
> +                    break;

Need a space after the "if" here.

> +void Parser::process(const String& value, const String& key, Syllable& current
> +                     ,bool prefix, bool suffix)

Comma goes at the end of the previous line, not the start of the next line.

Also, we frown on the use of bools for things like "prefix" and "suffix". Makes the code hard to understand at the call site.

> +typedef enum 
> +{
> +    TransformerUnicode,
> +    TransformerISCII,
> +    TransformerITrans,
> +    TransformerTSCII,
> +    TransformerTAB,
> +    TransformerTAM,
> +    TransformerDynamicFonts
> +} TransformerType;

This is the C language way to do an enum. In C++ you can just do:

    enum TransformerType {
    };

> +    void process(const String& value
> +            ,const String& key
> +            ,Syllable& syllable
> +            ,bool prefix = false
> +            ,bool suffix = false);

Commas in the wrong place here.

> +    for (int i = 0; i < input.length(); i++) {
> +        if (i != 0)
> +            type = FontTranscoder::getType(input[i]);
> +        if (type == TypeAccu || type == TypeDigit || type == TypeHallu 
> +            || type == TypeUnknown)
> +            m_body.append(input[i]);
> +        else if (type == TypeHalfForm) {
> +            if (suffix)
> +                m_body.insert(input.characters() + i,1,0);
> +            else
> +                m_body.append(input[i]);
> +        } else if (type == TypeGunintam) {
> +            if (prefix)
> +                m_prefixGunintam.append(input[i]);
> +            else
> +                m_gunintam.append(input[i]);
> +        } else if (type == TypeVattu) {
> +            if (prefix)
> +                m_prefixVattu.append(input[i]);
> +            else
> +                m_body.append(input[i]);
> +        } else if (type == TypeAccuMod)
> +            m_vowelModifier.append(input[i]);
> +        else if (type == TypeHalluMod)
> +            m_consonantModifier.append(input[i]);
> +    }

Building a String object a character at a time is going to be excruciatingly slow. You need a better idiom for this, perhaps using Vector<UChar> instead.

> +public:
> +    static const UChar visarga = 0x00B7;
> +    static const UChar vowelA = 0x00AB;
> +    static const UChar vowelAA = 0x00AC;
> +    static const UChar vowelI1 = 0x00AD;
> +    static const UChar vowelI2 = 0x00FE;
> +    static const UChar vowelII = 0x00AE;
> +    static const UChar vowelU = 0x00AF;
> +    static const UChar vowelUU = 0x00B0;
> +    static const UChar vowelE = 0x00B1;
> +    static const UChar vowelEE = 0x00B2;
> +    static const UChar vowelAI = 0x00B3;
> +    static const UChar vowelO = 0x00B4;
> +    static const UChar vowelOO = 0x00B5;
> +    static const UChar vowelAU = 0x00B6;
> +    static const UChar consntKA = 0x00B8;
> +    static const UChar consntNGA = 0x00B9;
> +    static const UChar consntCA = 0x00BA;
> +    static const UChar consntJA = 0x0192;
> +    static const UChar consntNYA = 0x00BB;
> +    static const UChar consntTTA = 0x00BC;
> +    static const UChar consntNNA = 0x00BD;
> +    static const UChar consntTA = 0x00BE;
> +    static const UChar consntNA = 0x00BF;
> +    static const UChar consntNNNA = 0x00C9;
> +    static const UChar consntPA = 0x00C0;
> +    static const UChar consntMA = 0x00C1;
> +    static const UChar consntYA = 0x00C2;
> +    static const UChar consntRA = 0x00C3;
> +    static const UChar consntLA = 0x00C4;
> +    static const UChar consntVA = 0x00C5;
> +    static const UChar consntZHA = 0x00C6;
> +    static const UChar consntLLA = 0x00C7;
> +    static const UChar consntRRA = 0x00C8;
> +    static const UChar consntSSA = 0x201E;
> +    static const UChar consntSA = 0x2026;
> +    static const UChar consntHA = 0x2020;
> +    static const UChar conjctKSH = 0x2021;
> +    static const UChar conjctSRII = 0x201A;
> +    static const UChar vowelsnAA = 0x00A1;
> +    static const UChar vowelsnI = 0x00A2;
> +    static const UChar vowelsnII = 0x00A3;
> +    static const UChar vowelsnU = 0x00A4;
> +    static const UChar vowelsnUU = 0x00A5;
> +    static const UChar vowelsnE = 0x00A6;
> +    static const UChar vowelsnEE = 0x00A7;
> +    static const UChar vowelsnAI = 0x00A8;
> +    static const UChar vowelsnAULEN = 0x00AA;
> +    static const UChar comboKPULLI = 0x00EC;
> +    static const UChar comboNGPULLI = 0x00ED;
> +    static const UChar comboCPULLI = 0x00EE;
> +    static const UChar comboJPULLI = 0x02C6;
> +    static const UChar comboNYPULLI = 0x00EF;
> +    static const UChar comboTTPULLI = 0x00F0;
> +    static const UChar comboNNPULLI = 0x00F1;
> +    static const UChar comboTPULLI = 0x00F2;
> +    static const UChar comboNPULLI = 0x00F3;
> +    static const UChar comboNNNPULLI = 0x00FD;
> +    static const UChar comboPPULLI = 0x00F4;
> +    static const UChar comboMPULLI = 0x00F5;
> +    static const UChar comboYPULLI = 0x00F6;
> +    static const UChar comboRPULLI = 0x00F7;
> +    static const UChar comboLPULLI = 0x00F8;
> +    static const UChar comboVPULLI = 0x00F9;
> +    static const UChar comboZHPULLI = 0x00FA;
> +    static const UChar comboLLPULLI = 0x00FB;
> +    static const UChar comboSSPULLI = 0x2030;
> +    static const UChar comboSPULLI = 0x0160;
> +    static const UChar comboHPULLI = 0x2039;
> +    static const UChar comboRRPULLI = 0x00FC;
> +    static const UChar comboKSHPULLI = 0x0152;
> +    static const UChar comboTTI = 0x00CA;
> +    static const UChar comboTTII = 0x00CB;
> +    static const UChar comboKU = 0x00CC;
> +    static const UChar comboKUU = 0x00DC;
> +    static const UChar comboNGU = 0x2122;
> +    static const UChar comboNGUU = 0x203A;
> +    static const UChar comboCU = 0x00CD;
> +    static const UChar comboCUU = 0x00DD;
> +    static const UChar comboNYU = 0x0161;
> +    static const UChar comboNYUU = 0x0153;
> +    static const UChar comboTTU = 0x00CE;
> +    static const UChar comboTTUU = 0x00DE;
> +    static const UChar comboNNU = 0x00CF;
> +    static const UChar comboNNUU = 0x00DF;
> +    static const UChar comboTU = 0x00D0;
> +    static const UChar comboTUU = 0x00E0;
> +    static const UChar comboNU = 0x00D1;
> +    static const UChar comboNUU = 0x00E1;
> +    static const UChar comboNNNU = 0x00DB;
> +    static const UChar comboNNNUU = 0x00EB;
> +    static const UChar comboPU = 0x00D2;
> +    static const UChar comboPUU = 0x00E2;
> +    static const UChar comboMU = 0x00D3;
> +    static const UChar comboMUU = 0x00E3;
> +    static const UChar comboYU = 0x00D4;
> +    static const UChar comboYUU = 0x00E4;
> +    static const UChar comboRU = 0x00D5;
> +    static const UChar comboRUU = 0x00E5;
> +    static const UChar comboLU = 0x00D6;
> +    static const UChar comboLUU = 0x00E6;
> +    static const UChar comboVU = 0x00D7;
> +    static const UChar comboVUU = 0x00E7;
> +    static const UChar comboZHU = 0x00D8;
> +    static const UChar comboZHUU = 0x00E8;
> +    static const UChar comboLLU = 0x00D9;
> +    static const UChar comboLLUU = 0x00E9;
> +    static const UChar comboRRU = 0x00DA;
> +    static const UChar comboRRUU = 0x00EA;
> +    static const UChar digitZERO = 0x20AC;
> +    static const UChar digitONE = 0x0081;
> +    static const UChar digitTWO = 0x008D;
> +    static const UChar digitTHREE = 0x017D;
> +    static const UChar digitFOUR = 0x008F;
> +    static const UChar digitFIVE = 0x0090;
> +    static const UChar digitSIX = 0x2022;
> +    static const UChar digitSEVEN = 0x2013;
> +    static const UChar digitEIGHT = 0x2014;
> +    static const UChar digitNINE = 0x02DC;
> +    static const UChar digitTEN = 0x009D;
> +    static const UChar digitHUNDRED = 0x017E;
> +    static const UChar digitTHOUSAND = 0x0178;
> +    static const UChar LQTSINGLE = 0x2018;
> +    static const UChar RQTSINGLE = 0x2019;
> +    static const UChar LQTDOUBLE = 0x201C;
> +    static const UChar RQTDOUBLE = 0x201D;
> +    static const UChar COPYRIGGHT = 0x00A9;

These should be private unless they are part of the public interface to this class. Ideally they shouldn't be class members at all. They could just go inside the .cpp file.

In general, names for Unicode characters should go into the CharacterNames.h header and should be the names from the Unicode specification, without any runs of all capital letters. For example: copyrightSign.

> -    if (oldTransform != style()->textTransform() || oldSecurity != style()->textSecurity()) {
> +    if (oldTransform != style()->textTransform() || oldSecurity != style()->textSecurity()
> +#if ENABLE(TEXT_TRANSCODER)
> +        || style()->font().isTranscodableFont()
> +#endif

This check seems wrong. It doesn't even check if the font changed. If it's the same font, we have no reason to re-transcode.

> +#if ENABLE(TEXT_TRANSCODER)
> +    if (style()->font().isTranscodableFont())
> +        m_text = transcode(String(m_text->characters(), m_text->length()));
> +#endif

Should just be String(m_text) here -- this code unnecessarily makes a deep copy.

>  #include "RenderObject.h"
> +#include <wtf/HashSet.h>
> +#include <wtf/HashMap.h>

Why did you add these includes?

>  
>  namespace WebCore {
>  
> @@ -147,6 +151,9 @@ private:
>      bool containsOnlyWhitespace(unsigned from, unsigned len) const;
>      int widthFromCache(const Font&, int start, int len, int xPos) const;
>      bool isAllASCII() const { return m_isAllASCII; }
> +#if ENABLE(TEXT_TRANSCODER)
> +    PassRefPtr<StringImpl> transcode(String text);
> +#endif

The argument type should be "const String&", not "String". The return type should probably just be String rather than PassRefPtr<StringImpl>.
Comment 14 Jungshik Shin 2009-02-25 14:09:59 PST
> In general, names for Unicode characters should go into the CharacterNames.h
> header and should be the names from the Unicode specification, without any runs
> of all capital letters. For example: copyrightSign.

Some of them in this patch are real Unicode character names while others are not. For instance, the following line :

   static const UChar comboZHUU = 0x00E8;

does not mean that U+00E8 is 'comboZHUU' in a Tamil font but means that 'U+00E8' (a Unicode character we have after converting to Unicode assuming that the text encoding is Windows-1252) actually represents 'comboZHUU' in a font-encoded font in effect. 

> Requiring a particular host name seems like a bad idea. I think the font name
> should be enough. Did we already discuss this?

We have discussed this before. In a sense, it's a policy issue. 

I believe font names are sufficiently unique to identify corresponding font encodings without a host name. For instance, when fontFoo was converted to a real Unicode-based font, its name is usually changed, too (like fontFoo_U or something like that). And, I haven't seen two Indic font-encoded fonts with the same name but with different 'font encodings' (but, that does not mean that there are none).

On the other hand, we do not want this convention (of using 'font-encoded fonts') to spread further than where they're used now. Using (font, host) pairs as keys rather than font alone would limit our support to a finite set of sites. 

Comment 15 Darin Adler 2009-02-25 14:39:57 PST
(In reply to comment #14)
> > In general, names for Unicode characters should go into the CharacterNames.h
> > header and should be the names from the Unicode specification, without any runs
> > of all capital letters. For example: copyrightSign.
> 
> Some of them in this patch are real Unicode character names while others are
> not. For instance, the following line :
> 
>    static const UChar comboZHUU = 0x00E8;

Yes, those should not go in CharacterNames.h.

> > Requiring a particular host name seems like a bad idea. I think the font name
> > should be enough. Did we already discuss this?
> 
> We have discussed this before. In a sense, it's a policy issue.
> 
> I believe font names are sufficiently unique to identify corresponding font
> encodings without a host name. For instance, when fontFoo was converted to a
> real Unicode-based font, its name is usually changed, too (like fontFoo_U or
> something like that). And, I haven't seen two Indic font-encoded fonts with the
> same name but with different 'font encodings' (but, that does not mean that
> there are none).
> 
> On the other hand, we do not want this convention (of using 'font-encoded
> fonts') to spread further than where they're used now. Using (font, host) pairs
> as keys rather than font alone would limit our support to a finite set of
> sites.

Given this justification, I'd prefer to omit the host name rule. I don't think that our browsers' behavior will be sufficient to affect the website developers behavior unless we had much larger market share.
Comment 16 Jungshik Shin 2009-02-25 16:47:20 PST
BTW, I'm afraid Tamil script is not such a good target script for  a proof-of-concept implementation because it does not have some of complexities other Indic scripts have. Devanagari script (for Hindi) would be better, IMHO. I guess the interface defined for FontEncoder is not sufficient for scripts more complex than Tamil. 

Well, I realize that exactly the same argument can be used to advocate for Tamil as the target for this pilot implementation. (it's simpler and easier to implement). And, because the code was ported from Padma which can handle other Indic scripts, perhaps, I should not be worried. :-)

As for TSCII (comment #3 and comment #4), it's different from other Indic font encodings in that it does not 'infringe' upon the ASCII range. In a TSCII-encoded html file,  bytes for '<', '>', 'd', 'i', 'v' have the same meaning as in ASCII. So, we can convert the html document in TSCII as **a whole** to Unicode. And, Tamil being simpler than other Indic scripts, this may be the case of other Tamil font encodings as well. 

In most other font encodings, the document-wide conversion does not work because what 0x61, 0x62, 0x41 mean varies within a single document depending on what font is in effect. When they're in text nodes that are enclosed by '<font face="fontHindiFoo">' or styled with 'font-family: fontHindiFoo', they represent some fragments of Indic syllables/grapheme clusters. So, transcoding has to be limited to those text nodes. 

In practice, I admit that this distinction is not that important except that having TSCII text decoder would speed up the rendering of TSCII-encoded documents because with that, TSCII-encoded documents do not have to go through what's being added in this bug. 

In conclusion, I'm taking back what I wrote about the need to implement TSCII in a separate bug. 

Comment 17 Jungshik Shin 2009-02-25 16:49:51 PST
(In reply to comment #15)

> Given this justification, I'd prefer to omit the host name rule. I don't think
> that our browsers' behavior will be sufficient to affect the website developers
> behavior unless we had much larger market share.

I wish we had larger influence :-). Anyway, I'm fine with that.  I wonder what Ian and Mike think of that. 


Comment 18 Nicholas Shanks 2009-09-10 01:37:00 PDT
For people doing Unicode evangelism, there is now an AAT font available for Malayalam:
http://sites.google.com/site/macmalayalam/

This is in addition to the excellent Bengali support provided by Ekushey:
http://ekushey.org/?page/osx

And the three fonts I converted, for Kannada, Telugu and Sinhala:
http://web.nickshanks.com/fonts/
Comment 19 Darin Adler 2010-01-12 09:28:56 PST
I think we should do this for the Japanese yen sign first to get the scaffolding in place. It's a much simpler case and lets us work out the machinery without involving the more-complex transcoding required for the Indian-script cases.
Comment 20 Jungshik Shin 2010-07-07 13:53:04 PDT
(In reply to comment #19)
> I think we should do this for the Japanese yen sign first to get the scaffolding in place. It's a much simpler case and lets us work out the machinery without involving the more-complex transcoding required for the Indian-script cases.

What do you think of doing something similar to handle better 'MS Office-exported html files"?  Those html files have :

<font face="webdings">abc 123</font> : to get some dingbats   
<font face="Symbol">S</font>  :  to get Greek Capital Sigma   

Some webkit ports pass a test (for the strict W3C CHARMOD interpretation) at  

http://www.hixie.ch/tests/adhoc/css/fonts/family/006.html

while others don't. 

I'm not so fond of bending rules so much, but ....   
(BTW, Chromium has a bug filed to pass the test : http://crbug.com/3766 )
Comment 21 Alexey Proskuryakov 2010-07-07 14:39:28 PDT
Yes, I think that would be a nice thing to do. We track Symbol font support as bug 14637, which is currently marked as blocked by transcoding bug.
Comment 22 Eric Seidel (no email) 2012-10-27 12:48:41 PDT
I'm curious if there is any updated context for this bug? or if it's just as important as it sounded in '09, just 3 years later? :)
Comment 23 Nicholas Shanks 2012-11-16 03:32:11 PST
(In reply to comment #14)
> I haven't seen two Indic font-encoded fonts with the same name but with different 'font encodings' (but, that does not mean that there are none).

I am responsible for one such font. There once was a custom-encoded 8-bit font called "Kedage", intended for Windows users. An OpenType, Unicode version of it was made, and christened "Kedage Unicode". I ported this Unicode font to AAT and released it as "Kedage". I chose not to append Unicode because there was no history an an existing font with that name on the platform, so deemed the differentiator unnecessary.
Comment 24 Jungshik Shin 2013-02-11 14:12:24 PST
(In reply to comment #22)
> I'm curious if there is any updated context for this bug? or if it's just as important as it sounded in '09, just 3 years later? :)

The Unicode is much more widely used than before, but still there are quite a lot of pages in one of Indian languages using this 'font hack encoding'. We still have a continuous stream of 'cryptic bug reports' from India saying that 'Tamil (put any Indic languages)' web page is rendered with gibberish. Most of them, if not all, refers to font-encoded web pages. 


In case of Chrome (as with Firefox), we implemented a transcoding extension, Padma ( https://chrome.google.com/webstore/detail/padma/ngifghlmhidnielinpjdkkiadocdffbi?hl=en ) to work around this problem. 

Unfortunately, this does not help Chrome on Android because Chrome on Android does not support an extension, yet. 

Recently, we realized that there are two 'pseudo-Unicode/font-based encodings' widely used for Burmese/Myanmar. [1]  In a sense, it's even worse because the pages are apparently in UTF-8 but Unicode code points are interpreted differently than what's specified by Unicode. Nonetheless, they can be handled in a similar way as font-encoded Indic pages. 


[1] They're Zawgyi and San Myanmar. Even though Unicode-based Burmese fonts have been freely available (e.g. Padauk and Mynamar 3), Zawgyi and 'San Myanmar' were used apparently for the easier input.  BTW, Windows 8 is the first version of Windows with a Burmese/Myanmar font included.