Sending a %u-escaped string (eg as generated by escape() for unicode data) to XMLHttpRequest::open() will cause any %u's to be escaped an extra time (ie the webserver will receive %25u2014 instead of %u2014). The likely culprit seems to be WebCore/kwq/KWQKURL.mm::appendEscapingBadChars(), which doesn't know about %u-style escaping. A patch similar to the following ought to fix the problem: --- KWQKURL.mm.orig Mon Jan 9 17:43:02 2006 +++ KWQKURL.mm Mon Jan 9 17:46:04 2006 @@ -981,6 +981,16 @@ *p++ = c; *p++ = *str++; *p++ = *str++; + } else if (c == '%' && strEnd - str >= 5 && + (str[0] == 'u' || str[0] == 'U') && + isHexDigit(str[0]) && isHexDigit(str[1]) && + isHexDigit(str[2]) && isHexDigit(str[3])) { + *p++ = c; + *p++ = *str++; + *p++ = *str++; + *p++ = *str++; + *p++ = *str++; + *p++ = *str++; } else if (c == '?') { *p++ = c; } else {
Do you have a test case that behaves differently in Safari and Firefox (or Internet Explorer)? The problem is not entirely clear to me from the description, sorry.
I've verified that my javascript code works as expected in Firefox 1.5. It's hard to do a portable test case, since the typical use is in AJAX code, but here's an extract: index.js: var isSafari = navigator.appVersion.match("AppleWebKit"); function xmlhttpRequest(filepath, callback) { var xmlhttp; if(window.XMLHttpRequest) { // Mozilla, Safari, ... xmlhttp = new XMLHttpRequest(); } else if (window.ActiveXObject) { // IE xmlhttp = new ActiveXObject("Microsoft.XMLHTTP"); } xmlhttp.open("GET", filepath, true); xmlhttp.onreadystatechange = function() { var value = callback(xmlhttp); return value; }; xmlhttp.send(null) } function getResponseXML(xmlhttp) { var doc = document.importNode(xmlhttp.responseXML.documentElement, true); if(isSafari) { // Safari doesn't initialise all node attributes (name, class etc.) // properly without this workaround. doc.innerHTML = doc.innerHTML; } return doc; } function editGetFields(tag_list, data) { for(var i = 0; i < tag_list.length; i++) { var name = tag_list.item(i).getAttribute("name"); if(name) { data[name] = tag_list.item(i).value; } } } function editTextSave(button, close) { var div_edit = document.getElementById("document-editor"); var document_id = div_edit.getAttribute("name"); var article_id = getArticleId(div_edit); var article_tr = document.getElementById("article-open-" + article_id); var data = new Object; editGetFields(div_edit.getElementsByTagName("input"), data); editGetFields(div_edit.getElementsByTagName("textarea"), data); var callback = function(xmlhttp) { if(xmlhttp.readyState != 4) return; var table_db = getResponseXML(xmlhttp); var tr_db = table_db.getElementsByTagName("tr").item(0); article_tr.parentNode.replaceChild(tr_db, article_tr); if(close) { displayArticleList(); } } xmlhttpRequest("actions/edit-text-save.xml?" + "article_id="+escape(article_id) + "&" + "document_id="+escape(document_id) + "&" + "finished="+escape(data.finished) + "&" + "headline="+escape(data.headline) + "&" + "subheadline="+escape(data.subheadline) + "&" + "keywords="+escape(data.keywords) + "&" + "blurb="+escape(data.blurb) + "&" + "byline="+escape(data.byline) + "&" + "notes="+escape(data.notes) + "&" + "story="+escape(data.story), callback); return false; } If one of data.x contains a unicode character outside ISO-8859-1 (like eg \u2014 (EM_DASH)), escape() above will encode it with %u-style encoding (eg %u2014). This is as expected. However when the resulting request is received by the http server the DWIM in XMLHttpRequest::open() will have quoted it an extra time (to eg %25u2014) when sent by Safari: ceylon.roxen.com - - [09/Jan/2006:17:15:44 +0100] "GET /es/actions/edit-text-sav e.xml?article_id=441&document_id=324&finished=on&headline=Katastrofhj%E4lp%20_%2 0ny%20exportsatsning&subheadline=&keywords=&blurb=&byline=¬es=&story=%E5%E4%F 6%20%25uACA1%25uAC94%25uAC84%0A%0AKatastrofhj%E4lp%20%25u2014%20ny%20exportsatsn ing%20%E5%E4%F6%20%BFSpanska%20fr%E5gor%3F%2075%20%25u2031%20%25u260E%0A%0AStora %20katastrofer%20skapar%20lidande%20och%20n%F6d%20%25u2014%20men%20ocks%E5%20exp [...] Note that there are several cases of double-escaped unicode characters above. But not with Firefox: shipon.roxen.com - - [09/Jan/2006:16:23:33 +0100] "GET /es/actions/edit-text-sav e.xml?article_id=441&document_id=324&finished=on&headline=Katastrofhj%E4lp%20_%2 0ny%20exportsatsning&subheadline=&keywords=&blurb=&byline=¬es=&story=%u6CE8%u 6587%u65B9%u6CD5%u304C%u3072%u3068%u76EE%u3067%u308F%u304B%u308B%u30AC%u30A4%u30 C9%u30C4%u30A2%u30FC%u306F%u3053%u3061%u3089%u3002%0A%u260E%20%E5%E4%F6%2075%20% [...] Above unicode characters have been escaped a single time (as expected).
Yes, I can see this with a simple test html and tcpflow. Would you like to submit the patch for review (by adding ChangeLog notes, attaching the patch as a file and setting a review flag)?
Created attachment 5599 [details] Suggested patch for KWQKURL.mm::appendEscapingBadChars()
Created attachment 5600 [details] Now with ChangeLog entry.
My test case: <http://nypop.com/~ap/webkit/xmlhttp-url-encoding.html> (too lazy to make one that doesn't require tcpflow, and no idea about how to make an automated one).
Comment on attachment 5599 [details] Suggested patch for KWQKURL.mm::appendEscapingBadChars() First, thanks for submitting this patch. It seems quite important to fix this bug! But I don't think this fix is correct. I think we should either be escaping all % characters in the string or not escaping any. It doesn't seem right to decide whether to escape the % character based on whether it's followed by u201F (for example) or not. If we shouldn't be escaping "%" characters in this case, then we need to change KURL behavior, perhaps adding a parameter if we don't always want that behavior.
Comment on attachment 5600 [details] Now with ChangeLog entry. OK, but my review comment stands (see comment 7 above). I don't think this is the correct fix.
I agree that having DWIM's like appendEscapingBadChars() is generally a bad idea to begin with, but now that it's there, it should either be fixed or removed entirely (which might break code that doesn't perform escaping properly, and thus relies on it being there). The alternative of escaping all "%" characters would break compatibility with eg Firefox even more.
I have updated the test case a bit to show more different cases. In the test, both MSIE and Firefox never escape % characters (MSIE doesn't even escape non-ASCII characters, actually). Looks like the DWIM code first appeared in a massive KURL rewrite between revisions 1905 and 1906; so it's not entirely obvious if it serves any purpose.
Created attachment 6486 [details] don't escape % characters I don't really see how to check all code paths that end up in appendEscapingBadChars()... There is a regression test included in the patch, and "everything else" still seems to work fine - which is admittedly less than a perfect evidence.
Comment on attachment 6486 [details] don't escape % characters Based on your test case I think never escaping % is the right thing. I think I am the one who did the rewrite and I didn't have a specific reason for escaping %. r=me
Note that IE7 changes this behavior. It rejects all invalid escape sequences and will not load the URL. This includes the %uXXXX form.
Mass moving XML DOM bugs to the "DOM" Component.