Bug 9697 - parseInt results may be inaccurate for numbers greater than 2^53
Summary: parseInt results may be inaccurate for numbers greater than 2^53
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 420+
Hardware: PC All
: P2 Normal
Assignee: Nobody
URL: javascript:alert(parseInt('0x10000000...
Keywords:
Depends on:
Blocks:
 
Reported: 2006-07-02 09:50 PDT by Björn Graf (boki)
Modified: 2007-07-17 19:28 PDT (History)
2 users (show)

See Also:


Attachments
Adds checking and reparsing for numbers that may overflow the double range (14.11 KB, patch)
2006-07-03 15:07 PDT, Björn Graf (boki)
ggaren: review-
Details | Formatted Diff | Diff
Proposed patch (76.69 KB, patch)
2007-07-15 23:01 PDT, Cameron Zwarich (cpst)
mjs: review-
Details | Formatted Diff | Diff
Revised proposed patch (74.34 KB, patch)
2007-07-16 03:22 PDT, Cameron Zwarich (cpst)
darin: review-
Details | Formatted Diff | Diff
Revised proposed patch (80.85 KB, patch)
2007-07-17 02:43 PDT, Cameron Zwarich (cpst)
darin: review+
Details | Formatted Diff | Diff
Revised proposed patch (80.97 KB, patch)
2007-07-17 10:01 PDT, Cameron Zwarich (cpst)
darin: review+
Details | Formatted Diff | Diff
Revised proposed patch (80.89 KB, patch)
2007-07-17 18:58 PDT, Cameron Zwarich (cpst)
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Björn Graf (boki) 2006-07-02 09:50:41 PDT
The result of parseInt may be inaccurate if the input number exceeds the 52 bits of the doubles mantissa. This slightly modified comment from Mozillas js_strtointeger explains it pretty well:

"This happens if the addition in number * radix + digit causes the result to be rounded down to an even least significant mantissa bit when the first dropped bit is a one. If any of the following digits in the number (which haven't been added in yet) are nonzero, then the correct action would have been to round up instead of down. An example occurs when reading the number 0x1000000000000081, which rounds to 0x1000000000000000 instead of 0x1000000000000100."

The code from js_strtointeger could be adapted to fix parseInt() in kjs/function.cpp. 

Failing tests:
ecma/GlobalObject/15.1.2.2-2.js
ecma/LexicalConventions/7.7.3-1.js
ecma/TypeConversion/9.3.1-3.js
Comment 1 Mark Rowe (bdash) 2006-07-02 15:59:15 PDT
I can confirm this bug with WebKit 418.8 and r15138.  I tested with the following:

javascript:alert(parseInt('0x1000000000000081', 16).toString(16))

With 418.8 the alert shows 0, while r15138 shows 1000000000000000.  Both of these are incorrect as the original report mentions.
Comment 2 Björn Graf (boki) 2006-07-03 15:07:03 PDT
Created attachment 9178 [details]
Adds checking and reparsing for numbers that may overflow the double range
Comment 3 Geoffrey Garen 2006-07-03 15:49:37 PDT
I'm not sure if the Mozilla license is compatible with the WebKit license. We've had conversations about this before, the outcome of which has been, "We don't know, but we don't think so."

Also, it looks like a lot of the patch you've submitted is just whitespace changes. If so, that can really make things difficult for integration.
Comment 4 Geoffrey Garen 2006-07-03 15:51:01 PDT
Comment on attachment 9178 [details]
Adds checking and reparsing for numbers that may overflow the double range

I guess I have to r- this over the license issue.
Comment 5 Björn Graf (boki) 2006-07-03 18:42:23 PDT
(In reply to comment #4)
> (From update of attachment 9178 [details] [edit])
> I guess I have to r- this over the license issue.
> 

Is cairo different for license issues then? And the last paragraph of the license states that it may be used under the LGPL so I thought it was compatible - not that I am a lawyer or anything close to that. Anyway, I can rewrite this stuff .

For the whitespace changes: should the new code match the old style or the one lined out in the guidelines?
Comment 6 Cameron Zwarich (cpst) 2007-07-14 09:48:40 PDT
I have a fairly simple fix for this problem. It occurs because numbers are read from the most significant digit downwards. To get correct rounding, one should go from the least significant digit upwards. However, this incurs one extra multiplication per digit. To get a fix that doesn't slow down the usual case, simply check if you have overloaded the mantissa, and if you have, go back and do it in the correct order.

Some of the tests are failing because of parseInt, and others are failing because this same bug is duplicated in the lexer. Should I make one integer parsing function that both parseInt and the lexer can use or duplicate the code?
Comment 7 David Kilzer (:ddkilzer) 2007-07-14 16:46:53 PDT
(In reply to comment #6)
> Some of the tests are failing because of parseInt, and others are failing
> because this same bug is duplicated in the lexer. Should I make one integer
> parsing function that both parseInt and the lexer can use or duplicate the
> code?

Create a single function--avoid duplicate code if possible!

Comment 8 Cameron Zwarich (cpst) 2007-07-15 23:01:22 PDT
Created attachment 15527 [details]
Proposed patch

Here is a patch that fixes these problems, as well as some other problems related to numeric conversions. It fixes all of the JS Core tests related to numeric conversion except three in ecma/TypeConversion/9.3.1-3.js. The first two are actually incorrect tests:

--> - "-0x123456789abcde8" = NaN FAILED! expected: 81985529216486880
--> - "-0x123456789abcde8" = NaN FAILED! expected: 81985529216486880

The ECMA spec (9.3.1) does not allow signs in hexadecimal numbers coerced by ToNumber(). The last is a specific instance of bug 4885:

--> -"\u20001234\u2001" = NaN FAILED! expected: -1234

It is difficult to make all of the conversion happen in one place, due to different char sizes and slightly different parsing requirements. Maybe I will do it at a later time when everything is using wide chars.
Comment 9 Maciej Stachowiak 2007-07-15 23:48:10 PDT
Comment on attachment 15527 [details]
Proposed patch

The substance of this fix looks great! However, I have a number of coding style comment.s

1) You added an awful lot of code inline to parseInt, UString::toDouble and the lexer. I would suggest factoring this out into separate functions. If you name it properly you may need less in the way of comments, and it may make it easier to refactor the code later to share more code.

2) 9007199254740992.0 is a mysterious magic number to have in the code. I would suggest using a named constant (static const double). If you name it well, you might not need quite so much in the way of comments.

3) Too much comment volume. We usually stick to comments that explain *why* the code is doing something, not *what* it is doing. The code should be written so that *what* it is doing is clear, through appropriate use of variable names and through factoring things out into functions. An explanation can be made for particularly subtle algorithms. Probably the most helpful comment would be that the "different method to ensure accuracy" is to calculate starting from the least significant digit, instead of from the most significant digit.

4) You say "If the radix is a power of 2, we can use a simpler algorithm that works because the radix has an exact floating-point representation". But I don't see any special casing for the radix being a power of two. Did I miss something?

5) It seems that numbers really have to be all-ASCII, perhaps it would be possible to share more code by coding the slower large number fallback in ASCII-only terms, and convert when it is needed, since it is an unusual case and already going to be slower.

6) (*c & 0xdf) -- this is a tricky way to do case folding, maybe such trickiness is not needed on the slow fallback path and you could just use toupper.

7) Perhaps better to use strncasecmp here than to optimize this rare case.
+          if ((c[0] == 'I' || c[0] == 'i') && (c[1] == 'N' || c[1] == 'n') && (c[2] == 'F' || c[2] == 'f'))

r- based on the multiplicity of style comments, but I'd love to have this patch in once they are addressed. Fixing those JS tests is good stuff.
Comment 10 Adam Roben (:aroben) 2007-07-15 23:59:12 PDT
Comment on attachment 15527 [details]
Proposed patch

+            if (s[p] == 'e') {
+                char shortenedString[p];
+
+                memcpy(shortenedString, s.ascii(), p);
+                shortenedString[p] = '0';
+
+                number = kjs_strtod(shortenedString + firstDigitPosition, 0);

I see two errors here:

1) shortenedString[p] = '0' should be shortenedString[p] = '\0'

2) The declaration of shortenedString won't compile with compilers other than GCC since it relies on a GCC extension.

You can actually avoid allocating a new string entirely by doing this:

+            if (s[p] == 'e') {
+                s[p] = '\0';
+
+                number = kjs_strtod(s + firstDigitPosition, 0);
+                s[p] = 'e';
Comment 11 Cameron Zwarich (cpst) 2007-07-16 03:22:20 PDT
Created attachment 15528 [details]
Revised proposed patch

Thanks for your prompt comments. I (hopefully!) fixed the style issues. I bit the ASCII bug and moved everything into one function, and I made a named constant for the overflow check. Adam pointed out one problem, which I realized after I submitted the patch, and there was another small problem of not checking for the decimal point in addition to the exponent symbol before calling strtod(). I also removed the usage of the GCC extension and the pointless copying of the string.
Comment 12 Darin Adler 2007-07-16 22:17:27 PDT
Comment on attachment 15528 [details]
Revised proposed patch

Looking great! I love that this fixes so many things in existing tests. I still see a few problems.

The variable "p" should be declared as a const char* -- then you wouldn't need the typecast on this line:

+    for (p = ((char*) s) + length - 1; p >= s; p--) {

I don't see any use of p that would require a non-const pointer.

I know Adam suggested this:

+                s[p] = '\0';

but it won't work -- you can't modify a const UString&. In fact, if it compiles, we should probably fix this line from the ustring.h header:

    UChar operator[](int pos) const;

If we change it to return a const UChar then you'd see a compiler error.

The right want to do this is simply to call s.substr(firstDigitPosition, p - firstDigitPosition).ascii(). The UString class has an appropriate optimization, so this won't have to allocate an entirely new string buffer.

In fact, the other callers should also call s.substr(firstDigitPosition).ascii(), so the buffer created by the ascii() function can be smaller.

Are you sure that HUGE_VAL works on all the platforms we need to compile on? I ask because we have KJS::Inf in value.h and we wouldn't need it if we could just use HUGE_VAL.

+          if (strncasecmp(c, "inf", 3) == 0)
+            return NaN;

I don't understand the code above at all, and I'd like to see the test cases that demonstrate we need it and why it's OK to just look at the first three characters.

Because this code was written incorrectly but the tests still passed, I think we need to devise at least one more test to ensure that everything is working correctly. If you can't find a test that fails due to the s[p] mistake, then I suspect that you don't need to do the truncation at all -- I suspect kjs_strtod will stop just fine when it encounters an "e" or a "." and if not we need a test to demonstrate it working. In general we need a test that exercises every branch in the new code, and that's not true at the moment.
Comment 13 Cameron Zwarich (cpst) 2007-07-17 00:05:54 PDT
(In reply to comment #12)
> The variable "p" should be declared as a const char* -- then you wouldn't need
> the typecast on this line:
>
> +    for (p = ((char*) s) + length - 1; p >= s; p--) {
>
> I don't see any use of p that would require a non-const pointer.

Oops. Fixed!

> Are you sure that HUGE_VAL works on all the platforms we need to compile on? I
> ask because we have KJS::Inf in value.h and we wouldn't need it if we could
> just use HUGE_VAL.

I took the usage of HUGE_VAL from existing code, so I figured it was okay. I'll change it to Inf.

> +          if (strncasecmp(c, "inf", 3) == 0)
> +            return NaN;
> 
> I don't understand the code above at all, and I'd like to see the test cases
> that demonstrate we need it and why it's OK to just look at the first three
> characters.

The code is there because we are passing off the conversion to strtod(). However, strtod() will interpret anything like INF, inf, etc. as IEEE infinity. However, the ECMA spec says that only "Infinity" is infinity, so we if strtod() returns infinity we need to check that it was not for the wrong reason, as if it was "inf" we should return NaN as per the spec.

> Because this code was written incorrectly but the tests still passed, I think
> we need to devise at least one more test to ensure that everything is working
> correctly. If you can't find a test that fails due to the s[p] mistake, then I
> suspect that you don't need to do the truncation at all -- I suspect kjs_strtod
> will stop just fine when it encounters an "e" or a "." and if not we need a
> test to demonstrate it working. In general we need a test that exercises every
> branch in the new code, and that's not true at the moment.

The s[p] check (in the original version where I copied the string) actually works and fixes a case. Without it,

parseInt("9007199254740992e2000")

returns Infinity. However, the check for the decimal point will never actually affect anything besides possibly the running time of strtod(), because if we've gone past the 53-bit range of the mantissa before the decimal point then nothing after it will affect the value. I figured I might as well add it in since I am checking for 'e'.

I'll add a LayoutTest with examples for all of the branches in my code beyond the cases in the Mozilla tests and then upload a new diff.
Comment 14 Cameron Zwarich (cpst) 2007-07-17 02:43:49 PDT
Created attachment 15541 [details]
Revised proposed patch

I fixed the problems mentioned in Darin's comments and added some test cases. Here are some explanations of the cases that are maybe not as obvious:

(- 'infinity').toString()

This should be NaN, but will be Infinity if we don't check for the first three letters. Note, this doesn't happen in the current version of toDouble(), but that is because the current code will also miss things like "1e2000", which should be Infinity but instead are NaN.

parseInt('0x100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000', 16)

This test and the one below demonstrate the necessity of the test for radixMultiplier being infinite in the loop in parseIntOverflow(). If the radixMultiplier becomes infinite, then the arithmetic below will make number become NaN if any digit is zero. Hence we explicitly test for this in the code. I added the decimal case as well for posterity, even though that is now handled by strtod().

parseInt('9007199254740992e2000')
"parseInt('9007199254740992.0e2000')

These will give Infinity unless we check the input of strtod(). The second demonstrates the necessity of checking for the decimal point as well.

The only test that will not fail no matter what changes to the code take place between the current code and the code in my patch is the one testing the value of

Number(0x1000000000000081).toString(16)

The hexadecimal conversion code in toDouble() is slightly different than elsewhere, so it rounds differently and converts this correctly, but it fails for other cases in the JS Core tests that are now fixed. I figured I'd add the test in anyways since I  added it for the lexer and parseInt().

I believe that accounts for every branch.
Comment 15 Darin Adler 2007-07-17 08:18:58 PDT
Comment on attachment 15541 [details]
Revised proposed patch

r=me

+    const char* p;
+    int digit;

I would have declared these where they are used rather than at the top of the function.

+            number = parseIntOverflow(s.substr(firstDigitPosition).ascii(), p - firstDigitPosition, radix);

Could pass p - firstDigitPosition to substr too to make a smaller buffer.

+          if (strncasecmp(c, "inf", 3) == 0)

This could use a comment like the one you wrote in this bug -- we shouldn't leave something unclear and non-obvious like this behind for the future.
Comment 16 Cameron Zwarich (cpst) 2007-07-17 10:01:50 PDT
Created attachment 15544 [details]
Revised proposed patch

Here is another slightly revised version of the patch. I took into account your comments about initialization and added the comment. I also removed the check of s[p] being 'e' or '.', because no matter what we can simply pass that substring to strtod().
Comment 17 Darin Adler 2007-07-17 10:37:07 PDT
Comment on attachment 15544 [details]
Revised proposed patch

Some nice improvements.

r=me

I still don't understand why you need to check for "inf". Wouldn't a check for "i" be good enough?

No need for else after return and you could probably simplify the logic and just have one return NaN.

+            number = parseIntOverflow(s.substr(firstDigitPosition).ascii(), p - firstDigitPosition, radix);

Could use substr(firstDigitPosition, p - firstDigitPosition) here -- more consistent with the expression above.
Comment 18 Cameron Zwarich (cpst) 2007-07-17 18:58:02 PDT
Created attachment 15557 [details]
Revised proposed patch

I made both of those suggested changes.
Comment 19 Darin Adler 2007-07-17 19:23:35 PDT
Comment on attachment 15557 [details]
Revised proposed patch

r=me
Comment 20 Darin Adler 2007-07-17 19:28:16 PDT
Committed revision 24394.