144263 – Streaming bytes of source code to SHA1 to calculate CodeBlockHash

NEW 144263

Streaming bytes of source code to SHA1 to calculate CodeBlockHash

https://bugs.webkit.org/show_bug.cgi?id=144263

Summary Streaming bytes of source code to SHA1 to calculate CodeBlockHash

Yusuke Suzuki

Reported 2015-04-27 10:07:01 PDT

Currently, CodeBlockHash once converts the entire source code string to the UTF-8 string and pass it to SHA1 to calculate the hash value. However, it's memory consuming because this process allocates a large buffer to store the entire UTF-8 string. Instead of the current process (alloc, convert, and use), we should implement the streaming decoder decoding UTF-16 into UTF-8 and stream the bytes into SHA1 iteratively.

Attachments
Add attachment proposed patch, testcase, etc.

Darin Adler

Comment 1 2015-04-28 08:11:54 PDT

I’m also not exactly sure why we SHA1 the UTF-8 representation. Seems like we could do it on the UTF-16 representation, although that’s not necessarily better. For non-Latin-1 strings we already store the UTF-16 representation. Also not sure why we chose SHA1 rather than the efficient hash function we use for our hash tables. I guess a 32-bit hash is just too small.

Yusuke Suzuki

Comment 2 2015-04-28 08:33:27 PDT

(In reply to comment #1) > I’m also not exactly sure why we SHA1 the UTF-8 representation. Seems like > we could do it on the UTF-16 representation, although that’s not necessarily > better. For non-Latin-1 strings we already store the UTF-16 representation. > > Also not sure why we chose SHA1 rather than the efficient hash function we > use for our hash tables. I guess a 32-bit hash is just too small. Hm, but seeing the m_hash which may be used as hash value for CodeBlockHash, it's unsigned. So the current hash value is 32-bit long. CodeBlockHash is introduced in the following bug. https://bugs.webkit.org/show_bug.cgi?id=103623 Seeing this issue, CodeBlockHash is introduced for debugging purpose (not for HashTable); To add a roughly unique name to CodeBlock to print it rather than using the pointer value of CodeBlock. For this purpse, the important thing is strong distribution. Even if it costs long time to calcuate that, we need to get a well-distributed hash value. I think this is why SHA1 is chosen here rather than StringHasher.

Darin Adler

Comment 3 2015-04-28 08:59:30 PDT

I’d like to understand better how much better SHA1 is than the fast hash in StringHasher and strongly distributing the bits of the data. I’m not surprised it’s better, but I’d be surprised if it’s a lot better.

Note You need to log in before you can comment on or make changes to this bug.

Status NEW

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component JavaScriptCore

Assignee

Yusuke Suzuki

Reported

2015-04-27 10:07 PDT

Modified

2015-04-28 08:59 PDT History

CC List

1 user Show

URL

Keywords

Depends on

144257

Blocks

Dependencies

tree graph