Bug 201734 - results.webkit.org: Shard result archives
Summary: results.webkit.org: Shard result archives
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Jonathan Bedard
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2019-09-12 13:00 PDT by Jonathan Bedard
Modified: 2019-09-16 17:15 PDT (History)
9 users (show)

See Also:


Attachments
Patch (12.64 KB, patch)
2019-09-12 13:32 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch (12.76 KB, patch)
2019-09-13 17:05 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch (12.84 KB, patch)
2019-09-16 15:51 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch for landing (12.87 KB, patch)
2019-09-16 15:59 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch for landing (12.87 KB, patch)
2019-09-16 16:11 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jonathan Bedard 2019-09-12 13:00:34 PDT
We start having issues as our result archives approach 50 MB.

Note that with this change, we will lose existing result archives, but we're still storing them on Buildbot, so that's not a big deal.
Comment 1 Jonathan Bedard 2019-09-12 13:32:48 PDT
Created attachment 378672 [details]
Patch
Comment 2 Aakash Jain 2019-09-12 14:37:24 PDT
(In reply to Jonathan Bedard from comment #0)
> We start having issues as our result archives approach 50 MB.
What kind of issues? Can you elaborate more on the problem and the fix.
Comment 3 Jonathan Bedard 2019-09-12 15:39:43 PDT
(In reply to Aakash Jain from comment #2)
> (In reply to Jonathan Bedard from comment #0)
> > We start having issues as our result archives approach 50 MB.
> What kind of issues? Can you elaborate more on the problem and the fix.

'Issues' meaning the writes would fail. The solution I've implemented here is basically the one suggested in https://cwiki.apache.org/confluence/display/CASSANDRA2/CassandraLimitations. Instead of placing the entirety of the archive in a single row, we split the archive into multiple rows, each with a blob of no more than 10 MB.

Added bonus, if we wanted to change the size of the blob shards in the future, we can just leave the existing data as is.
Comment 4 Alexey Proskuryakov 2019-09-12 17:38:14 PDT
Have you considered storing these in S3?
Comment 5 Jonathan Bedard 2019-09-13 07:32:54 PDT
(In reply to Alexey Proskuryakov from comment #4)
> Have you considered storing these in S3?

I have not.

It's probably the case that S3 is better suited for this sort of thing, but the results database doesn't have an S3 dependency at the moment, so I'm a bit reluctant to add one. If we were storing these in S3, I think we would also have to solve the key-rotation problem for test archives originating from Apple's internal infrastructure, which we know is not a trivial problem to solve.
Comment 6 Aakash Jain 2019-09-13 12:21:29 PDT
(In reply to Jonathan Bedard from comment #5)
> If we were storing these in S3, I think we would also have to solve the key-rotation problem for test archives originating from Apple's internal infrastructure, which we know is not a trivial problem to solve.
I believe we have already solved that problem with EWS and build.webkit.org. The archive is uploaded by the bots to the build-master, and build-master uploads to S3. Only build-master has the required authentication to upload to S3.
Comment 7 Jonathan Bedard 2019-09-13 12:31:50 PDT
(In reply to Aakash Jain from comment #6)
> (In reply to Jonathan Bedard from comment #5)
> > If we were storing these in S3, I think we would also have to solve the key-rotation problem for test archives originating from Apple's internal infrastructure, which we know is not a trivial problem to solve.
> I believe we have already solved that problem with EWS and build.webkit.org.
> The archive is uploaded by the bots to the build-master, and build-master
> uploads to S3. Only build-master has the required authentication to upload
> to S3.

That's the authentication problem, we would need in-place encryption with rotating keys.
Comment 8 Alexey Proskuryakov 2019-09-13 13:44:48 PDT
Why do we need encryption?
Comment 9 Jonathan Bedard 2019-09-13 13:51:00 PDT
(In reply to Alexey Proskuryakov from comment #8)
> Why do we need encryption?

We would need in-place encryption to use S3 for storing Apple Internal test results, at least according to the guidance we have recently been given for storing similar archives.
Comment 10 Aakash Jain 2019-09-13 15:20:50 PDT
Comment on attachment 378672 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=378672&action=review

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:51
> +    SHARD_SIZE = 10 * 1000 * 1000  # Shards of 10 MB

Would be good to elaborate in the comment here that this is because of Cassandra limitation.

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:52
> +    MEMORY_CAP = 2 * 1000 * 1000 * 1000  # Don't allow more than 2 gigs of archives in memory at one time

MAX_MEMORY or MEMORY_LIMIT might be more clear.

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:54
> +    class ArchiveMetaByCommit(ClusteredByConfiguration):

please rename 'meta' to 'metadata' everywhere.

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:61
> +        meta = columns.Text(required=True)

Ditto, 'meta' => 'metadata'.

I am not sure if it's efficient to combine two columns (hash and size) into one (as json) here. Probably two separate columns (hash and size) would be more clear and efficient. 
For relational database it violates first normal form https://en.wikipedia.org/wiki/First_normal_form probably doesn't matter much in Cassandra though.

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:72
> +    class ArchiveShardByCommit(Model):

what do you think about naming it ArchiveDataByCommit? Not sure if the word 'shard' is immediately obvious to everyone. Or how about 'chunk' instead of 'shard'.

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:74
> +        digest = columns.Text(partition_key=True, required=True)

please 'digest' to 'hash' everywhere.

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:76
> +        shard = columns.Blob(required=True)

can consider renaming shard to 'chunk' or 'data'.

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:124
> +                index = 0

Would be nice to add a comment here that we are sharding the archive here.

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:126
> +                digest = hashlib.md5()

'digest' => 'hash'

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:132
> +                        digest=digest.hexdigest(), index=index,

please do .hexdigest() above in line 127.

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:179
> +                            raise RuntimeError('Archive metadata does not contain a key')

key or hash?
Comment 11 Aakash Jain 2019-09-13 15:42:42 PDT
I am ok with this fix. However, I would encourage you to explore the S3 approach as well. The solution we use here could also be used for EWS and build.webkit.org test archives.
Comment 12 Jonathan Bedard 2019-09-13 17:05:57 PDT
Created attachment 378764 [details]
Patch
Comment 13 dewei_zhu 2019-09-16 10:39:47 PDT
Comment on attachment 378764 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=378764&action=review

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:127
> +                index = 0

Is 'index' primary key per partition?

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:129
> +                digest = hashlib.md5(archive.getvalue()).hexdigest()

Is MD5 good enough?

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:180
> +                            raise RuntimeError(MEMORY_ERROR)

Can we sum size to ensure we won't hit memory limit before doing any query from ArchiveChunks? Maybe we can do all sanity checks (including the digest existence check) ahead?
Also, do you think it's necessary to add a unit test the memory limit gets hit by temporarily modifying the limit?

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:193
> +                        rows = sorted(rows, key=lambda row: row.index)

Can we get the rows ordered when we query it?

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:208
> +                        if not result.get(config):
> +                            result[config] = []

Maybe cleaner if we use setdefault here?

> Tools/resultsdbpy/resultsdbpy/model/archive_context_unittest.py:106
> +        print(len(buff.getvalue()))

Do we need `print` here or this is just for debugging purpose?
Comment 14 Jonathan Bedard 2019-09-16 13:15:36 PDT
(In reply to dewei_zhu from comment #13)
> Comment on attachment 378764 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=378764&action=review
> 
> > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:127
> > +                index = 0
> 
> Is 'index' primary key per partition?

Yes...it's basically a shards order in the archive.

> 
> > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:129
> > +                digest = hashlib.md5(archive.getvalue()).hexdigest()
> 
> Is MD5 good enough?

Yes, Aakash and I talked about this. We can change it to a secure hash, but engineering a hash collision doesn't get get you anything, and you would have to have the API key to upload an archive anyways. Engineering a hash collision would just replace the existing archive with that hash.

> 
> > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:180
> > +                            raise RuntimeError(MEMORY_ERROR)
> 
> Can we sum size to ensure we won't hit memory limit before doing any query
> from ArchiveChunks? Maybe we can do all sanity checks (including the digest
> existence check) ahead?
> Also, do you think it's necessary to add a unit test the memory limit gets
> hit by temporarily modifying the limit?
> 
> > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:193
> > +                        rows = sorted(rows, key=lambda row: row.index)
> 
> Can we get the rows ordered when we query it?

I need to verify, but they probably already are sorted....so we might not need that.

> 
> > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:208
> > +                        if not result.get(config):
> > +                            result[config] = []
> 
> Maybe cleaner if we use setdefault here?
> 
> > Tools/resultsdbpy/resultsdbpy/model/archive_context_unittest.py:106
> > +        print(len(buff.getvalue()))
> 
> Do we need `print` here or this is just for debugging purpose?
Comment 15 Jonathan Bedard 2019-09-16 15:51:52 PDT
Created attachment 378905 [details]
Patch
Comment 16 dewei_zhu 2019-09-16 15:55:10 PDT
Comment on attachment 378905 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=378905&action=review

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:181
> +                    if not value.get('digest'):
> +                        raise RuntimeError('Archive metadata does not contain a hash')

We can also move this to above for loop so that we fail early with lightweight.
Comment 17 Jonathan Bedard 2019-09-16 15:59:17 PDT
(In reply to dewei_zhu from comment #16)
> Comment on attachment 378905 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=378905&action=review
> 
> > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:181
> > +                    if not value.get('digest'):
> > +                        raise RuntimeError('Archive metadata does not contain a hash')
> 
> We can also move this to above for loop so that we fail early with
> lightweight.

Actually, we should just ignore it if we don't have a digest (note: that shouldn't be possible anyways, just being defensive)
Comment 18 Jonathan Bedard 2019-09-16 15:59:44 PDT
Created attachment 378907 [details]
Patch for landing
Comment 19 Russell Epstein 2019-09-16 16:10:36 PDT
Comment on attachment 378907 [details]
Patch for landing

View in context: https://bugs.webkit.org/attachment.cgi?id=378907&action=review

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:51
> +    CHUNK_SIZE = 10 * 1000 * 1000  # Cassandra doesn't do well with data blobs of more than 10 MB

Shouldn't this be 1024 instead?

> Tools/resultsdbpy/resultsdbpy/model/archive_context.py:53
>  

Ditto
Comment 20 Jonathan Bedard 2019-09-16 16:11:10 PDT
Created attachment 378908 [details]
Patch for landing
Comment 21 WebKit Commit Bot 2019-09-16 17:13:22 PDT
Comment on attachment 378908 [details]
Patch for landing

Clearing flags on attachment: 378908

Committed r249932: <https://trac.webkit.org/changeset/249932>
Comment 22 WebKit Commit Bot 2019-09-16 17:13:24 PDT
All reviewed patches have been landed.  Closing bug.
Comment 23 Radar WebKit Bug Importer 2019-09-16 17:15:27 PDT
<rdar://problem/55421685>