We start having issues as our result archives approach 50 MB. Note that with this change, we will lose existing result archives, but we're still storing them on Buildbot, so that's not a big deal.
Created attachment 378672 [details] Patch
(In reply to Jonathan Bedard from comment #0) > We start having issues as our result archives approach 50 MB. What kind of issues? Can you elaborate more on the problem and the fix.
(In reply to Aakash Jain from comment #2) > (In reply to Jonathan Bedard from comment #0) > > We start having issues as our result archives approach 50 MB. > What kind of issues? Can you elaborate more on the problem and the fix. 'Issues' meaning the writes would fail. The solution I've implemented here is basically the one suggested in https://cwiki.apache.org/confluence/display/CASSANDRA2/CassandraLimitations. Instead of placing the entirety of the archive in a single row, we split the archive into multiple rows, each with a blob of no more than 10 MB. Added bonus, if we wanted to change the size of the blob shards in the future, we can just leave the existing data as is.
Have you considered storing these in S3?
(In reply to Alexey Proskuryakov from comment #4) > Have you considered storing these in S3? I have not. It's probably the case that S3 is better suited for this sort of thing, but the results database doesn't have an S3 dependency at the moment, so I'm a bit reluctant to add one. If we were storing these in S3, I think we would also have to solve the key-rotation problem for test archives originating from Apple's internal infrastructure, which we know is not a trivial problem to solve.
(In reply to Jonathan Bedard from comment #5) > If we were storing these in S3, I think we would also have to solve the key-rotation problem for test archives originating from Apple's internal infrastructure, which we know is not a trivial problem to solve. I believe we have already solved that problem with EWS and build.webkit.org. The archive is uploaded by the bots to the build-master, and build-master uploads to S3. Only build-master has the required authentication to upload to S3.
(In reply to Aakash Jain from comment #6) > (In reply to Jonathan Bedard from comment #5) > > If we were storing these in S3, I think we would also have to solve the key-rotation problem for test archives originating from Apple's internal infrastructure, which we know is not a trivial problem to solve. > I believe we have already solved that problem with EWS and build.webkit.org. > The archive is uploaded by the bots to the build-master, and build-master > uploads to S3. Only build-master has the required authentication to upload > to S3. That's the authentication problem, we would need in-place encryption with rotating keys.
Why do we need encryption?
(In reply to Alexey Proskuryakov from comment #8) > Why do we need encryption? We would need in-place encryption to use S3 for storing Apple Internal test results, at least according to the guidance we have recently been given for storing similar archives.
Comment on attachment 378672 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=378672&action=review > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:51 > + SHARD_SIZE = 10 * 1000 * 1000 # Shards of 10 MB Would be good to elaborate in the comment here that this is because of Cassandra limitation. > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:52 > + MEMORY_CAP = 2 * 1000 * 1000 * 1000 # Don't allow more than 2 gigs of archives in memory at one time MAX_MEMORY or MEMORY_LIMIT might be more clear. > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:54 > + class ArchiveMetaByCommit(ClusteredByConfiguration): please rename 'meta' to 'metadata' everywhere. > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:61 > + meta = columns.Text(required=True) Ditto, 'meta' => 'metadata'. I am not sure if it's efficient to combine two columns (hash and size) into one (as json) here. Probably two separate columns (hash and size) would be more clear and efficient. For relational database it violates first normal form https://en.wikipedia.org/wiki/First_normal_form probably doesn't matter much in Cassandra though. > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:72 > + class ArchiveShardByCommit(Model): what do you think about naming it ArchiveDataByCommit? Not sure if the word 'shard' is immediately obvious to everyone. Or how about 'chunk' instead of 'shard'. > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:74 > + digest = columns.Text(partition_key=True, required=True) please 'digest' to 'hash' everywhere. > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:76 > + shard = columns.Blob(required=True) can consider renaming shard to 'chunk' or 'data'. > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:124 > + index = 0 Would be nice to add a comment here that we are sharding the archive here. > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:126 > + digest = hashlib.md5() 'digest' => 'hash' > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:132 > + digest=digest.hexdigest(), index=index, please do .hexdigest() above in line 127. > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:179 > + raise RuntimeError('Archive metadata does not contain a key') key or hash?
I am ok with this fix. However, I would encourage you to explore the S3 approach as well. The solution we use here could also be used for EWS and build.webkit.org test archives.
Created attachment 378764 [details] Patch
Comment on attachment 378764 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=378764&action=review > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:127 > + index = 0 Is 'index' primary key per partition? > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:129 > + digest = hashlib.md5(archive.getvalue()).hexdigest() Is MD5 good enough? > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:180 > + raise RuntimeError(MEMORY_ERROR) Can we sum size to ensure we won't hit memory limit before doing any query from ArchiveChunks? Maybe we can do all sanity checks (including the digest existence check) ahead? Also, do you think it's necessary to add a unit test the memory limit gets hit by temporarily modifying the limit? > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:193 > + rows = sorted(rows, key=lambda row: row.index) Can we get the rows ordered when we query it? > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:208 > + if not result.get(config): > + result[config] = [] Maybe cleaner if we use setdefault here? > Tools/resultsdbpy/resultsdbpy/model/archive_context_unittest.py:106 > + print(len(buff.getvalue())) Do we need `print` here or this is just for debugging purpose?
(In reply to dewei_zhu from comment #13) > Comment on attachment 378764 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=378764&action=review > > > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:127 > > + index = 0 > > Is 'index' primary key per partition? Yes...it's basically a shards order in the archive. > > > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:129 > > + digest = hashlib.md5(archive.getvalue()).hexdigest() > > Is MD5 good enough? Yes, Aakash and I talked about this. We can change it to a secure hash, but engineering a hash collision doesn't get get you anything, and you would have to have the API key to upload an archive anyways. Engineering a hash collision would just replace the existing archive with that hash. > > > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:180 > > + raise RuntimeError(MEMORY_ERROR) > > Can we sum size to ensure we won't hit memory limit before doing any query > from ArchiveChunks? Maybe we can do all sanity checks (including the digest > existence check) ahead? > Also, do you think it's necessary to add a unit test the memory limit gets > hit by temporarily modifying the limit? > > > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:193 > > + rows = sorted(rows, key=lambda row: row.index) > > Can we get the rows ordered when we query it? I need to verify, but they probably already are sorted....so we might not need that. > > > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:208 > > + if not result.get(config): > > + result[config] = [] > > Maybe cleaner if we use setdefault here? > > > Tools/resultsdbpy/resultsdbpy/model/archive_context_unittest.py:106 > > + print(len(buff.getvalue())) > > Do we need `print` here or this is just for debugging purpose?
Created attachment 378905 [details] Patch
Comment on attachment 378905 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=378905&action=review > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:181 > + if not value.get('digest'): > + raise RuntimeError('Archive metadata does not contain a hash') We can also move this to above for loop so that we fail early with lightweight.
(In reply to dewei_zhu from comment #16) > Comment on attachment 378905 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=378905&action=review > > > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:181 > > + if not value.get('digest'): > > + raise RuntimeError('Archive metadata does not contain a hash') > > We can also move this to above for loop so that we fail early with > lightweight. Actually, we should just ignore it if we don't have a digest (note: that shouldn't be possible anyways, just being defensive)
Created attachment 378907 [details] Patch for landing
Comment on attachment 378907 [details] Patch for landing View in context: https://bugs.webkit.org/attachment.cgi?id=378907&action=review > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:51 > + CHUNK_SIZE = 10 * 1000 * 1000 # Cassandra doesn't do well with data blobs of more than 10 MB Shouldn't this be 1024 instead? > Tools/resultsdbpy/resultsdbpy/model/archive_context.py:53 > Ditto
Created attachment 378908 [details] Patch for landing
Comment on attachment 378908 [details] Patch for landing Clearing flags on attachment: 378908 Committed r249932: <https://trac.webkit.org/changeset/249932>
All reviewed patches have been landed. Closing bug.
<rdar://problem/55421685>