Bug 97643 - Flakiness Dashboard server OOMs when the results.json gets too large
Summary: Flakiness Dashboard server OOMs when the results.json gets too large
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
: 75499 (view as bug list)
Depends on:
Blocks:
 
Reported: 2012-09-26 01:18 PDT by Dominik Röttsches (drott)
Modified: 2017-07-18 08:27 PDT (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dominik Röttsches (drott) 2012-09-26 01:18:51 PDT
The flakiness dashboard does not accept results from our WebKit 2 EFL bot:
http://build.webkit.org/builders/EFL%20Linux%2064-bit%20Debug%20WK2

At the end of each build, in the uploading step, we see:
00:44:46.921 6045 Uploading JSON files for builder: EFL Linux 64-bit Debug WK2
00:45:37.191 6045 Received HTTP status 500 loading "http://test-results.appspot.com/testfile/upload".  Retrying in 10 seconds...
Comment 1 Ojan Vafai 2012-09-26 11:55:35 PDT
I've fixed the glitch.

I deleted the "show all runs" data for this bot. Deleting the data for the bot isn't a big deal since we only keep the last 500 runs anyways, it's just a temporary data loss.

This is a long-standing bug when the accumulated data in the results.json gets too large), the python server runs out of memory trying to parse it. We delete runs older than 500 and we delete entries that have only passed or been skipped in the past 500 runs. So the results.json is usually self-pruning and we don't hit this. But if to too many different tests fail in the past 500 runs, we get stuck here.

There are a couple of proposed solutions, but noone has had the time to implement them:
1. Move over to using AppEngine Backend servers: https://developers.google.com/appengine/docs/python/backends/overview
2. Use a TaskQueue to do the JSON merging https://developers.google.com/appengine/docs/python/taskqueue/overview
3. Chunk the json we store every 100 runs and make the dashboard UI load 100 run chunks at a time. This would solve both the memory problem and would have the benefit of making it so we don't have to delete data older than 500 runs.

Due to http://code.google.com/p/googleappengine/issues/detail?id=7973 we can't get the error logs to show us which builders are having this problem. :(
Comment 2 Ojan Vafai 2012-09-26 13:17:16 PDT
This is also affecting the Content Shell Chromium bots. I haven't deleted their results.json files since there are so many failures it will just start happening again.

Peter, also a heads up in case you start seeing this with the Android bots.
Comment 3 jochen 2012-09-27 00:37:48 PDT
(In reply to comment #2)
> This is also affecting the Content Shell Chromium bots. I haven't deleted their results.json files since there are so many failures it will just start happening again.

of course the real fix is to make the bots not fail that much

> 
> Peter, also a heads up in case you start seeing this with the Android bots.
Comment 4 Dominik Röttsches (drott) 2012-10-01 01:43:09 PDT
(In reply to comment #1)
> I've fixed the glitch.
> I deleted the "show all runs" data for this bot. 

Thanks a lot!
Comment 5 Ojan Vafai 2012-10-12 11:32:29 PDT
*** Bug 75499 has been marked as a duplicate of this bug. ***