<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://bugs.webkit.org/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4.1"
          urlbase="https://bugs.webkit.org/"
          
          maintainer="admin@webkit.org"
>

    <bug>
          <bug_id>203858</bug_id>
          
          <creation_ts>2019-11-05 11:27:53 -0800</creation_ts>
          <short_desc>EWS should retry build in case of kill-old-processes failure</short_desc>
          <delta_ts>2020-02-20 08:06:56 -0800</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WebKit</product>
          <component>Tools / Tests</component>
          <version>Other</version>
          <rep_platform>Unspecified</rep_platform>
          <op_sys>Unspecified</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          <see_also>https://bugs.webkit.org/show_bug.cgi?id=208003</see_also>
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords>InRadar</keywords>
          <priority>P2</priority>
          <bug_severity>Normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Aakash Jain">aakash_jain</reporter>
          <assigned_to name="Aakash Jain">aakash_jain</assigned_to>
          <cc>aakash_jain</cc>
    
    <cc>ap</cc>
    
    <cc>commit-queue</cc>
    
    <cc>jbedard</cc>
    
    <cc>webkit-bug-importer</cc>
          

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>1587479</commentid>
    <comment_count>0</comment_count>
    <who name="Aakash Jain">aakash_jain</who>
    <bug_when>2019-11-05 11:27:53 -0800</bug_when>
    <thetext>Sometime a bot gets in a bad state and kill-old-processes start failing. When that happens, that bot process (and fails) the build very quickly. Because of this all the pending builds end-up failing.

For example 2 weeks back ews121 went into bad state and many builds failed, e.g:

https://ews-build.webkit.org/#/builders/24/builds/2671
https://ews-build.webkit.org/#/builders/24/builds/2672
https://ews-build.webkit.org/#/builders/24/builds/2673
https://ews-build.webkit.org/#/builders/24/builds/2674
https://ews-build.webkit.org/#/builders/24/builds/2675
https://ews-build.webkit.org/#/builders/24/builds/2677
https://ews-build.webkit.org/#/builders/24/builds/2680
https://ews-build.webkit.org/#/builders/24/builds/2684
https://ews-build.webkit.org/#/builders/24/builds/2690
https://ews-build.webkit.org/#/builders/24/builds/2691
https://ews-build.webkit.org/#/builders/24/builds/2693
https://ews-build.webkit.org/#/builders/24/builds/2694
https://ews-build.webkit.org/#/builders/24/builds/2697

We should retry the build, in case of kill-old-processes failure, so that that bot do not burn through all the pending builds. The build will keep on retrying, until a different bot picks it up. This will make EWS robust against this kind of infrastructure failure.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1587481</commentid>
    <comment_count>1</comment_count>
      <attachid>382837</attachid>
    <who name="Aakash Jain">aakash_jain</who>
    <bug_when>2019-11-05 11:33:54 -0800</bug_when>
    <thetext>Created attachment 382837
Patch</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1587482</commentid>
    <comment_count>2</comment_count>
    <who name="Aakash Jain">aakash_jain</who>
    <bug_when>2019-11-05 11:34:57 -0800</bug_when>
    <thetext>Sample run: https://ews-build.webkit-uat.org/#/builders/3/builds/227</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1587918</commentid>
    <comment_count>3</comment_count>
    <who name="Jonathan Bedard">jbedard</who>
    <bug_when>2019-11-06 14:23:07 -0800</bug_when>
    <thetext>....if kill-old-prcesses fails, we should force reboot the machine. Do we have any evidence that a retry will actually help?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1589096</commentid>
    <comment_count>4</comment_count>
    <who name="Aakash Jain">aakash_jain</who>
    <bug_when>2019-11-10 05:29:40 -0800</bug_when>
    <thetext>It just happened again in https://ews-build.webkit.org/#/builders/24/builds/4562 and https://ews-build.webkit.org/#/builders/24/builds/4566

&gt; ....if kill-old-prcesses fails, we should force reboot the machine.
Yes, rebooting the machine is a better idea. However, it will take me a while to implement and test that. Meanwhile can we land this (maybe with a FIXME), since this is clearly an improvement.

&gt; Do we have any evidence that a retry will actually help?
Yes, we have already seen it working many times for RETRY on checkout failure (https://trac.webkit.org/changeset/247364/webkit). For example when bot igalia1-gtk-wk2-ews went out of space in https://ews-build.webkit.org/#/builders/4/builds/6684, instead of simply failing, build was retried, and picked up by different bot and passed in https://ews-build.webkit.org/#/builders/4/builds/6685</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1589192</commentid>
    <comment_count>5</comment_count>
    <who name="Jonathan Bedard">jbedard</who>
    <bug_when>2019-11-11 07:41:21 -0800</bug_when>
    <thetext>(In reply to Aakash Jain from comment #4)
&gt; It just happened again in
&gt; https://ews-build.webkit.org/#/builders/24/builds/4562 and
&gt; https://ews-build.webkit.org/#/builders/24/builds/4566
&gt; 
&gt; &gt; ....if kill-old-prcesses fails, we should force reboot the machine.
&gt; Yes, rebooting the machine is a better idea. However, it will take me a
&gt; while to implement and test that. Meanwhile can we land this (maybe with a
&gt; FIXME), since this is clearly an improvement.

You&apos;ve convinced me this is an improvement often enough to be worth landing, although I remain somewhat skeptical of our ability to trust machines which fail kill-old-processes.

I&apos;m actually not sure that rebooting takes much testing or additional code. I don&apos;t think we need to put effort into being delicate, I don&apos;t see a world where machines are failing to kill-old-processes frequently enough to find themselves in a crash loop, and if we&apos;re really worried about that, we can just refuse to reboot unless a bot has been up longer than some amount of time (I&apos;d say an hour, but that&apos;s sort of arbitrary)

&gt; 
&gt; ...</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1589201</commentid>
    <comment_count>6</comment_count>
      <attachid>382837</attachid>
    <who name="WebKit Commit Bot">commit-queue</who>
    <bug_when>2019-11-11 08:23:33 -0800</bug_when>
    <thetext>Comment on attachment 382837
Patch

Clearing flags on attachment: 382837

Committed r252324: &lt;https://trac.webkit.org/changeset/252324&gt;</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1589202</commentid>
    <comment_count>7</comment_count>
    <who name="WebKit Commit Bot">commit-queue</who>
    <bug_when>2019-11-11 08:23:34 -0800</bug_when>
    <thetext>All reviewed patches have been landed.  Closing bug.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1589377</commentid>
    <comment_count>8</comment_count>
    <who name="Radar WebKit Bug Importer">webkit-bug-importer</who>
    <bug_when>2019-11-11 16:57:21 -0800</bug_when>
    <thetext>&lt;rdar://problem/57099318&gt;</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1621178</commentid>
    <comment_count>9</comment_count>
    <who name="Aakash Jain">aakash_jain</who>
    <bug_when>2020-02-20 06:03:42 -0800</bug_when>
    <thetext>This change seems to be working fine. Few examples where this helped:

https://ews-build.webkit.org/#/builders/22/builds/10002
https://ews-build.webkit.org/#/builders/3/builds/16167
https://ews-build.webkit.org/#/builders/3/builds/16230
https://ews-build.webkit.org/#/builders/9/builds/17233</thetext>
  </long_desc>
      
          <attachment
              isobsolete="0"
              ispatch="1"
              isprivate="0"
          >
            <attachid>382837</attachid>
            <date>2019-11-05 11:33:54 -0800</date>
            <delta_ts>2019-11-11 08:23:33 -0800</delta_ts>
            <desc>Patch</desc>
            <filename>bug-203858-20191105143353.patch</filename>
            <type>text/plain</type>
            <size>2386</size>
            <attacher name="Aakash Jain">aakash_jain</attacher>
            
              <data encoding="base64">SW5kZXg6IFRvb2xzL0NoYW5nZUxvZwo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09Ci0tLSBUb29scy9DaGFuZ2VMb2cJKHJl
dmlzaW9uIDI1MjA2NCkKKysrIFRvb2xzL0NoYW5nZUxvZwkod29ya2luZyBjb3B5KQpAQCAtMSwz
ICsxLDE1IEBACisyMDE5LTExLTA1ICBBYWthc2ggSmFpbiAgPGFha2FzaF9qYWluQGFwcGxlLmNv
bT4KKworICAgICAgICBFV1Mgc2hvdWxkIHJldHJ5IGJ1aWxkIGluIGNhc2Ugb2Yga2lsbC1vbGQt
cHJvY2Vzc2VzIGZhaWx1cmUKKyAgICAgICAgaHR0cHM6Ly9idWdzLndlYmtpdC5vcmcvc2hvd19i
dWcuY2dpP2lkPTIwMzg1OAorCisgICAgICAgIFJldmlld2VkIGJ5IE5PQk9EWSAoT09QUyEpLgor
CisgICAgICAgICogQnVpbGRTbGF2ZVN1cHBvcnQvZXdzLWJ1aWxkL3N0ZXBzLnB5OgorICAgICAg
ICAoS2lsbE9sZFByb2Nlc3Nlcy5ldmFsdWF0ZUNvbW1hbmQpOiBSZXRyeSB0aGUgYnVpbGQgaW4g
Y2FzZSBvZiBmYWlsdXJlLgorICAgICAgICAoS2lsbE9sZFByb2Nlc3Nlcy5nZXRSZXN1bHRTdW1t
YXJ5KTogVXBkYXRlIHRoZSBidWlsZC1zdGVwIHN1bW1hcnkgc3RyaW5nLgorICAgICAgICAqIEJ1
aWxkU2xhdmVTdXBwb3J0L2V3cy1idWlsZC9zdGVwc191bml0dGVzdC5weTogVXBkYXRlZCB1bml0
LXRlc3RzLgorCiAyMDE5LTExLTA1ICBXZW5zb24gSHNpZWggIDx3ZW5zb25faHNpZWhAYXBwbGUu
Y29tPgogCiAgICAgICAgIE5hdGl2ZSB0ZXh0IHN1YnN0aXR1dGlvbnMgaW50ZXJmZXJlIHdpdGgg
SFRNTCA8ZGF0YWxpc3Q+IG9wdGlvbnMgcmVzdWx0aW5nIGluIGNyYXNoCkluZGV4OiBUb29scy9C
dWlsZFNsYXZlU3VwcG9ydC9ld3MtYnVpbGQvc3RlcHMucHkKPT09PT09PT09PT09PT09PT09PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQotLS0gVG9vbHMv
QnVpbGRTbGF2ZVN1cHBvcnQvZXdzLWJ1aWxkL3N0ZXBzLnB5CShyZXZpc2lvbiAyNTIwNjQpCisr
KyBUb29scy9CdWlsZFNsYXZlU3VwcG9ydC9ld3MtYnVpbGQvc3RlcHMucHkJKHdvcmtpbmcgY29w
eSkKQEAgLTk4NCw2ICs5ODQsMTYgQEAgY2xhc3MgS2lsbE9sZFByb2Nlc3NlcyhzaGVsbC5Db21w
aWxlKToKICAgICBkZWYgX19pbml0X18oc2VsZiwgKiprd2FyZ3MpOgogICAgICAgICBzdXBlcihL
aWxsT2xkUHJvY2Vzc2VzLCBzZWxmKS5fX2luaXRfXyh0aW1lb3V0PTYwLCBsb2dFbnZpcm9uPUZh
bHNlLCAqKmt3YXJncykKIAorICAgIGRlZiBldmFsdWF0ZUNvbW1hbmQoc2VsZiwgY21kKToKKyAg
ICAgICAgaWYgY21kLmRpZEZhaWwoKToKKyAgICAgICAgICAgIHNlbGYuYnVpbGQuYnVpbGRGaW5p
c2hlZChbJ0ZhaWxlZCB0byBraWxsIG9sZCBwcm9jZXNzZXMsIHJldHJ5aW5nIGJ1aWxkJ10sIFJF
VFJZKQorICAgICAgICByZXR1cm4gc2hlbGwuQ29tcGlsZS5ldmFsdWF0ZUNvbW1hbmQoc2VsZiwg
Y21kKQorCisgICAgZGVmIGdldFJlc3VsdFN1bW1hcnkoc2VsZik6CisgICAgICAgIGlmIHNlbGYu
cmVzdWx0cyA9PSBGQUlMVVJFOgorICAgICAgICAgICAgcmV0dXJuIHt1J3N0ZXAnOiB1J0ZhaWxl
ZCB0byBraWxsIG9sZCBwcm9jZXNzZXMnfQorICAgICAgICByZXR1cm4gc2hlbGwuQ29tcGlsZS5n
ZXRSZXN1bHRTdW1tYXJ5KHNlbGYpCisKIAogY2xhc3MgUnVuV2ViS2l0VGVzdHMoc2hlbGwuVGVz
dCk6CiAgICAgbmFtZSA9ICdsYXlvdXQtdGVzdHMnCkluZGV4OiBUb29scy9CdWlsZFNsYXZlU3Vw
cG9ydC9ld3MtYnVpbGQvc3RlcHNfdW5pdHRlc3QucHkKPT09PT09PT09PT09PT09PT09PT09PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQotLS0gVG9vbHMvQnVp
bGRTbGF2ZVN1cHBvcnQvZXdzLWJ1aWxkL3N0ZXBzX3VuaXR0ZXN0LnB5CShyZXZpc2lvbiAyNTIw
NjQpCisrKyBUb29scy9CdWlsZFNsYXZlU3VwcG9ydC9ld3MtYnVpbGQvc3RlcHNfdW5pdHRlc3Qu
cHkJKHdvcmtpbmcgY29weSkKQEAgLTU5MSw3ICs1OTEsNyBAQCBjbGFzcyBUZXN0S2lsbE9sZFBy
b2Nlc3NlcyhCdWlsZFN0ZXBNaXhpCiAgICAgICAgICAgICArIEV4cGVjdFNoZWxsLmxvZygnc3Rk
aW8nLCBzdGRvdXQ9J1VuZXhwZWN0ZWQgZXJyb3IuJykKICAgICAgICAgICAgICsgMiwKICAgICAg
ICAgKQotICAgICAgICBzZWxmLmV4cGVjdE91dGNvbWUocmVzdWx0PUZBSUxVUkUsIHN0YXRlX3N0
cmluZz0nS2lsbGVkIG9sZCBwcm9jZXNzZXMgKGZhaWx1cmUpJykKKyAgICAgICAgc2VsZi5leHBl
Y3RPdXRjb21lKHJlc3VsdD1GQUlMVVJFLCBzdGF0ZV9zdHJpbmc9J0ZhaWxlZCB0byBraWxsIG9s
ZCBwcm9jZXNzZXMnKQogICAgICAgICByZXR1cm4gc2VsZi5ydW5TdGVwKCkKIAogCg==
</data>

          </attachment>
      

    </bug>

</bugzilla>