FireCloud is now powered by Terra! -- STARTING MAY 1st, 2019 THIS WEBPAGE WILL NO LONGER BE UPDATED.
From now on, please visit the Terra Help Center for documentation, tutorials, roadmap and feature announcements.
Want to talk to a human? Email the helpdesk, post feature requests or chat with peers in the community forum.
FIRECLOUD | Doc #11944 | Handle benign failure messages more appropriately

Handle benign failure messages more appropriately
Feature Requests | Created 2018-05-09 | Last updated 2018-07-25


Comments (6)

I (re-)ran our CGA somatic variant calling pipeline on 402 TCGA THCA pairs. In addition to the expected failures due to congestion in rawls (see https://gatkforums.broadinstitute.org/firecloud/discussion/11860/rawls-failure-in-10-of-402-workflows-launched-in-single-submission#latest ), there were also four failures in workflows, mid-operation, all appearing to be associated with container creation. I reran these four jobs and they all ran successfully through completion. The error messages for these four failed workflows were:

message: Workflow failed
causedBy: 
message: Task Clinical_Workflow.Mutect2_Task:6:1 failed. Job exited without an error, exit code 0. PAPI error code 10. Message: 15: Gsutil failed: Could not capture docker logs: Unable to capture docker logs exit status 1
message: Cromwell server was restarted while this workflow was running. As part of the restart process, Cromwell attempted to reconnect to this job, however it was never started in the first place. This is a benign failure and not the cause of failure for this workflow, it can be safely ignored.

message: Workflow failed
causedBy: 
message: Task Clinical_Workflow.Mutect1_Task:5:1 failed. The job was stopped before the command finished. PAPI error code 10. Message: 11: Docker run failed: command failed: docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.37/containers/create: read unix @->/var/run/docker.sock: read: connection reset by peer. See 'docker run --help'. . See logs at gs://fc-ce9e4f8c-2c1f-4d67-94e7-4170daa0c81d/5e9c7d0c-ae1d-4213-9cdb-b4ef91c25f9f/Clinical_Workflow/fcb66940-3deb-4cef-9439-a4bcf800d6d2/call-Mutect1_Task/shard-5/

message: Workflow failed
causedBy: 
message: Task Clinical_Workflow.Mutect1_Task:5:1 failed. The job was stopped before the command finished. PAPI error code 10. Message: 11: Docker run failed: command failed: docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.37/containers/create: read unix @->/var/run/docker.sock: read: connection reset by peer. See 'docker run --help'. . See logs at gs://fc-ce9e4f8c-2c1f-4d67-94e7-4170daa0c81d/5e9c7d0c-ae1d-4213-9cdb-b4ef91c25f9f/Clinical_Workflow/546b7656-1e1d-4547-bc6c-1e9c22dc2526/call-Mutect1_Task/shard-5/
message: Cromwell server was restarted while this workflow was running. As part of the restart process, Cromwell attempted to reconnect to this job, however it was never started in the first place. This is a benign failure and not the cause of failure for this workflow, it can be safely ignored.

message: Workflow failed
causedBy: 
message: Cromwell server was restarted while this workflow was running. As part of the restart process, Cromwell attempted to reconnect to this job, however it was never started in the first place. This is a benign failure and not the cause of failure for this workflow, it can be safely ignored.
  (11 copies of the same message)
    message: Cromwell server was restarted while this workflow was running. As part of the restart process, Cromwell attempted to reconnect to this job, however it was never started in the first place. This is a benign failure and not the cause of failure for this workflow, it can be safely ignored.
    message: Task Clinical_Workflow.normalMM_Task:NA:1 failed. The job was stopped before the command finished. PAPI error code 10. Message: 11: Docker run failed: command failed: docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.37/containers/create: read unix @->/var/run/docker.sock: read: connection reset by peer. See 'docker run --help'. . See logs at gs://fc-ce9e4f8c-2c1f-4d67-94e7-4170daa0c81d/5e9c7d0c-ae1d-4213-9cdb-b4ef91c25f9f/Clinical_Workflow/3004d191-8ea2-4a29-b934-4e71ac7f9a42/call-normalMM_Task/
    message: Cromwell server was restarted while this workflow was running. As part of the restart process, Cromwell attempted to reconnect to this job, however it was never started in the first place. This is a benign failure and not the cause of failure for this workflow, it can be safely ignored.
(8 copies of the same message)
    message: Cromwell server was restarted while this workflow was running. As part of the restart process, Cromwell attempted to reconnect to this job, however it was never started in the first place. This is a benign failure and not the cause of failure for this workflow, it can be safely ignored.

As mentioned, I reran these four failing workflows and they completed with no problem. Given how adamant the system was in telling me the restarting of cromwell did not cause the workflow failures, I am apt to believe Cromwell's restart did have a role in the workflow failures. Regardless, I'd like to understand the source of these intermittent failures.

Here is information on the failures:

Google Project: cloud-resource-miscellaneous Workspace: CBB_20180405_TCGA_THCA_ControlledAccess_V1-0_DATA Submission ID: 5e9c7d0c-ae1d-4213-9cdb-b4ef91c25f9f Workflow IDs: a8b9ae04-52bf-476f-a4cc-bee63f5aa013, fcb66940-3deb-4cef-9439-a4bcf800d6d2, 546b7656-1e1d-4547-bc6c-1e9c22dc2526, 3004d191-8ea2-4a29-b934-4e71ac7f9a42


Return to top Comment on this article