status | title | creation-date | last-updated | authors | see-also | |||||
---|---|---|---|---|---|---|---|---|---|---|
proposed |
Refine Retries for TaskRuns and CustomRuns |
2022-09-08 |
2022-10-20 |
|
|
Two distinct imperfections on Retries
we'd like to address in this TEP:
Retries
onTimeout
is designed inconsistently between TaskRun and CustomRun.- For CustomRun, the document instructs developers to set
Timeout
for all retry attempts. While in the actual implementation, it is set for each retry attempt. See the ref. - For TaskRun created out for a PipelineTask, the
Timeout
is set for each retry attempt. - For Standalone TaskRun, there's no
Retries
implemented.
- For CustomRun, the document instructs developers to set
- Both
PipelineRun
reconciler andTaskRun
|CustomRun
reconciler are partially responsible for implementing theRetries
as of today. See tektoncd/pipeline#5248.
Timeout
must be set for each retry attempt in the four runtime objects (independent TaskRun, TaskRun part of a Pipeline, independent CustomRun, CustomRun part of a Pipeline) that supportRetries
including no Timeout (Timeout set to 0).- TaskRun reconciler which is part of the Tekton Pipeline Controller implements
retries
for two runtime objects (independent TaskRun and TaskRun part of a Pipeline).
- Define retries behavior for PipelineRuns.
- The collective timeout for
tasks
, collective timeout forfinally
tasks, and thetimeout
at thepipeline
level does not change.
The behavior alignment improves UX. Considering the following example:
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: custom-task-pipeline
spec:
tasks:
- name: task-run-example
taskRef:
name: task-run-example
retries: 1
timeout: "10s"
- name: custom-run-example
taskRef:
apiVersion: example.dev/v1alpha1
kind: Example
retries: 1
timeout: "10s"
Say customers define two child resources within a PipelineRun:
task-run-example
custom-run-example
They set both retries
and timeout
for the two resources, under the current implementation, the two runtime objects behave differently, which is not intuitive.
task-run-example
will be retried once after 10s.custom-run-example
will be timed out after 10s. But if the corresponding CustomRun controller implements retries for each attempt, like in TaskRuns, instead of for all attempts per the documented guidance, then thecustom-run-example
would be retried once after 10s, working similarly to thetask-run-example
.
As a standalone runtime object, TaskRuns can be used independently (outside of a PipelineRun) in production environment, here are several use cases:
- https://github.com/tektoncd/catalog/tree/main/task/send-to-webhook-slack/0.1 which is used in Tekton CI
- https://github.com/tektoncd/catalog/tree/main/task/sendmail/0.1
- Tekton CD: cleanup runs.
Transient errors are everywhere especially in the Cloud Environment, services can be down for a short period of time making the entire TaskRun fails. https://learn.microsoft.com/en-us/azure/architecture/best-practices/transient-faults#why-do-transient-faults-occur-in-the-cloud explains how common the transient errors are in the Cloud env.
With retries supported, customers are able to write robust TaskRuns to support such use cases.
In this section, we'd like to compare the general retry strategy in the CI/CD industry, particularly, compare if they retry when timeout (where there are deviation between CustomRun and TaskRun). So that we can decide if we'd like to specify retries for all retry attempts or for each individual retry in both CustomRun
and TaskRun
.
Typically, a retry strategy includes:
- When to retry
- The amount of attempts
- Actions to take after a failed attempt
- Timeout of each attempt
- Retry until a certain condition is met
Retry Action in GA | GitLab Job | Ansible Task | Concourse Step | |
---|---|---|---|---|
When to Retry | on failure | configurable | always retry, conditional stop 1 | configurable |
Attempts amount | supported | supported | supported | supported |
Timeout for each attempt | supported | supported | supported | supported |
Timeout for all attempts | supported | - | - | - |
Several observations regarding to the feature table above:
- We can configure timeout duration per attempt in all CI systems that support the
retry
functionality. - GitHub Action doesn't support retry natively, but because the flexibility of customized actions, some users write their own
retry
action to make it work, and those customized actions even support what to do before retrying a failed attempt. - Concourse mentioned the reason it retries per attempt is somewhat arbitrary.
No matter how we implement the retry functionality, we propose to set Timeout
for each retry attempt. This is propsed based on the existing behavior and the investigation about other CI/CD systems, see related work.
- Stop relying on
len(retriesStatus)
to determine whether a TaskRun or CustomRun finishes, useConditionSucceeded
&ConditionFalse
& Reason=="TimedOut" instead. Retries
andTimeout
are passed fromPipelineTask
toTaskRunSpec
andCustomRunSpec
.
Three sub-options about the way to implement retriesStatus
:
-
1.a: Update
retriesStatus
for each retry attempt forTaskRun
, keepretriesStatus
forCustomRun
- No API change
- Need to implement a strategy for clients to get the previous pod and read its logs.
-
1.b: Update
retriesStatus
for each retry attempt forTaskRun
, deprecateretriesStatus
forCustomRun
- No implementation restrictions of
retriesStatus
forCustomRun
- Need to implement a strategy for clients to get the previous pod and read its logs.
- No implementation restrictions of
-
1.c: Deprecate
retriesStatus
for bothTaskRun
andCustomRun
, create a newTaskRun
for each retry attempt, add a new fieldRetryAttempts
inTaskRunStatusFields
to record names of all retry attempts.- Easier to retrieve logs from retried TaskRuns.
- See Appendix - I for more implementation details.
Benefits:
- Improve
Retries
implementation separation by making it only a TaskRun concern - Consistent interface for retries.
- Consistent termination condition.
- No changes to CustomRun API.
- Standalone TaskRun can retry on its own.
Concerns
- Dashboard and CLI may need extra works if we remove
retriesStatus
. - If a CustomRun controller doesn't support retries, it results in a poor user experience since the PipelineRun controller passes retries directly to the CustomRun and expects the CustomRun controller to implement it.
- Make
retries
aPipelineRun
concern - Remove
retries
fromCustomRun
spec - Move logic for
retries
to PipelineRun reconciler and create newTaskRun
s andRuns
at each attempt. - Remove
retriesStatus
from TaskRun & CustomRun
Benefits:
- Consistent interface for
retries
- Custom task controller developers get a default implementation of retries for free (by embedding in a pipeline)
- "Pipelines in pipeline" can be retried the same as the other resources
- Improve the retries of TaskRuns created from PipelineTasks by using separate TaskRuns for each retry
- No changes to the PipelineRun API (not in the spec at least)
- No changes to the TaskRun API (not in the spec at least)
Concerns:
- API Change for
Run
andCustomRun
(need to removeretries
&retriesStatus
)- We are moving Custom Task Run from alpha (Run) to beta (CustomRun) (see TEP-0114), which is a great timing for us to remove fields from
Run
.
- We are moving Custom Task Run from alpha (Run) to beta (CustomRun) (see TEP-0114), which is a great timing for us to remove fields from
- Dashboard and CLI may need extra works if we remove
retriesStatus
- Standalone
TaskRun
can't retry on its own. - It's not quite user-friendly if a CustomRun controller implements its own retry strategy, for example:
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: pr-custom-task-
spec:
pipelineSpec:
tasks:
- name: wait
timeout: "1s"
retries: 1 // The common retries field in the PipelineTask
taskSpec:
specialized-retries: 5 // Specialized retries field in Custom Task Spec.
other-spec-fields: foobar
The custom task users would be confused about which retries field to use in order to retry a Run.
Retrying pipeline-in-pipeline has a lot of uncertainty, we'd like to use another TEP to confirm it.
One consideration we may want to revisit when designing retry pipeline-in-pipeline: we may want to focus on retrying PipelineRun as a whole, rather than retry some failed child tasks, because the child tasks are retriable as part of a PipelineRun.
If a CustomRun controller doesn't implement retries (such as the wait task under experimental folder), this results in a poor user experience since the pipelinerun controller passes retries directly to the CustomRun and expects the CustomRun controller to implement it.
We've had some discussions in the API WG. We agreed that we expect all CustomRun controller to implement the retries. However, whether they implement it or not is out of our control.
- New
Retries
field inTaskRunSpec
type TaskRunSpec struct {
// Retries represents how many times this task should be retried in case of task failure: ConditionSucceeded set to False
// +optional
Retries string
}
- New
RetryAttempts
field inTaskRunStatus
type TaskRunStatusFields struct {
// RetryAttempts record the names of TaskRuns which are created for retry
// +optional
RetryAttempts []string
}
Label tekton.dev/retry-count: <retry number>
is attached to every TaskRun. For a TaskRun that's not a retry, the retry number
will be set as 0
.
We'll use this this label to decide the value of context.task.retry-count
(instead of using len(tr.Status.RetriesStatus)
in the current implementation)
Label tekton.dev/retry-parent: <parent taskrun name>
is attached to each retry TaskRun.
Say we submit the following TaskRun:
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
name: tr
labels:
tekton.dev/retry-count: 0
spec
timeout: 1s
retries: 1
...
status:
conditions:
- status: True
reason: Unknown
retryAttempts:
1 second elapsed, TaskRun reconciler needs to retry the TaskRun tr
:
- Create a new TaskRun
tr-attempt-1
- Attach the following labels to the new TaskRun
tekton.dev/retry-count: 1
tekton.dev/retry-parent: tr
- Add the new TaskRun name to
status.retryAttempts
of its parent TaskRun. - Update the Reason of the Condition as
Retrying
, keep Status as True.
Now we have two TaskRuns:
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
name: tr
labels:
tekton.dev/retry-count: 0
spec
timeout: 1s
retries: 1
...
status:
conditions:
- status: True
reason: Retrying
retryAttempts:
- tr-attempt-1
---
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
name: tr-attempt-1
labels:
tekton.dev/retry-count: 1
tekton.dev/retry-parent: tr
spec
timeout: 1s
retries: 1
...
status:
conditions:
- status: True
reason: Unknown
retryAttempts:
1 second elapsed again, tr-attempt-1
is timeout.
In the reconciliation loop of tr-attempt-1
, the reconciler checks that the value of tekton.dev/retry-count
is equivalent to Spec.Retries
, it updates the Condition of tr-attempt-1
as Status=False, Reason=TimedOut
.
Then in the reconciliation loop of tr
, the reconciler checks that the last attempt in retryAttempts
is tr-attempt-1
and it has already failed on TimedOut, it updates the condition of tr
as Status=False, Reason=TimedOut
.
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
name: tr
labels:
tekton.dev/retry-count: 0
spec
timeout: 1s
retries: 1
...
status:
conditions:
- status: False
reason: TimedOut
retryAttempts:
- tr-attempt-1
---
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
name: tr-attempt-1
labels:
tekton.dev/retry-count: 1
tekton.dev/retry-parent: tr
spec
timeout: 1s
retries: 1
...
status:
conditions:
- status: False
reason: TimedOut
retryAttempts:
The relationship of the original TaskRun and TaskRuns created for retry is:
originalTaskRun
/ \
taskRun-attempt-1 ... taskRun-attempt-n
- TEP-0002: Custom Tasks
- TEP-0069: Custom Tasks Retries
- TEP-0100: Slim down PipelineRunStatus
- Issue #5248: Decouple Retries implementation between TaskRun reconciler and PipelineRun reconciler
- PR #5393: Clarify the behavior of CustomRun retries