| status | proposed | ||||
|---|---|---|---|---|---|
| title | Refine Retries for TaskRuns and CustomRuns | ||||
| creation-date | 2022-09-08 | ||||
| last-updated | 2022-10-20 | ||||
| authors |
|
||||
| see-also |
|
Two distinct imperfections on Retries we'd like to address in this TEP:
RetriesonTimeoutis designed inconsistently between TaskRun and CustomRun.- For CustomRun, the document instructs developers to set
Timeoutfor all retry attempts. While in the actual implementation, it is set for each retry attempt. See the ref. - For TaskRun created out for a PipelineTask, the
Timeoutis set for each retry attempt. - For Standalone TaskRun, there's no
Retriesimplemented.
- For CustomRun, the document instructs developers to set
- Both
PipelineRunreconciler andTaskRun|CustomRunreconciler are partially responsible for implementing theRetriesas of today. See tektoncd/pipeline#5248.
Timeoutmust be set for each retry attempt in the four runtime objects (independent TaskRun, TaskRun part of a Pipeline, independent CustomRun, CustomRun part of a Pipeline) that supportRetriesincluding no Timeout (Timeout set to 0).- TaskRun reconciler which is part of the Tekton Pipeline Controller implements
retriesfor two runtime objects (independent TaskRun and TaskRun part of a Pipeline).
- Define retries behavior for PipelineRuns.
- The collective timeout for
tasks, collective timeout forfinallytasks, and thetimeoutat thepipelinelevel does not change.
The behavior alignment improves UX. Considering the following example:
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: custom-task-pipeline
spec:
tasks:
- name: task-run-example
taskRef:
name: task-run-example
retries: 1
timeout: "10s"
- name: custom-run-example
taskRef:
apiVersion: example.dev/v1alpha1
kind: Example
retries: 1
timeout: "10s"Say customers define two child resources within a PipelineRun:
task-run-examplecustom-run-example
They set both retries and timeout for the two resources, under the current implementation, the two runtime objects behave differently, which is not intuitive.
task-run-examplewill be retried once after 10s.custom-run-examplewill be timed out after 10s. But if the corresponding CustomRun controller implements retries for each attempt, like in TaskRuns, instead of for all attempts per the documented guidance, then thecustom-run-examplewould be retried once after 10s, working similarly to thetask-run-example.
As a standalone runtime object, TaskRuns can be used independently (outside of a PipelineRun) in production environment, here are several use cases:
- https://github.com/tektoncd/catalog/tree/main/task/send-to-webhook-slack/0.1 which is used in Tekton CI
- https://github.com/tektoncd/catalog/tree/main/task/sendmail/0.1
- Tekton CD: cleanup runs.
Transient errors are everywhere especially in the Cloud Environment, services can be down for a short period of time making the entire TaskRun fails. https://learn.microsoft.com/en-us/azure/architecture/best-practices/transient-faults#why-do-transient-faults-occur-in-the-cloud explains how common the transient errors are in the Cloud env.
With retries supported, customers are able to write robust TaskRuns to support such use cases.
In this section, we'd like to compare the general retry strategy in the CI/CD industry, particularly, compare if they retry when timeout (where there are deviation between CustomRun and TaskRun). So that we can decide if we'd like to specify retries for all retry attempts or for each individual retry in both CustomRun and TaskRun.
Typically, a retry strategy includes:
- When to retry
- The amount of attempts
- Actions to take after a failed attempt
- Timeout of each attempt
- Retry until a certain condition is met
| Retry Action in GA | GitLab Job | Ansible Task | Concourse Step | |
|---|---|---|---|---|
| When to Retry | on failure | configurable | always retry, conditional stop 1 | configurable |
| Attempts amount | supported | supported | supported | supported |
| Timeout for each attempt | supported | supported | supported | supported |
| Timeout for all attempts | supported | - | - | - |
Several observations regarding to the feature table above:
- We can configure timeout duration per attempt in all CI systems that support the
retryfunctionality. - GitHub Action doesn't support retry natively, but because the flexibility of customized actions, some users write their own
retryaction to make it work, and those customized actions even support what to do before retrying a failed attempt. - Concourse mentioned the reason it retries per attempt is somewhat arbitrary.
No matter how we implement the retry functionality, we propose to set Timeout for each retry attempt. This is propsed based on the existing behavior and the investigation about other CI/CD systems, see related work.
- Stop relying on
len(retriesStatus)to determine whether a TaskRun or CustomRun finishes, useConditionSucceeded&ConditionFalse& Reason=="TimedOut" instead. RetriesandTimeoutare passed fromPipelineTasktoTaskRunSpecandCustomRunSpec.
Three sub-options about the way to implement retriesStatus:
-
1.a: Update
retriesStatusfor each retry attempt forTaskRun, keepretriesStatusforCustomRun- No API change
- Need to implement a strategy for clients to get the previous pod and read its logs.
-
1.b: Update
retriesStatusfor each retry attempt forTaskRun, deprecateretriesStatusforCustomRun- No implementation restrictions of
retriesStatusforCustomRun - Need to implement a strategy for clients to get the previous pod and read its logs.
- No implementation restrictions of
-
1.c: Deprecate
retriesStatusfor bothTaskRunandCustomRun, create a newTaskRunfor each retry attempt, add a new fieldRetryAttemptsinTaskRunStatusFieldsto record names of all retry attempts.- Easier to retrieve logs from retried TaskRuns.
- See Appendix - I for more implementation details.
Benefits:
- Improve
Retriesimplementation separation by making it only a TaskRun concern - Consistent interface for retries.
- Consistent termination condition.
- No changes to CustomRun API.
- Standalone TaskRun can retry on its own.
Concerns
- Dashboard and CLI may need extra works if we remove
retriesStatus. - If a CustomRun controller doesn't support retries, it results in a poor user experience since the PipelineRun controller passes retries directly to the CustomRun and expects the CustomRun controller to implement it.
- Make
retriesaPipelineRunconcern - Remove
retriesfromCustomRunspec - Move logic for
retriesto PipelineRun reconciler and create newTaskRuns andRunsat each attempt. - Remove
retriesStatusfrom TaskRun & CustomRun
Benefits:
- Consistent interface for
retries - Custom task controller developers get a default implementation of retries for free (by embedding in a pipeline)
- "Pipelines in pipeline" can be retried the same as the other resources
- Improve the retries of TaskRuns created from PipelineTasks by using separate TaskRuns for each retry
- No changes to the PipelineRun API (not in the spec at least)
- No changes to the TaskRun API (not in the spec at least)
Concerns:
- API Change for
RunandCustomRun(need to removeretries&retriesStatus)- We are moving Custom Task Run from alpha (Run) to beta (CustomRun) (see TEP-0114), which is a great timing for us to remove fields from
Run.
- We are moving Custom Task Run from alpha (Run) to beta (CustomRun) (see TEP-0114), which is a great timing for us to remove fields from
- Dashboard and CLI may need extra works if we remove
retriesStatus - Standalone
TaskRuncan't retry on its own. - It's not quite user-friendly if a CustomRun controller implements its own retry strategy, for example:
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: pr-custom-task-
spec:
pipelineSpec:
tasks:
- name: wait
timeout: "1s"
retries: 1 // The common retries field in the PipelineTask
taskSpec:
specialized-retries: 5 // Specialized retries field in Custom Task Spec.
other-spec-fields: foobarThe custom task users would be confused about which retries field to use in order to retry a Run.
Retrying pipeline-in-pipeline has a lot of uncertainty, we'd like to use another TEP to confirm it.
One consideration we may want to revisit when designing retry pipeline-in-pipeline: we may want to focus on retrying PipelineRun as a whole, rather than retry some failed child tasks, because the child tasks are retriable as part of a PipelineRun.
If a CustomRun controller doesn't implement retries (such as the wait task under experimental folder), this results in a poor user experience since the pipelinerun controller passes retries directly to the CustomRun and expects the CustomRun controller to implement it.
We've had some discussions in the API WG. We agreed that we expect all CustomRun controller to implement the retries. However, whether they implement it or not is out of our control.
- New
Retriesfield inTaskRunSpec
type TaskRunSpec struct {
// Retries represents how many times this task should be retried in case of task failure: ConditionSucceeded set to False
// +optional
Retries string
}- New
RetryAttemptsfield inTaskRunStatus
type TaskRunStatusFields struct {
// RetryAttempts record the names of TaskRuns which are created for retry
// +optional
RetryAttempts []string
}Label tekton.dev/retry-count: <retry number> is attached to every TaskRun. For a TaskRun that's not a retry, the retry number will be set as 0.
We'll use this this label to decide the value of context.task.retry-count (instead of using len(tr.Status.RetriesStatus) in the current implementation)
Label tekton.dev/retry-parent: <parent taskrun name> is attached to each retry TaskRun.
Say we submit the following TaskRun:
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
name: tr
labels:
tekton.dev/retry-count: 0
spec
timeout: 1s
retries: 1
...
status:
conditions:
- status: True
reason: Unknown
retryAttempts:1 second elapsed, TaskRun reconciler needs to retry the TaskRun tr:
- Create a new TaskRun
tr-attempt-1 - Attach the following labels to the new TaskRun
tekton.dev/retry-count: 1tekton.dev/retry-parent: tr
- Add the new TaskRun name to
status.retryAttemptsof its parent TaskRun. - Update the Reason of the Condition as
Retrying, keep Status as True.
Now we have two TaskRuns:
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
name: tr
labels:
tekton.dev/retry-count: 0
spec
timeout: 1s
retries: 1
...
status:
conditions:
- status: True
reason: Retrying
retryAttempts:
- tr-attempt-1
---
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
name: tr-attempt-1
labels:
tekton.dev/retry-count: 1
tekton.dev/retry-parent: tr
spec
timeout: 1s
retries: 1
...
status:
conditions:
- status: True
reason: Unknown
retryAttempts:1 second elapsed again, tr-attempt-1 is timeout.
In the reconciliation loop of tr-attempt-1, the reconciler checks that the value of tekton.dev/retry-count is equivalent to Spec.Retries, it updates the Condition of tr-attempt-1 as Status=False, Reason=TimedOut.
Then in the reconciliation loop of tr, the reconciler checks that the last attempt in retryAttempts is tr-attempt-1 and it has already failed on TimedOut, it updates the condition of tr as Status=False, Reason=TimedOut.
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
name: tr
labels:
tekton.dev/retry-count: 0
spec
timeout: 1s
retries: 1
...
status:
conditions:
- status: False
reason: TimedOut
retryAttempts:
- tr-attempt-1
---
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
name: tr-attempt-1
labels:
tekton.dev/retry-count: 1
tekton.dev/retry-parent: tr
spec
timeout: 1s
retries: 1
...
status:
conditions:
- status: False
reason: TimedOut
retryAttempts:The relationship of the original TaskRun and TaskRuns created for retry is:
originalTaskRun
/ \
taskRun-attempt-1 ... taskRun-attempt-n
- TEP-0002: Custom Tasks
- TEP-0069: Custom Tasks Retries
- TEP-0100: Slim down PipelineRunStatus
- Issue #5248: Decouple Retries implementation between TaskRun reconciler and PipelineRun reconciler
- PR #5393: Clarify the behavior of CustomRun retries