Chelsea and I have been looking at adding end-to-end openTelemetry tracing support to Kata Containers.
With Kata 2.x's simpler architecture, we only need tracing support in:
- hypervisor (ideally)
- runtime
- agent
Note: Not considering hypervisor tracing at the present time.
Good progress but still need to a set of consistent and standarised tracing span tags.
"Beset by problems" (mostly resolved).
- Rebase hell
- Fast pace of change (good and bad).
- Testing tracing changes is slow.
- Lots and lots and lots and lots of rebasing...
- Rewrites for API changes and async agent changes.
- Conversion of VSOCK exporter and trace forwarder to async (done)
- Agent shutdown (done)
- This is only required for tracing.
- Required a lot of refactoring and we hit a number of bugs on the way.
- However, shutdown PR landed a code is now much cleaner.
- ... and the agent can be shut down!
- Agent shutdown needs a test (99% done)
- Debugging tracing problems is very hard.
- Need to know the agent is ending gracefully.
- Has to be fairly elaborate.
- Took a long time to write and test.
- PR raised.
- Still not passing in CI env ;(
- issues with rust tracing crates (done)
- A world of pain :-)
- The rust tracing landscape is still in a state of flux.
- Problems with span hierarchy (in progress)
- Some code cannot be traced without major invasive surgery.
- Therefore, we register a global tracer.
- BUT, the means the span hierarchy is difficult to get right.
- This is not fully resolved yet.
- Raise a foundational agent tracing PR this week
- Code is not that useful currently
(due to the span hierarchy issue). - However, worth landing this now:
by landing the basics, we can ratchet up the support iteratively (aka I can avoid wasting lots of time constantly rebasing and re-testing!)
- Code is not that useful currently
- Send tracing summary to the mailing list.
- "Call to arms"
- Developing and testing tracing is very time consuming.
- Need help from community to speed up progress.
- Volunteer now! ;)
- Ideally, we could use "follow the sun" to speed landing this features.
See the new GitHub project.
Presented today at the Kata Containers Architecture Committee meeting.
To view as HTML, I ran this: