Skip to content

Instantly share code, notes, and snippets.

@brandonbryant12
Created November 21, 2025 16:48
Show Gist options
  • Select an option

  • Save brandonbryant12/070cfd4acab8b1bb5e2bec4d3335bd59 to your computer and use it in GitHub Desktop.

Select an option

Save brandonbryant12/070cfd4acab8b1bb5e2bec4d3335bd59 to your computer and use it in GitHub Desktop.
Yes, absolutely—you can get full tracing, spans, and rich data in Datadog even if everything currently just logs → CloudWatch → Datadog.
But there’s an important distinction:
• What you have now:
• Logs from Lambda and Step Functions go to CloudWatch, then to Datadog Logs.
• This is logging only (unless you embed trace IDs yourself).
• What you want:
• APM tracing + spans (end-to-end per transcript / per Lambda execution).
• Log ↔ trace correlation, service maps, latency breakdown, etc.
That is possible—you just need to add Datadog’s tracing + Lambda integration on top of the existing log pipeline.
1. Logs vs Traces (why logs-only isn’t enough)
Right now, you likely have:
• CloudWatch log group → Datadog forwarder → Datadog Logs.
• You see Lambda logs, maybe Step Functions state logs.
• But:
• No traces (no trace_id / span_id dimension).
• No breakdown like llm_score vs preprocess vs db_load.
• No request-level timelines.
Datadog tracing adds another dimension:
• Each Lambda invocation becomes a trace with one or more spans.
• You can add spans around:
• LLM calls
• Preprocessing
• DB writes
• Logs can be automatically correlated to the trace (so from a log you can jump into the trace view).
2. How to get tracing for Python Lambdas
You can keep your CloudWatch → Datadog Logs setup and add APM:
Step 1 – Use the Datadog Lambda Layer / Extension
For each Lambda (Python):
• Add the Datadog Lambda layer for your region/runtime.
• Set env vars like:
• DD_API_KEY or use the Forwarder/Extension
• DD_SITE (e.g., datadoghq.com)
• DD_SERVICE=your-pipeline
• DD_ENV=prod (or staging)
Step 2 – Wrap your handler
Instead of:
def handler(event, context):
...
You do:
from datadog_lambda.wrapper import datadog_lambda_wrapper
from datadog_lambda.metric import lambda_metric
@datadog_lambda_wrapper
def handler(event, context):
# your code
...
This gives you:
• A root span per Lambda invocation
• Cold start tagging
• Auto-instrumentation for some libraries (depending on setup)
Step 3 – Add manual spans around key stages
To get rich detail, wrap your pipeline steps:
from ddtrace import tracer
def process_transcript(call_id):
with tracer.trace("pipeline.fetch_transcript") as span:
span.set_tag("call_id", call_id)
transcript = fetch_transcript(call_id)
with tracer.trace("pipeline.preprocess") as span:
span.set_tag("call_id", call_id)
cleaned = preprocess_transcript(transcript)
with tracer.trace("pipeline.llm_score") as span:
span.set_tag("call_id", call_id)
scores = score_with_llm(cleaned)
with tracer.trace("pipeline.db_load") as span:
span.set_tag("call_id", call_id)
load_scores(call_id, scores)
return scores
Now in Datadog APM you’ll see:
• The Lambda span at the top.
• Child spans:
• pipeline.fetch_transcript
• pipeline.preprocess
• pipeline.llm_score
• pipeline.db_load
You can quantify exactly where time and errors happen.
Step 4 – Correlate logs with traces
If you use Datadog’s logging integration (or enable DD_LOGS_INJECTION=true), Datadog will:
• Inject trace_id and span_id into your logs automatically.
• When logs reach Datadog (via the CloudWatch → Datadog path you already have), the UI can link:
• log → trace
• trace → logs
This is what gets you the “rich” experience: from one failed LLM call log, jump into the complete per-transcript trace.
3. What about Step Functions?
Step Functions themselves don’t run your code, but:
• They generate execution + state transition logs to CloudWatch.
• Datadog has Step Functions integration that:
• Pulls execution metrics and events.
• Can visualize Step Functions as part of your service map.
For tracing across Step Functions and Lambdas:
• Each Lambda invoked by Step Functions can be traced using the Lambda Layer + wrapper as above.
• You can pass correlation IDs (like call_id, execution_arn) via:
• Input/output payloads, and
• Tags on spans/logs.
Then in Datadog, you can:
• See metrics per Step Function execution (success/failure, duration).
• Drill into the individual Lambda traces for each step.
Even if you don’t get a fully contiguous “single trace across all steps”, you do get:
• Traces for each Lambda.
• Logs for the Step Function with execution ARN.
• Shared tags (call_id, step_function_name) that let you pivot between them.
4. Does this work with container-image Lambdas?
Yes:
• If you package Lambdas as Docker images, you just:
• Add ddtrace, datadog-lambda libraries via requirements.txt.
• Set up the Datadog Lambda extension in the image or via a layer.
• Wrap the handler the same way (@datadog_lambda_wrapper).
• Logs still go to CloudWatch → Datadog Logs.
• Spans/traces are sent to Datadog via the extension/agent.
5. What “rich data” you can expect
With this setup you can get:
• Per-transcript view:
• Total Lambda execution time.
• Breakdown by stage (fetch → preprocess → LLM → load).
• Errors localized to a span (e.g., “LLM timeout”).
• Per-step metrics:
• count, p95 latency, error rate for each span name.
• Cross-service map:
• Step Functions (as a service)
• Lambdas (as services)
• LLM gateway (if instrumented with ddtrace too)
• Postgres (if using supported client libs)
• Log ↔ trace correlation:
• From a log line with a specific call_id and trace_id, jump to the exact trace in APM.
6. Direct answer to your question
Is it possible to set up tracing and spans and have rich data in Datadog with logs from Lambda → CloudWatch → Datadog, and Step Functions too?
Yes.
• Your existing log shipping path (Lambda/Step Functions → CloudWatch → Datadog) is fine and can stay.
• To get tracing + spans + rich data, you add:
• Datadog Lambda Layer / Extension.
• The Python wrapper/decorator for Lambdas.
• Optional manual spans & tags in your pipeline code.
• (And optionally a Step Functions integration for execution metrics.)
If you’d like, I can write a concrete example for one of your real Lambdas (e.g., the LLM scoring Lambda) showing:
• Updated handler code with Datadog wrapper.
• Manual spans for scoring.
• Example of a metric (e.g., tokens processed) and tags to add.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment