Troubleshooting
This page is part of the Temporal Knowledge Hub.
Document the path for application teams to escalate issues to the platform team.
This article documents how to observe and troubleshoot Temporal Workflows and Workers across environments (i.e. dev, prd).
Detection
The first step to troubleshooting is collecting Temporal Workflow telemetry and understanding the issue.
Provide a monitoring dashboard for your application teams to troubleshoot Temporal applications.
At ABC Financial, the following observability tools are supported for Temporal Cloud:
| Tool | Purpose | What it answers |
|---|---|---|
| Temporal Cloud UI | Source of truth for Temporal Workflow Event History, status, and traces. | What happened to the Workflow? What is the current Workflow status? |
| go/temporal-dashboard | Provides a single-pane-of-glass monitoring for logs, metrics, and traces across ABC Financial applications. | Are the Workers healthy and sufficiently scaled? What happened to the upstream and downstream services? |
Gather context
Before troubleshooting, collect this information:
- Namespace: Which Temporal Cloud namespace?
- Workflow ID: Specific Workflow instance(s) affected
- Time window: When did the issue start? Is it ongoing or intermittent?
- Recent changes: Any recent deployments or configuration updates?
- Impact Scope: Single Workflow, specific Workflow Type, or entire Namespace?
Quick health checks
Perform these checks before detailed investigation:
- Is Temporal Cloud healthy?
- Check status.temporal.io.
- Are Workers healthy?
- go/temporal-dashboard → Infrastructure → Filter by
service:temporal
- go/temporal-dashboard → Infrastructure → Filter by
- Are there recent deployments?
- Check Slack channel.
Respond
Include runbooks for common Temporal issues.
Common issues and troubleshooting steps
1. Workflow not starting
Symptoms: Workflow appears in Temporal Cloud UI as Running, but the Workflow is not executing.
Troubleshooting:
- Check Worker Registration
- Datadog → Logs → Filter:
service:temporal "Registered workflow" - Verify your Workflow Type appears in Worker startup logs
- Datadog → Logs → Filter:
- Verify Task Queue
- Temporal UI → Search for Workflows on your Task Queue
- Confirm Task Queue name matches exactly (case-sensitive) between Temporal Client and Worker
- Check Client Connection
- Datadog → Filter by your application service name
- Search for:
"Temporal"AND"connection"OR"authentication" - Look for API key or connection errors
Fix:
- Redeploy Worker if Workflow not registered.
- Correct Task Queue name mismatch in code.
- Contact Temporal Platform team for API key issues.
Escalation
Escalate to the Temporal platform team when the issue persists after following the troubleshooting steps above.
Include the following information in your request:
1. Temporal Cloud Namespace
2. Workflow ID(s) and time window
3. Description of the issue
4. Context collected (from the Detection section)
5. Troubleshooting steps already attempted
6. Other helpful information (e.g. screenshots)
Response time SLA
- P1 (Production outage): 30 minutes
- P2 (Degraded performance): 4 hours
- P3 (Non-urgent issues): 1 business day
Alerts
It is the application team's responsibility to detect Temporal issues. Hence, it is recommended that you create appropriate alerts to proactively catch issues early.
Include relevant alert definition for your engineering teams.
Here are some example alerts:
| Alert name | Metric | Condition | Channel |
|---|---|---|---|
| High Workflow failure rate | temporal.workflow.failed | > 10% failure rate over 10 minutes | Page |
| High Activity Schedule-to-Start latency | temporal.activity.schedule_to_start_latency (p95) | > 30 seconds for 15 minutes | Slack |
| High Worker CPU utilization | kubernetes.cpu.usage.pct | > 80% for 10 minutes | Slack |
Need help?
- Learn how the Temporal platform can support you.
- Reach out to the Temporal platform team via
#temporal-supportSlack channel.