Troubleshooting

info

This page is part of the Temporal Knowledge Hub.

note

Document the path for application teams to escalate issues to the platform team.

This article documents how to observe and troubleshoot Temporal Workflows and Workers across environments (i.e. dev, prd).

Detection

The first step to troubleshooting is collecting Temporal Workflow telemetry and understanding the issue.

note

Provide a monitoring dashboard for your application teams to troubleshoot Temporal applications.

At ABC Financial, the following observability tools are supported for Temporal Cloud:

Tool	Purpose	What it answers
Temporal Cloud UI	Source of truth for Temporal Workflow Event History, status, and traces.	What happened to the Workflow? What is the current Workflow status?
go/temporal-dashboard	Provides a single-pane-of-glass monitoring for logs, metrics, and traces across ABC Financial applications.	Are the Workers healthy and sufficiently scaled? What happened to the upstream and downstream services?

Gather context

Before troubleshooting, collect this information:

Namespace: Which Temporal Cloud namespace?
Workflow ID: Specific Workflow instance(s) affected
Time window: When did the issue start? Is it ongoing or intermittent?
Recent changes: Any recent deployments or configuration updates?
Impact Scope: Single Workflow, specific Workflow Type, or entire Namespace?

Quick health checks

Perform these checks before detailed investigation:

Is Temporal Cloud healthy?
1. Check status.temporal.io.
Are Workers healthy?
1. go/temporal-dashboard → Infrastructure → Filter by service:temporal
Are there recent deployments?
1. Check Slack channel.

Respond

note

Include runbooks for common Temporal issues.

Common issues and troubleshooting steps

1. Workflow not starting

Symptoms: Workflow appears in Temporal Cloud UI as Running, but the Workflow is not executing.

Troubleshooting:

Check Worker Registration
- Datadog → Logs → Filter: service:temporal "Registered workflow"
- Verify your Workflow Type appears in Worker startup logs
Verify Task Queue
- Temporal UI → Search for Workflows on your Task Queue
- Confirm Task Queue name matches exactly (case-sensitive) between Temporal Client and Worker
Check Client Connection
- Datadog → Filter by your application service name
- Search for: "Temporal" AND "connection" OR "authentication"
- Look for API key or connection errors

Fix:

Redeploy Worker if Workflow not registered.
Correct Task Queue name mismatch in code.
Contact Temporal Platform team for API key issues.

Escalation

Escalate to the Temporal platform team when the issue persists after following the troubleshooting steps above.

Include the following information in your request:

Temporal Cloud Namespace
Workflow ID(s) and time window
Description of the issue
Context collected (from the Detection section)
Troubleshooting steps already attempted
Other helpful information (e.g. screenshots)

Response time SLA

P1 (Production outage): 30 minutes
P2 (Degraded performance): 4 hours
P3 (Non-urgent issues): 1 business day

Alerts

It is the application team's responsibility to detect Temporal issues. Hence, it is recommended that you create appropriate alerts to proactively catch issues early.

note

Include relevant alert definition for your engineering teams.

Here are some example alerts:

Alert name	Metric	Condition	Channel
High Workflow failure rate	`temporal.workflow.failed`	> 10% failure rate over 10 minutes	Page
High Activity Schedule-to-Start latency	`temporal.activity.schedule_to_start_latency` (p95)	> 30 seconds for 15 minutes	Slack
High Worker CPU utilization	`kubernetes.cpu.usage.pct`	> 80% for 10 minutes	Slack

Need help?

Learn how the Temporal platform can support you.
Reach out to the Temporal platform team via #temporal-support Slack channel.

Detection​

Gather context​

Quick health checks​

Respond​

Common issues and troubleshooting steps​

1. Workflow not starting​

Escalation​

Response time SLA​

Alerts​

Need help?​

Detection

Gather context

Quick health checks

Respond

Common issues and troubleshooting steps

1. Workflow not starting

Escalation

Response time SLA

Alerts

Need help?