AWS Step Functions: Orchestrating Distributed Workflows
Step Functions lets you coordinate Lambda functions, ECS tasks, DynamoDB operations, and external services into durable state machines — with built-in error handling, retries, parallel execution, and audit history. This covers state machine types (Standard vs Express), state types (Task, Choice, Parallel, Map, Wait), error handling patterns, the integration catalog, and operational considerations for production workflows.

Distributed workflows are hard. When a multi-step process spans Lambda functions, database writes, external API calls, and human approval gates, you need something to track state, handle failures, and retry individual steps without re-running the entire workflow from scratch. That something is Step Functions.
A Step Functions state machine defines your workflow as a JSON or YAML document (Amazon States Language). Each state is an explicit step — a Lambda invocation, an SDK call, a wait, a conditional branch. When a step fails, Step Functions retries it with configurable backoff. If it fails permanently, it routes to an error handler. Every execution is logged with the full input, output, and history of each state transition.
Standard vs Express Workflows
Step Functions has two execution models with fundamentally different guarantees:
| Standard | Express | |
|---|---|---|
| Execution duration | Up to 1 year | Up to 5 minutes |
| Execution semantics | Exactly-once | At-least-once |
| Execution history | Full in console + CloudWatch | CloudWatch only |
| Pricing | Per state transition ($0.025/1,000) | Per execution + duration |
| Throughput | 2,000 executions/second | 100,000+ executions/second |
| Use case | Long-running, auditable business processes | High-volume, short-duration workflows |
Standard workflows are the right default for most business processes: order fulfillment, user onboarding flows, data pipelines, approval workflows. The exactly-once guarantee means each Task state executes at most once even if Step Functions retries due to service issues.
Express workflows are right when you need high throughput at low cost and can tolerate at-least-once semantics — event processing pipelines, IoT data transformation, real-time stream processing. For Express, your Lambda functions must be idempotent since they may execute more than once.
1# Create a Standard workflow
2aws stepfunctions create-state-machine \
3 --name order-fulfillment \
4 --type STANDARD \
5 --definition file://order-fulfillment.asl.json \
6 --role-arn arn:aws:iam::012345678901:role/StepFunctionsRole
7
8# Create an Express workflow
9aws stepfunctions create-state-machine \
10 --name event-processor \
11 --type EXPRESS \
12 --definition file://event-processor.asl.json \
13 --role-arn arn:aws:iam::012345678901:role/StepFunctionsRole \
14 --logging-configuration '{
15 "level": "ALL",
16 "includeExecutionData": true,
17 "destinations": [{"cloudWatchLogsLogGroup": {"logGroupArn": "arn:aws:logs:us-east-1:012345678901:log-group:/aws/states/event-processor:*"}}]
18 }'Express workflows require explicit CloudWatch Logs configuration — there is no built-in execution history.
State Types
Task State
A Task state invokes a resource — a Lambda function, an SDK API call, or an Activity (for long-running work polled by an external worker).
1{
2 "ChargePayment": {
3 "Type": "Task",
4 "Resource": "arn:aws:lambda:us-east-1:012345678901:function:charge-payment",
5 "Parameters": {
6 "orderId.$": "$.orderId",
7 "amount.$": "$.total"
8 },
9 "ResultPath": "$.paymentResult",
10 "Next": "FulfillOrder"
11 }
12}Parameters constructs the input to the resource. Fields ending in .$ are JSONPath expressions evaluated against the current state input. ResultPath controls where the resource's output is written in the state data — $.paymentResult merges it into the existing input rather than replacing it entirely.
SDK integrations let you call AWS service APIs directly without a Lambda shim:
1{
2 "SaveOrder": {
3 "Type": "Task",
4 "Resource": "arn:aws:states:::dynamodb:putItem",
5 "Parameters": {
6 "TableName": "orders",
7 "Item": {
8 "orderId": {"S.$": "$.orderId"},
9 "status": {"S": "PENDING"},
10 "createdAt": {"S.$": "$$.Execution.StartTime"}
11 }
12 },
13 "ResultPath": null,
14 "Next": "ChargePayment"
15 }
16}$$.Execution.StartTime uses the context object ($$) — Step Functions provides execution metadata (execution name, start time, state machine ARN) available at runtime.
Choice State
A Choice state branches based on conditions evaluated against the current state data:
1{
2 "CheckInventory": {
3 "Type": "Choice",
4 "Choices": [
5 {
6 "Variable": "$.inventoryCount",
7 "NumericGreaterThan": 0,
8 "Next": "ReserveInventory"
9 },
10 {
11 "Variable": "$.backorderAllowed",
12 "BooleanEquals": true,
13 "Next": "CreateBackorder"
14 }
15 ],
16 "Default": "NotifyOutOfStock"
17 }
18}Choice states have no Next at the top level — each choice rule has its own Next. The Default handles all unmatched cases. Choice states don't support Retry or Catch directly — handle errors in the states they branch to.
Parallel State
A Parallel state executes multiple branches simultaneously. All branches must complete before the state machine moves to Next.
1{
2 "NotifyAll": {
3 "Type": "Parallel",
4 "Branches": [
5 {
6 "StartAt": "SendEmail",
7 "States": {
8 "SendEmail": {
9 "Type": "Task",
10 "Resource": "arn:aws:lambda:us-east-1:012345678901:function:send-email",
11 "End": true
12 }
13 }
14 },
15 {
16 "StartAt": "SendSMS",
17 "States": {
18 "SendSMS": {
19 "Type": "Task",
20 "Resource": "arn:aws:lambda:us-east-1:012345678901:function:send-sms",
21 "End": true
22 }
23 }
24 },
25 {
26 "StartAt": "UpdateAnalytics",
27 "States": {
28 "UpdateAnalytics": {
29 "Type": "Task",
30 "Resource": "arn:aws:states:::dynamodb:updateItem",
31 "Parameters": {
32 "TableName": "order-analytics",
33 "Key": {"date": {"S.$": "$.orderDate"}},
34 "UpdateExpression": "ADD orderCount :one",
35 "ExpressionAttributeValues": {":one": {"N": "1"}}
36 },
37 "End": true
38 }
39 }
40 }
41 ],
42 "ResultPath": "$.notificationResults",
43 "Next": "OrderComplete"
44 }
45}The output of a Parallel state is an array — one element per branch, in branch declaration order. If any branch fails and has no Catch, the entire Parallel state fails.
Map State
A Map state iterates over an array in the state data, applying the same workflow to each element — in parallel, up to a configurable MaxConcurrency.
1{
2 "ProcessLineItems": {
3 "Type": "Map",
4 "ItemsPath": "$.lineItems",
5 "ItemSelector": {
6 "item.$": "$$.Map.Item.Value",
7 "orderId.$": "$.orderId"
8 },
9 "MaxConcurrency": 10,
10 "Iterator": {
11 "StartAt": "ProcessItem",
12 "States": {
13 "ProcessItem": {
14 "Type": "Task",
15 "Resource": "arn:aws:lambda:us-east-1:012345678901:function:process-line-item",
16 "End": true
17 }
18 }
19 },
20 "ResultPath": "$.itemResults",
21 "Next": "SummarizeResults"
22 }
23}ItemsPath specifies which array to iterate. ItemSelector (formerly Parameters in older ASL versions) shapes each iteration's input. MaxConcurrency: 0 means unlimited concurrency. $$.Map.Item.Value and $$.Map.Item.Index give access to the current element and its position in the context object.
Wait State
A Wait state pauses execution for a duration or until a timestamp — without consuming Lambda execution time:
1{
2 "WaitForProcessing": {
3 "Type": "Wait",
4 "Seconds": 300,
5 "Next": "CheckStatus"
6 }
7}Or wait until a specific timestamp:
1{
2 "ScheduleReminder": {
3 "Type": "Wait",
4 "TimestampPath": "$.reminderAt",
5 "Next": "SendReminder"
6 }
7}reminderAt must be an ISO 8601 timestamp string (e.g., 2026-06-01T10:00:00Z). Wait states are useful for polling patterns — wait, check status, branch on result, repeat.
Error Handling
Retry
Retry defines exponential backoff on error:
1{
2 "ChargePayment": {
3 "Type": "Task",
4 "Resource": "arn:aws:lambda:us-east-1:012345678901:function:charge-payment",
5 "Retry": [
6 {
7 "ErrorEquals": ["Lambda.ServiceException", "Lambda.AWSLambdaException", "Lambda.SdkClientException", "Lambda.TooManyRequestsException"],
8 "IntervalSeconds": 2,
9 "MaxAttempts": 3,
10 "BackoffRate": 2,
11 "JitterStrategy": "FULL"
12 },
13 {
14 "ErrorEquals": ["PaymentGateway.RateLimitError"],
15 "IntervalSeconds": 30,
16 "MaxAttempts": 5,
17 "BackoffRate": 1.5
18 }
19 ],
20 "Catch": [
21 {
22 "ErrorEquals": ["PaymentGateway.CardDeclined"],
23 "ResultPath": "$.error",
24 "Next": "NotifyPaymentFailed"
25 },
26 {
27 "ErrorEquals": ["States.ALL"],
28 "ResultPath": "$.error",
29 "Next": "HandleUnexpectedError"
30 }
31 ],
32 "Next": "FulfillOrder"
33 }
34}ErrorEquals can reference Lambda error names (thrown as named exceptions), States.TaskFailed, States.Timeout, States.HeartbeatTimeout, or States.ALL (catch-all). The four Lambda transient error codes — Lambda.ServiceException, Lambda.AWSLambdaException, Lambda.SdkClientException, Lambda.TooManyRequestsException — should be included in every Lambda Task Retry block for resilience against transient Lambda service issues.
JitterStrategy: "FULL" adds randomized jitter to prevent thundering-herd retries when multiple executions fail simultaneously.
BackoffRate multiplies the interval on each attempt: with IntervalSeconds: 2 and BackoffRate: 2, the delays are 2s, 4s, 8s.
Catch
Catch routes to a different state on unrecoverable failure. ResultPath: "$.error" merges the error cause into the existing input — without it, the error replaces the state data. ResultPath: null discards the error output and passes the original input to the catch state.
Heartbeat
For long-running Lambda functions or Activities, set HeartbeatSeconds to detect stuck executions:
1{
2 "RunETLJob": {
3 "Type": "Task",
4 "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
5 "Parameters": {
6 "FunctionName": "arn:aws:lambda:us-east-1:012345678901:function:start-etl-job",
7 "Payload": {
8 "taskToken.$": "$$.Task.Token",
9 "jobConfig.$": "$.jobConfig"
10 }
11 },
12 "HeartbeatSeconds": 60,
13 "TimeoutSeconds": 3600,
14 "Next": "ProcessResults"
15 }
16}waitForTaskToken pauses the state machine until the Lambda (or external system) calls SendTaskSuccess or SendTaskFailure with the token. This is the pattern for human approval gates, long-running batch jobs, and external system integrations.
Callback Pattern (waitForTaskToken)
The callback pattern decouples Step Functions from long-running external work:
1import boto3
2import json
3
4sfn = boto3.client('stepfunctions')
5
6def start_etl_job(event, context):
7 task_token = event['taskToken']
8 job_config = event['jobConfig']
9
10 # Start the job (async — don't wait for it here)
11 job_id = launch_etl_job(job_config)
12
13 # Store the token so the job can report back when done
14 store_token(job_id, task_token)
15
16 return {'jobId': job_id}
17
18def on_job_completion(job_id: str, success: bool, result: dict):
19 task_token = retrieve_token(job_id)
20
21 if success:
22 sfn.send_task_success(
23 taskToken=task_token,
24 output=json.dumps(result)
25 )
26 else:
27 sfn.send_task_failure(
28 taskToken=task_token,
29 error='ETLJobFailed',
30 cause=json.dumps(result.get('error', {}))
31 )The state machine waits indefinitely (up to TimeoutSeconds) for send_task_success or send_task_failure. This is how you integrate Step Functions with systems that don't have synchronous completion — Glue jobs, ECS batch tasks, third-party APIs with webhooks.
SDK Integration Patterns
Step Functions has two integration patterns for SDK calls:
Request-Response (default): calls the API, moves to Next immediately with the API's synchronous response. Right for fast operations (DynamoDB PutItem, SNS Publish, SQS SendMessage).
Synchronous (.sync:2): waits for the job to complete before moving on. Right for batch operations.
1{
2 "StartGlueJob": {
3 "Type": "Task",
4 "Resource": "arn:aws:states:::glue:startJobRun.sync:2",
5 "Parameters": {
6 "JobName": "transform-orders",
7 "Arguments": {
8 "--date.$": "$.processingDate"
9 }
10 },
11 "Next": "ProcessResults"
12 }
13}The .sync:2 suffix tells Step Functions to poll the Glue job until completion (using internal polling — no Lambda needed). Supported services for sync integration include ECS (RunTask), Glue, EMR, Batch, Athena, and others.
Full Example: Order Fulfillment Workflow
1{
2 "Comment": "Order fulfillment workflow",
3 "StartAt": "ValidateOrder",
4 "States": {
5 "ValidateOrder": {
6 "Type": "Task",
7 "Resource": "arn:aws:lambda:us-east-1:012345678901:function:validate-order",
8 "Retry": [
9 {
10 "ErrorEquals": ["Lambda.ServiceException", "Lambda.AWSLambdaException", "Lambda.SdkClientException", "Lambda.TooManyRequestsException"],
11 "IntervalSeconds": 2,
12 "MaxAttempts": 3,
13 "BackoffRate": 2
14 }
15 ],
16 "Catch": [
17 {
18 "ErrorEquals": ["OrderValidationError"],
19 "ResultPath": "$.error",
20 "Next": "RejectOrder"
21 }
22 ],
23 "Next": "CheckInventory"
24 },
25 "CheckInventory": {
26 "Type": "Choice",
27 "Choices": [
28 {
29 "Variable": "$.inventoryAvailable",
30 "BooleanEquals": true,
31 "Next": "ReserveAndCharge"
32 }
33 ],
34 "Default": "NotifyBackorder"
35 },
36 "ReserveAndCharge": {
37 "Type": "Parallel",
38 "Branches": [
39 {
40 "StartAt": "ReserveInventory",
41 "States": {
42 "ReserveInventory": {
43 "Type": "Task",
44 "Resource": "arn:aws:states:::dynamodb:updateItem",
45 "Parameters": {
46 "TableName": "inventory",
47 "Key": {"productId": {"S.$": "$.productId"}},
48 "UpdateExpression": "SET reserved = reserved + :qty",
49 "ExpressionAttributeValues": {":qty": {"N.$": "States.Format('{}', $.quantity)"}}
50 },
51 "End": true
52 }
53 }
54 },
55 {
56 "StartAt": "ChargePayment",
57 "States": {
58 "ChargePayment": {
59 "Type": "Task",
60 "Resource": "arn:aws:lambda:us-east-1:012345678901:function:charge-payment",
61 "Retry": [
62 {
63 "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
64 "IntervalSeconds": 5,
65 "MaxAttempts": 3,
66 "BackoffRate": 2
67 }
68 ],
69 "Catch": [
70 {
71 "ErrorEquals": ["PaymentDeclinedError"],
72 "ResultPath": "$.paymentError",
73 "Next": "PaymentDeclined"
74 }
75 ],
76 "End": true
77 },
78 "PaymentDeclined": {
79 "Type": "Task",
80 "Resource": "arn:aws:lambda:us-east-1:012345678901:function:handle-declined-payment",
81 "End": true
82 }
83 }
84 }
85 ],
86 "Next": "ShipOrder"
87 },
88 "ShipOrder": {
89 "Type": "Task",
90 "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
91 "Parameters": {
92 "FunctionName": "arn:aws:lambda:us-east-1:012345678901:function:initiate-shipment",
93 "Payload": {
94 "taskToken.$": "$$.Task.Token",
95 "orderId.$": "$.orderId"
96 }
97 },
98 "TimeoutSeconds": 86400,
99 "Next": "NotifyShipped"
100 },
101 "NotifyShipped": {
102 "Type": "Task",
103 "Resource": "arn:aws:states:::sns:publish",
104 "Parameters": {
105 "TopicArn": "arn:aws:sns:us-east-1:012345678901:order-notifications",
106 "Message.$": "States.Format('Order {} has shipped', $.orderId)"
107 },
108 "End": true
109 },
110 "RejectOrder": {
111 "Type": "Task",
112 "Resource": "arn:aws:states:::sns:publish",
113 "Parameters": {
114 "TopicArn": "arn:aws:sns:us-east-1:012345678901:order-notifications",
115 "Message.$": "States.Format('Order {} rejected: {}', $.orderId, $.error.Cause)"
116 },
117 "End": true
118 },
119 "NotifyBackorder": {
120 "Type": "Task",
121 "Resource": "arn:aws:lambda:us-east-1:012345678901:function:handle-backorder",
122 "End": true
123 }
124 }
125}IAM Roles
Step Functions needs an execution role with permissions to call every resource referenced in the state machine:
1{
2 "Version": "2012-10-17",
3 "Statement": [
4 {
5 "Effect": "Allow",
6 "Action": ["lambda:InvokeFunction"],
7 "Resource": [
8 "arn:aws:lambda:us-east-1:012345678901:function:validate-order",
9 "arn:aws:lambda:us-east-1:012345678901:function:charge-payment",
10 "arn:aws:lambda:us-east-1:012345678901:function:initiate-shipment",
11 "arn:aws:lambda:us-east-1:012345678901:function:handle-backorder"
12 ]
13 },
14 {
15 "Effect": "Allow",
16 "Action": ["dynamodb:UpdateItem"],
17 "Resource": "arn:aws:dynamodb:us-east-1:012345678901:table/inventory"
18 },
19 {
20 "Effect": "Allow",
21 "Action": ["sns:Publish"],
22 "Resource": "arn:aws:sns:us-east-1:012345678901:order-notifications"
23 },
24 {
25 "Effect": "Allow",
26 "Action": ["logs:CreateLogDelivery", "logs:PutLogEvents", "logs:DescribeLogGroups"],
27 "Resource": "*"
28 }
29 ]
30}Use the least-privilege principle: grant access to specific function ARNs and table ARNs, not *. The console's "auto-generate IAM role" feature generates overly broad permissions — always review and tighten.
Monitoring and Observability
CloudWatch Metrics
Step Functions publishes execution metrics per state machine:
| Metric | What it tells you |
|---|---|
ExecutionsStarted | Invocation volume |
ExecutionsFailed | Unhandled failures |
ExecutionsTimedOut | Executions exceeding TimeoutSeconds |
ExecutionThrottled | Start rate exceeded account limit |
ExecutionTime | End-to-end duration |
1# Alert on execution failures
2aws cloudwatch put-metric-alarm \
3 --alarm-name sfn-order-fulfillment-failures \
4 --namespace AWS/States \
5 --metric-name ExecutionsFailed \
6 --dimensions Name=StateMachineArn,Value=arn:aws:states:us-east-1:012345678901:stateMachine:order-fulfillment \
7 --statistic Sum \
8 --period 60 \
9 --evaluation-periods 1 \
10 --threshold 0 \
11 --comparison-operator GreaterThanThreshold \
12 --alarm-actions arn:aws:sns:us-east-1:012345678901:platform-alertsX-Ray Tracing
Enable X-Ray to trace execution across Step Functions and Lambda:
aws stepfunctions update-state-machine \
--state-machine-arn arn:aws:states:us-east-1:012345678901:stateMachine:order-fulfillment \
--tracing-configuration enabled=trueWith X-Ray, each state becomes a segment and Lambda invocations become subsegments. You can see exactly where time is spent across a multi-step execution.
Frequently Asked Questions
When should I use Step Functions instead of a Lambda orchestrating other Lambdas?
A Lambda that calls other Lambdas (Lambda-calls-Lambda) is an anti-pattern: the outer Lambda runs (and is billed) for the full duration of every inner call, you get no visibility into sub-steps, and a crash loses all state. Step Functions offloads state management and orchestration to the service, bills per state transition (not per duration waiting), and provides execution history for every run. Use Step Functions whenever a workflow has more than one step, needs retries, or needs to survive failures gracefully.
How do I pass data between states?
State data flows as a JSON document through the execution. Each state receives the current document as input and produces a JSON value as output. Use ResultPath to merge output into the existing document instead of replacing it. Use Parameters to reshape input before sending to a resource. Use OutputPath to filter what gets passed to the next state. JSONPath expressions (fields ending in .$) let you reference any part of the current state data.
What's the cost model?
Standard: $0.025 per 1,000 state transitions. A 10-state workflow executing 100,000 times/month = 1,000,000 transitions = $25/month. No charge for time spent waiting in Wait states or waiting on waitForTaskToken callbacks.
Express: $0.00001 per state transition + $0.00001667 per GB-second of duration. For high-volume, short workflows (thousands of executions/minute), Express is significantly cheaper.
Can Step Functions execute other Step Functions?
Yes. Use the states:startExecution.sync:2 resource to invoke a child state machine and wait for it to complete:
1{
2 "RunSubWorkflow": {
3 "Type": "Task",
4 "Resource": "arn:aws:states:::states:startExecution.sync:2",
5 "Parameters": {
6 "StateMachineArn": "arn:aws:states:us-east-1:012345678901:stateMachine:sub-workflow",
7 "Input.$": "$.subWorkflowInput"
8 },
9 "Next": "ProcessSubResult"
10 }
11}This is useful for building modular workflow libraries — a root workflow orchestrates several sub-workflows, each independently testable and versioned.
How do I test state machines locally?
The Step Functions Local Docker container lets you run state machines locally:
1docker run -p 8083:8083 amazon/aws-stepfunctions-local
2
3# Create state machine against local endpoint
4aws stepfunctions create-state-machine \
5 --endpoint-url http://localhost:8083 \
6 --name test-workflow \
7 --definition file://workflow.asl.json \
8 --role-arn arn:aws:iam::012345678901:role/DummyRole
9
10# Start execution locally
11aws stepfunctions start-execution \
12 --endpoint-url http://localhost:8083 \
13 --state-machine-arn arn:aws:states:us-east-1:012345678901:stateMachine:test-workflow \
14 --input '{"orderId": "test-001"}'Step Functions Local mocks Lambda invocations using configurable mock responses, so you can test state machine logic without deploying Lambda functions.
For the Lambda functions that Step Functions orchestrates, see AWS Lambda: Functions, Event Sources, Layers, and Serverless Patterns. For DynamoDB as the state persistence layer for workflow data, see AWS DynamoDB: Data Modeling, Capacity, Indexes, and Streams. For EventBridge as the trigger that starts state machine executions on schedule or event, see AWS SQS, SNS, and EventBridge: Messaging and Event-Driven Architecture.
Designing a Step Functions workflow for a complex business process, debugging a state machine that's failing under load, or migrating from a homegrown orchestration system to Step Functions? Talk to us at Coding Protocols — we help platform teams build reliable distributed workflows that handle failure gracefully.


