Introduction to
AWS Step Functions
About St. Louis Serverless
About Jack Frosch
1st Career
About Jack Frosch
2nd Career
- October 1st
- St. Louis Serverless Meetup
- Deep Dive into AWS Cloud Development Kit (CDK)
- October 10th
- AWS GameDay St. Louis 2019
- Hosted by Object Computing
- November 5th
- St. Louis Serverless Meetup
- Lessons Learned with Lambda (Guest Speaker)
- December 3rd
- St. Louis Serverless Meetup
- Serverless on Azure (Guest Speaker)
- What is AWS Step Functions
- Why it's important
- Step Functions By The Numbers
- Core Concepts
- Step Functions Example
- Service Integrations
- Developing Step Functions
What is AWS Step Functions?
AWS Step Functions
- An Amazon hosted service for orchestrating workflows
- Highly available
- Security using Identity and Access Management (IAM)
- Works with other AWS services as well as on-premises services
- Built-in error and retry handling
- Execution event history
- Use via console, CLI, API, and the CDK
- It's a serverless offering
- No servers to configure and maintain
- Automatically scales up and down, all way to zero
- Pay nothing when not in use
AWS Step Functions are based on concepts of tasks and state machines
State Machine
- An abstract machine
- Can be in exactly one of finite number of states
- State transitions occur as a result of some event
- Internal event; e.g. task completion
- External event; e.g. message received
- Great for modeling workflows and processes
Example - Batch Job Workflow
Task (end=true)
Why it's important
Meet the Monolith
- Already exists
- Easy to debug from beginning to end
- One thing to manage
- Works most of the time
- Hard to fully understand
- Increasing accidental complexity
- High coupling / low cohesion
- To scale one, must scale all
- Expensive regression testing
- Bugs harder to find
- Old technology lock-in
- Inflexible architecture
- Developer recruitment and retention
- Slower release cadence
- Desire for risk avoidance overcomes desire to innovate
- Unconstrained competitors
API Gateway
- Handle service invocations
- Handle scheduling
- Handle branching
- Handle errors
- Handle retries
- Handle failures
- Chain to other workflows
This leads to
- Increased complexity
- Increased coupling
We need to externalize the workflow orchestration.
We want to spend our time solving business problems, but we write code to
API Gateway
Step f() By The Numbers
- 4,000 state transitions per month free
- $0.025 per 1,000 state transitions
State Machine Execution Limits
- 1,000,000 open executions
- 1 year max execution length / idle time
- 25,000 execution events in history
- 90 days execution history retention
Task Execution Limits
- 1,000 pollers calling GetActivityTask
- 32,768 characters input/output
Account Limits
- 10,000 registered activities
- 10,000 registered state machines
- 1MB per API request
Core Concepts
- AWS States Language
- Defining a State Machine
- States
- State Transitions
- Data handling in states
- Handling Errors
- State Types
Amazon States Language
- A DSL for specifying a state machine in JSON format
- Copyright
- Code examples in DSL are Apache 2.0
Defining a State Machine
- Must have a "States" field
- Must have one and only one "StartAt" field referencing one of the states
- May have a "Comment field"
- May have a "Version" field specifying the States Language version (if omitted, 1.0 is assumed)
{ "Comment": "My Batch Job workflow", "StartAt": "Submit Job", "States": { "Submit Job": {
- States are defined in the top-level "States" object
- States describe tasks ("units of work") or flow control
- State name must be unique in scope of state machine
- State name must <= 128 Unicode characters
- Each state must have a "Type" field
- Each state may have a "Comment" field
- Most state types have additional requirements
- Any state other than types Choice, Succeed, and Fail may have a boolean field named "End"
- Terminal states have {"End": true} or are of type Succeed or Fail
Simple State Example
"HelloWorld": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:HelloWorld",
"Next": "NextState",
"Comment": "Executes the HelloWorld Lambda function"
State Transitions
- After executing the action in a non-terminal state, the state machine transitions to the state specified in "Next" field
- The state transitioned to from a Choice state type is determined by the logic in the Choice state
- A state can have multiple incoming transitions from other states
States Data
- Interpreter passes data between states
- All data must be in JSON format
- Initial data may be provided to the start state
- If no data provided an empty JSON object is passed; i.e. { }
- A state can create output data which must be JSON
- Numbers generally conform to JavaScript double precision, IEEE-854 values
- Strings, booleans, and numbers are valid JSON texts
States Data - Path Expressions
- When states need to access specific fields, they can use JsonPath expressions
- A Path expression starts with a $
- $$ path expression means path is taken from context object
- A Reference Path is a Path that resolves to a single node in the JSON data
The operators “@”, “,”, “:”, and “?” are not supported
States Data - Timestamps
- Timestamps used in data must conform to RFC3339 profile of ISO 8601
- A 'T character must be used to separate date and time
- If a numeric timezone offset not used, a capital 'Z' must terminate the string; e.g.
States Data - Example
"foo": 123,
"bar": ["a", "b", "c"],
"car": {
"cdr": true
$.foo => 123 $.bar => ["a", "b", "c"] $.car.cdr => true
More Valid Examples
$ $.store\.book $.\stor\\k $ $.foo.\.bar $.foo\@bar.baz\[\[.\?pretty $.&Ж中.\uD800\uDF46 $.ledgers.branch[0].pending.count $.ledgers.branch[0] $.ledgers[0][22][315].foo $['store']['book'] $['store'][0]['book']
States Data - Input/Output
- By default, states append the results data with the full input data to form the output data
- However, a state may only be interested in a subset of the input data, or even restructure it differently than the input
- Four optional fields exist for this...
- InputPath
- A path selecting some or all of the state's input
- Default is $
- Parameters
- Any value constructed from the input
- Becomes the effective input
- Has no default
States Data - Input/Output (cont'd)
- ResultPath
- Input data + state result
- Must be a reference path
- Default is $ (overwrites and replaces input)
- OutputPath
- A path applied to the ResultPath
- Yields the effective output that is input to next state
- Defaults to $; i.e. the ResultPath
States Data - Input/Output (cont'd)
States Data - Input/Output Example 1
"StartAt": "Add",
"States": {
"Add": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:Add",
"InputPath": "$.numbers",
"ResultPath": "$.sum"
"End": true
States Data - Input/Output Example 1
"title": "Numbers to add",
"numbers": { "val1": 3, "val2": 4 }
"title": "Numbers to add",
"numbers": { "val1": 3, "val2": 4 },
"sum": 7
States Data - Input/Output Example 2
"Preapre Data": {
"Type": "Task",
"Resource": "arn:aws:swf:us-east-1:123456789012:task:X",
"Next": "Y",
"Parameters": {
"flagged": true,
"parts": {
"first.$": "$.vals[0]",
"last3.$": "$.vals[3:]"
States Data - Input/Output Example 2
"flagged": 7,
"vals": [0, 10, 20, 30, 40, 50]
"flagged": true,
"parts": {
"first": 0,
"last3": [30, 40, 50]
Handling Errors
- Runtime errors are identified by case-sensitive strings, called Error Names
- The States Language has some reserved Error Names that all begin with "States."
- Custom error names must not start with "States."
- Error handling has two flavors, which can be used together
- Retriers
- Catchers
States - Predefined Errors
States - Built-in Error Names
States - Retriers
- Task and Parallel State types may include a "Retry" field that defines an array of Retriers
- Required Retrier fields:
- "ErrorEquals" array - specifies error names handled
- Optional Retrier fields...
States - Optional Retrier Fields
- "MaxAttempts"
- Non-negative, integer field (0 means never)
- Default = 3
- "IntervalSeconds"
- Positive, integer field representing delay after error before first retry until
- Default = 1
- Multiplier of retry interval applied after every retry
- Default = 2.0
States - Retrier Example
"Retry" : [ {
"ErrorEquals": [ "States.Timeout" ],
"IntervalSeconds": 3,
"MaxAttempts": 3,
"BackoffRate": 2.0
} ]
- If a timeout error occurs, wait 3 seconds, then retry
- If a timeout occurs again, wait 3.0s x 2.0 (6.0 seconds), then retry
- If a timeout occurs again, wait 6.0s x 2.0 (12.0 seconds), then retry
State Errors - Catchers
- Task States and Parallel States may define a "Catch" field
- The "Catch" field defines an array of Catchers
- A Catcher must specify an "ErrorEquals" field specifying an array of error names
- A Catcher must specify a "Next" field specifying an existing State name
- Retriers, if specified, execute first
- If error still exists, Catchers are evaluated and processed
State Errors - Catchers (cont'd)
- When a Catcher causes a transition to the "Next" state, the output must contain a string field named "Error" containing the error name
- The error output should contain a string field named "Cause" containing human-readable error information
- A Catcher may have a "ResultPath" field so the error output is appended to the input data
State Errors - Catchers Example
"Catch": [
"ErrorEquals": [ "java.lang.Exception" ],
"ResultPath": "$.error-info",
"Next": "RecoveryState"
"ErrorEquals": [ "States.ALL" ],
"Next": "EndMachine"
State Types and Fields
State Type: Pass
- A no-op state used for development and testing
- Merely passes the input data to the output
- Optional "Result" field representing the stub data
State Type: Pass Example
"No-op": {
"Type": "Pass",
"Result": {
"x-datum": 0.381018,
"y-datum": 622.226993
"ResultPath": "$.coords",
"Next": "End"
{ "geo-ref": "Home" }
"geo-ref": "Home",
"coords": {
"x-datum": 0.381018,
"y-datum": 622.226993
State Type: Task
- A state that does work through a specified resource
- The "Resource" field specifying a URI is required
- The States Language does not restrict the URI, but in AWS the value will be an Amazon Resource Name (ARN)
Optionally timeouts in positive integer seconds
- "TimeoutSeconds" (default = 60)
- Must be smaller than TimeoutSeconds
- Workers can call back via SendTaskHeartbeat
If either timeout exceeded, a "States.Timeout" error occurs
State Type: Task Example
"TaskState": {
"Comment": "Task State example",
"Type": "Task",
"Resource": "arn:aws:swf:us-east-1:123456789012:task:HelloWorld",
"Next": "NextState",
"TimeoutSeconds": 300,
"HeartbeatSeconds": 60
State Type: Choice
- A state that adds branching logic to a state machine
- Must have a "Choices" field with non-empty array
- Each element of the array is called a Choice Rule
- A Choice Rule has
- A comparison operation
- A "Next" field which must match an existing state
- The first Choice Rule in the array with an exact match is the choice taken
- Choice states may have a "Default" field used if no Choice Rule matches
Choice State Comparison Operators
- StringEquals
- StringLessThan
- StringGreaterThan
- StringLessThanEquals
- StringGreaterThanEquals
- NumericEquals
- NumericLessThan
- NumericGreaterThan
- NumericLessThanEquals
- NumericGreaterThanEquals
- BooleanEquals
- TimestampEquals
- TimestampLessThan
- TimestampGreaterThan
- TimestampLessThanEquals
- TimestampGreaterThanEquals
- And
- Or
- Not
State Type: Choice Example
"ChoiceStateX": {
"Type" : "Choice",
"Choices": [
"Not": {
"Variable": "$.type",
"StringEquals": "Private"
"Next": "Public"
"And": [
"Variable": "$.value",
"NumericGreaterThanEquals": 20
"Variable": "$.value",
"NumericLessThan": 30
"Next": "ValueInTwenties"
"Default": "DefaultState"
"Public": {
"Type" : "Task",
"Next": "NextState"
"ValueInTwenties": {
"Type" : "Task",
"Next": "NextState"
"DefaultState": {
"Type": "Fail",
"Cause": "No Matches!"
State Type: Wait
- A state that adds a delay to a state machine
- The delay time can be specified using different fields
- "Seconds" - wait duration (in seconds)
- "SecondsPath" - Delay seconds from input path
- "Timestamp" - Datetime expiry value in ISO-8601 format
- "TimestampPath" - Datetime expiry value from input
State Type: Wait Examples
"wait_ten_seconds" : {
"Type" : "Wait",
"Seconds" : 10,
"Next": "NextState"
"wait_until" : {
"Type": "Wait",
"Timestamp": "2016-03-14T01:59:00Z",
"Next": "NextState"
"wait_some_seconds" : { "Type" : "Wait", "SecondsPath" : "$.secondsDelay", "Next": "NextState" }
"wait_until" : {
"Type": "Wait",
"TimestampPath": "$.expirydate",
"Next": "NextState"
State Type: Parallel
- A state that causes parallel execution of "branches"
- Must have a "Branches" field that is an array of branch objects
- Each branch must have a "StartAt" field
- Each branch must have a "States" field
- A branch's State may have a "Next" field pointing to a State in that branch's States array, but not outside it
- Parallel State's "Next" field not processed until all branches complete
State Type: Parallel (cont'd)
- Any uncaught error or transitioning to a Fail state in a branch fails the whole Parallel State
- If the Parallel State does not handle the error, the entire state machine will be marked as failed
- Parallel State passes the input (or InputPath) to each branch's "StartAt" state
- Parallel State aggregates each branch's output into an output array, without requirement for each branches output to be same shape
Parallel State Example 1
"LookupCustomerInfo": {
"Type": "Parallel",
"Branches": [
"StartAt": "LookupAddress",
"States": {
"LookupAddress": {
"Type": "Task",
"End": true
"StartAt": "LookupPhone",
"States": {
"LookupPhone": {
"Type": "Task",
"End": true
"Next": "NextState"
Parallel State Example 2
"FunWithMath": {
"Type": "Parallel",
"Branches": [
"StartAt": "Add",
"States": {
"Add": {
"Type": "Task",
"Resource": "arn:aws:swf:::task:Add",
"End": true
"StartAt": "Subtract",
"States": {
"Subtract": {
"Type": "Task",
"Resource": "arn:aws:swf:::task:Subtract",
"End": true
"Next": "NextState"
[3, 2]
[5, 1]
State Type: Succeed
- A state that terminates a state machine successfully
- A terminal state; i.e. has no "Next" field
- Useful as candidates for "Next" states in Choice States
"Completed": { "Type": "Succeed" }
State Type: Fail
- A state that terminates a state machine as failed
- A terminal state; i.e. has no "Next" field
- Must have a field named "Error" that specifies error name
- Must have a field named "Cause" used to provide human readable message
State Type: Fail Example
"FailState": { "Type": "Fail", "Error": "States.Timeout", "Cause": "Report generation timed out" }
A Batch Job Example
State machine example
"Comment": "An example of the Amazon States Language.",
"StartAt": "Submit Job",
"States": {
"Submit Job": {
"Type": "Task",
"ResultPath": "$.guid",
"Next": "Wait X Seconds",
"Retry": [
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 1,
"MaxAttempts": 3,
"BackoffRate": 2
"Wait X Seconds": {
"Type": "Wait",
"SecondsPath": "$.wait_time",
"Next": "Get Job Status"
State machine example cont'd
"Get Job Status": {
"Type": "Task",
"Next": "Job Complete?",
"InputPath": "$.guid",
"ResultPath": "$.status",
"Retry": [
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 1,
"MaxAttempts": 3,
"BackoffRate": 2
"Job Complete?": {
"Type": "Choice",
"Choices": [
"Variable": "$.status",
"StringEquals": "FAILED",
"Next": "Job Failed"
"Variable": "$.status",
"StringEquals": "SUCCEEDED",
"Next": "Get Final Job Status"
"Default": "Wait X Seconds"
State machine example cont'd
"Job Failed": {
"Type": "Fail",
"Cause": "AWS Batch Job Failed",
"Error": "DescribeJob returned FAILED"
"Get Final Job Status": {
"Type": "Task",
"InputPath": "$.guid",
"End": true,
"Retry": [
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 1,
"MaxAttempts": 3,
"BackoffRate": 2
A Simple State Machine
Task (end=true)
A Simple State Machine
"Comment": "A simple AWS Batch workflow",
"StartAt": "Submit Job",
"States": {
"Submit Job": {
"Type": "Task",
"ResultPath": "$.guid",
"Next": "Wait X Seconds",
"Retry": [
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 1,
"MaxAttempts": 3,
"BackoffRate": 2
A Simple State Machine
"Wait X Seconds": {
"Type": "Wait",
"SecondsPath": "$.wait_time",
"Next": "Get Job Status"
A Simple State Machine
"Get Job Status": {
"Type": "Task",
"Next": "Job Complete?",
"InputPath": "$.guid",
"ResultPath": "$.status",
"Retry": [
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 1,
"MaxAttempts": 3,
"BackoffRate": 2
A Simple State Machine
"Job Complete?": {
"Type": "Choice",
"Choices": [
"Variable": "$.status",
"StringEquals": "FAILED",
"Next": "Job Failed"
"Variable": "$.status",
"StringEquals": "SUCCEEDED",
"Next": "Get Final Job Status"
"Default": "Wait X Seconds"
A Simple State Machine
"Get Final Job Status": {
"Type": "Task",
"InputPath": "$.guid",
"End": true,
"Retry": [
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 1,
"MaxAttempts": 3,
"BackoffRate": 2
A Simple State Machine
"Job Failed": {
"Type": "Fail",
"Cause": "AWS Batch Job Failed",
"Error": "DescribeJob returned FAILED"
Service Integrations
Step Functions works directly with some AWS Services
No Lambda required!
Launch a Batch Job and consume its results
Insert or get a record from DynamoDB
Publish to SNS topic or to a SQS queue
Even launch another Step Functions State Machine
Service Integration Patterns
Request / Response
Run a job
Wait for a callback with the Task Token
- Request sent to resource
- When HTTP response received, transition to next state
- Will not wait for job to complete
"Send message to SNS":{
"Message":"Hello from Step Functions!"
Request / Response Pattern
"Run a Job" Pattern
- Request sent to resource identified by ARN with .sync
- Wait for job to complete
"Manage Batch task": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobDefinition": "arn:aws:batch:us-east-2:123456789012:job-definition/testJobDefinition",
"JobName": "testJob",
"JobQueue": "arn:aws:batch:us-east-2:123456789012:job-queue/testQueue"
}, "Next": "NEXT_STATE"
"Wait for callback with Task Token" Pattern
- Request sent to resource with ARN ending in .waitForTaskToken
- Pass a TaskToken parameter
- Wait for a callback to the SendTokenSuccess or SendTokenFailure API endpoint
- The callback must contain the TaskToken
"Wait for callback with Task Token" Pattern Example
Post Message w/ TT
Pull Message
Step F()
Callback w/ TT
"Wait for callback with Task Token" Pattern Example
"Send message to SQS": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
"Parameters": {
"QueueUrl": "",
"MessageBody": {
"Message": "Hello from Step Functions!",
"TaskToken.$": "$$.Task.Token"
}, "Next": "NEXT_STATE" }
Service Integrations
Developing Step Functions
- AWS Console (Demo)
- Step Functions Local
- Cloud Development Kit (CDK)
Console Demos
Step Functions Local
Download and run a Java JAR
- java -jar StepFunctionsLocal.jar -v
Download and Run a Docker image
- docker pull amazon/aws-stepfunctions-local
- docker run -p 8083:8083 amazon/aws-stepfunctions-local
Step Functions Local - JAR
Step Functions Local - Docker
docker run -p 8083:8083 /
--env-file aws-stepfunctions-local-credentials.txt /
AWS Cloud Development Kit (CDK) - (Deep Dive in October!)
import sfn = require('@aws-cdk/aws-stepfunctions');
import tasks = require('@aws-cdk/aws-stepfunctions-tasks');
const submitLambda = new lambda.Function(this, 'SubmitLambda', { ... });
const getStatusLambda = new lambda.Function(this, 'CheckLambda', { ... });
const submitJob = new sfn.Task(this, 'Submit Job', {
task: new tasks.InvokeFunction(submitLambda),
// Put Lambda's result here in the execution's state object
resultPath: '$.guid',
const waitX = new sfn.Wait(this, 'Wait X Seconds', {
duration: sfn.WaitDuration.secondsPath('$.wait_time'),
const getStatus = new sfn.Task(this, 'Get Job Status', {
task: new tasks.InvokeFunction(getStatusLambda),
// Pass just the field named "guid" into the Lambda, put the
// Lambda's result in a field called "status"
inputPath: '$.guid',
resultPath: '$.status',
const jobFailed = new sfn.Fail(this, 'Job Failed', {
cause: 'AWS Batch Job Failed',
error: 'DescribeJob returned FAILED',
const finalStatus = new sfn.Task(this, 'Get Final Job Status', {
task: new tasks.InvokeFunction(getStatusLambda),
// Use "guid" field as input, output of the Lambda becomes the
// entire state machine output.
inputPath: '$.guid',
const definition = submitJob
.next(new sfn.Choice(this, 'Job Complete?')
// Look at the "status" field
.when(sfn.Condition.stringEquals('$.status', 'FAILED'), jobFailed)
.when(sfn.Condition.stringEquals('$.status', 'SUCCEEDED'), finalStatus)
new sfn.StateMachine(this, 'StateMachine', {
timeout: Duration.minutes(5)
- AWS Step Functions represent a shared canvas for discussions with business
- They help us decouple workflow management from core business logic
- They have robust error and retry support
- They have very limited programming logic support
- Home page
- Pricing
- Limits
- Service Integration Details
Introduction to AWS Step Functions
By Jack Frosch
Introduction to AWS Step Functions
- 2,630