Serverless changes the game of building applications, making it easier for developers to develop APIs with isolated, single unit-of-work functions that are locked down and scoped to just the resources they need. But as these applications grow, they can become a victim of their own success, and brush up against AWS resource limits.

When you have a CloudFormation stack approaching the resource limit, you may encounter problems adding more resources to the stack when you deploy your next change. AWS CloudFormation has a maximum limit on the number of resources per stack, which is typically (a rather generous) 500 logical resources per stack.

It's probably fair to say that to reach this point, you would be deploying dozens of APIs, so this can be a fairly niche issue (especially compared to the previous 200 resource limit). However you're nearing the stack resource limit, it's essential to address it so you can continue adding to your application and deploy it.

The main approach that is used to overcome this limitation is to split your existing stack into multiple smaller stacks. This approach is known as stack splitting or stack partitioning. By grouping your resources into separate stacks, you can avoid hitting up against this limit.

You also gain the ability to selectively shut-off parts of your application when deploying it in test or review environments. This is advantageous because it can reduce stack deployment time and reduce the number of resources used in the AWS account, for which AWS also imposes limits, even though most serverless resources cost nothing to deploy and run if they are unused.

Stack splitting with AWS SAM

The process of splitting up a SAM stack with multiple APIs on an API Gateway is a bit more involved than simply moving the resources to child stacks or separate stacks.

Firstly, SAM imposes another limitation on functions declared with AWS::Serverless::Function that register API Gateway endpoints with an Events source handler of type Api - they cannot be moved to other stacks.

(I speculate that this limit exists because SAM requires knowledge of each of your lambda functions and their API endpoints to know when it needs to update the associated AWS::ApiGateway::Deployment resource on each deployment, which is an optimisation).

This is obviously a problem if your API has grown and you are bumping up against the resource limit. You can move non-function related resources as a stop-gap, but eventually you will want to split up your API functions too.

It doesn't help that each function resource connected to an API Gateway actually consumes multiple resources, bringing you close to your limit with far less than 200 APIs:

Function (AWS::Lambda::Function): the underlying lambda function resource
Log Group (AWS::Logs::LogGroup): logging resource
IAM Role (AWS::IAM::Role): grants your function permission to access other AWS resources
Function Permission (AWS::Lambda::Permission): grants access to API Gateway to call the function
Function Version (AWS::Lambda::Version): only when using function versioning, which is needed for reserved concurrency

One solution is to create multiple API gateways across multiple stacks but this is a more significant refactor on your application. It also requires that your API consumers be updated to connect to your new API endpoints, which is a huge hurdle to clear if it is used by external consumers or if you want to avoid downtime or running multiple versions of your API for a while.

Secondly, we have to replace some of the "magic" (or underlying resource generation) that SAM provides with manual resource declaration. This includes declaring our API endpoints in an OpenAPI specification and creating the Lambda Permission resources manually.

Finally, we need to consider some general issues when moving things between stacks, which I'll describe next.

An high-level process for stack splitting

Identify logical separations: analyze your existing stack and identify logical groupings of resources that can be split into separate stacks. This is typically based on functionality or system domains (to borrow terminology from Domain Driven Design).
Define stack dependencies: determine the dependencies between the resources in your existing stack.
It will be easier to group resources together that are related and have the same dependencies. Failing to do this well means that you will have a lot of cross-stack dependencies, or end up with cycles (which have to be broken). It can also be complex to manage and will increase your application's deployment time.
Create new stacks: once you have identified the logical separations and dependencies, create new CloudFormation stacks for each group of resources. Each new stack will represent a subset of the original stack's resources. These could be sub-stacks or separately deployed stacks (each come with their own advantages and challenges).
Update templates and parameters: modify the CloudFormation templates for each new stack to reflect the resources it will include. Update any relevant parameters or resource references to accommodate the separation.
Update stack creation process (separate stacks only): it's important that the stack deployment order reflects their dependencies, as otherwise your exported stack outputs will not be available to dependent stacks.
Refactor stack outputs and references: when your original stack's outputs were being used by other systems or CloudFormation stacks, ensure that these outputs are still accessible by updating the references in the consuming resources or stacks to point to the corresponding outputs in the new stacks.
(This can be a delicate process, because you won't be able to simply just move outputs on existing stacks in production environments if they are in use by other stacks - you have to generate new output names and use those in your refactored stacks while keeping the existing ones in place)
Test and validate: test the new stack deployments thoroughly to ensure that all resources are created and function correctly. Validate the inter-stack communication and any cross-stack dependencies to guarantee the system's integrity. Validate that you can update your stack with your changes, and not merely deploy them from scratch.

Technical implementation

In the sections below, I go into more detail about how to implement stack splitting with SAM-based templates.

Defining a `DefinitionBody` on your API

The first step is defining the DefinitionBody attribute of your AWS::Serverless::Api resource with your function endpoints. This is an extended version of the OpenAPI specification with Amazon API Gateway extensions. (If you already specify an OpenAPI definition body, you can just extend it as described below.)

For each API endpoint, you need to define the x-amazon-api-gateway-integration field for each endpoint. It is a matter of translating your Type: Api event handler for each function.

For example, imagine you had a POST /v1/todos endpoint defined on your function like so:

TodoPostFunction:
  Type: AWS::Serverless::Function
  Properties:
    Events:
      ApiEvent:
        Type: Api
        Properties:
          Path: /v1/todos
          Method: POST

You would then update your API resource DefinitionBody to contain the endpoint as follows:

Api:
  Type: AWS::Serverless::Api
  Properties:
    # ...
    DefinitionBody:
      openapi: 3.0.0
      info:
        version: '1.0'
        title: My Api
      paths:
        '/v1/todos': # API path
          post: # API Method
            x-amazon-api-gateway-integration:
              # This is always POST - it is the
              # AWS API call that is referenced
              httpMethod: POST
              type: aws_proxy
              uri: !Sub >-
                arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${TodoPostFunction.Arn}/invocations
            responses: {}

If you are translating a lot of APIs, remember to deploy regularly to validate your definitions are interpreted correctly.

Removing `Events` handler and creating Lambda Permission

The next step is to remove the Type: Api events handler (and whole Events block if you only have one handler for the function) for each function. When you do this, SAM will stop declaring the AWS::Lambda::Permission resource for your function automatically.

Thankfully, these are mostly boilerplate and can be easily added as below.

TodoPostFunctionPermission:
  Type: AWS::Lambda::Permission

  Properties:
    Action: lambda:InvokeFunction
    FunctionName: !Ref TodoPostFunction
    Principal: apigateway.amazonaws.com
    SourceAccount: !Ref AWS::AccountId
    SourceArn: !Sub arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:*

(One thing to note about this method: it does relax the SourceArn beyond what SAM normally generates. SAM normally limits the invocation to just your API endpoint, but this change permits any API Gateway endpoint in your account to invoke the function. You should make sure this is permissable within your security requirements.)

You should deploy a test stack at this stage and validate that everything is working - it is easy to accidentally forget to define an AWS::Lambda::Permission or to specify the wrong endpoint in the DefinitionBody!

Moving functions

Once your satisfied with your migration away from the Type: Api event source handler, you can begin moving your functions to substacks.

For most functions, this is a matter of:

Moving the AWS::Serverless::Function and AWS::Lambda::Permission implementation to a child stack or another stack
Updating the parameters of the child stack or other stack to match those needed by the function declaration, and making sure they are passed in via their parent stack and/or via deployment process

Update the references to your function:

a. child stacks: add an Output declaration to your child stack exporting your function ARN

Outputs:
  TodoPostFunctionArn:
    Value: !GetAtt TodoPostFunction.Arn

then, replace the function reference in your main stack:

uri: !Sub >-
  arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${MyChildStack.Outputs.TodoPostFunctionArn}/invocations

b. separate stacks: add an Output declaration with an Export name to your separate stack, exporting the function ARN:

Outputs:
  TodoPostFunctionArn:
    Value: !GetAtt TodoPostFunction.Arn
    Export:
      # We've substituted the stack name into the export value,
      # but you don't need to do this
      Name: !Sub '${AWS::StackName}-TodosPostFunctionArn'

then, replace the function reference in your API stack:

uri:
  Fn::Sub:
    - arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${TodoPostFunctionArn}/invocations
    - TodoPostFunctionArn: !ImportValue 'MySeparateStackName-TodosPostFunctionArn'

You will also need to make sure that your separate stack can be deployed before your API stack (you will not be able to have mutual references).

IMPORTANT: if you have specified the Name property of your functions (i.e. you have stopped CloudFormation from generating its name), you will need to rename it in the child stack.

This is necessary because CloudFormation cannot tell that you have moved the resource between stacks; it will attempt to create a resource with the same name in the new stack, before it removes the old resource from the old stack. This will generate a conflict otherwise.

(This is unnecessary if you left this property blank, because CloudFormation generates a unique name for your function).

Validation

Lastly, its important to validate that your change can be deployed safely without breaking your stacks or causing a rollback. The best way to do this is in a test version of your stack, where the old version is deployed first, then your changes are deployed as an update.

If you just deploy your changes as a separate stack without doing an update on the old version of your stack, you may miss potential issues that can arise (especially with resource renaming) only when performing updates.