Learning DynamoDB is hard work! But I hear from many developers who have dived in and tried it out because they hear about its performance and scalability and its simple API-model, but back away when it stops working because they’ve tried to use it like a SQL database, found it too difficult to evolve to a changing understanding of requirements, and simply not invested the time to learn how to use it well.
There are now numerous courses and tutorials covering DynamoDB, which has grown off the back of increasing use of serverless architectures (where DynamoDB is a great fit) and the interesting engineering challenge that comes from using single-table design (popularised by Rick Houlihan’s AWS re:Invent talks).
That said, I’ve gathered here my own recommendations for learning about DynamoDB and strategies I’ve found useful for using it efficiently and practically.
Spending the time to understand DynamoDB’s capabilities and limits will serve you well when you start modelling a new or existing application’s data structures. This will make it easier to decide how to store certain types of data to optimise its access.
- the difference between partition and sort keys and how they work together to index an item
- what are secondary indexes and how they are used to enable predictable fast access to items without their primary key
the different ways to read data, such as:
- knowing about the existence of change streams and transactions and what use cases they solve
- having some idea of operational capabilities, such as backups, point-in-time-recovery, encryption, VPC endpoints, etc. which you may never need but are good to know about when your boss inevitably asks you about them.
AWS provides a list of excellent resources for learning DynamoDB in their documentation, but curiously neglects this excellent book on DynamoDB modelling which covers everything above in from beginners to professional-level.
Single-table modelling, even if you still decide to use multiple tables, is necessary to know in order to make the most of DynamoDB and its performance capabilities. It’s necessary to be able to efficiently implement some one-to-many and many-to-many access patterns, and structuring your data so that your tables can store disparate data types will make it much easier to evolve your application with new data types as the need arises.
Alex Debrie gives a good overview of the what and when of single-table modelling, Rick Houlihan’s various AWS:ReInvent talks on advanced modelling are a good way to get a sense of the possibilities while scaring yourself silly then replaying them and pausing every so often, but the best way is probably Alex Debrie’s book or one of a number of articles that are available online (such as this one by Forrest Brazeal or this deep-dive video series from AWS).
You probably won’t use every single-table modelling technique from the get-go - the one’s I’ve found most useful to be across are:
- Understanding how multiple collections of data can be stored in one table
- Denormalisation techniques for one-to-many relationships
- Storing data and indexes separately by duplicating indexed fields into separate index properties instead of trying to share them (a management layer like dynaglue does this automatically; see the next section)
- Strategies for many-to-many relationships, including how to keep join-collections up to date
Object-relationship modelling (ORM) frameworks have a bad reputation for increasing complexity without improving the usability of SQL databases from application code, and rightly so, but at the same time they’ve “poisoned the well” (so to speak) for the advocacy of tools that abstract away from the database layer.
Instead, what has emerged is either techniques for building directly against the application layer, or the adaption to database technology to accommodate the desires of application developers, such as no-SQL databases, the ability to store unstructured JSON data in database columns and query it with SQL, JSON-schema support in the database layer, etc.
With DynamoDB, most new features have been enrichments to existing behaviour, not new ways of querying data that make it easier to query your data model (beyond things like PartiQL which give a familiar SQL interface to DynamoDB but also hide non-performant queries instead of making them obvious to the developer).
I’ve found using a lighter abstraction layer very helpful with DynamoDB, especially when dealing with single-table designs, as its made it much quicker and less error-prone to create my queries and ensuring that my indexes are populated correctly.
It’s also helped remove a lot of the complexity and issues coding directly against the DynamoDB API, but without hiding any performance problems (which is easy to do with PartiQL).
As a result, I developed dynaglue for this purpose, and highly recommend the use of something like it when developing with DynamoDB (alternatives I’ve seen include DynamoDB Toolbox, TypeORM and DynamoDB OneTable).
You should know what you are writing and reading back from your tables, and I would argue that using a type-aware language is not enough, especially if you are just casting between a JSON structure and an internal type in TypeScript, which gives you no protection against discrepencies in data storage from different pieces of code accessing the same data, or if you need to perform transformations before you read/write data.
Just like you would declare a schema for your SQL tables using DDL, maintaining a JSON-Schema and validating against it before writing to the database, helps to maintain data integrity. It makes sure that you do not accidentally introduce new properties without validating their type first or changing it’s definitions inadvertently between code versions. The structure of your data becomes explicit and visible during code changes.
(On an additional note, although it would seem convenient, I would caution against re-using schema declarations you may have created for your API in your database schema validation (or vice-versa): sharing schemas for two different purposes makes it hard to evolve them separately, and introduces the risk you would change one or the other in incompatible ways. Always maintain separate separate schemas for your data storage and API layers - future you will be thankful!)
One of the ways you help maintain a performant DynamoDB table is by ensuring that your partition keys are as unique as possible, so that DynamoDB can distribute your documents across multiple partitions as your data grows. This is obviously hard with timeseries data (the DynamoDB manual has specific advice for handling time-series access patterns), but it is relatively straightforward for typical transaction records.
The ideal document identifier can be generated in a distributed fashion without hash collisions and is reasonably random that it is hard to guess and will shard the DynamoDB at high data storage levels. Even if you are not planning to build tables that run to hundreds of gigabytes, you should still ensure that your keys are random enough to avoid hash collisions.
Another desirable property of identifiers is being naturally sorted based on creation order, which can be achieved with identifiers that are auto-incrementing or encode a timestamp into them.
In SQL databases, an incrementing counter is often used, but this is hard with DynamoDB, especially with distributed clients, because DynamoDB does not have native functionality to produce these values, and generating them in a distributed fashion is difficult. SQL databases can do this efficiently because there is typically one server and these types of operations are done internally (as with all other query operations, such as access-planning, which is why SQL has more trouble scaling in a predictable manner, as hetrogenous and workloads are handled on the same compute layer).
Using timestamps is a poor alternative as well, as even with millisecond resolution, the chance of hash collisions is quite high, especially if you are writing records at a rapid rate, and the records will be grouped close together. I’ve seen this pattern before, and the result is ugly - it’s very easy to override existing records without realising and generating bizarre long-lived behaviour in your system caused by the hash collision.
UUIDs are a step up, but can be a bit cumbersome to use in their typical string representation as they are quite long being encoded in hex. They also don’t have any natural sort ordering, but they are a reasonable start to generating unique identifiers in a safe way.
MongoDB uses BSON-IDs for its records, which give pseudo-random values which are timeseries ordered. These are great for single-table designs in serverless architectures as they can be generated in a distributed fashion, and when stored in a sort key give a natural index ordered by document creation.
If time-series ordering is not important to you, functions like nanoid can achieve much of the same outcomes without needing to worry about hash collisions (and while being URL-friendly too).
It can be hard with agile-development methodologies to get into the habit of planning architecturally ahead, but single-table design with DynamoDB typically assumes you know all of your access patterns upfront in order to structure your data optimally.
In reality, you don’t need to know everything upfront, but it is ideal to have some idea about how your data will be accessed into the short to medium term. The advantages of doing so include:
- indexing fields before-hand - this prevents needing to back-fill later (especially while your data set is small),
- pre-allocating indexes before they are used - even if you don’t fill them, CloudFormation only permits deploying one new Global Secondary Index (GSI) at a time, so it can be good to look ahead and add unused indexes before you need them so you aren’t coordinating multiple deploys to get them in place before you deliver code that requires them
- managing document size - not only thinking about how your data is accessed, but how it will grow, can help you make earlier decisions about splitting data into multiple collections or simply storing it in your original document. While it is always easy to add new collections later, finding that your documents are growing in size as time goes on means that you still need to perform a painful data migration step when you want to refactor it
DynamoDB is deliberately structured with a limited set of data management features, which makes it hard to translate relational data modelling with normalisation to its model.
Learning DynamoDB means letting go of a lot of those techniques and learning new techniques to achieve the same principles you require for data integrity, performant access and ease-of-access.
For example, this means using streams to propagate changes to other records or other systems, identifying when you really need full read consistency, not being afraid to duplicate data between indexes and even other documents when it makes sense for your data patterns.
This even means realising when DynamoDB is a good fit for your use cases, and when something else may be complementary, like ElasticSearch for text based search, or exporting to S3 and scanning your data with Athena for analytics use cases.
The key part of this is learning about all its features (not just its API operations, even though you should be familiar with every part of them to use them successfully) and when to apply them.