DynamoDB Data Modeling Series
- What really matters in DynamoDB data modeling?
- Which flavor of DynamoDB secondary index should you pick?
- Maximize cost savings and scalability with an optimized DynamoDB secondary index
- Single table design for DynamoDB: The reality (YOU ARE HERE)
You’ve probably heard about “single table design.” It has become a contentious topic, and (as usual) there are two sides to the story. In this fourth (and final) article in a series about DynamoDB data modeling, I’ll talk about the history of “single table design”. Where did it come from? How did it go wrong? And how can you make smart choices about which data belongs in the same table versus separate tables?
If you’ve been following along with the series I’d recommend skimming the prior articles to refresh your memory on some of the core concepts like item collections, secondary index projection, and metering. The context will be important to get a full understanding of the discussion below.
For the impatient, here is the TLDR: the “single table design” techniques and example models used to push an ideal of exactly one table are contrived - not representative of the reality for DynamoDB customers operating critical loads at scale. The details often do not stand up to the scrutiny of service experts, and focusing too heavily on reducing the number of tables is likely to leave you with a costly, scale-constrained, underperforming design. As engineers, we have learned to take a skeptical view of seasonal bandwagons - and “single table design” is no different. You should consider the many downsides before jumping on board and committing your developers to a DynamoDB experience that’s much bumpier than it needs to be.
The optimal number of tables for a particular DynamoDB data model might be one, but often it will be more.
Oh, you’re in for the whole story? Great - let’s start at the beginning.
Is there something to this “single table design” thing?
Yes! There is definitely something important to understand from the early learnings that first prompted the “single table design” phrase. When Amazon.com teams were trying to wrap their heads around DynamoDB as part of their database modernization goals, they tended to bring their third normal form modeling approach along from their relational database experience: they’d create the same number of DynamoDB tables as they would have done in their relational model. Unknowingly, they were building JOINs into their application code instead of making use of two very powerful concepts in DynamoDB: item collections, and schema flexibility. They were missing out on some core DynamoDB benefits and trying to build part of a database engine into their application code without reason!
Getting the best results from DynamoDB requires that you identify opportunities to denormalize - store fully-built results for easy retrieval with no additional work necessary. You can do this using an item or an item collection - not to point out the obvious, but data in the same item or in the same item collection is also (by necessity) in the same table. DynamoDB’s greatest strength is in scaling via storage rather than compute - so denormalize and take advantage! You might find that some post-processing in your application code is necessary to meet advanced requirements - but the good news is that this scales better than doing all that computation on a centralized database node.
Denormalization as a technique is certainly not unique to DynamoDB, but it is of particular importance in DynamoDB. Why? Because if you don’t store information together where it can be accessed in a minimum of requests, you may see higher failure rates and increased tail latency. This is very different from your familiar RDBMS of old, where the data is all available on the same node that’s processing your request - DynamoDB is a distributed system (in the extreme). If you store the data for a particular data operation across a number of different keys, you’ll need to retrieve them via separate client requests. The more separate requests you make the higher the chance that one will encounter a transient problem - they’re likely to live on different partitions (which reside on different nodes, in different racks, and maybe different AZs) - and they’ll go through a different set of network links, routers, switches etc.
Storing data in the same table does not help to address this - except perhaps for extremely small tables when making a Scan call.
What does help is storing related information in the same item or the same item collection (that is, the same partition key value).
Denormalizing in DynamoDB data modeling produced the bonus effect of having less total tables than a standard third normal form pattern. Back when single table design was first being marketed, there was no auto scaling, no on-demand, no adaptive capacity - each table meant more operational burden. So fewer tables held some attraction - however, I argue (and did so from the beginning) that if availability is a high priority, the “fewer tables is always better” claim is bogus. DynamoDB has come a long way, and you definitely want the option to tailor things like capacity mode, auto scaling policy, backups, storage class etc according to the requirements of particular subsets of your data. When you force all your data (even completely unrelated) into one table, you are stuck with just one configuration for all of it.
What went wrong?
The “single table design” phrase was originally intended to encourage Amazon.com teams to take a fresh look at data modeling possibilities when migrating to DynamoDB. In retrospect, it was a choice of words that was too easily misunderstood.
The goal was to convince those Amazon.com teams to denormalize and store related data in an already-JOINed form using schema flexibility and item collections for efficiency. Despite some internal concerns about potential for confusion in the “single table” message, the terminology was presented externally by some as a cure-all. With an AWS marketing assist and the power of social media, it took on a life of its own. Some practitioners took things too literally - making exactly one table (and even “overloaded” global secondary indexes (GSIs) - BAD IDEA) the goal in the designs they evangelized. For promotional purposes, the needless complexity was characterized as wizardry - and everybody on Twitter wanted to say they’d fully grasped it as that was the price of admission to the cool kids club! In the confusion, things like scalability, operability, and complexity were quickly waved away and glossed over with (false) claims of best practice and performance improvement. Blindly applying all of the techniques for pushing all data into just one table for every data model is something that you may get away with in very low throughput use cases - but when scaled out in production it can quickly produce some serious operational regrets.
I’ve worked with hundreds (thousands?) of large scale customers of DynamoDB to review their data models so they reach the fullest levels of success using this powerful database - this includes many AWS service teams who are building critical dependencies for efficiency and scale around DynamoDB. I have seen it work incredibly well, but I’ve also seen all the various modalities of burn when it is misused.
Based on my experience, I would strongly recommend treating the number of tables as an outcome of a good DynamoDB data modeling approach.
Denormalize when possible and advantageous - beyond that think about your operating requirements and standards. Do NOT start with exactly one table as a design constraint or you’ll end up working backwards through a lot of complicated and costly compromises to make it happen. There’s a very strong likelihood that you’d regret those choices in time.
A wise man (hi Krog!) once said, ”just because you can, doesn’t mean you should”. It has always been possible to place any two items in the same DynamoDB table - all they need is the same data type for the primary key and unique values of the key attribute(s). It’s easy to meet these requirements (for a price). But should you choose to? Well, only if it makes it easier and more efficient to manage and make use of that data. Many of the techniques discussed as the “single table design” story gained momentum were aimed at finding ways to cram completely unrelated data (which will never be stored together in an item collection or retrieved together using Query) into the same table.
Reasons to store data in the same table…
- It’s one “record” - that is, it is denormalized into a single item, or into the same item collection (in the table itself or in a secondary index) - this is so that it can be efficiently updated and retrieved - the data is definitely related and you can’t help but keep it in the same table!
- It’s multiple records, but of the same type and authoritatively “owned” by the same service. This is analogous to S3 buckets - if you were running the “invoice” service, you’d probably create a separate S3 bucket for the invoices, right? Why? Isolation reduces risk and you want the flexibility to manage all the data the same way without forcing the same configuration on other data.
The price you pay for “exactly one table”...
- You make adjustments to ensure key uniqueness - for example, adding a text prefix to every partition key value. This increases the size of your items (recall that item size is the most crucial factor in cost efficiency and smooth scalability). It also adds developer complexity and reduces readability. And you are not getting anything in return if neither of the above reasons to store data in the same table are true.
- All the data must have the same Streams configuration, the same backup configuration, the same capacity mode and auto scaling policy, the same TTL attribute name, same table class, same global table replication configuration, same table-level metrics (you lose potentially crucial per-data-type observability). Different types of data often have different operating patterns: different timing of load variations and more or less spikiness; and/or a different balance of reads/writes/storage. You are limiting flexibility in a big way - for what?
- You can’t limit a Scan by type of data and you can’t limit an export to S3 either. These are both very common and useful paths for ETL, and also for occasional bulk updates. As an example, imagine you’ve got an avatar management site with 10k user accounts (1KB items), and 2M 5kB image items. You bundled all those records into the same table (social media told you to do it). Now you want to revise all the account creation timestamps in the account records to store as a number attribute (epoch seconds) instead of a string (ISO 8601). To find those account records, you have to scan or export the entire table, 99.9% of the effort being wasted on records which are irrelevant to your goal.
- Increased blast radius. Even if you’re following the guidance to isolate data stores by service (shared access only via APIs), mixing unrelated types of data within a service in the same table doesn’t make sense. In the avatar management scenario above, let’s imagine that you made a mistake in your bulk update and you pummel one account item to the point that the partition throttles. Yes, DynamoDB will try to split things out to isolate the hot key (this takes time), but in the meantime you’re impacting a portion of your users whose account items live on the same partition - they may have trouble with the account preferences management part of the site. BUT, you’ll also be affecting the experience of a different bunch of users whose avatar access is also being impacted. Wouldn’t it be nice if the dependency upon these unrelated data types was independent? How about if you make a big mistake with your accounts management workflow and you want to restore those account records using point-in-time recovery? Now you’re dealing with a much bigger volume to be recovered and a longer time to recover. When you have unrelated data supporting separate functionalities, “exactly one table” is not your friend.
- Complicated access control. For unrelated data, it’s easier to manage policy at a table level - that means less risk of error and better security.
But I saw something in a video that said…
- “It’s more performant/efficient to get multiple items if they’re in the same table”. This is only true if the items are in the same item collection so you can retrieve them using Query. Otherwise, DynamoDB doesn’t care at all - if you BatchGetItem 10 items from 10 tables, the read unit consumption and latency will be the same as if they were all stored in the same table.
- “Because of some DynamoDB internal stuff, magical things will happen if I just force all my data into one DynamoDB table”. Nope. Does not work that way in real life, sorry.
- “Multiple data types sharing the same auto-scaled capacity in a table will allow for sharing of underutilized provisioned capacity and I can save money”. There might be rare cases where this could save you a tiny margin on your DynamoDB bill, but I haven’t seen it happen. Given that “exactly one table” strategy typically involves paying for larger items, over-projection to secondary indexes and wasted read/write units, this feels quite dubious - and are you willing to put your future operational success at risk to roll the dice? That would not be my recommendation.
The straight scoop
In full disclosure, my work history at Amazon gives me a strong DynamoDB bias. I’m working on different serverless infrastructure services at Momento these days - but I still have deep affection for DynamoDB. It’s an incredible database when applied effectively for the right purpose - and we’re very lucky to have so many choices in the database space these days! The engineering experts who build and operate DynamoDB are phenomenal folks and they have over 10 years of experience continually refining the product for customers. They want you to have success when building around DynamoDB, and I do too.
Unfortunately, the misinterpretation of “single table design” took DynamoDB and some of its customers on a painful tangent. I spoke with countless folks who were about to give up on DynamoDB because the (very reasonable) design they’d come up with had more than one table, and merging them made no sense. I was able to help many of them move past the confusion, but I often wonder how many others have abandoned their interest in a DynamoDB-backed solution because of misguided “single table design” pressure in social media. Until recently, the DynamoDB documentation did not mention “single table design”. Sadly, it was recently updated to include information about the (apparently binary) choice between “single table design” (one heavily misunderstood phrase) and “multi table design” (a truly strange and misguided concept) - so, now it’s clear as mud, right?!
Here’s my advice to you, dear reader: set aside the whole “single table design” debacle. There’s no free lunch in databases and DynamoDB isn’t as different or complicated as some would have you believe. When building your DynamoDB data model, start with your access patterns, recognize that DynamoDB tables allow for schema flexibility, keep items small, and build item collections where appropriate to optimize your read and write unit consumption. Carry the same methodologies through to your secondary indexes. Factor in the operational needs you have. You’ll wind up with N tables. It’s okay - go ahead and prosper.