Prime Video, serverless, monoliths, bad blog posts, and the importance of definitions
I’ve been seeing the Prime Video “Microservices to Monoliths” post since it came out, because some of my (so called) friends have been sharing another highly biased post that referenced it from someone very influential in tech, but who I don’t want to boost (initials D and H and H).
The original post is here:
I very much agree with Adrian Cockcroft and his post So many bad takes — What is there to learn from the Prime Video microservices to monolith story.
What the Prime Video team did was quite simple: they built a “serverless first” initial version, and then modified it to use containers when they needed to optimise elements of the system. With more experience, they could have built a better system initially, but that’s what learning is about.
The improved version is still serverless in my view.
It’s not a monolith.
The idea that serverless systems must only contain Lambda functions for compute is wrong or always use specific technologies and never use others is wrong.
These words and concepts are so badly defined for most people in tech and most people never interrogate their own bias.
Just as we rarely consider the words “microservices” and “server” and “data centre” and “instance” and “container”.
The root of why this Prime Video tech post became such a lightning rod was simply that the original post was badly written, and the terminology used confused so many terms, that it made it impossible unless you read the whole thing and understand what serverless is to understand what they are trying to say.
Or, the short version: it’s a really badly written blog post.
What the Prime Video team actually did
In effect, all the team did was to replace a Step Functions workload with an ECS workload and for very good technical reasons.
In essence it is the same system if you look at it as a black box, because the SNS topic and S3 bucket are the outputs for both, and the inputs are events from the customer.
Calling this a “monolith” is not right in my view. It’s event driven, it’s triggered via a lambda function, it’s scaling using ECS rather than Step Functions, and crucially there is something else that they realised that makes the ECS version better.
The ECS version is better than the Step Functions version because the task that they were attempting to do — analysing video streams — didn’t work so well within Step Functions, but worked a lot better when the task was run on a single container.
This makes sense, and is explained well in the article. Passing lots of data around a distributed system is slower and more expensive than keeping it in a container and orchestrating it in the container. That’s obvious if you think about it as well.
Now, Prime Video also hit scaling limits. Not everyone has Prime Video’s problems. Most people don’t. It’s one of the reasons you would look at moving certain workloads of very high load distributed compute away from AWS Lambda.
Side Note: I am going to stress very strongly at this point that very few workloads when well designed actually reach the point of hitting scaling limits with serverless, so scaling limits are almost never a reason to dismiss AWS Lambda or serverless at the design stage. I have never found it an issue with any of the solutions I’ve designed.
If you need to move compute away from AWS Lambda, you move it to autoscaling ECS tasks triggered by AWS Lambda. This is pretty much still serverless, but instead of pay-per-use compute, you are recognising that your workload is likely never going to scale to zero (which is a useful thing to know), so at that point you can optimise for the scale you have by using containers.
It’s still serverless.
It’s a very good example of refining a serverless solution in my view.
But the Prime Video blog post is still really badly written
What went wrong?
The penultimate part of the post is called “From distributed microservices to a monolith application”.
The final section is called “Results and Takeaways”
“Serverless” is used three times in the blog post. Twice as “serverless components” and once as “serverless architecture”. There is a lot of discussion in the post about a “distributed” approach, separately from the word “serverless”.
Then there are the uses of the words “microservices”, “services” and “distributed systems”.
They talk about their initial solution being a “distributed system using serverless components”, they then “migrated to a monolith”, but then said that “Conceptually, the high-level architecture remained the same”.
And then, they “cloned the service multiple times” (“service” here refers to the container based solution they had built to replace the Step Functions based part of the solution) and then built a higher level orchestration layer to distribute requests.
In other words, they built something that looks a lot like… Step Functions.
It’s all about the words
I hope that you can see that the blog post is incredibly confusing from a terminology point of view.
What is a “serverless component”?
What is a “distributed component”?
What is a “service” or “microservice”?
For that matter what is a “distributed system” as opposed to a “serverless architecture”?
You might think I’m nitpicking.
I am definitely nitpicking.
If you’ve not come across me before, I used to be the Senior Developer Advocate for Serverless at AWS from 2017–18. I wrote some blog posts in that time. We had rules we had to follow, and the blog posts had to pass some quite serious review procedures (they were called “Chris Munns”).
This blog wouldn’t have got past first review and would have been politely sent back with many revisions or more likely “please review the policies and rewrite”.
This blog post has a mish mash of ill defined terms and it’s unclear what the purpose of the article is.
Do you know what the purpose of the article is? If the purpose was to explain how to modify a serverless architecture when you hit scale limits, which is what it should have been about, then the article failed spectacularly.
And worst of all, it can easily be used to be critical of AWS, AWS Lambda, serverless, and also undermine the very hard work of another part of Amazon.
And it has done significant damage to the work of the serverless community.
This blog post has been thrown at me from multiple people saying “what do you think of that?” and telling me that even Amazon don’t think “serverless” can do “scale”.
It’s just more ammunition for people who don’t know what they’re talking about to dismiss “serverless” as an option.
It sucks because those of us who have been doing serverless incredibly successfully for years have had to put up with FUD like this from uninformed people for so long. For this blog post to have delivered this kind of destructive ammunition to the tech world from an AMAZON company is just an absolute nightmare.
Prime Video is part of Amazon, but it’s not part of AWS. For some reason many people think that “Amazon” is a single entity and that it’s a big office somewhere in a building in Seattle. I know it’s hard to understand but it’s not that hard. And when people see Amazon, they think that it carries the weight of AWS as well. That’s why it matters.
I don’t think that the blog is under the purview of AWS (pretty sure it’s not) and so AWS likely don’t have editorial control. But I suspect that there has been a very strongly worded statement from AWS to all areas of Amazon that will essentially say “don’t you dare do something like this again”.
It’s probably hoping that the tech world ignore it, but the fact is that it keeps getting brought up to me.
Why write this blog?
I’m going to point people to this blog post as my response if anyone else asks me about that post or the D and H and H one.
I’ve already got annoyed in other social media forums, but that rarely stays visible.
While I do agree with Adrian, the thing that annoys me about this whole thing is how this all could have been avoided.
When you work at a company like Amazon, your words have consequences. It is one of the reasons that I actually ended up not liking it working there. I knew how important the words I wrote and said were, and it caused me stress and anxiety. (I know now this was partly as a result of having undiagnosed ADHD)
That Prime Video blog post feels like someone doesn’t understand how powerful words can be when your words are linked to a company like Amazon. It also feels like someone didn’t understand how the words would hit in other communities.
It is very hard to write really good content in tech.
This should be a case study in how not to do it.
Because some of us have worked very hard to get it right.