I’ve been pleasantly surprised at the response to my recent posts on Serverless architecture. It’s been exciting to see both validation (thanks) and real questions (again, thanks) about the approach. It’s challenging me to think.
One of the things that has come up is people saying that it seems too perfect. Where are the problems? Is it really the holy grail for us programmers to go forwards with?
There are both good and bad things with serverless…
My plan is to write some posts on what I’ve learned from a practical point of view. There are issues to resolve and there are things that need to be worked out so that we have a (hopefully) common approach to fixing them for the future.
Learning 1: No servers is a real joy
It’s all about the function. The function is everything. Of course, you have to consider use of technical resources (bloated functions are worse than doing it the old way) but essentially, for each function you create, you don’t have to panic about anything else.
Think of all the things you do with your servers. You setup monitors to tell you if the server is down. You setup notifications if something big happens on your server (it crashes/fails/poops itself). You consider hardware upgrades over time (if managing your own). Every public holiday is a potential day for stuff hitting fans. Every holiday you ever take, you need a laptop.
In short, it’s a nightmare.
Imagine a world where you genuinely don’t have to care.
The servers are somebody else’s problem.
Learning 2: Testing is a pain
We all love unit tests. And TDD. And BDD. And everything else that comes out of the “perfect world of tech” bucket. At least, we love the idea of them and tell everyone we use it all the time. Then we find out that most people don’t have 100% test coverage, and realise we live in an imperfect world.
You’d think that in a single function world, it would be remarkably simple to test that function. And you’d be right, if that function doesn’t have to interact with other stuff, like shared databases, and third party servers.
Unfortunately, functions are almost never completely isolated. When they are, the likelihood of needing to test is actually reduced.
At the moment, we are limited in our ability to provide coherent testing frameworks around serverless purely because third party services expect us to trust them and don’t necessarily provide a “test” scenario for us.
It’s fine if you are using a PostgreSQL database, and a node.js function, that you control, because you can test that. But it’s not quite so straight forward when you use a shared database, because if you test one you have to test them all. Does my change on one function break a change on another?
Not only that but what if you’re using a third party service that provides you with complex and proprietary data? What if that service does not have a “test” scenario for you to provide test cases against? You are essentially “trusting” the third party to have done your testing for you.
And if you get an SDK from a third party, what happens if that SDK is broken/incomplete/partly untested? We have had more than one scenario where the third party SDK has turned out to have a bug (albeit a minor bug) that has caused a failure in our service. Who’s responsibility is it then to fix it?
At present, we’re in a world where serverless is more about trusting than testing. It will change over time, but this is an issue that needs resolution pretty soon.
Testing is harder. Trust the right third parties and you should be fine.
Learning 3: Logging, analytics and metrics are quite different
This might sound odd, but when you have a framework, there is an element of how the framework is put together that gives you a simple set of metrics to work from and logging tools which are established and understood. People build metrics for these frameworks and admin panels and a whole bunch of other stuff.
Serverless is more bespoke in approach. It is a trade off. It’s not necessarily better or worse.
As a quite crude example, the HTTP request log for example hasn’t changed much in many years.
However, you don’t have an HTTP request log with serverless. You can’t do
tail -f error_log
and watch what is happening in quite the same way. You lose some of the “hands on” tools that sys admins have used for years. You are at least one step removed from the action.
Without a framework or established patterns, then you have to consider your own metrics and analytics more carefully as well. Also you have to log things more appropriately. A big thing is logging too much. If you have to search masses of logs of “perfectly good” info, for when there is a problem, you are wasting time.
It’s not quite so simple to build your admin and identify your metrics either. It’s easier to store data and post-process, but again, your post processing is more bespoke.
The key metrics are things like average speed of function (GB/s is how Lambda is charged) rather than anything else. You definitely have improved granularity in terms of knowing where your issues occur (“function x is running 25% slower than normal”) but actually identifying the problems can be more difficult without appropriate logging.
I will say here that AWS Cloudwatch is very helpful in providing some of these type of metrics at a glance. You can see invocations of AWS Lambda functions and see the average response times and that is genuinely most of what you need.
Identify your key metrics and build up from there.
Learning 4: Choose your third party APIs wisely!
if your app is erroring due to a third party, unless they have properly provided decent error messages, you are in a world of pain.
Imagine getting a response saying “Code: 388” from a third party. Not obviously an error for one. Only after a few minutes (or hours… or days) do you find it’s an error they return and then going to the documentation for the third party and having to search 14 different documents that mention it, and the forum where eventually you find you should have put in a property called mindless_property.
Now imagine the error saying “Code: 388. Your request was missing the property mindless_property. Go and see the documentation and search for shortcut “error388”). Go to their documentation with a “shortcut” search, and put in error388 and voila.
Good third parties will know this already. Their SDK and API documentation will identify some of these issues and have good searches. Bad third parties will be virtually impossible to figure out.
However, because we’re one step removed from the actual invocation, you are sometimes left in the dark about the problem for longer than you’d like.
Rule of thumb: More mature third party providers produce better logging and error messages. Bad third parties mean you have to ring them up to find out why something has failed.
Pick your third party APIs wisely!
Learning 5: Scaling is a dream
Simply put, I am no longer responsible for my application coping with scale. Above and beyond writing code that works within the resources I’m given (i.e. it doesn’t use 47GB of RAM when only 512MB is allocated) and is appropriately coded, I don’t have to worry about this.
Think about it.
You can have a set of functions built to handle 200 or 200 million.
Of course, there are scaling issues you have to consider, but then that is more around how you architect the application in the first place.
But the advantage of serverless is that you can replace one function for another with relative ease.
So if you find there is a problem or bottleneck, you can easily either rewrite or repurpose or refactor or find a third party that does it better/faster/easier or whatever.
We built a serverless solution that scaled to several tens of thousands of users per day from a standing start without having to change any code.
And I didn’t have to do anything either. No provisioning. No load balancing. No discussion with anyone.
Next time someone asks you if your service is scalable, if you built it serverless and architected it will it is remarkably simple to say yes.
Oh and if you find it doesn’t scale (db bottleneck, third party slow etc), I can guarantee you that making it more scalable is easier than with a large framework sitting on a bunch of servers.
Agility of response to issues is improved as well.
Serverless… it can easily be highly scalable from day 1.
I’m stopping here… until part 2
If you want me to discuss something specific let me know.
I’m writing this partly for my benefit to help me think about what I’ve achieved, but also for others to make an informed decision about serverless as a route forwards.
Part 2 coming soon!