On Data Lakes and Data Flows — Serverless Best Practices Expanded
In my recent blog post on Serverless Best Practices I wrote that one of the best practices was “Data Flows not Data Lakes” and it was one that people have asked me about quite a bit.
It is probably the most problematic of the best practices in terms of developing an application, because it catches the most people out. It seems an obvious thing to say, but it’s not an obvious thing to explain, and a couple of paragraphs didn’t seem to do it justice.
So let’s dive into a lot more depth.
Data in motion and the “data layer”
We have the concept of data in motion (or flight) vs data at rest in modern application development. This is often in conjunction to a conversation about encryption or security. Data in motion is moving between parts of your application, from one server to another or from one microservice to another, or even to or from an external service or user. It’s vital to the working of your application. Data at rest is often sitting in a store like a database or a blob store or a file.
When you build a serverless application, the most difficult thing to keep in mind is that your application is a distributed application from the very beginning. Often that distribution is via multiple functions in a FaaS (Function as a Service e.g. AWS Lambda) or it is a number of distributed services that are connected up through a triggering mechanism. However it is put together, the upshot is that there is a lot of data in motion throughout the system all the time. In fact, it’s rarely at rest at all.
The problem with data being in motion all of the time is that the concept of a data layer (that exists in more traditional architectures) is actually a lot more complex. Where is the data layer? Does it really exist? What does it look like now?
Flexibility of Data Structures
Changing from the concept of a data layer, to having data in motion all of the time has other impacts on the way we approach the application. Often within a traditional application it is often necessary to standardise a data schema or model across multiple different parts of an application. This ensures that a team can know what type of data they are working with or whether they are receiving the right data.
When your data is always in motion then the data flowing between the elements of your application is different each time. If that is the case, then you don’t have a standardised schema across the application. You can create one if you wish.
If you have no need of a standardised schema then you have a very flexible data structure to work with (in fact, you don’t have one). The data can flow and change through your application in whatever way you choose. This is hugely useful, especially if you wish to change your application in the future.
If you push your application into a standardised schema (e.g. an RDBMS) then you automatically are enforcing an inflexibility onto the data. That inflexibility may not be immediately obvious, but it is blindingly obvious after your application has been running for a period of time.
There are times when you want your application to be flexible, and times when you want it to be inflexible. However, the most important thing to consider when you’re building a serverless application with Functions is speed and Functions are the most common way we build serverless applications.
Data and Functions
Functions are meant to be stateless, and by stateless, this means that there shouldn’t be any data stored in the function itself.
This stateless element is quite important when it comes to flexibility. If you have to start validating against a standardised schema, over time, this tends to add an overhead. When you start to add in storing and retrieving from a data store, the more schema validation that is required, the more overhead is needed. If you have an index or two, or validation across other sources, you end up with a scenario where the overhead can become onerous.
Functions are stateless for a reason. They are meant to be light and fast, and the data that comes in is meant to be validated for that function only and then transformed for whatever purpose the function has been developed and passed onto the next phase of the journey.
It is a flow through a set of functions and queues.
Sometimes you need data in your functions
Sometimes you do need to get structured data into your function. And sometimes you really need to get a specific complex dataset from somewhere.
This problem comes back to the data “layer” issue again. When building an application the traditional thing to do is “I need X so go to the database and get it”. It’s a SQL query or a search or something similar.
Again, the key is speed.
Make your function fast by ensuring that the least amount of work is needing to be done and the smallest wait happens.
Make it one simple call and make it as quick as possible.
The way to achieve this is to ensure that the data has already been created and stored in the form that the function needs.
Making tables or blob objects or even updating the function regularly with updated data is far far faster than storing data in a large data store and using SQL.
Often the way to achieve this is to generate the data before it is needed. Essentially creating some form of data cache where appropriate. Again, we are moving away from the slower data layer.
Data Lakes are for Analytics
So when do you need data lakes?
Well, there are some points that you really need analytics. And analytics are generally not needed at speed.
And that’s the thing.
Data flows are all about speed and data lakes are all about data connections.
How fast do you want what you are looking for?
There will be people who will respond that RDBMS are very fast (and they are) but they aren’t scalable at the same scale as FaaS solutions, and other services are not either.
When you introduce an RDBMS what you tend to bring in over time is inflexbility. Initially this isn’t the case, but as you build upon the solution, the data schema will become more stale over time, and the application will become linked to it.
The interesting thing is that data within an application will often end up in a data lake. It is relatively easy to see when a data flow has started and ended because you start with an input of some sort and it ends in a data lake. This is how it should be.
Redirecting Data Flows
One of the big advantages of using data flows within a serverless application is that you can easily redirect an output from a part of your application.
Why is this important?
Redirecting flows allows you to easily build replacement data flows without needing to impact schemas.
Redirecting flows allows you to replace elements of an application without having to replace the rest of an application.
Redirecting flows allows you to separate the business logic from the data schema.
And when you run an application for a period of time, you will want to redirect flows. Maybe a business process changes, or you change a service that you’re using for analytics or something similar. Redirecting a flow is relatively simple. Damming a lake is not.
Unidirectional Data Flows and Circuit Breakers
One other aspect which is useful to talk about is unidirectional data flows or to put it another way ensuring that data only flows one way through your application.
One common way for applications to be built is for functions to do something and return data back up the chain. This is especially common in web applications that are built in a “serverless” way.
I would strongly suggest to look more at building unidirectional flows and CQRS simply because of the ability to detach data flows later down the line.
Data flowing down and back up a chain tends to make for inflexibility in an application again. Again, this may not be an issue early on in an application, but over time, this inflexibility causes issues due to rigidity within the functions and queues.
Part of the downside of bidirectional data flows is that it makes a function or service do more than one thing and that tends to make bigger elements. However that’s not the biggest issue.
Having unidirectional data flows allows you to have circuit breakers.
Within a serverless application it is really useful to catch errors in a relatively standard way. Circuit breakers are a particularly useful way of doing this, by using queues to route data through an application, the circuit breakers can have a Dead Letter Queue (DLQ) and a notifier when an error occurs.
If that is your standard application error pattern throughout an application, then your error handling becomes significantly easier.
Your application becomes more self healing, by moving issues into DLQs and also by having a standard approach, your team will standardise on how to resolve the issues.
And if your application is built in this way, then the application as a whole won’t fail. Your “server” doesn’t go down and the likelihood of being able to fix it quickly is significantly increased.
Data Flows not Data Lakes
Changing the paradigm from the application layer and data layer of traditional architectures into a more distributed architecture and serverless application architecture is not straight forward, but it has huge rewards.
Building an application in this way is not initially very easy, but once you start down this route, it becomes relatively easy to see individual data flows through the system.
Work out how data flows through your business logic, and work out your transformations at each point.
Once you have your data flows, it’s relatively easy to work out whether there is any complex data you need at each point that isn’t easily retrieved. That sometimes takes a bit of thought, but it’s where you build the equivalent of a view or cache on the data.
When you have all that, push your analytics data into some form of data lake.
Data that doesn’t need to be performant can go in there.
That isn’t to say you should never use an RDBMS — sometimes you should.
Go make your data flow.
Go analyse your data lakes.
Go build serverless.
What do I mean by “Data Lake”?
(I’m adding this section after a tweet asking for clarity)
So, a Data Lake is a term that covers something that gathers all data into itself, structured and unstructured. In the context of this article the idea is that a Serverless application simply does not structure it’s data throughout the application — it stays relatively schema and structure free — except at the boundary of each specific Function or resource. So all data flows through the application until it reaches the lake. Like a river or stream flows into a lake.
The point of the lake is not to be a repository of information for the application itself, but to be the final endpoint for the data and useful for learning and analytics.
The application may have small datasets that it uses inside the application itself e.g. a table, or a search or a graph db, but they are optimised for that specific use, and not a “one size fits all” type approach.
So, when I use the term Data Lake, I am thinking about the type of architectural decision where an application has one large data store which is used for most/all data storage within an application. This “one size fits all” data storage approach was efficient when servers were scarce, but now that we have FaaS we can highly optimise each part of an application, including data storage.
We still need the concept of a data lake, but when you are building for speed and scale, you really need to think about data flows, not data lakes.