Handling large amounts of data is a prerequisite of digital transformation, and key to this are the concepts of data lakes and data warehouses and data hubs and data marts.
In this article, we’ll start at the top of that hierarchy and look at data lakes. As organizations try to get a grip of their data and wring as much value from it as possible, the data lake is a core concept.
It’s an area of data management and analysis that depends on storage – sometimes lots of it – and it’s an activity ripe for a move to the cloud but can also be handled on-premise.
Data lake vs. data warehouse
The data lake is conceived of as the first place an organization’s data flows to. It is the repository for all data collected from the organisation’s operations, where it will reside in a more or less raw format.
Perhaps there will be some metadata tagging to facilitate searches of data elements. Still, it is intended that specialists such as data scientists and those that develop touchpoints downstream of the lake will access data in the data lake.
Downstream is appropriate because the data lake is seen, like an actual lake, as something into which all data sources flow, and they are potential, many, varied and unprocessed. Data would go downstream to the data warehouse from the lake, implying something more processed, packaged,d and ready for consumption.
While the data lake contains multiple stores of data, in formats not easily accessible or readable by the vast majority of employees – unstructured, semi-structured, and structured – the data warehouse is made up of structured data in databases to which applications and employees are afforded access. A data mart or hub may allow for data that is even more easily consumed by departments.
So, a data lake holds large quantities of data in its original form. Unlike queries to the data warehouse or mart, to interrogate the data lake requires a schema-on-read approach.
Data lake: Data types and access methods
Data sources in a data lake will include all data from an organization or one of its divisions. It might consist of structured data from relational databases, semi-structured data such as CSV and log files, data in XML and JSON formats, unstructured data like emails, documents, and PDFs, and binary data such as images, audio, and video.
In terms of storage protocol, it will need to store data that originated in the file, block, and object storage. But, of those, object storage is a common choice of protocol for the data lake itself. Don’t forget, access will not be to the data itself, but to the metadata headers that describe the data, which could be attached to anything from a database to a photo. Complex querying of the data often happens elsewhere, not in the data lake.
Object storage is very well-suited to storing vast amounts of data, as unstructured data. That is, you can’t query it like you can a database in block storage, but you can store multiple object types in a large flat structure and find out what’s there.
Object storage is generally not designed for high performance, and that’s fine for data lake use cases where queries are more complex to construct and process than in a relational database in a data warehouse. But that’s fine because much querying at the data lake stage will provide more easily queryable data stores for the downstream data warehouse.
Data lake on-prem vs. cloud
All the usual on-premise vs. cloud arguments apply to data lake operations. On-prem data lake deployment has to take account of space and power requirements, design, hardware and software procurement, management, the skills to run it, and ongoing costs in all these areas.
Outsourcing the data lake to the cloud has the advantage of offloading the capital expenditure (capex) costs of infrastructure to an operational fee (opex) one of payments to the cloud provider. That, however, could result in unexpected costs as data volumes scale and upon data flow to and from the cloud, for which you will also be charged.
So, a careful analysis of the benefits and drawbacks of each is needed. That could also take into account issues such as compliance and connectivity that go beyond just storage and data lake architecting.