How To Tell If You Need To Build Your Own Data Lake
In the current digital economy, data is the new oil, so an organisation can’t have too much of it. Today’s company executives are turning to data when making crucial business decisions in sales/marketing, operations, human resources and just about every aspect of the enterprise.
More data is becoming available in large volumes and from various sources, but turning plain data into actual usable information—i.e. facts, trends and statistical numbers, can be a complicated and tedious process. So, businesses have a choice. They could continue to acquire whatever data they can, then simply leave out what they can’t store, or they could invest in data warehouse technologies to process data faster and get the right information more quickly.
The solution could be somewhere in between these options - a data lake.
But, first things first:
What is a data lake?
A data lake is a unified, centralised data repository where data is ingested and stored in its original form. Large quantities of structured, semi-structured and unstructured data can be stored in the lake, which makes it easy to immediately push data from different systems. Thus, another way to view a data lake is to see it as a big data architecture that uses the straightforward approach of store now, analyse later.
This is why data lakes offer a significant advantage to businesses which don’t yet have the necessary tools to process huge volumes of data. Companies can still ingest or capture all the data from both internal and external sources (and very cost-effectively, because data lakes cost way less than the traditional data warehouse). Companies can then store them in the lake as-is, until such time that they can be processed through different analytics and business intelligence technologies.
It’s time to build your own data lake when…
Equipped with this basic understanding of a data lake, creating one for your business may seem like the obvious answer. After all, data is always valuable, whether you are able to mine useful information now or in the future.
That said, having one’s own data lake may not be the right path for every company. How do you determine this? Well, it’s time to create your own data lake if most or all of the following scenarios are true for your organisation:
You’re dealing with different types of data.
Typically, data lakes are mostly utilised to store data generated from constant streams of high-velocity and high-volume sources. Streaming data include IoT sensors, security logs, product logs, web interactions, app activity and others. Costs can easily escalate when you attempt to put these types of data into structural databases as quickly as you can, and the information you get may not even be what you need.
Having a data lake allows you to store the data in the meantime, and then leverage the appropriate processing tools later so you can mine the data for exact purposes. This means, however, that if you’re working with data that’s already in standard, tabular formats, such as those generated by financial systems or CRM databases, then sticking with a conventional data warehouse might be the more fitting option for you.
Your data-acquisition process has become very resource-intensive.
Many companies are already using a data warehouse, but business directions can change and you may need to acquire more data from different sources. The process of ingesting new data sets, particularly if these are only semi-structured, could become very expensive.
That’s not even considering how your ETL (extract, transform and load) tool could easily become overloaded from having to transform additional data in new formats. In this case, building a data lake would allow you to ingest all the new data you want without requiring your current tool to process it all.
You want to keep all your data in one place.
Data lakes allow enterprises to keep their data all in one place. Having your data distributed in different locations works for some companies, but for those who need to have it in a single repository, a data lake can serve that purpose.
The benefits of having corporate data in one place include: having standardised access control and security policies, maximised storage costs, less administrative overhead, and the ability to use one single service to perform search, filter and transformation.
You don’t have immediate plans for all your data yet.
As mentioned previously, the data lake approach can be summed up as - store now, analyse later. This gives the business the flexibility to simply let all raw data (along with all the associated metadata) flow into the lake just to have it on hand, or with a future use case in mind. With a data warehouse, the data typically needs to be stored in a specific format because there’s already a particular use case for it.
It’s important to remember, however, that you will eventually have to determine exactly what you want from your data. Otherwise, you risk turning your data lake into a data swamp where there is this goldmine of data which never produces anything useful.
Once you know what insights you need, you can assess if you are collecting enough data, and then invest in data scientists, ETL operations and BI or data visualisation tools to generate the information that will help drive your business decisions.