Data Lakes for Non-Techies

Click to learn more about author Joan Fabregat-Serra.

IT has weaponized jargon since the very beginning of IT itself. Jargon was a key element to create vendor lock-in situations during the ’90s and early 2000s. Moreover, complex usability helped in developing a network of certified (aka expensive and lucrative) consultancy workforce.

IT has recently experienced a massive shift towards openness and standardization. Yet, the jargon lock-in issue has mutated into confusing buzzwords to sound like an insider or tech-savvy.

With data lakes, we encounter the same confusion created by the mutating nomenclature. There are many different denominations for data lakes’ key components and concepts. Furthermore, data lake strategies can be overwhelming due to confusion created by jargon.

This article intends to bring some insights and clarify basic data lake jargon. Moreover, we will discuss basic roles interacting with the entire cycle of a data lake. We hope that data-centric organizations may find this article useful.

Data lakes are cloud storage systems, like S3 and Blobs, that store data in its raw format (e.g., Jason, txt, CSV, HTML, logs, binaries, etc). Hadoop introduced this technology as a big data solution in the early 2000s. But the innate complexity of Hadoop made it a niche solution.

Nowadays, many data lake alternatives can query raw data and process it as you would do in a data warehouse, making data lakes very popular. With this new evolution, a confusing buzzword appeared: data lakehouse or virtual data warehouse. Nonetheless, data lakehouses are anything but a simple upgrade towards better usability.

As it happens with ETL, data lakes are usually structured in three differentiated areas. Yet, there is no quorum whatsoever in the naming:

First zone: landing, raw, bronze, pond, or swamp
Second zone: exploration, silver, refinery, or sandbox
Third zone: operational, consumer, gold, refined, or lagoon

Let’s see what the purpose of each zone is and assign them basic responsibilities.

The Landing Zone

The landing zone is usually IT’s kingdom. IT is responsible to maintain all the processes that populate this zone with raw data.

Since data is in its raw format only on a few occasions, the landing zone is accessible by analysts. Those exceptions are often related to situations when analysts need to explore “dark data.” If such work with raw data is providing recurrent value to the company, then IT needs to develop a formal process through the other two stages. On such occasions, IT may remove access to the landing zone when the job is done.

The Exploration Zone

The exploration zone is where data is transformed into usable tables and basic cleansing has been made. Although IT is in charge of maintaining all data pipelines, the data owner plays a major role in defining and structuring the data. Moreover, data owners make sure that data is well understood by IT. Exploration tables should be used for exploratory analytics and any sporadic usage.

The Operational Zone

The operational zone is where the end tables are published resulting in strict quality and regulatory rules applied. Data is being processed in “harmony” with other data entities to ensure that the resulting products are usable. The resulting tables have to be optimized for concrete use cases.

Hopefully, this short article helps to clarify basic concepts and roles involving data lakes. For clarifications and follow-ups, please use the comment section below.

TAKE OUR DATA MANAGEMENT CERTIFICATION PREP COURSES

Data Topics

Leave a Reply Cancel reply