Data Catalog, Semantic Layer, and Data Warehouse: The Three Key Pillars of Enterprise Analytics

*Read more about Prashanth Southekal and Inna Tokarev Sela.*

Analytics at the core is using data to derive insights for measuring and improving business performance [1]. To enable effective management, governance, and utilization of data and analytics, an increasing number of enterprises today are looking at deploying the data catalog, semantic layer, and data warehouse. But what exactly are these data and analytics tools and what value do they offer in improving the business performance of the firm?

What Is a Data Warehouse?

A data warehouse or enterprise data warehouse (EDW) is a system that aggregates data from different source systems into a single, central, consistent data store to support data analytics and artificial intelligence (AI). Data warehouses are commonly used primarily for combining data from one or more sources, reducing load on operational systems, tracking historical changes in data, and providing a single source of truth.

Historically, the data in the EDWs are organized as the star schema and the snowflake schema data structures.

Star schema consists of one central fact table, which can be joined to several denormalized dimension tables. It is considered the simplest and most common type of schema, and its users benefit from its faster speeds while querying.
In Snowflake schema, the fact table is connected to several normalized dimension tables, and these dimension tables have child tables. Users of a snowflake schema benefit from its low levels of data redundancy, but it comes at a cost to query performance.

However, today, both the star schema and the snowflake schema are not very relevant due to some fundamental shifts happening in the world of data warehousing.

The star and snowflake schemas were relevant when storage and compute were expensive and limited. Today, with the advent of cloud computing, cloud data warehouses (such as Amazon Redshift, Google BigQuery, Microsoft Azure Synapse, Snowflake, Databricks, and more) offer cheaper, unlimited, and faster storage/compute.
The star and snowflake schemas were relevant when business changes were relatively static. Today, businesses are very dynamic. We live in a VUCA (volatility, uncertainty, complexity, and ambiguity) world where companies need the ability to forecast with greater speed, accuracy, and efficiency. Building data models on the star schema and the snowflake schema is time-consuming. Gartner reported that 50% of BI projects will not meet business expectations at the time of going live. While 98% of BI projects are declared successful in week No. 1 after going live, only 50% remain successful by week #10 [2]. So, businesses need data warehouse models that are scalable, flexible, faster, and cost-effective.
With in-memory and MPP (massively parallel processing) capabilities, organizations are increasingly leveraging embedded analytics, i.e. the convergence of OLTP (OnLine Transactional Processing) and OLAP (OnLine Analytical Processing) solutions. Today, we are increasingly moving toward one integrated system for business processes and analytics. This convergence eliminates data aggregates, reduces the dependency on batch processing, and lessens the effort associated with data integration and cleansing. This ultimately results in offering faster consumption of insights into the business processes.

What Is a Data Catalog?

A data catalog is an organized inventory of data. The data inventory in a data catalog typically includes the original data source location, schema, lineage, and more. A data catalog employs the metadata “data about data” to monitor, derive usage context, and even derive business descriptions on the data that come from the data warehouse or the source systems.

Basically, a data catalog is a collection of metadata, combined with data management and search tools, that helps analysts and other data users find the data they need, serves as an inventory of all available data, and provides information to evaluate the fitness of data for intended uses [3]. A well-organized data catalog offers two key purposes:

It provides quick access to data since the data is categorized and easy to find across the entire data landscape. A data catalog is a key tool for governing data across the entire data lifecycle. A data catalog is like an index or catalog in a library. If the librarian doesn’t have a well-maintained catalog of books, it might take the librarian hours to find a particular book. Similarly, a data analyst, might take hours or it would be extremely difficult to find, obtain, and evaluate the necessary data if the data catalog is not available.
In addition, the data catalog has another important focus: providing a “Google search” to the enterprise data assets used in reports, dashboards, data warehouse tables, and more.

What Is a Semantic Layer?

A semantic layer simplifies and translates technical data into a language the businesses can understand. It works by converting the metadata from the data sources and the applications into a cross-organization semantic knowledge graph. The semantic layer sits between the data sources (source systems) and the analytics/AI tools, making it easier for people to access and analyze data without needing to understand the technical details. A semantic layer is like a translator that bridges the gap between business language and data language by bringing consistent and aligned business data definitions.

However, for the semantic layer to work well it requires robust metadata that is enterprise-wide. Metadata plays a crucial role in the formation of the semantic layer, but also in the monitoring of changes over time, which are being reflected immediately in all semantic layer-based applications. In fact, a data catalog is a data-layer-bound manifestation of a semantic layer, connected to data and serving analytics and application layers. A data catalog is valuable for the semantic layer because it has the:

Data source location, schema, lineage, and quality metrics of the data assets, mapping relationships between them to create a unified and consistent view of the data in the entire enterprise. This part of the semantic layer is usually referred to as a data catalog. The automated semantic layer leads to an automated data catalog. In the context of the semantic layer, a data catalog can help to define the various data sources and mappings between them, which can be used to create a unified and consistent view of the data in the entire enterprise.
Data dictionary, which defines the data structure, data element names, data types, and data definitions, and anything to do with the technical aspects of data to foster data sharing.
Business glossary, the business logic cousin of the data dictionary. It is a repository of metadata that defines the business terms, metrics, concepts, and rules used within an organization. In the context of the semantic layer, a business glossary creates a common business vocabulary to ensure that business terms are used consistently across different reports and dashboards.

To summarize, the data catalog helps the semantic layer to provide a unified view of data across different data sources while ensuring that data is used consistently. A data catalog focuses on the inventory list of the data assets with its technical attributes (metadata), while the semantic layer is a virtual layer of business logic over data mapping.

*Figure 1: Relationship between data catalog, semantic layer, and data warehouse*

When Are the Data Catalog, Semantic Layer, and Data Warehouse Required?

Here are the situations where companies might need the data catalog, semantic layer, and data warehouse.

Data warehouses combine data from one or more sources, reducing the load on operational systems, tracking historical changes in data, and providing a single source of truth for deriving insights.
If there are multiple business definitions (in big organizations, they most certainly are), you need standardized business terms and metrics. The power of a semantic layer lies in its ability to unify various siloed data and application components within an organization under standardized business terms and metrics. Basically, the semantic layer serves as a “co-pilot” to the business users by supporting ad-hoc queries created by “super users” who don’t have much formal training on data and SQL. They need more support and guidance than a skilled BI/data developer. A semantic layer ensures users conform to standard and consistent data definitions in plain business language.
The conventional/traditional ways to locate and utilize data are ineffective, laborious, fragmented, and only available to the technical group. If the business data is constantly growing and if business users need faster and more reliable insights, we need to promote data democratization (by balancing it with data protection). The data catalog, sematic layer, and DWH are data democratization enablers to empower businesses with analytics.
Today, data is present everywhere: in local systems, cloud-based systems, SQL databases, or NoSQL stores, and in a variety of forms – structured, unstructured, and semi-structured. This is mainly due to the decentralization of business models, which has resulted in increased complexities such as multiple data definitions, data formats, types, and more to meet the needs of various stakeholders. This scenario is often seen in large global corporations which have a diverse and distributed workforce. Say, marketing calls a business a prospect while managing the leads in Salesforce CRM, sales might call this same business a client as orders are managed in SAP ERP, and finance could call the same business entity a counterparty as the invoicing process is managed in Oracle EBS. So, in this complex environment, how do you get a report that aligns all three data elements into one? In the data landscape that is siloed, it is extremely difficult to get a single “Lead to Cash” report. Functions such as Finance need enterprise-wide reporting to meet the IFRS/GAPP compliance needs. So, if your reporting needs to be enterprise-wide while still addressing inconsistent data definitions and data silos, then the semantic layer is the option.
Organizations that aspire to become AI-ready need the data catalog, semantic layer, and data warehouse. The generative AI rush is everywhere. The executive management is excited about the opportunity. The data people must ensure the data quality, including the semantic uniform definitions of structured data.

When Are a Data Catalog, Semantic Layer, and Data Warehouse Not Required?

Here are the situations where companies might not need the data catalog, semantic layer, and data warehouse.

1. A semantic layer (and data catalog and data warehouse) is recommended for:

Enterprise-wide reporting when there are data silos with inconsistent and multiple data definitions, and
To serve many diverse analytics users.

If all data for reporting and analytics is sourced from one system, the data will likely be consistent and in one format. In that case, you don’t need a semantic layer (and data catalog and data warehouse). In other words, if the data culture of the company is to foster a single version of truth (SoT) with a single data source, then the semantic layer is not required. However, this scenario is very rare in today’s decentralized and distributed businesses.

2. A semantic layer is practically a metadata layer. It doesn’t store any data. The data is stored in the data warehouse, or the source transactional systems, and the semantic layer accesses the data with the right data mapping. Sophisticated power users (such as senior business analysts, data scientists, etc.) who typically query a range of databases, including external sources, are also usually limited to the specific datasets highly curated by data engineers. In this scenario, they can identify the right data via semantic layer, but will still be dependent on data engineers to create their datasets.

3. A semantic layer facilitates ad-hoc query and reporting, and the people who need ad-hoc query and reporting tools are usually the super users. On the other hand, if you are delivering BI and analytics to the masses of sophisticated users, then a Semantic Layer is highly recommended.

To summarize, the data catalog, semantic layer, and data warehouse foster data centralization for driving and consuming insights quickly. These components are key to becoming a data-driven enterprise. A report from MIT says digitally mature firms are 26% more profitable than their peers [4]. McKinsey Consulting indicates that data-driven organizations are 23 times more likely to acquire customers [5]. Industry analyst firm Forrester found that organizations that use data to derive insights for decision-making are almost three times more likely to achieve double-digit growth [6]. Overall, if organizations are looking to stay ahead of the game, they need to become a data-driven enterprise then the Data Catalog, Semantic Layer and Data Warehouse are the key elements in that architecture.

References

Southekal, Prashanth, “Analytics Best Practices”, Technics Publications, 2020
Gukeria, Hari, “Bi Valuenomics: BI Business Value”, Authorhouse, 2010
alation.com/blog/what-is-a-data-catalog/
ide.mit.edu/insights/digitally-mature-firms-are-26-more-profitable-than-their-peers/
mckinsey.com/capabilities/growth-marketing-and-sales/our-insights/five-facts-how-customer-analytics-boosts-corporate-performance
forrester.com/blogs/data-analytics-and-insights-investments-produce-tangible-benefits-yes-they-do/

LISTEN NOW: MY CAREER IN DATA PODCAST