It’s All About Relations!

*Read more about author Thomas Frisendal.*

The new ISO 39075 Graph Query Language Standard is to hit the data streets in late 2023 (?). Then what?

If graph databases are standardized pretty soon, what will happen to SQL? They will very likely stay around for a long time. Not simply because legacy SQL has a tremendous inertia, but because relational database paradigms are actually good for some things. Note that I shifted term from SQL to relational. Not everything that Dr. Codd (the father of the relational model) had hoped for made it into the commercial SQL implementations – at least not the first 20-30 years (the relational model was published in 1970 and ISO SQL was first published in 1986).

Dr. Codd surely wanted one thing to be of high importance: relations.

But, wait a minute, a relational relation is modeled as a table in SQL? Yes, that is true. But the data bank (Codd’s initial term) should impose no restrictions on the accessibility of attributes across relations (under the umbrella of data independence). The then-current DBMS systems had all kinds of restrictions coming from implementation techniques such as tree structures or pointer chains. Modern SQL systems have very sophisticated query optimizers, which work fine, provided that the semantic quality of the data is OK and that functional dependencies are completely understood and adhered to in the data models. (And that is not always easy.)

So, from that perspective SQL sets a standard for data independence. Dr. Codd phrased it like this:

“It provides a means of describing data with its natural structure only-that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation and organization of data on the other.” (His Turing paper “A Relational Model of Data for Large Shared Data Banks” from 1970)

The challenging part of this – even today – is the performance in massively multi-join data models.

What Should We Expect from GQL Databases?

GQL (its’ DDL and its’ metadata graph and so forth) should be open and flexible. Developers of today (including data engineers, data scientists, and so forth) want modern data stacks having flexibility, mix and match, plug and play, and so forth. So, while e.g. SHACL integration might be good for some heavy constraints handling use cases, it should not be the only choice. A developer would want to plug it in, if necessary, and otherwise use basic GQL constraints or something else, as they fit. Development platforms such as Github also fit into this picture (text files, which are versioned).

GQL will exist in many use case scenarios having diverse data stack architectures. This means that the core metadata graph of GQL should be robust enough to meet many diverse integrations and mappings.

Even in a pure property graph configuration (think a graph like a third normal form data model), there is a need for a canonical metadata graph; mapping to different aggregation strategies for distributing properties across the nodes/vertices and edges/relationships.

And in situations with diverse graph paradigms, the canonical level is the focal point for mapping to and from. Already today there are commercial products implementing RDF/SPARQL (from the W3C) + openCypher (the major predecessor to GQL) and also Gremlin (from Apache) + openCypher. Amazon Neptune supports all three graph languages today.

The use cases and requirements for graph databases mostly focus on complex data models with high levels of connectivity. Which translates into lots of relations and sophisticated query handling combined with sophisticated persistence strategies.

But let us begin with the basics.

Introduction to Relationships and Graphs

In mathematics, graph theory is “the study of graphs, which are mathematical structures used to model pairwise relations between objects” (text from Wikipedia on graph theory, accessed Oct. 11 2022), such as in this visualization:

There are many types of graphs, but almost all are based on pairwise relations between objects. Relations are semantic in the sense that they convey verbal/logical information from some business domain(s), including “is a” and “has,” but also more implicative relationships such as “identified by” or “purchased at.” Besides graph databases, relations are found in different, widely used paradigms, some of which are listed here:

The ISO 24707 Common Logic standard with its conceptual graphs built from concepts and relations
“Fact statements” (conceptual modeling and object-role modeling, ORM)
Triples (RDF, semantics, ontologies, etc.)
Relationships/edges (various kinds of property graphs)
Functional dependencies (between and inside) relations in relational theory, as discussed above

All of these kinds of relations share a semantic pattern “subject – predicate – object,” as it is called in case of the RDF / semantic web family of standards from the W3C.

NB: Concepts are called not only “concepts,” but also object (types), entity (types) et al.

In classic mathematical graph theory, the terms used are: Nodes / vertices / points, edges / links / lines. In graph theory the relations may be directed having starting points and end points. Hyper-relations may have multiple start / ending point types.

Extending Graph Complexity

The various types of graph paradigms include more constructs, such as properties (attributes), directionality, cardinality, uniqueness, labels on graph elements, and more.

GQL is a declarative language supporting acyclic, directed, labeled property graphs. Properties may reside on nodes/vertices and/or edges/relationships. And there are no implicit rules for normalization and redundancies, etc. This is a very versatile paradigm for many use cases, both simple and complex as well as operational applications, analytics and special graph algorithms such as centrality, community detection, machine learning, and many more.

There are many similarities between the graph pattern matching facilities of SQL Property Graph Queries, ISO/IEC DIS 9075-16, Information technology – Database languages SQL – Part 16: Property Graph Queries (SQL/PGQ). However, GQL is a pure and comprehensive graph database language that does not require the presence of SQL.

Canonical Graph Representation

As can be seen from the above, most graph paradigms share a basic, canonical, form consisting of nodes/vertices, representing concepts, as well as edges/relationships connecting the nodes/vertices to express the semantics of the concept model, including the dependencies between graph elements. This is what we called Graph Normal Form in my July 2022 blog post.

Here is a canonical form of a (fictive) webshop example:

The (meta) graph visualization above is created (by plantuml.com) from this script:

package “Webshop example” {

(Sale) — (TotalDiscount) : may have

(Sale) — (ShoppingCartId) : identified by

(Sale) — (OrderDate) : effective at

(Sale) — (TotalPrice) : committed

(Sale) –> (CartItem) : contains

(CartItem) <– (Product) : relates to

(CartItem) — (Item#) : identified by

(CartItem) — (ItemQuantity) : quantity

(CartItem) — (ItemPrice) : confirmed

top to bottom direction

(Product) — (SKUNumber) : identified by

(Product) — (ItemDescription) : described as

(Product) — (ListPrice) : advertised

(Customer) –> (Sale) : committed

(Customer) — (CustomerId) : identified by

(Customer) — (CustomerName) : registered as

(Customer) — (CustomerEmail) : confirmation to

}

This is basically a list of “Subject – object : predicate.” Notice that all nodes can be named, and, equally so, all relations may be annotated with a text (i.e., a name) that enhances the readers’ understanding of the semantics of graph relations.

Graphs at this level are designated as being in “graph normal form” (in formal graph theory). Most graphs may be decomposed to this level, and, when supplemented with rich annotations, such graphs are also called semantic networks.

NB: Note that future extensions of GQL in specific areas will rely on the graph normal form metadata paradigm to include new/extended descriptors, which participate in the canonical representation of the graph content. Many advanced features will require metadata at the lowest level (property level) of the affected parts of the graph.

Constructing Property Graphs from Graph Normal Form

GQL is a standard query language for property graphs, and the main extension of the canonical graph form is the concept of properties (which also have GQL descriptors). A property graph data model representing the sample graph above could be visualized like this:

Property graphs can be seen as materializations (logical or physical) of the decomposed graph normal form representations of some semantic data models, where some properties are aggregated to become attributes of different node/vertex types, and/or (in GQL et al) also on different edge/relationship types. (Properties on relationships are not shown in the sample diagram above.)

Conclusions about Relations and Graphs

If a canonical form is not available, dependencies might have to be inferred from the graph query pattern and possibly the data content at query execution time (similar to the elaborate query optimization in SQL).

An explicit, canonical form (graph normal form / conceptual graph):

Can be inferred from the data
Can accumulate business information model metadata over time
Will most likely be much richer than a sql model (many more named relations)
Can more effectively drive an unrestricted graph query pattern across large subgraphs, built on data originating in sql
Can map effectively to other technologies

Relations are at the core of the challenge and at the heart of the solution! Decompose them, and you can automate more metadata discovery and more complex query strategies! The result is a knowledge graph that evolves over time.

Acknowledgement: This post is inspired by a great keynote speech:

From the Modern Data Stack to Knowledge Graphs

by Bob Muglia, board member at Relational.ai and former CEO of Snowflake Inc., held at the Knowledge Graph Conference in New York in May 2022. You can see his presentation on YouTube. Thank you, Bob!

NB: The work on V1 of the new GQL standard is planned to be finalized in late 2023.

LIVE ONLINE TRAINING: DATA MODELING BOOTCAMP (APRIL 30-MAY 2, 2024)

Data Topics