What is a graph database?

A graph database is one that stores data in terms of entities and the relationships between entities. A variant on this theme are RDF (resource description framework) databases which store data in the format subject-predicate-object, which is known as a triple.

There are three types of graph database: true graph databases, triple stores and conventional databases that provide some graphical capabilities. Triple stores are often referred to as RDF or semantic databases. The difference between a true graph product and a triple store is that the former supports index free adjacency (which means you can traverse a graph without needing an index) and the latter doesn’t. The former are designed to support property graphs (graphs where properties may be assigned to either entities or their relationships, or both) but recently some triple stores have added this capability.

Both graph and RDF databases may be native products or they may be built on top of other database types. Most commonly, other database types are forms of NoSQL database though there are some relational implementations.

RDF databases target semantic processing, often with the ability to combine information across structured and unstructured data. Both graph and RDF databases may be ACID compliant and both are frequently targeted at transactional environments. All graph products target analytics but different products are targeted at operational analytics (those suitable for transactional environments) as opposed to data warehousing analytics. In this last category there is also a distinction between vendors targeting known-known problems as opposed to those that also cover known-unknowns and those tackling unknown-unknowns: the most intractable of all.

Given that both graph and RDF databases target both transactional environments and have query processing capabilities, these are an obvious candidate for supporting so-called HTAP processing whereby the database is used for both transactional/operational processing and real-time analytics. Compared to some other approaches to HTAP this has the major advantage that the data only needs to be stored once. Both concurrent analytics (where the analytics is separate from operational processes, for example in supporting real-time dashboards) and in-process analytics (where the analytics are embedded in real-time operational processing) may be supported. In the latter case, there are a variety of graph

algorithms supported by vendors that may be implemented for machine learning purposes.

Graph databases handle a class of issues that are too structured for NoSQL and too diverse for relational technologies. In the latter case, relational databases are inherently limited to one-to-one, many-to-one and one-to-many relationships. They do not cater well for problems (such as bill of materials – a classic case) that are many-to-many. For these types of requirements graph databases not only perform way better better than relational databases but they allow some types of query that are simply not possible otherwise. Semantic query support tends to be particularly strong in triple stores.

Another major point is that research suggests that graph visualisations are very easy and intuitive for users. It is also worth noting that many (not all) graph products are schema-free. This means that if you want to change the structure of the environment you simply add a new entity or relationship as required and do not have to explicitly implement a schema change. This is a major advantage over relational databases.

This market is emerging and there are many open source projects and vendors, many of which will not ultimately survive. Nevertheless there are still new products coming on to the marketplace. Conversely, there are companies that have been in this space for more than a decade, so the technology is not entirely new. One noticeable trend is for triple store vendors to add support for property graphs.

Another trend is towards multi-model implementations. This is where the database supports graph technology as just one of possibly several views into the data. A major consideration with such offerings is the extent to which these different representations can work together. Some vendors require, for example, require a different API to be used for each model type supported, whereas others have integrated their environment so that the different models are effectively transparent to one another.

One major issue that has yet to be finalised is with respect to language support. SPARQL (SPARQL protocol and RDF query language) is a W3C standard and is a declarative language but by no means all vendors support it. In general, RDF vendors support SPARQL, but property graph vendors do not, though there are exceptions to this. In the property graph space the

Gremlin graph traversal language is part of the Apache Tinkerpop project and is supported by some vendors, while other suppliers have adopted their own “SQL-like” languages. Also with significant traction is OpenCypher, which is a declarative language (Gremlin is only partially so). ANSI has a working party to define SQL extensions to support graph processing while there is also an initiative to create a standardised GQL (graph query language). It is also worth noting that GraphQL, which is an open source project, is gaining traction as a graph API to replace REST.

Finally, while it is too early to call this trend one vendor has introduced a graph capability based on adjacency matrices rather than adjacency lists. If this proves successful, and early results suggest that that will be the case, then we are like to see this being more widely adopted as it promises much better performance.

There have been a lot of new entrants to the market and changes amongst those within the market. As far as new products are concerned these include RedisGraph, TigerGraph, MemGraph, Trovares, and Microsoft Cosmos DB. Cambridge Semantics has unbundled AnzoGraph from its Anzo Data Lake offering and both Amazon and SAP have also entered the market. In the case of Amazon Neptune, it is an open secret that this is based on BlazeGraph while SAP has acquired OrientDB. Or, rather, it acquired the company that had acquired OrientDB, though we think that OrientDB was probably incidental as far as this acquisition was concerned. Nevertheless, SAP appears to be moving ahead with the product though there is the perennial danger that it may increasingly be targeted at the SAP user base rather than the wider community.

The one company that has withdrawn from the graph space is IBM. However, the company continues to work on the development of JanusGraph (effectively a replacement to Titan).

Despite all of the above, the market leaders in this space continue to be Neo4J and OntoText (GraphDB), which are graph and RDF database providers respectively. These are the longest established vendors in this space (both founded in 2000) so they have a longevity and experience that other suppliers cannot yet match. How long this will remain the case remains to be seen.

Journal