This article is an application of the article “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering by Belkin and Niyogi.”
Graphs can be represented via their adjacency matrix and from there on one can use the well-developed field of algebraic graph theory. We show in simple steps how this representation can be used to perform node attribute inference on the Cora citation network.

    import matplotlib.pyplot as plt
    from sklearn.manifold import TSNE
    from sklearn.decomposition import PCA
    import os
    import networkx as nx
    import numpy as np
    import pandas as pd
    from sklearn.linear_model import LogisticRegressionCV
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import f1_score
    %matplotlib inline

Dataset

The Cora dataset is the hello-world dataset when looking at graph learning. We have described in details in this article and will not repeat it here. You can also find in the article a direct link to download the data.
The construction below recreates the steps outlined in the article.

    data_dir = os.path.expanduser("~/data/cora")
    cora_location = os.path.expanduser(os.path.join(data_dir, "cora.cites"))
    g_nx = nx.read_edgelist(path=cora_location)
    cora_data_location = os.path.expanduser(os.path.join(data_dir, "cora.content"))
    node_attr = pd.read_csv(cora_data_location, sep='\t', header=None)
    values = { str(row.tolist()[0]): row.tolist()[-1] for _, row in node_attr.iterrows()}
    nx.set_node_attributes(g_nx, values, 'subject')
    feature_names = ["w_{}".format(ii) for ii in range(1433)]
    column_names =  feature_names + ["subject"]
    node_data = pd.read_table(os.path.join(data_dir, "cora.content"), header=None, names=column_names)
    g_nx_ccs = (g_nx.subgraph(c).copy() for c in nx.connected_components(g_nx))
    g_nx = max(g_nx_ccs, key=len)
    node_ids = list(g_nx.nodes())
    print("Largest subgraph statistics: {} nodes, {} edges".format(
        g_nx.number_of_nodes(), g_nx.number_of_edges()))
    node_targets = [ g_nx.node[node_id]['subject'] for node_id in node_ids]
    print(f"There are {len(np.unique(node_targets))} unique labels on the nodes.")
    print(f"There are {len(g_nx.nodes())} nodes in the network.")
Largest subgraph statistics: 2485 nodes, 5069 edges
There are 7 unique labels on the nodes.
There are 2485 nodes in the network.

We map the subject to a color for rendering purposes.

    colors = {'Case_Based': 'black',
              'Genetic_Algorithms': 'red',
              'Neural_Networks': 'blue',
              'Probabilistic_Methods': 'green',
              'Reinforcement_Learning': 'aqua',
              'Rule_Learning': 'purple',
              'Theory': 'yellow'}

The graph Laplacian

There are at leat 3 graph Laplacians in use. These are called unormalized, random walk and normalised graph Laplacian and they are defined as follows:

  • Unormalized: $L = D-A$
  • Random Walk: $L_{rw} = D^{-1}L = I – D^{-1}A$
  • Normalised: $L_{sym} = D^{-1/2}LD^{-1/2} = I – D^{-1/2}AD^{-1/2}$

We’ll use the unormalised graph Laplacian from here on.
The adjacency matrix of the graph in numpy format:

    A = nx.to_numpy_array(g_nx)

and the degree matrix from this:

    D = np.diag(A.sum(axis=1))
    print(D)
[[168.   0.   0. ...   0.   0.   0.]
 [  0.   5.   0. ...   0.   0.   0.]
 [  0.   0.   6. ...   0.   0.   0.]
 ...
 [  0.   0.   0. ...   4.   0.   0.]
 [  0.   0.   0. ...   0.   4.   0.]
 [  0.   0.   0. ...   0.   0.   2.]]

So, the unnormalized Laplacian is

    L = D-A
    print(L)
[[168.  -1.  -1. ...   0.   0.   0.]
 [ -1.   5.   0. ...   0.   0.   0.]
 [ -1.   0.   6. ...   0.   0.   0.]
 ...
 [  0.   0.   0. ...   4.  -1.  -1.]
 [  0.   0.   0. ...  -1.   4.   0.]
 [  0.   0.   0. ...  -1.   0.   2.]]

Eigenvectors and eigenvalues of the Laplacian

Numpy can directly give you all you need in one line:

    eigenvalues, eigenvectors = np.linalg.eig(L)

In general, eigenvalues can be complex. Only special types of matrices give rise to real values only. So, we’ll take the real parts only and assume that the dropped complex dimension does not contain significant information.

    eigenvalues = np.real(eigenvalues)
    eigenvectors = np.real(eigenvectors)

Let’s also order the eigenvalues from small to large:

    order = np.argsort(eigenvalues)
    eigenvalues = eigenvalues[order]

For example, the first eigenvalues are:

    eigenvalues[0:10]
array([3.33303173e-15, 1.48014820e-02, 2.36128446e-02, 3.03008575e-02,
       4.06458495e-02, 4.72354991e-02, 5.65503673e-02, 6.00350936e-02,
       7.24399539e-02, 7.45956530e-02])

The first eigenvalue is as good as zero and this is a general fact; the smallest eigenvalue is always zero. The reason it’s not exactly zero above is because of computational accuracy.
So, we will omit the first eigenvector since it does not convey any information.
We also take a 32-dimensional subspace of the full vector space:

    embedding_size = 32
    v_0 = eigenvectors[:, order[0]]
    v = eigenvectors[:, order[1:(embedding_size+1)]]

A plot of the eigenvalue looks like the following:

    plt.plot(eigenvalues)

png
Let’s use t-SNE to visualize out 32-dimensional organization:

    tsne = TSNE()
    v_pr = tsne.fit_transform(v)
    alpha=0.7
    label_map = { l: i for i, l in enumerate(np.unique(node_targets))}
    node_colours = [ label_map[target] for target in node_targets]
    fig = plt.figure(figsize=(10,8))
    plt.scatter(v_pr[:,0],
                v_pr[:,1],
                c=node_colours, cmap="jet", alpha=alpha)

png
We see that the eigenvectors of the Laplacian form clusters corresponding to the target labels. It means that in principle we can train a model using the eigenvectors and make predictions about an unseen graphs. Simply said, given an unlabelled graph we can extract its Laplacian, feed it to the model and get labels for the nodes.

Training a classifier

The eigenvectors are from the point of view of machine learning just ordinary feature vectors. Taking a training and test set really means in this case taking a subset of the nodes (a subgraph) even though on a code level it’s just an ordinary classifier.

    X = v
    Y = np.array(node_targets)
    clf = RandomForestClassifier(n_estimators=10, min_samples_leaf=4)
    X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=140, random_state=42)
    clf.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
    print("score on X_train {}".format(clf.score(X_train, y_train)))
    print("score on X_test {}".format(clf.score(X_test, y_test)))
score on X_train 0.8571428571428571
score on X_test 0.7292110874200426

From here on you can experiment with different embedding dimensions, different Laplacians or any other matrix representing the structure of the graph.

To understand communities you best approach it via small datasets rather than big data. NetworkX is the ideal toolkit along the journey.

NetworkX is a graph analysis library for Python. It has become the standard library for anything graphs in Python. In addition, it’s the basis for most libraries dealing with graph machine learning. Stellargraph in particular requires an understanding of NetworkX to construct graphs.
Below is an overview of the most important API methods. The official documentation is extensive but it remains often confusing to make things happen. Some simple questions (adding arrows, attaching data…) are usually answered in StackOverflow, so the guide below collects these simple but important questions.

General remarks

The library is flexible but these are my golden rules:

  • do not use objects to define nodes, rather use integers and set data on the node. Layout has issues with objects.
  • the API changed a lot over the versions, make sure when you find an answer somewhere that it matches your version. Often methods and answers do not apply because they relate to an older version.
    import networkx as nx
    import matplotlib.pyplot as plt
    from faker import Faker
    faker = Faker()
    %matplotlib inline
    

Creating graphs

There are various constructors to create graphs, among others:

    # default
    G = nx.Graph()
    # an empty graph
    EG = nx.empty_graph(100)
    # a directed graph
    DG = nx.DiGraph()
    # a multi-directed graph
    MDG = nx.MultiDiGraph()
    # a complete graph
    CG = nx.complete_graph(10)
    # a path graph
    PG = nx.path_graph(5)
    # a complete bipartite graph
    CBG = nx.complete_bipartite_graph(5,3)
    # a grid graph
    GG = nx.grid_graph([2, 3, 5, 2])

Graph generators

Graph generators produce random graphs with particular properties which are of interest in the context of statistics of graphs. The best-known phenomenon is six degrees of separation which you can find on the internet, our brains, our social network and whatnot.

Erdos-Renyi

The Erdos-Renyi model is related to percolations and phase transitions but is in general the most generic random graph model.
The first parameter is the amount of nodes and the second a probability of being connected to another one.

    er = nx.erdos_renyi_graph(50, 0.15)
    nx.draw(er, edge_color='silver')

png

Watts-Strogratz

The Watts-Strogratz model produces small-world properties.
The first parameter is the amound of node then follows the default degree and thereafter the probability of rewiring and edge. So, the rewiring probability is like the mutation of an otherwise fixed-degree graph.

    ws = nx.watts_strogatz_graph(30, 2, 0.32)
    nx.draw(ws)

png

Barabasi-Albert

The Barabasi-Albert model reproduces random scale-free graphs which are akin to citation networks, the internet and pretty much everywhere in nature.

    ba = nx.barabasi_albert_graph(50, 5)
    nx.draw(ba)

png
You can easily extract the exponential distribution of degrees:

    g = nx.barabasi_albert_graph(2500, 5)
    degrees = list(nx.degree(g))
    l = [d[1] for d in degrees]
    plt.hist(l)
    plt.show()

png

Drawing graphs

The draw method without additional will present the graph with spring-layout algorithm.

    nx.draw(PG)

png
There are of course tons of settings and features and a good result is really dependent on your graph and what you’re looking for. If we take the bipartite graph for example it would be nice to see the two sets of nodes in different colors:

    from networkx.algorithms import bipartite
    X, Y = bipartite.sets(CBG)
    cols = ["red" if i in X else "blue" for i in CBG.nodes() ]
    nx.draw(CBG, with_labels=True, node_color= cols)

png
The grid graph on the other hand is better drawn with the Kamada-Kawai layout in order to see the grid sturcture:

    nx.draw_kamada_kawai(GG)

png

Nodes

If you start from scratch the easiest way to define a graph is via the add_edges_from method as shown here:

    G = nx.Graph()
    G.add_node("time")
    G.add_edges_from([
        ("time","space"),
        ("gravitation","curvature"),
        ("gravitation","space"),
        ("time","curvature"),
        ])
    labels = {}
    pos = nx.layout.kamada_kawai_layout(G)
    nx.draw(G, pos=pos, with_labels= True)
    nx.draw_networkx_edge_labels(G,pos,
        {
            ("time","space"): "interacts with",
            ("gravitation","curvature"): "is"
        },
        label_pos=0.4
    )
{('time',
  'space'): Text(0.39999997946599436, 0.6000000143526016, 'interacts with'),
 ('gravitation',
  'curvature'): Text(-0.3999999793235416, -0.6000000147443909, 'is')}

png
The nodes can however be arbitrary objects:

    class Person:
        def __init__(self, name):
            self.name = name
        @staticmethod
        def random():
            return Person(faker.name())
    g = nx.Graph()
    a = Person.random()
    b = Person.random()
    c = Person.random()
    g.add_edges_from([(a,b), (b,c), (c,a)])
    # to show the names you need to pass the labels
    nx.draw(g, labels = {n:n.name for n in g.nodes()}, with_labels=True)

png
As mentioned earlier, it’s better to use numbers for the nodes and set the data via the set_node_attributes methods as shown below.

Edges

Arrows can only be shown if the graph is directed. NetworkX is essentially a graph analysis library and much less a graph visualization toolbox.

    G=nx.DiGraph()
    G.add_edge(1,2)
    pos = nx.circular_layout(G)
    nx.draw(G, pos, with_labels = True , arrowsize=25)
    plt.show()

png
Data can be assigned to an edge on creation

    G=nx.DiGraph()
    a = Person.random()
    b = Person.random()
    G.add_node(0, data = a)
    G.add_node(1, data = b)
    G.add_edge(0, 1, label="knows")
    labelDic = {n:G.nodes[n]["data"].name for n in G.nodes()}
    edgeDic = {e:G.get_edge_data(*e)["label"] for e in G.edges}
    kpos = nx.layout.kamada_kawai_layout(G)
    nx.draw(G,kpos,  labels = labelDic, with_labels=True, arrowsize=25)
    nx.draw_networkx_edge_labels(G, kpos, edge_labels= edgeDic, label_pos=0.4)
{(0, 1): Text(-0.19999999999999996, -8.74227765734758e-09, 'knows')}

png

Analysis

There many analysis oriented methods in NetworkX, below are just a few hints to get you started.
Let’s assemble a little network to demonstrate the methods.

    gr = nx.DiGraph()
    gr.add_node(1, data = {'label': 'Space' })
    gr.add_node(2, data = {'label': 'Time' })
    gr.add_node(3, data = {'label': 'Gravitation' })
    gr.add_node(4, data = {'label': 'Geometry' })
    gr.add_node(5, data = {'label': 'SU(2)' })
    gr.add_node(6, data = {'label': 'Spin' })
    gr.add_node(7, data = {'label': 'GL(n)' })
    edge_array = [(1,2), (2,3), (3,1), (3,4), (2,5), (5,6), (1,7)]
    gr.add_edges_from(edge_array)
    import random
    for e in edge_array:
    #     nx.set_edge_attributes(gr, {e: {'data':{'weight': round(random.random(),2)}}})
        gr.add_edge(*e, weight=round(random.random(),2))
    labelDic = {n:gr.nodes[n]["data"]["label"] for n in gr.nodes()}
    edgeDic = {e:gr.edges[e]["weight"] for e in G.edges}
    kpos = nx.layout.kamada_kawai_layout(G)
    nx.draw(G,kpos,  labels = labelDic, with_labels=True, arrowsize=25)
    o=nx.draw_networkx_edge_labels(G, kpos, edge_labels= edgeDic, label_pos=0.4)

png
Getting the adjacency matrix gives a sparse matrix. You need to use the todense method to see the dense matrix. There is also a to_numpy_matrix method which makes it easy to integrate with numpy mechanics.

    nx.adjacency_matrix(gr).todense()
matrix([[0, 1, 0, 0, 0, 0, 1],
        [0, 0, 1, 0, 1, 0, 0],
        [1, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]], dtype=int64)

The spectrum of this adjacency matrix can be directly obtained via the adjacency_spectrum method:

    nx.adjacency_spectrum(gr)
array([-0.5+0.8660254j, -0.5-0.8660254j,  1. +0.j       ,  0. +0.j       ,
        0. +0.j       ,  0. +0.j       ,  0. +0.j       ])

The Laplacian matrix (see definition here) is only defined for undirected graphs but is just a method away:

    nx.laplacian_matrix(gr.to_undirected()).todense()
matrix([[ 3, -1, -1,  0,  0,  0, -1],
        [-1,  3, -1,  0, -1,  0,  0],
        [-1, -1,  3, -1,  0,  0,  0],
        [ 0,  0, -1,  1,  0,  0,  0],
        [ 0, -1,  0,  0,  2, -1,  0],
        [ 0,  0,  0,  0, -1,  1,  0],
        [-1,  0,  0,  0,  0,  0,  1]], dtype=int64)

If you need to use the edge data in the adjacency matrix this goes via the attr_matrix:

    nx.attr_matrix(gr, edge_attr="weight")
(matrix([[0.  , 0.75, 0.  , 0.  , 0.  , 0.  , 0.96],
         [0.  , 0.  , 0.76, 0.  , 0.53, 0.  , 0.  ],
         [0.35, 0.  , 0.  , 0.13, 0.  , 0.  , 0.  ],
         [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
         [0.  , 0.  , 0.  , 0.  , 0.  , 0.06, 0.  ],
         [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
         [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ]]), [1, 2, 3, 4, 5, 6, 7])

Simple things like degrees are simple to access:

    list(gr.degree)
[(1, 3), (2, 3), (3, 3), (4, 1), (5, 2), (6, 1), (7, 1)]

The shortest path between two vertices is just as simple but please note that there are dozens of variations in the library:

    nx.shortest_path(gr, 1, 6, weight="weight")
[1, 2, 5, 6]

Things like the radius of a graph is defined for undirected graphs:

    nx.radius(gr.to_undirected())
2
    nx.find_cores(gr)
{1: 2, 2: 2, 3: 2, 4: 1, 5: 1, 6: 1, 7: 1}

Centrality is also a whole world on its own. If you wish to visualize the betweenness centrality you can use something like:

    cent = nx.centrality.betweenness_centrality(gr)
    nx.draw(gr, node_size=[v * 1500 for v in cent.values()], edge_color='silver')

png
Getting the connected components of a graph.

    nx.is_connected(gr.to_undirected())
    comps = nx.components.connected_components(gr.to_undirected())
    for c in comps:
        print(c)
{1, 2, 3, 4, 5, 6, 7}

Clique

A clique is a complete subgraph of a particular size. Large, dense subgraphs are useful for example in the analysis of protein-protein interaction graphs, specifically in the prediction of protein complexes.

    def show_clique(graph, k = 4):
        '''
        Draws the first clique of the specified size.
        '''
        cliques = list(nx.algorithms.find_cliques(graph))
        kclique = [clq for clq in cliques if len(clq) == k]
        if len(kclique)>0:
            print(kclique[0])
            cols = ["red" if i in kclique[0] else "white" for i in graph.nodes() ]
            nx.draw(graph, with_labels=True, node_color= cols, edge_color="silver")
            return nx.subgraph(graph, kclique[0])
        else:
            print("No clique of size %s."%k)
            return nx.Graph()

Taking the Barabasi graph above and checking for isomorphism with the complete graph of the same size we can check that the found result is indeed a clique of the requested size.

    subg = show_clique(ba,5)
    nx.is_isomorphic(subg, nx.complete_graph(5))
[5, 1, 7, 9, 6]
True

png

    red = nx.random_lobster(100, 0.9, 0.9)
    nx.draw(ba)

png

    petersen = nx.petersen_graph()
    nx.draw(petersen)

png

    G=nx.karate_club_graph()
    cent = nx.centrality.betweenness_centrality(G)
    nx.draw(G, node_size=[v * 1500 for v in cent.values()], edge_color='silver')

png

Graph visualization

As described above, if you want pretty images you should use packages outside NetworkX. The dot and GraphML formats are pretty standard and saving a graph to a particular format is really easy.
For example, here we save the karate-club to GraphML and used yEd to layout things together with a centrality resizing of the nodes.

    G=nx.karate_club_graph()
    nx.write_graphml(G, "./karate.graphml")

For Gephi you can use the GML format:

    nx.write_gml(G, "./karate.gml")

Get and set data

In the context of machine learning and real-world data graphs it’s important that nodes and edges carry data. The way it works in NetworkX can be a bit tricky, so let’s make it clear here how it functions.

Node get/set goes like this:

    G = nx.Graph()
    G.add_node(12, payload = {'id': 44, 'name': 'Swa' })
    print(G.nodes[12]['payload'])
    print(G.nodes[12]['payload']['name'])
{'id': 44, 'name': 'Swa'}
Swa

One can also set the data after the node is added:

    G = nx.Graph()
    G.add_node(12)
    nx.set_node_attributes(G, {12:{'payload':{'id': 44, 'name': 'Swa' }}})
    print(G.nodes[12]['payload'])
    print(G.nodes[12]['payload']['name'])
{'id': 44, 'name': 'Swa'}
Swa

Edge get/set is like so:

    G = nx.Graph()
    G.add_edge(12,15, payload={'label': 'stuff'})
    print(G.get_edge_data(12,15))
    print(G.get_edge_data(12,15)['payload']['label'])
{'payload': {'label': 'stuff'}}
stuff

One can also set the data after the edge is added:

    G = nx.Graph()
    G.add_edge(12,15)
    nx.set_edge_attributes(G, {(12,15): {'payload':{'label': 'stuff'}}})
    print(G.get_edge_data(12,15))
    print(G.get_edge_data(12,15)['payload']['label'])
{'payload': {'label': 'stuff'}}
stuff

Pandas

The library has support for import/export from/to Pandas dataframes. This exchange, however, applies to edges and not to nodes. The row of a frame are used to define an edge and if you want to use a frame for nodes or both, you are on your own. It’s not difficult though, let’s take a graph and turn it into a frame.

    g = nx.barabasi_albert_graph(50, 5)
    # set a weight on the edges
    for e in g.edges:
        nx.set_edge_attributes(g, {e: {'weight':faker.random.random()}})
    for n in g.nodes:
        nx.set_node_attributes(g, {n: {"feature": {"firstName": faker.first_name(), "lastName": faker.last_name()}}})

You can now use the to_pandas_edgeList method but this will only output the weights besides the edge definitions:

    nx.to_pandas_edgelist(g)
source target weight
0 0 5 0.021629
1 0 6 0.294875
2 0 7 0.967585
3 0 8 0.553814
4 0 9 0.531532
220 40 43 0.313282
221 41 42 0.930995
222 42 48 0.380596
223 43 46 0.236164
224 45 47 0.213095

225 rows × 3 columns

    import pandas as pd
    import copy
    node_dic = {id:g.nodes[id]["feature"] for id in g.nodes} # easy acces to the nodes
    rows = [] # the array we'll give to Pandas
    for e in g.edges:
        row = copy.copy(node_dic[e[0]])
        row["sourceId"] = e[0]
        row["targetId"] = e[1]
        row["weight"] = g.edges[e]["weight"]
        rows.append(row)
    df = pd.DataFrame(rows)
    df
firstName lastName sourceId targetId weight
0 Rebecca Griffin 0 5 0.021629
1 Rebecca Griffin 0 6 0.294875
2 Rebecca Griffin 0 7 0.967585
3 Rebecca Griffin 0 8 0.553814
4 Rebecca Griffin 0 9 0.531532
220 Tyler Morris 40 43 0.313282
221 Mary Bolton 41 42 0.930995
222 Colton Hernandez 42 48 0.380596
223 Michael Moreno 43 46 0.236164
224 Mary Morris 45 47 0.213095

225 rows × 5 columns

Note that you need this denormalization of the node data because you actually need two datasets to describe a graph in a normalized fashion.

StellarGraph

The StellarGraph library can import directly from NetworkX and Pandas via the static StellarGraph.from_networkx method.
One important thing to note here is that the features on a node as defined above will not work because the framework expects numbers and not strings or dictionaries. If you do take care of this (one-hot encoding and all that) then this following will do:

    from stellargraph import StellarGraph
    gs = StellarGraph.from_networkx(g, edge_type_default = "relation", node_type_default="person", node_features = "feature", edge_weight_attr = "weight")
    print(gs.info())
StellarGraph: Undirected multigraph
 Nodes: 50, Edges: 225
 Node types:
  default: [50]
    Features: none
    Edge types: default-default->default
 Edge types:
    default-default->default: [225]

A graph database is one that stores data in terms of entities and the relationships between entities. A variant on this theme are RDF (resource description framework) databases which store data in the format subject-predicate-object, which is known as a triple.

There are three types of graph database: true graph databases, triple stores and conventional databases that provide some graphical capabilities. Triple stores are often referred to as RDF or semantic databases. The difference between a true graph product and a triple store is that the former supports index free adjacency (which means you can traverse a graph without needing an index) and the latter doesn’t. The former are designed to support property graphs (graphs where properties may be assigned to either entities or their relationships, or both) but recently some triple stores have added this capability.

Both graph and RDF databases may be native products or they may be built on top of other database types. Most commonly, other database types are forms of NoSQL database though there are some relational implementations.

RDF databases target semantic processing, often with the ability to combine information across structured and unstructured data. Both graph and RDF databases may be ACID compliant and both are frequently targeted at transactional environments. All graph products target analytics but different products are targeted at operational analytics (those suitable for transactional environments) as opposed to data warehousing analytics. In this last category there is also a distinction between vendors targeting known-known problems as opposed to those that also cover known-unknowns and those tackling unknown-unknowns: the most intractable of all.

Given that both graph and RDF databases target both transactional environments and have query processing capabilities, these are an obvious candidate for supporting so-called HTAP processing whereby the database is used for both transactional/operational processing and real-time analytics. Compared to some other approaches to HTAP this has the major advantage that the data only needs to be stored once. Both concurrent analytics (where the analytics is separate from operational processes, for example in supporting real-time dashboards) and in-process analytics (where the analytics are embedded in real-time operational processing) may be supported. In the latter case, there are a variety of graph 

algorithms supported by vendors that may be implemented for machine learning purposes.

Graph databases handle a class of issues that are too structured for NoSQL and too diverse for relational technologies. In the latter case, relational databases are inherently limited to one-to-one, many-to-one and one-to-many relationships. They do not cater well for problems (such as bill of materials – a classic case) that are many-to-many. For these types of requirements graph databases not only perform way better better than relational databases but they allow some types of query that are simply not possible otherwise. Semantic query support tends to be particularly strong in triple stores.

Another major point is that research suggests that graph visualisations are very easy and intuitive for users. It is also worth noting that many (not all) graph products are schema-free. This means that if you want to change the structure of the environment you simply add a new entity or relationship as required and do not have to explicitly implement a schema change. This is a major advantage over relational databases.

This market is emerging and there are many open source projects and vendors, many of which will not ultimately survive. Nevertheless there are still new products coming on to the marketplace. Conversely, there are companies that have been in this space for more than a decade, so the technology is not entirely new. One noticeable trend is for triple store vendors to add support for property graphs.

Another trend is towards multi-model implementations. This is where the database supports graph technology as just one of possibly several views into the data. A major consideration with such offerings is the extent to which these different representations can work together. Some vendors require, for example, require a different API to be used for each model type supported, whereas others have integrated their environment so that the different models are effectively transparent to one another.

One major issue that has yet to be finalised is with respect to language support. SPARQL (SPARQL protocol and RDF query language) is a W3C standard and is a declarative language but by no means all vendors support it. In general, RDF vendors support SPARQL, but property graph vendors do not, though there are exceptions to this. In the property graph space the 

Gremlin graph traversal language is part of the Apache Tinkerpop project and is supported by some vendors, while other suppliers have adopted their own “SQL-like” languages. Also with significant traction is OpenCypher, which is a declarative language (Gremlin is only partially so). ANSI has a working party to define SQL extensions to support graph processing while there is also an initiative to create a standardised GQL (graph query language). It is also worth noting that GraphQL, which is an open source project, is gaining traction as a graph API to replace REST.

Finally, while it is too early to call this trend one vendor has introduced a graph capability based on adjacency matrices rather than adjacency lists. If this proves successful, and early results suggest that that will be the case, then we are like to see this being more widely adopted as it promises much better performance.

There have been a lot of new entrants to the market and changes amongst those within the market. As far as new products are concerned these include RedisGraph, TigerGraph, MemGraph, Trovares, and Microsoft Cosmos DB. Cambridge Semantics has unbundled AnzoGraph from its Anzo Data Lake offering and both Amazon and SAP have also entered the market. In the case of Amazon Neptune, it is an open secret that this is based on BlazeGraph while SAP has acquired OrientDB. Or, rather, it acquired the company that had acquired OrientDB, though we think that OrientDB was probably incidental as far as this acquisition was concerned. Nevertheless, SAP appears to be moving ahead with the product though there is the perennial danger that it may increasingly be targeted at the SAP user base rather than the wider community.

The one company that has withdrawn from the graph space is IBM. However, the company continues to work on the development of JanusGraph (effectively a replacement to Titan).

Despite all of the above, the market leaders in this space continue to be Neo4J and OntoText (GraphDB), which are graph and RDF database providers respectively. These are the longest established vendors in this space (both founded in 2000) so they have a longevity and experience that other suppliers cannot yet match. How long this will remain the case remains to be seen.