A graph is an abstract representation of some objects where some pairs of objects are linked in some way. The objects are called nodes, vertices, points and the pairs of objects being linked are called connections, links, edges, lines or even arcs. The variety of names results from the variety of contexts in which graphs are used as well as whether one focuses on the mathematical aspect or the visual aspect of a graph. We will use node and connection to designated the elements of a graph below.
From a purely mathematical point of view you can idealize nodes and connections but from a more practical point of view it’s useful to associate to a node some payload or data:
- A unique identifier in the graph or sometimes a unique identifier on a more semantic level. For example, it makes sense to assign the unique Twitter name (the screen_name) to a node or, if the node represents an entity in a database, to take the primary key of the record.
- Some concrete data which adds a dimension to the node related to the business context. For example, you could attach Twitter user information or a geographic tag or even an image.
We will define in F# a node as a combination of these two elements:
type Node<'TDaton> = int (* node identifier *) * 'TDaton (* node data *)
The payload is a generic parameter and depends on your concrete context; it could be a complex, composite entity on its own. We have chosen the identifier to be an integer type merely for convenience, it could be a string type or Guid, if you wish, and it would not affect much the structures or algorithms.
A connection, defined as a pair or nodes, is mathematically speaking uniquely defined as the combination of the node identifiers. From a practical perspective it makes sense however to enlarge the concept with additional bits:
- Some concrete data or payload, with the same argumentation as above.
- Some weight factor or additional numerical value which represents the relative importance of the connection.
These additional connection properties add a semantic layer to the graph but also
- Allow you to define multiple connections between nodes. The tuple of identifiers only would indeed not be sufficient to enable multiple connections between nodes.
- Turn the graph into a so-called weighted graph. These graphs are essential when studying data networks, extremum problems represented by graphs (for example, the fastest route between two cities) and more.
- Allow you to attach computations and values to graphs or define the state of a graph.
One end of a connection is called source, begin, origin or start(point) while the complementary end is called sink, target or end(point). We will use source and sink to designate where the connection starts, respectively where the connection ends. If the order is of no importance the connection is said to be undirected, otherwise it’s called a directed connection. On a graph level, if the connections are directed then the graph is called a directed graph (often abbreviated as digraph) otherwise we speak of an undirected graph. Sometimes an undirected connection is considered as a connection representing a bi-directional connection. Equally well, the direction of a connection is sometimes redundant in certain algorithms. This means that considering only digraphs covers both directed and undirected graphs, if one takes care in concrete algorithms to intentionally ‘overlook’ the direction if desired.
As it stands, the Connection type allows one to define loops or reflexive connections; these are connections with the same source and sink. In some situations this is a desired feature, in other contexts this is a pathology one has to take care of on the public API level. To keep things simple we will allow loops but you should evaluate in your application domain whether this is necessary and eventually take carefully analyze the graph algorithms for potential issues due to loops.
There are various ways one can now combine the connections and nodes in a graph type. You can find in the literature and textbooks quite a bit about the pro and contra of the diverse approaches, but let me highlight here the two most common approaches:
- Adjacency matrix. This entails the construction of a matrix of size N corresponding to the amount of nodes in the graph and boolean entries where entry [i,j] is true if there is a connection between i and j. One can replace the boolean type by the payload of the connection or use the unique identifier of the node instead of the row (column) number but it wouldn’t alter much the essence of the approach. The matrix will use O(N*N) memory to store the connections and it allows a fast lookup but a full loop over all connection will be slow. Especially if the graph is spare (many nodes and few connections) this is an inefficient implementation. The main advantage is however the fact that by using matrices you have at your disposal a whole lot of matrix-oriented techniques and algorithms.
- Adjacency list.
This approach saves memory and is the right implementation if the graph is not altered too often or too much. It’s more difficult to traverse and does not allow matrix computations, at least not without conversion. For each vertex you keep track (in a dictionary) of the vertices connected. There is a priori in this situation no entry if there is no link for a given vertex but often an empty list is kept so the isolated vertex is still represented. Without it you need a separate structure for singletons.
In addition to this I should add that there is an wide margin in defining a data structure for a graph and that much can be said about the pro and contra of a certain choices. For example, one could in the definition of a node above have incorporated the identifier in the payload or imposing an interface constraint on the generic parameter. Similarly, we can absorb the weight of a connection into the payload. The correct choice is really a compromise between various poles:
- The concrete business domain you try to model using graphs. For instance, in some database related scenarios it might be enough to store only a primary key in the payload and fetch the concrete data when results are displayed. In other situations the payload might contain information relevant to the graph algorithms.
- The ease of access to properties and how you want your graph API to be. Sometimes you have to sacrifice a technically pleasing implementation in favor of something which makes the public API more enjoyable to the end-user.
- Whether you want compatibility with other .Net languages or wish to expose your API through an Odata service to web clients. This defines the data types and the structure of your implementation to some extend.
- The necessary mathematical and technical properties you need to have in order to use certain algorithms. For example, when dealing with graph layout algorithms you need to store some runtime properties.
- Performance considerations and whether your typical usage of the API induces lots of updates or mostly consists of reading values (which is the case if you focus on visualizing graphs).
The most common graph implementation is based on adjacency lists due to the fact that real-world graphs are typically not dense (less connections than nodes) and that list oriented methods have better support in programming languages than matrix-oriented methods. This is also the case in F# where matrices (trace, transpose, inverse…) are somewhat lesser citizens than sequences. Corresponding to the enlarged connection we defined above we thus define the following implementation for a connection:
type Connection<'TDaton> = int (* source Id *) * int (* sink Id *) * int (* weight *) * 'TDaton (* connection data *)
We will define in addition an Adjacency type as an alias for a list of connections and another alias Atom which represents a single node with its connections
type Adjacency<'TStructon> = Connection<'TStructon> list (* a list of connections to other nodes*) type Atom<'TNodeDaton, 'TConnectionDaton> = Node<'TNodeDaton> * Adjacency<'TConnectionDaton>
The collection of connections having the same sink is called the incoming connections of the common sink node. The complementary collection of connections with a common source is called the outgoing connections. The amount of incoming and outgoing connections of a node is the degree (or valency) of a node. Note that it’s very easy in our implementation to get the collection of outgoing connections (since it’s the second part of an Adjacency instance) but it requires a bit more work to collect the incoming connections. Corresponding to the incoming (outgoing) connections one also calls the complementary nodes the incoming (outgoing) nodes. Furthermore, the union of incoming and outgoing nodes gives the neighborhood (neighbors) of a node.
Finally, we can define a graph as a list of Atoms like so
type Graph<'TNodeDaton, 'TConnectionDaton> = Atom<'TNodeDaton, 'TConnectionDaton> list
To fully benefit from the Graph structure as defined above we need to furnish our API with methods which allow us to easily access, edit and delete the individual properties and elements. These so-called CRUD methods (Create, Read, Update and Delete) can be implemented by navigating through the Graph-Adjacency-Atom hierarchy (object model) and keeping the integrity of the Graph structure intact:
- When deleting a node all attached connections should be deleted as well
- When dealing with an undirected graph; if a connection from node a to node b is deleted then the adjacency entry linking node b to node a should also be deleted
- Node identifiers should be unique within the scope of the graph. So, if these integers are generated as part of an AddNode method there should be an internal check which ensures this uniqueness.
Let’s take for example a method which takes one or more objects (payload) to create corresponding nodes:
let GetAtomId (atom:Atom<_,_>) :int = atom |> fst |> fst let GetAtom (id:int) (graph:Graph<_,_>) : Atom<_,_> option = graph |> List.tryFind (fun atom -> (GetAtomId atom) = id) let GetNode (id:int) (graph:Graph<_,_>) : Node option = match (GetAtom id graph) with | Some(a) -> Some(fst a) | None -> None let AddNode (node:Node<'TNodeData>) (graph:Graph<'TNodeData, _>) : Graph<'TNodeData, _> = let id = GetNodeId node match (GetNode id graph) with |None -> let newAdjacencyList : Adjacency = [] let newAtom : Atom<'TNodeData,_> = (node,newAdjacencyList) graph@[newAtom] | _ -> graph (*let's be kind and not raise an error but in some situation you should do so*) let AddNodeData ([] items: 'TDaton array) (graph:Graph<'TDaton, _>) : Graph<'TDaton, _> = let max: int ref = {contents = if (graph.Length=0) then -1 else graph |> List.maxBy (fun atom -> (GetAtomId atom)) |> GetAtomId } Seq.fold (fun acc data -> max := !max+1 let newNode:Node<'TDaton> = (!max, data) (AddNode newNode acc) ) [] items
The GetAtomId returns the identifier contained in the Atom by drilling down twice in the composite tuple. The GetAtom will return the Atom with the specified identifier. The List.tryFind method here ensure that we get None if the identifier is not present in the graph. Based on this method, the AddNode will attempt to insert a new node with the specified identifier but will simply ignore the insertion if the identifier is already present. Whether you raise an exception here or not depends on your context and how the rest of the API is using the method.
Finally, the AddNodeData method will first find the maximum of the identifiers in order to assign a new unique identifier whereafter the AddNode is used to insert a new node.
The removal of a node is somewhat more complex
let PointsTo (id:int) (atom:Atom<_,_>) = atom |> snd |> List.exists (fun (_,y,_,_)-> y=id) let GetConnectionSinkId ((_,sinkId,_,_):Connection) = sinkId let private FilterHasNotSinkId (id:int)= fun x -> ((GetConnectionSinkId x) <> id) let RemoveNode (id:int) (graph:Graph<_, _>) = graph |> ( [] |> List.fold (fun acc (n,a) -> if (fst n)=id then acc else if (PointsTo id (n,a)) then let newAd = a |> List.filter (FilterHasNotSinkId id) let newAtom = (n,newAd) acc@[newAtom] else acc@[(n,a)] ) )
The PointsTo helper method tells you whether one of the connections contained in the given Atom points to (has a sink) a specified identifier. In essence, this method helps to find the incoming nodes of a node with the specified identifier. The GetConnectionSinkId is a simple filering method which picks out the sink identifier of the given connection. The complementary FilterHasNoSinkId method is used as a filtering method in RemoveNode to filter out those Atoms which do not point to the identifier we wish to removed from the graph. The folding is a loop over the Atoms which removes the connections pointing to the identifier we need to remove. The reason for this is to ensure that no connection point anymore to an absent node of course (see the integrity checks above).
For testing purposes and rapid prototyping it’s useful to increment the standard CRUD API with a few methods which help to quickly create graphs with certain properties. For example, the following compact input would be more handy than an explicit series of AddNode and AddConnection calls:
(0, [1;2;3]); (1, [2;3]); (2, [3]); (3, [1;2]);
This representation is in fact a reduction of our Graph structure where the payload has been stripped away. It focuses on the purely relational structure independent of the semantic meaning of the diagram.
The parsing of this input (graph DSL) is fairly simple:
let ParseList (list : (int* int list) list) : Graph