The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

This dataset is the MNIST equivalent in graph learning and we explore it somewhat explicitly here in function of other articles using again and again this dataset as a testbed.

► Direct link to download the Cora dataset

Download and unzip, say in ~/data/cora/.

    import os
    import networkx as nx
    import pandas as pd
    data_dir = os.path.expanduser("~/data/cora")

Import the edges via pandas:

    edgelist = pd.read_csv(os.path.join(data_dir, "cora.cites"), sep='\t', header=None, names=["target", "source"])
    edgelist["label"] = "cites"

The edgelist is a simple table with the source citing the target. All edges have the same label:

    edgelist.sample(frac=1).head(5)
target source label
3581 72908 93923 cites
5303 656231 103531 cites
2005 14531 592830 cites
987 4330 37884 cites
1695 10183 1120713 cites

Creating a graph from this is easy:

    Gnx = nx.from_pandas_edgelist(edgelist, edge_attr="label")
    nx.set_node_attributes(Gnx, "paper", "label")

A node is hence

    Gnx.nodes[1103985]

{'label': 'paper'}

The data attached to the nodes consists of flags indicating whether a word in a 1433-long dictionary is present or not:

    feature_names = ["w_{}".format(ii) for ii in range(1433)]
    column_names =  feature_names + ["subject"]
    node_data = pd.read_csv(os.path.join(data_dir, "cora.content"), sep='\t', header=None, names=column_names)

Each node has a subject and 1433 other flags corresponding to word occurence:

    node_data.head(5)
w_0 w_1 w_2 w_3 w_4 w_5 w_6 w_7 w_8 w_9 w_1424 w_1425 w_1426 w_1427 w_1428 w_1429 w_1430 w_1431 w_1432 subject
31336 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 Neural_Networks
1061127 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Rule_Learning
1106406 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Reinforcement_Learning
13195 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Reinforcement_Learning
37879 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Probabilistic_Methods

5 rows × 1434 columns

The different subjects are:

    set(node_data["subject"])

{'Case_Based',
 'Genetic_Algorithms',
 'Neural_Networks',
 'Probabilistic_Methods',
 'Reinforcement_Learning',
 'Rule_Learning',
 'Theory'}

A typical ML challenges with this dataset in mind:

  • label prediction: predict the subject of a paper (node) on the basis of the surrounding node data and the structure of the graph
  • edge prediction: given node data, can one predict the papers that should be cited?

You will find on this site plenty of articles which are based on the Cora dataset.

For your information, the visualization above was created via an export of the Cora network to GML (Graph Markup Language), an import into yEd and a balloon layout. It shows some interesting characteristics which can best be analyzed via centrality.

Loading the dataset in Neo4j

Certain packages, like StellarGraph, allow to learn from graphs when stored in a database. This opens up all sorts of possibilities, especially in the context of knowledge graphs, fraud detection and more.

The methods below help to transfer the Cora data to Neo4j as the de facto graph store these days. The technique is really straightforward but do note that the rather large 1433-dimensional vector describing the content of a paper is breaking the Neo4j browser. That is, the network visualizer in Neo4j attempts to load these vectors along with the network structure but this fails even for a single node.

The py2neo package is the way to connect to Neo4j from Python. Simply pip-install this package and connect to the store via something like

graph = py2neo.Graph(host="localhost", port=7687, user="neo4j", password="neo4j")

To start with an empty database you can truncate everything with

empty_db_query = """
MATCH(n) DETACH
DELETE(n)
"""

tx = graph.begin(autocommit=True)
tx.evaluate(empty_db_query)

If you get an error at this point it’s likely because of the port or the password.

To load all the nodes use the following:

loading_node_query = """
    UNWIND $node_list as node
    CREATE( e: paper {
        ID: toInteger(node.id),
        subject: node.subject,
        features: node.features
    })
    """


batch_len = 500

for batch_start in range(0, len(node_list), batch_len):
    batch_end = batch_start + batch_len
    # turn node dataframe into a list of records
    records = node_list.iloc[batch_start:batch_end].to_dict("records")
    tx = graph.begin(autocommit=True)
    tx.evaluate(loading_node_query, parameters={"node_list": records})

Similarly, for the edges:

loading_edge_query = """
    UNWIND $edge_list as edge

    MATCH(source: paper {ID: toInteger(edge.source)})
    MATCH(target: paper {ID: toInteger(edge.target)})

    MERGE (source)-[r:cites]->(target)
    """

batch_len = 500

for batch_start in range(0, len(edge_list), batch_len):
    batch_end = batch_start + batch_len
    # turn edge dataframe into a list of records
    records = edge_list.iloc[batch_start:batch_end].to_dict("records")
    tx = graph.begin(autocommit=True)
    tx.evaluate(loading_edge_query, parameters={"edge_list": records})

If you want to use this graph database with StellarGraph, see the docs about stellargraph.connector.neo4j connector.