AN INTRODUCTION TO KNOWLEDGE REPRESENTATION
(This is six-part series on semantics and reasoning)
The code below assumes a semantic store endpoint is available at localhost:3030
which corresponds to the default of a local Fuseki server. The code is actually independent of Fuseki and based on SPARQL 1.1; you can run an AllegroGraph service in a docker container or a remote Stardog service.
The rdflib is just a pip away if not installed
pip install rdflib
as well as the sparqlwrapper:
pip install sparqlwrapper
to connect to the service.
The query and update endpoints are separate but you can unify them in one object like so:
from SPARQLWrapper import SPARQLWrapper, JSON
from rdflib.plugins.stores.sparqlstore import SPARQLUpdateStore
import rdflib
ENDPOINT = "http://localhost:3030/Test"
store = SPARQLUpdateStore(
queryEndpoint=f"{ENDPOINT}/sparql",
update_endpoint=f"{ENDPOINT}/update",
context_aware=False)
store.setReturnFormat(JSON)
This allows you to insert data straight away:
store.update('Insert data { }')
and to query it one would use something like:
q = """
PREFIX home:
SELECT ?p
WHERE { home:swa ?p ?o }
"""
results = store.query(q)
bindings = results.bindings
if bindings is None or len(bindings) == 0:
print("No results")
results = list()
for r in bindings:
print(r[rdflib.Variable('p')].toPython())
Note that the result set is a dictionary with a rdflib.Vairable
key, not a string. The query can also be copy/pates in YASGUI provided you alter the dropdown box with the appropriate endpoint.
q = """
SELECT (count(*) as ?c)
WHERE { ?s ?p ?o }
"""
results = store.query(q)
bindings = results.bindings
if bindings is None or len(bindings) == 0:
print("No results")
print(bindings[0][rdflib.Variable('c')].toPython())
The toPython
method is a utility converting uniformly URI’s and literals for you.
Let’s add some data to our store first so we can demo some more techniques with it. The iris dataset is the hello-world equivalent in data science and is useful for some ideas described later on:
import pandas
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)
dataset.head()
sepal-length | sepal-width | petal-length | petal-width | class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
How to push this typical tabular data in a semantic store? There are various ways, below is the fairly standard star-graph type of storage.
for i in dataset.index:
q = f"""
PREFIX iris:
PREFIX item:
INSERT DATA {{
iris:data iris:contains item:{i}.
item:{i} iris:sepal_length "{dataset.iloc[i]["sepal-length"]}".
item:{i} iris:sepal_width "{dataset.iloc[i]["sepal-width"]}".
item:{i} iris:petal_length "{dataset.iloc[i]["petal-length"]}".
item:{i} iris:petal_width "{dataset.iloc[i]["petal-width"]}".
item:{i} iris:class "{dataset.iloc[i]["class"]}".
}}
"""
store.update(q)
To extract these records back via SPARQL you need to use a subquery, somethind like this:
q = """
PREFIX iris:
PREFIX item:
SELECT ?s ?k ?m
where {
?s ?k ?m
{
select ?s
{
?s
}
}
}
LIMIT 2
"""
results = store.query(q)
bindings = results.bindings
if bindings is None or len(bindings) == 0:
print("No results")
for r in bindings:
print(r.values())
What about graphing these results? This can be easily done with NetworkX but requires first a bit of stripping. The URI’s are too verbose for a graph, so let’s reduce things a bit:
class Iris:
def __init__(self):
self.id = -1
self.petal_length = -1
self.petal_width = -1
self.sepal_length = -1
self.sepal_width = -1
self.type = None
q = """
PREFIX iris:
PREFIX item:
SELECT ?s ?k ?m
where {
?s ?k ?m
{
select ?s
{
?s
}
}
}
LIMIT 1000
"""
results = store.query(q)
bindings = results.bindings
flowers = {}
if bindings is None or len(bindings) == 0:
print("No results")
for r in bindings:
id = int(r[rdflib.Variable('s')].toPython().replace("http://www.orbifold.net/iris/item/", ""))
if id in flowers.keys():
flower = flowers[id]
else:
flower = Iris()
flowers[id]= flower
prop_name = r[rdflib.Variable('k')].toPython().replace("http://www.orbifold.net/iris/", "")
# the propname 'class' is a reserved word
if prop_name == "class": prop_name = "type"
prop_value = r[rdflib.Variable('m')].toPython()
vars(flower)[prop_name] = prop_value
Note that technique shows you:
- how to convert triples to standard record sets
- how to strip noise away
- how semantic data can be converted to tabular data ready for machine learning
Now, with some NetworkX API you can easily draw the flower network (or part of it):
import networkx as nx
%matplotlib inline
G = nx.Graph()
rec = []
for i in range(30):
v = vars(flowers[i])
for key in v.keys():
rec.append((f"f{i}", v[key]))
G.add_edges_from(rec)
# labels = {}
cols = ["red" if str(n)[0]=="f" else "green" for n in G.nodes() ]
pos = nx.layout.kamada_kawai_layout(G)
nx.draw(G, pos=pos, with_labels= True, node_color= cols, edge_color="silver")
There are of course many ways to approach the data, with H2O Flow with Zeppelin and so on.
Another way to visualize things is with YASGUI. For example, after having pushed the iris data above you can visualize the distribution of a feature:
How many records do we have in the triple store?
q = f"""
select (COUNT(*) AS ?count) where {{
?s ?o.
}}
"""
results = store.query(q)
print (int(results.bindings[0][rdflib.Variable('count')].value))
150
How many have the word ‘setosa’?
q = f"""
select (COUNT(*) AS ?count) where {{
?s ?o.
FILTER (contains( str(?o), "setosa"))
}}
"""
results = store.query(q)
print (int(results.bindings[0][rdflib.Variable('count')].value))
50
There are plenty of other fun things you can do with triples and Python but I’d hope this gets you started.