Apache Jena disaster

If you want to play with SPARQL and triples you will find that you end up with a few options: Apache Jena (Fuseki), Stardog and BrightstarDB. It seems that BrighstarDB is not active anymore and the free version of Stardog is limited to millions of triples. In one of our projects we used Fuseki for a POC and found it to be OK until we hit various incomprehensible problems and loss of data. To be clear: do not use Apache Jena beyond a POC and simple setups. The problem is that you will not find any other SPARQL open source alternative and, hence, forced to buy expensive licenses.

Considering that Jena is relatively old and widely known one would think that it’s battle resistant and though it does not boast enterprise features (e.g. clustering) it should be good to go. Not. For instance:

we found that concurrent reading and writing of triples would bring the service to a halt with as little as three users
reading after writing sometimes requires one to insert a delay in order to let Jena digest the changes
deleted databases persist on disk while not visible in Jena
data sometimes disappears for no reason

and one cannot but systematically distrust the service. Some of the issues likely have their origin in Fuseki (the REST service on top of Jena) but others definitely are deep inside Jena. This situation adds to the problematic acceptance of semantic thinking in the industry. Ontologies and related tools (like Stanford’s Protégé) still very much feel like academic inventions and although there is a growing interest it’s still far away from the relational standard. A bit like the R language for statistical analysis; massive value but quirky at its core.

So, using open source triple stores for real-world applications? Not yet. If your triple count is in the trillions you will have to go for MarkLogic and alike. Which is really a shame since there are so many great open source or free solutions in the relational world.

Journal