Support Triplestore Database Backend
Specific discussion from one of the tracks opened in #367
Django is designed so that if you need to support a new database, you create a new database backend for it. It isn't very straightforward at the best of times and for us it's very non-linear, because Django's been heavily designed around SQL. A lot of the failed efforts seem to have failed because they ended up hacking around the SQL or they hard-forked it and ran into an issue that it's difficult to maintain
I don't like the approach of hacking around the SQL, I think the maintenance work will still be high and at that point we'd be better off using SQL and hacking around RDF
Overview
Django's ORM tree, down to the database backend, looks like this:
Evidently this is tightly coupled to SQL. A more "Conventional" alternative might look like this:
I liked the gist of the approach I read here https://vsoch.github.io/2016/cogat-neo4j/, and it's got me thinking about solutions which involve forking Django, simulating the same ORM as Django but cutting the tree off at the point where it reaches down to the SQL stuff and then relying on something like RDFLib:
Django codebases
I started by having a look at previous efforts' internal workings and the Django codebase to see how this might fit
- django-semantic includes work on a SPARQLCompiler. Their logic looks closely tied into SQL logic though so I discontinued this source
- django-rdf-io sits between the Django ORM and RDF, mapping the one to the other so that a non-RDF Django model can be used to push/pull to Graph data. They were considering supporting an LDP API and looking at DjangoLDP actually
- Neo4Django were taking a similar approach to us, providing custom NodeModel definitions, but they also override Django fields and foreign keys so that they correspond to Graph equivalents (e.g #351 ). For queries they were performing a similar cutting of the ORM as I describe above, mimicking a QuerySet but translating to the Cypher query language (https://neo4django.readthedocs.io/en/latest/querying.html)
Neo4J
- Neo4J is a NoSQL graph storage with a large community
- the graph is fast (TODO: validating this claim)
- the neosemantics extension allows the input/output of RDF data
- model validation using SHACL shapes but not SHeX
- importing OWL data
- It's not a semantic store underneath, it relies on defined mappings between Neo4J nodes and edges to URLs for resources and properties
- However it says that the importing/exporting of RDF data is lossless, so I would assume that this means "even if it's not in the Neo4J schema"
Using Neo4J with Django
Neo4J supports drivers that allow its use in Python, JavaScript, Go and many other languages
There's an OGM called neomodel and it's supported in Django via the library django-neomodel
Note that it requires hefty redefinition of the model fields, as with Neo4Django:
from neomodel import StructuredNode, StringProperty, UniqueIdProperty, RelationshipTo
class Person(StructuredNode):
uid = UniqueIdProperty()
name = StringProperty(unique_index=True)
friends = RelationshipTo('Person','FRIEND')
These support Neo4J's graph modelling, but not the semantic data, so annoyingly we'd be turning a graph storage into Django storage only to convert it again to RDF. It also doesn't mean that we'd be magically supporting the Open World Assumption - since we're mapping to statically defined Django models
Following this the OGM looks similar to the ORM, but you can see that it's been adapted for exploring graph data:
Person.nodes.all()
person.city.connect(paris)
person.city.is_connected(paris) # True
I think the OGM here looks like it would be good to use, like the Django ORM, but the extensive differences to the ORM mean that it's one or the other in the application logic - there's no "my application works for SQL storage and for graph storage", unless it transforms the SQL data into graph data
Since this is a bridge from Neo4J to Django we'd still need to provide a bridge from the OGM to the semantic web
RDFLib
- Functions for managing graph data which look and feel like Jena's Model.. useful for managing graphs, but not a Django ORM
- Provides parsers and serializers for RDF content
- there's a SPARQL wrapper
- by default RDFLib stores triples in memory, which can be serialized into various RDF formats and written to disk (filesystem). It runs into problems with large datasets which can exceed the amount of RAM available (!)
Stores
Bizarrely the RDFLib homepage mentions 2 key-value stores which haven't been updated in 7 years, but it doesn't mention the RDFLib core's SPARQL Store and Sleepycat Triplestore. I was quite relieved when I eventually found these, though
The HDT Store (while being 14.5 times more efficient with storage than NTriples) is only an optimisation for reading, and doesn't support writing
Finally there is SQL-Alchemy for SQL storage (#367 )
Performance
The first benchmark I looked at is not kind on the Berkerly DB key-value store, and it finds that the PostgreSQL store is 3 times faster (I assume via SQL-Alchemy). It doesn't compare the SPARQL or triple stores from the core
In the 10s of posts I've seen discussing performance from the community it's normally stated that RDFLib isn't built to be fast for large datasets and doesn't intend to be
Different Approaches
- I saw this comment suggesting the use of Apache Jena Fuseki with RDFLib for better performance
- In the same thread the same author referenced a blog post about plugging RDFLib into StarDog, from 2014. I'd never heard of StarDog but they are used by a host of large organisations
Interfacing RDFLib with Django
TODO.. next :)
Next steps
TODO
Things to look out for (going forward)
- in Django you can override the ORM to run raw SQL if you wish. A library doing this (if we provide a triplestore ORM) ... will break it
References
On writing custom database backends
- https://simpleisbetterthancomplex.com/media/2016/11/db.pdf
- Django (1.8) in depth https://www.youtube.com/watch?v=tkwZ1jG3XgA
- Django docs
- Django repository
On (pre-)existing Graph backends
Django repositories
- https://djangopackages.org/grids/g/nosql/
- https://github.com/scholrly/neo4django/
- https://github.com/rob-metalinkage/django-rdf-io
- https://github.com/rfloriano/semantic-django
On Neo4J
- https://vsoch.github.io/2016/cogat-neo4j/
- https://medium.com/swlh/create-rest-api-with-django-and-neo4j-database-using-django-nemodel-1290da717df9
On Neo4J neosemantics
- examples starting at around 11:00 https://www.youtube.com/watch?v=5wluUfomasg
- user manual https://neo4j.com/labs/neosemantics/4.0/
On RDFLib
- Performance comparisons of database storages: https://docs.ropensci.org/rdflib/articles/articles/storage.html
- RDFLib repository & issues
- RDFLib 5.0 docs
- RDFLib website
- Various stores repositories, documentation and posts