Storing triple data: overview
This has come up in a couple of recent issues: #156 and djangoldp-account#72
The issue is a work-in-progress, but please feel free to chime in with any research
Why NoSQL?
I think it'd be very challenging to efficiently model the kind of "infinite variety of fields" (the Open World Assumption) of RDF in an SQL storage. This is useful for standards compliance and for making applications interoperable with third parties
TODO: comparison of triplestore with other NoSQL options
Why SQL?
- Obviously because it's widely used and already supported in Django, it was the default option
- Using RDF views on relational data enables you to integrate data available from different sources
- Users with existing SQL databases and SQL logic
- anything else?
Apache Jena SDB
Apache Jena used to provide an SQL-backed SDB, but they recently discontinued this effort. This thread explains why they discontinued it
Their models were created using RDF, but actually stored as RDF using a custom database schema. It wasn't a general SQL-to-RDF mapping layer like DjangoLDP
Apache Jena TDB
They provide a Triplestore-backed database system which is largely discussed as being fast, but I haven't seen any benchmark comparisons. Extending Django with a new Triplestore Database provider is possible, and it wouldn't necessarily mean that users can't use SQL-backed databases as well or instead
How would we do it?
Triplestore Database Backend
We'd need to write a custom Database backend for Django using a triplestore under the hood. A priori it would rely on RDFLib
This is a low-level feature in Django and the things which exist are evidently built with relational database backends in mind, since these are the ones supported
Useful resources
- This presentation: https://simpleisbetterthancomplex.com/media/2016/11/db.pdf
- (video) Django in Depth: https://www.youtube.com/watch?v=tkwZ1jG3XgA
- This paper discusses extending an existing Object-Relational Database Management System (ORDBMS) into an RDF Triple Store: http://vos.openlinksw.com/owiki/wiki/VOS/VOSRDFWP
Similar efforts
- Apache Jena TDB: https://github.com/apache/jena/tree/main/jena-tdb
- Django-RDF-io: https://github.com/rob-metalinkage/django-rdf-io
- Unfinished/unmaintained effort to create a Django backend for OpenLink Virtuoso triplestore: https://github.com/rfloriano/semantic-django
Storing Triples in an SQL Relational Database
Useful resources
- This paper discusses extending an existing Object-Relational Database Management System (ORDBMS) into an RDF Triple Store: http://vos.openlinksw.com/owiki/wiki/VOS/VOSRDFWP
In Semantic Web parlance, this implies representing non-RDF data as RDF data by way of Ontology mapping. Transparently representing non-RDF data as RDF data via Ontology mapping entails either storing application triples in an RDF Triple Store or converting SPARQL queries to SQL (called SQL rewriting) and reformatting the results back into an RDF form
They're using one table to store all quads (RDF_QUAD
), using integer IDs for the primary keys, and then using other tables to convert these internal ids into external ones. They provide a support for an ANY
datatype in SQL (Object is ANY, to allow literals)
A SQLAlchemy-backed, formula-aware RDFLib Store. It stores its triples in the following partitions:
- Asserted non rdf:type statements.
- Asserted rdf:type statements (in a table which models Class membership). The motivation for this partition is primarily query speed and scalability as most graphs will always have more rdf:type statements than others.
- All Quoted statements.
- some performance results of these kinds of solutions on HUGE datasets: https://www.w3.org/wiki/LargeTripleStores (industry-leading results https://download.oracle.com/otndocs/tech/semantic_web/pdf/OracleSpatialGraph_RDFgraph_1_trillion_Benchmark.pdf)
Mapping SQL to RDF
- Overview: https://www.w3.org/wiki/RdfAndSql
Unlike much of the current work on SQL RDF data stores, this approach attempts to provide RDF access to data in specialized SQL data stores rather than provide an SQL data store for RDF https://www.w3.org/2002/05/24-RDF-SQL/
I think that any effort to store "outside" fields in a new table for that field (generated on the fly?) would be horrendously slow. However maybe it's possible to store unknown fields into a generic quad store to provide the Open World Assumption principle of RDF (like the Virtuoso solution) -- #230. It's worth asking "why?" with regard to this suggestion, though, because it may be that we don't really want to hack the Open World Assumption into SQL in this way (if an application wants it, wouldn't they migrate their data to a triplestore or a quadstore SQL schema as above?)
- ancient history: https://www.w3.org/DesignIssues/RDB-RDF.html
- Example code in Python (from 2006): http://dig.csail.mit.edu/2006/dbview/dbview.py
TODO: Relation Database Schema -> Ontology
- R2O (2004) (http://www.cs.man.ac.uk/~ocorcho/documents/SWDB2004_BarrasaEtAl.pdf)
- what's happened since this paper?
Oracle
TODO
Provide a system of "RDF Views" which take SQL data and output it in RDF. This is similar to what we're doing by serializing users' relational data
Alternative approaches
Leveraging an existing solution
- Virtuoso provides both SQL rewriting and triplestore based solutions. https://www.w3.org/wiki/VirtuosoUniversalServer
- TODO: further research (https://www.w3.org/wiki/SemanticWebTools)
Implementing a patch
- I'm wondering if we could adapt a
JSONField
to hold "other fields", storing in the column a document containing all of the fields not pre-empted by the model schema. Querying this data might produce some problems and require us to patch parts of Django. Accessingmy_obj.other_fields['my_field']
on the object breaks with the standard JSON-LD access and it would also be a tricky part to make the access integrate seamlessly, i.e. usingmy_obj.my_field
Using MongoDB
MongoDB is a fast and popular document store. Storing JSON means that we can store JSON-LD in the database. It's possible to use MongoDB with Django. Using PyMongo makes it truly schema-less, but you lose the Django ORM and instead write Mongo queries using Python dictionaries. Clients should be able to use this now with DjangoLDP to recuperate the open world assumption as they need it. Requires testing with viewsets, models and serializers to see how djangoldp handles this in practice