Storing triple data: overview

This has come up in a couple of recent issues: #156 and djangoldp-account#72

The issue is a work-in-progress, but please feel free to chime in with any research

Why NoSQL?

I think it'd be very challenging to efficiently model the kind of "infinite variety of fields" (the Open World Assumption) of RDF in an SQL storage. This is useful for standards compliance and for making applications interoperable with third parties

TODO: comparison of triplestore with other NoSQL options

Why SQL?

Obviously because it's widely used and already supported in Django, it was the default option
Using RDF views on relational data enables you to integrate data available from different sources
Users with existing SQL databases and SQL logic
anything else?

Apache Jena SDB

Apache Jena used to provide an SQL-backed SDB, but they recently discontinued this effort. This thread explains why they discontinued it

Their models were created using RDF, but actually stored as RDF using a custom database schema. It wasn't a general SQL-to-RDF mapping layer like DjangoLDP

Apache Jena TDB

They provide a Triplestore-backed database system which is largely discussed as being fast, but I haven't seen any benchmark comparisons. Extending Django with a new Triplestore Database provider is possible, and it wouldn't necessarily mean that users can't use SQL-backed databases as well or instead

How would we do it?

Triplestore Database Backend

We'd need to write a custom Database backend for Django using a triplestore under the hood. A priori it would rely on RDFLib

This is a low-level feature in Django and the things which exist are evidently built with relational database backends in mind, since these are the ones supported

Useful resources

This presentation: https://simpleisbetterthancomplex.com/media/2016/11/db.pdf
(video) Django in Depth: https://www.youtube.com/watch?v=tkwZ1jG3XgA
This paper discusses extending an existing Object-Relational Database Management System (ORDBMS) into an RDF Triple Store: http://vos.openlinksw.com/owiki/wiki/VOS/VOSRDFWP

Similar efforts

Apache Jena TDB: https://github.com/apache/jena/tree/main/jena-tdb
Django-RDF-io: https://github.com/rob-metalinkage/django-rdf-io
Unfinished/unmaintained effort to create a Django backend for OpenLink Virtuoso triplestore: https://github.com/rfloriano/semantic-django

Storing Triples in an SQL Relational Database

Useful resources

This paper discusses extending an existing Object-Relational Database Management System (ORDBMS) into an RDF Triple Store: http://vos.openlinksw.com/owiki/wiki/VOS/VOSRDFWP

In Semantic Web parlance, this implies representing non-RDF data as RDF data by way of Ontology mapping. Transparently representing non-RDF data as RDF data via Ontology mapping entails either storing application triples in an RDF Triple Store or converting SPARQL queries to SQL (called SQL rewriting) and reformatting the results back into an RDF form

They're using one table to store all quads (RDF_QUAD), using integer IDs for the primary keys, and then using other tables to convert these internal ids into external ones. They provide a support for an ANY datatype in SQL (Object is ANY, to allow literals)

RDFLib: https://github.com/RDFLib/rdflib-sqlalchemy

A SQLAlchemy-backed, formula-aware RDFLib Store. It stores its triples in the following partitions:

Asserted non rdf:type statements.

Asserted rdf:type statements (in a table which models Class membership). The motivation for this partition is primarily query speed and scalability as most graphs will always have more rdf:type statements than others.

All Quoted statements.

some performance results of these kinds of solutions on HUGE datasets: https://www.w3.org/wiki/LargeTripleStores (industry-leading results https://download.oracle.com/otndocs/tech/semantic_web/pdf/OracleSpatialGraph_RDFgraph_1_trillion_Benchmark.pdf)

Mapping SQL to RDF

Overview: https://www.w3.org/wiki/RdfAndSql

Unlike much of the current work on SQL RDF data stores, this approach attempts to provide RDF access to data in specialized SQL data stores rather than provide an SQL data store for RDF https://www.w3.org/2002/05/24-RDF-SQL/

I think that any effort to store "outside" fields in a new table for that field (generated on the fly?) would be horrendously slow. However maybe it's possible to store unknown fields into a generic quad store to provide the Open World Assumption principle of RDF (like the Virtuoso solution) -- #230. It's worth asking "why?" with regard to this suggestion, though, because it may be that we don't really want to hack the Open World Assumption into SQL in this way (if an application wants it, wouldn't they migrate their data to a triplestore or a quadstore SQL schema as above?)

ancient history: https://www.w3.org/DesignIssues/RDB-RDF.html
Example code in Python (from 2006): http://dig.csail.mit.edu/2006/dbview/dbview.py

TODO: Relation Database Schema -> Ontology

R2O (2004) (http://www.cs.man.ac.uk/~ocorcho/documents/SWDB2004_BarrasaEtAl.pdf)
what's happened since this paper?

Oracle

TODO

https://docs.oracle.com/database/121/RDFRM/rdf-views.htm#RDFRM556

Provide a system of "RDF Views" which take SQL data and output it in RDF. This is similar to what we're doing by serializing users' relational data

Alternative approaches

Leveraging an existing solution

Virtuoso provides both SQL rewriting and triplestore based solutions. https://www.w3.org/wiki/VirtuosoUniversalServer
TODO: further research (https://www.w3.org/wiki/SemanticWebTools)

Implementing a patch

I'm wondering if we could adapt a JSONField to hold "other fields", storing in the column a document containing all of the fields not pre-empted by the model schema. Querying this data might produce some problems and require us to patch parts of Django. Accessing my_obj.other_fields['my_field'] on the object breaks with the standard JSON-LD access and it would also be a tricky part to make the access integrate seamlessly, i.e. using my_obj.my_field

Using MongoDB

MongoDB is a fast and popular document store. Storing JSON means that we can store JSON-LD in the database. It's possible to use MongoDB with Django. Using PyMongo makes it truly schema-less, but you lose the Django ORM and instead write Mongo queries using Python dictionaries. Clients should be able to use this now with DjangoLDP to recuperate the open world assumption as they need it. Requires testing with viewsets, models and serializers to see how djangoldp handles this in practice

Edited Feb 21, 2022 by Calum Mackervoy