Serverside Performance Epic: Updated for June 2022

Overview

I'm creating this issue to provide a state-of-play on the performance investigations in DjangoLDP as they are to date, how they've improved, findings that are still relevant and areas we're aware still need improving. As we develop this track, we can keep it up-to-date with new findings and improvements made

References

This previous performance epic
This report on serverside performance dating from 08/2020
Spreadsheet of results which accompany the report
a number of performance tests can be found spread across various issues in various repos where discussions with clients were happening on how to improve. Unfortunately we didn't centralise where these were stored but I'll try to find what I can

Test Methodologies

We set up local and staging servers to use fixtures and generated test data. We wrote this test data generation script which creates (500) users and (500) projects and spreads the users between the projects randomly as members
See here the 'instructions to reproduce' for setting up a local environment with test data
We used cProfile which gives a verbose output of function calls and time elapsed in each call. This I found to be excessively granular in general, but provides a useful way to identify the main pain points and to squeeze out some micro-optimisations in certain functions (to be honest what we need right now are the big improvements)
We wrote our own less granular profiling code by using timers (link to branch code)
Using Django silk
Measuring the global "timing" tab during requests to the API, and in using client-side applications manually
We compared the performance to mock solutions which weren't using djangoldp features

Existing findings

Serialization is expensive (profiling found that most of the time was spent in to_representation). We found that the growth was exponential - O(N^2) with each layer of nesting on the serialized resource. Serializing time grows linearly with the number of resources
Pagination changed serialization so that it remains constant O(1) with added rows (tied to page size). See below

Existing suggestions

Serializing grows with the number of fields on a resource

djangoldp-packages/djangoldp#400 (closed)

Serializing permissions on every resource is expensive

The permissions in Solid follow a WebACL approach, where permissions are serialized for every resource explicitly. There is an open MR in djangoldp which changes our behaviour so that we're not doing this for every resource, only on the containers

Note that this is just about serializing permissions. The object-level permissions are applied by using this generic filter backend

Filtering on object-level permissions

Django-Guardian is a dependency of djangoldp, introduced during an early refactor of the permissions system. In conventional Django coding, object-level permissions might be expensive, compared to applying permissions classes to specific views. I think that the idea was to have one generic permissions class and have "agents" (listeners) apply changes to object permissions which would allow for a more WebACL approach. In reality object-level permissions are hardly used in package code and most package developers are writing custom permissions classes for their models
The LDListMixin applies filtering to the model it contains. This duplicates the work of a FilterBackend, but it does it because many of the containers being served won't be on the parent queryset, but on a nested field (e.g. user.circles). It's also an issue because in the future a container should be able to have many kinds of object

Pagination

Pagination improves the performance of serialization with increasing records by making the performance constant O(1)

In Linked Data it's a little more complex for the front-end because of the decentralised nature of applications (especially around ordering) and we had informally estimated we'd need around 5 000 € to complete the feature. In the backend it's already supported

Cache

Caching improves the performance of serialization on subsequent requests on a resource. The initial issue demonstrated positive results (I think this is buried in a client issue somewhere). There's an open issue to replace the use of a custom cache and instead tying it into Django's inbuilt cache system

Database pre-fetch

We implemented a pre-fetch mechanism which speeds up database queries by making one (full) query which will provide the data we need for all subsequent queries (during serialization) (initial results)

Compression

We implemented standard compression by default in requests. This improved the file size and download time of responses to browser

Edited 2 years ago