Serverside Performance Epic: Updated for June 2022
Overview
I'm creating this issue to provide a state-of-play on the performance investigations in DjangoLDP as they are to date, how they've improved, findings that are still relevant and areas we're aware still need improving. As we develop this track, we can keep it up-to-date with new findings and improvements made
References
- This previous performance epic
- This report on serverside performance dating from 08/2020
- Spreadsheet of results which accompany the report
- a number of performance tests can be found spread across various issues in various repos where discussions with clients were happening on how to improve. Unfortunately we didn't centralise where these were stored but I'll try to find what I can
Test Methodologies
- We set up local and staging servers to use fixtures and generated test data. We wrote this test data generation script which creates (500) users and (500) projects and spreads the users between the projects randomly as members
- See here the 'instructions to reproduce' for setting up a local environment with test data
- We used
cProfile
which gives a verbose output of function calls and time elapsed in each call. This I found to be excessively granular in general, but provides a useful way to identify the main pain points and to squeeze out some micro-optimisations in certain functions (to be honest what we need right now are the big improvements) - We wrote our own less granular profiling code by using timers (link to branch code)
- Using Django silk
- Measuring the global "timing" tab during requests to the API, and in using client-side applications manually
- We compared the performance to mock solutions which weren't using djangoldp features
Existing findings
- Serialization is expensive (profiling found that most of the time was spent in
to_representation
). We found that the growth was exponential - O(N^2) with each layer of nesting on the serialized resource. Serializing time grows linearly with the number of resources - Pagination changed serialization so that it remains constant O(1) with added rows (tied to page size). See below
Existing suggestions
Serializing grows with the number of fields on a resource
Serializing permissions on every resource is expensive
The permissions in Solid follow a WebACL approach, where permissions are serialized for every resource explicitly. There is an open MR in djangoldp which changes our behaviour so that we're not doing this for every resource, only on the containers
Note that this is just about serializing permissions. The object-level permissions are applied by using this generic filter backend
Filtering on object-level permissions
- Django-Guardian is a dependency of djangoldp, introduced during an early refactor of the permissions system. In conventional Django coding, object-level permissions might be expensive, compared to applying permissions classes to specific views. I think that the idea was to have one generic permissions class and have "agents" (listeners) apply changes to object permissions which would allow for a more WebACL approach. In reality object-level permissions are hardly used in package code and most package developers are writing custom permissions classes for their models
- The
LDListMixin
applies filtering to the model it contains. This duplicates the work of aFilterBackend
, but it does it because many of the containers being served won't be on the parent queryset, but on a nested field (e.g.user.circles
). It's also an issue because in the future a container should be able to have many kinds of object
Pagination
Pagination improves the performance of serialization with increasing records by making the performance constant O(1)
In Linked Data it's a little more complex for the front-end because of the decentralised nature of applications (especially around ordering) and we had informally estimated we'd need around 5 000 € to complete the feature. In the backend it's already supported
Cache
Caching improves the performance of serialization on subsequent requests on a resource. The initial issue demonstrated positive results (I think this is buried in a client issue somewhere). There's an open issue to replace the use of a custom cache and instead tying it into Django's inbuilt cache system
Database pre-fetch
We implemented a pre-fetch mechanism which speeds up database queries by making one (full) query which will provide the data we need for all subsequent queries (during serialization) (initial results)
Compression
We implemented standard compression by default in requests. This improved the file size and download time of responses to browser