ActivityQueue is too slow

mentioned in issue djangoldp-notification#36 (closed)

mentioned in issue #332

mentioned in issue infra/prosody-modules#17

changed the description

Just so it is written somewhere, this is the cause of one of the most pressing issues on Hubl at the moment. When someone adds a user to a circle, it sometimes takes minutes before being able to @mention her. Ours users conclude that adding that person to the circle didn't work. I personally get this bug reported to me almost everyday now.

Questions about how it is now

Last week I discussed the ActivityQueueService with JB & potential redesigns:

Why the delay before sending the activity ?

Referring to this delay, during which the processor switches back to the main thread

We decided to include it because previously there were a lot of redundant activities being sent (especially with many-to-many fields) and the delay before sending allows some time for new activities to come along and invalidate it

Why are they sent in a single blocking thread ?

In Python having multiple Threads is an illusion, they're actually always in one thread. I can share my notes on the Python concurrency research I did when I made the decision to use multi-threading rather than multi-processing if you like, but from memory it was because using the queue is thread-safe where multi-processing isn't

Thread-safety

We decided against using Celery at the time because it adds a broker (like Redis) as an infrastructure dependency (post)

We didn't use asyncio because it requires threads to explicitly say when they will give up the processor

Questions about how it could be

Removing the sender delay

Ideally we'd remove this delay but we will need to find an alternative solution to the problem it resolved. The activity is scheduled in a listener (e.g. post_save), where obviously we can't know what might be scheduled later, except for example that a Create activity will be followed by an Update in Django

Threadsafe storage of scheduled activities

We store activities in a scheduled state because it allows us to revive them if for example the server goes down before they were delivered. I discussed with JB using the file-system to store the temporary activities and I think this would be a better solution

Sending activities concurrently

we could switch to a Celery/Redis based solution to have it manage the concurrency for us

This avoids reinventing the wheel but adds infrastructure dependencies for all users of DjangoLDP

When we made the first version(s) of the backlinks system it was fairly simple and so we decided it was better to avoid the dependencies than the programming overhead. Is this still the case?

we could use a multi-processing solution

i.e. parallelism instead of concurrency. A parallel process(es) runs the Activity Queue Worker(s) to send the activities

Using the file system for scheduled activities as suggested above should remove the issue of thread-safety in using a separate process

Without having done a full estimation I think refactoring to use Redis and using multiprocessing with the filesystem are similar scales of work

Waiting for the Activity response shouldn't block me from processing other activities

Shortcoming of the Python requests library which we're using. We should use an asynchronous variant, like aiohttp

Cool. What next step do you see from here?

There are some design questions that we need to figure out:

Removing the sender delay
Threadsafe storage of scheduled activities
Sending activities concurrently, Celery, multiprocessing or ?
Asynchronous requests (just need to pick a library)

Then I can provide an estimation

Celery+Redis sounds the best solution.

Solution:

Keep the actual infrastructure. Add a Celery+Redis alternative, make it activable/de-activable & de-activate/activate the actual infrastructure based on the configuration.

About the sender delay, let's keep it as low as possible, like <100ms. It sounds that the actual delay is per activity, meaning that 10 identical activities with a DEFAULT_ACTIVITY_DELAY to 3 may lead to 30 seconds awaited before actually send the activity.

@balessan I guess your approval is welcome here

@alexbourlier : @jbpasquier message is the summary of my discussion with him today so you can consider that as my approval.

That's terrific news!

Is @calummackervoy implementing it?

I think I'll be the one estimating this in any case... this is another one that we should aim to get to in April I think

@calummackervoy @balessan Not OK to answer messages on 15+ days I can totally understand that you guys have plenty to do but answering us here takes one minute.

If this can be prioritized and done faster, that's really helpful. That's one of the most painful thing in Hubl at the moment

@alexbourlier sorry about that. I didn't know the answer to your question but I suppose I could have replied "I don't know"

If this can be prioritized and done faster, that's really helpful

I was wondering about the relationship to #332. A priori it can be done alongside it but there will need to be good communication between those devs. Is one or the other a clear priority for FNK?

Yes, I think no letting conversations die is easy and it removes this sense of black hole that we sometimes get

I'm not able to answer regarding the priority between the two issues you're pointing out. My understanding is that they are both needed to have notifications on the job board working. That's our need: notifications on the job board.

@jbpasquier may have an opinion about how to prioritize those two, or whether they can be parallelized or not

For the meantime I opened an MR for a shorter activity delay (!209 (merged)), we could deploy it next week when JB's back

I change the delay from 3s to 100ms, in my testing on the pre-prod it didn't mean that redundant activities were sent and evidently the queue is a lot faster. Sorry I could've thought to do this sooner, or been less cautious in my original default!

No worries, I'm not sure we're able to nail the right way on our first try all the time. Coming from an English culture probably makes it worse (I'm joking)

Thanks for the update

mentioned in issue energie-partagee/djangoldp_energiepartagee#8

mentioned in issue #236

mentioned in issue #365 (closed)

mentioned in issue #357 (closed)

mentioned in issue #370

mentioned in issue #381

I'm closing this issue to split it into its composing parts

Improvements to the basic ActivityQueueService (what we have now)

Decreasing the sender delay - DONE
#380
#381

Sending activities concurrently - an optional extension using Celery

#382

P.S. I think that #380 and #381 are not required by FNK since you plan to use a new celery-based solution

There were some discussions on another thread about whether this issue is a core-team concern or a client package. I think that #382 at least is no-doubt an extension to DjangoLDP, it's not a bug that we decided not to use celery

Do I have a green light to provide an estimation for #382 then?

@calummackervoy yes I think so. At least you can provide an estimation.

@jbpasquier @alexbourlier is that topic handled by FNK directly ?

Depends. Will see after the estimation who is willing to pay for it.

closed

mentioned in issue #266 (closed)

mentioned in issue #380