Federation: routine to clear old activities

Maybe we don't want to save any of them. In which case we would need to keep an activity that is already processed?

It's necessary for compliance to the ActivityPub and Linked Data Notifications specifactions, for third party applications

It's also necessary for my design for #249 (closed)

It can be used in providing "undo activity" functionality

#249 (closed) can save un-processed ones only

Still, not a solution so.

Does it mean that anyone using AP will save anything for years? That's really a lot of datas.

We could make this a server setting so that it's up to the server admin.. and default to something like 6 months seems sensible?

Implementing #249 (closed) would reduce the number of activities being sent as well

changed the description

do you know if there is some Activity Pub guidance on the lifetime of an activity?

I found this Git issue which discusses the topic

TLDR;

transient activities do not need to be stored indefinitely - the examples in the spec are (intentionally) loose which implies it's the choice of the application developer, the examples given are chat messages and game notifications
the conclusion was that "Clarify that it's up to servers if they want to keep around objects as long as they want. If they want to delete objects, like maybe delete a bunch of game notifications, that's a-ok"
Hubzilla and others are using expiry keys on activities

In our case, at the moment, I think that the only Activitys which we're storing and which continue to be useful after they are sent are failed activities, used in debugging? @jbpasquier

We could make the default be to not store an activity at all, unless a server setting STORE_ACTIVITIES = 'all' is present or something? This could also be a setting on the Model meta but we might want to make this TODO until someone tells us they need it

The Linked Data Notifications spec doesn't seem to have this same flexibility (I've not researched it in as much detail). It might be a good time to clarify that we're following ActivityStreams/ActivityPub and not Linked Data Notifications, except where the two intersect?

In our case, at the moment, I think that the only Activitys which we're storing and which continue to be useful after they are sent are failed activities, used in debugging?

We're storing everything. Community's database actually own 103.495 activities. :-)

Only failed, targeted to Prosody, are useful. Others, well, usually I don't even know what's there purpose.

We could make the default be to not store an activity at all, unless a server setting STORE_ACTIVITIES = 'all' is present or something? This could also be a setting on the Model meta but we might want to make this TODO until someone tells us they need it

What about an approach more-or-less like the logger of Django? With a top layer array that would accept some filter(s) to keep activities?

Like, say:

STORE_ACTIVITIES = [
  {
    "external_id": "https://jabber.happy-dev.fr/community.startinblox.com/happydev_user_admin"
  },
  {
    "object": "https://api.community.startinblox.com/users/john/",
    "type": "Update"
  }
]

This would keep activities which have the external_id specified or activities which have object john + type update.

I guess that STORE_ACTIVITIES = "*" or STORE_ACTIVITIES = ["*"] would also work.

Only failed, targeted to Prosody, are useful. Others, well, usually I don't even know what's there purpose.

Their purpose was to follow the spec really. Very occasionally successful ones are useful in debugging, usually because there's another which is failed

What about an approach more-or-less like the logger of Django? With a top layer array that would accept some filter(s) to keep activities?

To be honest I think that for now this might be a sledgehammer to crack a nut. I think that we could make do with this?

STORE_ACTIVITIES = "verbose" | "error" | None

old activities are useful in debugging, @jbpasquier can we make the logger standard on the production servers? I could log their success to here instead, and when reproducing an activity we can set

STORE_ACTIVITIES = 'verbose'

temporarily

can we make the logger standard on the production servers?

What do you mean by standard?

I think that the logger.debug('...') doesn't go anywhere at the moment?

We just need to store them in a file on the server logs, and set the appropriate settings.. something like

LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'handlers': {
        'file': {
            'level': 'DEBUG',
            'class': 'logging.FileHandler',
            'filename': os.path.join(os.path.dirname(BASE_DIR), 'admin', 'logs', 'sib_app', 'sib_app.log'),
        },
    },
    'loggers': {
        'djangoldp': {
            'handlers': ['file'],
            'level': 'DEBUG',
            'propagate': False,
        },
        '*': {
            'handlers': ['file'],
            'level': 'DEBUG',
            'propagate': True,
        },
    },
}

Oh, right, Sentry is the actual place, I think - but not for debug things. I'm not sure that we want to keep every debug on a file on production? Maybe, we can add some settings like your, but commented, to allow an easier debugging?

@plup Any position on this?

For debugging when we're reproducing something I think we could just change the setting to

STORE_ACTIVITIES = 'verbose'

with this I had in mind that a record of historical activities might be useful, and somewhere in the admin logs would make sense - as with incoming HTTP logs

in other cases running more verbose logs would be useful as well though

incoming HTTP logs

We already have them, from the Apache proxy log by Alwaysdata

Maybe some kind of naive logrotate would avoid filling the disk space with useless logs while still keeping some days. @plup Could we? Drop debug logs on a predefined file on each servers and ensure to keep only 7 days worth of logs or something like that?

There is something I can do on the logging options. Not sure if I have enough liberty to implement that though.

changed time estimate to 2h

@balessan @jbpasquier I mentioned in the stand up that this is likely a big reason why the performances of activities are worse in production than in testing

Community's database actually own 103.495 activities. :-)

We do some database lookups to check that an activity is definitely valid before sending it, e.g.

def get_most_recent_sent_activity(source_obj, source_target_origin):
            # get a list of activities with the right type
            activities = ActivityModel.objects.filter(external_id=url, is_finished=True,
                                                      type__in=['add', 'remove']).order_by('-created_at')[:10]

            # we are searching for the most recent Add/Remove activity which shares inbox, object and target/origin
            for a in activities.all():
                astream = a.to_activitystream()
                obj = astream.get('object', None)
                target_origin = astream.get('target', astream.get('origin', None))
                if obj is None or target_origin is None:
                    continue

                if source_obj == obj and source_target_origin == target_origin:
                    return a
            return None

as you can see we limit the comparison to 10 but this is still a SELECT query ORDER BY on 103,500 resources

I realised that this wasn't in the scope of #362 (closed). If there's funding for it we could try it first and see how it effects reactivity. Obviously in any case we get the guaranteed bonus of freeing storage on the production databases

Adding two fields for target & object would allow to filter directly on the queryset and so avoid the loop?

Yeah nice 👍 I had made a note of it a while back in another issue (#285)

With #332 I've started moving the ActivityPub stuff into a different repository though, so DjangoLDP-side I think extending the models with these kinds of optimisations would be wise

Different repository? Why not an included package like for djangoldp_crypto? I don't see any Startin'blox customer who will not need this package.

But there are people who might want to use Django + ActivityPub (without Startin'Blox), and might contribute to that for us

Our activitypub system works without DjangoLDP? No needs for urlid & cie?

I'm moving everything I can into the new repository, designing it so that DjangoLDP can inject its behaviour and with something like the Rest Framework LDP (suggested) refactor in mind, whilst minimising scope creep to avoid spending the budget 😅 so far so good, haven't ran into any major issues. I should be able to push some code & documentation soon:tm:

I've done a little code duplication for things like the urlid field and allowing rdf_type on the Model Meta. One day I think these should belong to a Rest Framework LDP library which both DjangoLDP and Django-ActivityPub can extend

@balessan also wanted to highlight explicitly (#266 (comment 62513))

The Linked Data Notifications spec doesn't seem to have this same flexibility (I've not researched it in as much detail). It might be a good time to clarify that we're following ActivityStreams/ActivityPub and not Linked Data Notifications, except where the two intersect?

The LDN is only a protocol that describe the way senders (applications) can send messages to receivers (servers) and how consumers (applications) can retrieve them.
It does not describe the life-cycle of the content of the notification, but only of the notification by itself, we can notify whatever we want whenever we feel that's a necessity for whichever reason.

OK great, thanks :)

changed time estimate to 3h

@balessan I think that this issue could improve reactivity of the backlinks system (i.e. the renewed interest in infra/prosody-modules#17 and #362 (closed))

See #266 (comment 64088) for why

as you can see we limit the comparison to 10 but this is still a SELECT query ORDER BY on 103,500 resources

It would also save a lot of database space. To recap the main use of storing successful activities is to debug them in the context of unsuccessful ones or an issue with synchronisation (e.g. when I was debugging an issue in circle creation here djangoldp-circle#2 (closed)), but providing the setting STORE_ACTIVITIES = 'verbose' would provide well enough for this in my opinion

Does someone have 3h budget for it? 🙂

@calummackervoy If you are able to provide a solution to this in 3 hours, feel free to take it.

the main use of storing successful activities is to debug them in the context of unsuccessful ones

This isn't true 🤦 the main use of storing successful activities is to check past activities before sending a new one. For example we used a check_update_is_new function to prevent an Update activity being fired on every save - it verifies that the object being sent has updated information on the last successful update, first

This is still a bottleneck, though. I have some ideas for how we might refactor away the database access in this then we can continue with the same plan, but it would be good to discuss them next week

assigned to @calummackervoy

added 35m of time spent at 2021-08-25

added 2h 15m of time spent

added 10m of time spent

mentioned in merge request !236 (merged)

closed with merge request !236 (merged)

mentioned in commit dfa3299a

Federation: routine to clear old activities

Designs

Child items ...

Activity