gPodder development blog: Avoiding re-parsing of unchanged feeds

2012-08-09

Avoiding re-parsing of unchanged feeds

I've recently been thinking of an idea to speed up feed parsing in gPodder in cases where the feed contents don't change (i.e. no new or changed episodes), but the server doesn't support E-Tag or If-Modified-Since headers.

Right now, whenever gPodder updates a podcast feed, it sends the E-Tag and Last-Modified headers to feedparser, which will use it to make a conditional HTTP request. In case the server supports it, this will avoid parsing (and downloading) the whole feed, and the server will just tell the feedparser "Nothing new", which in turn will finish the feed update for gPodder - avoiding both the download of the whole feed data, and the parsing inside feedparser and gPodder.

Now, the idea here is that on mobile devices like the N9, we want to avoid CPU usage as much as possible (in addition to keeping the traffic down if possible). One idea now would be to calculate a hash over the downloaded feed content, and remember that hash. In cases where the server doesn't support E-Tag or If-Modified-Since, this could effectively avoid having to parse the feed again. Of course, a hash of the feed contents can only be calculated if the content is downloaded first, so this won't allow for avoiding re-downloads, but in general, I guess that it could improve performance on devices where parsing takes up a non-trivial amount of time.

This might need some experimentation if it is effective in the general case. One idea could be to check the gpodder.net Top 100 podcasts lists and see how many of them support E-Tag and If-Modified-Since and how many of them would benefit from having this special "content hash" in place to avoid re-parsing.

The issue has been filed here, let's see if it's something that could go upstream: feedparser issue 370

Update: Patch added for gPodder and feedparser, both patches can be found in gPodder bug 1634.

8 comments:

StecchinoAugust 9, 2012 at 1:32 PM
Slow parsing has certain been bugging me about the gPodder on N9. I always cringe when I accidentally trigger an update. It doesn't just take a long time (for pretty modest subscriptions count of 16 feeds), but also make the entire application unresponsive.
If it was easy I would suggest replacing the feedparser with a native QtXmlStreamReader one, possibly the one from Amarok [1]. It does incremental parsing while the feed is still being downloaded and can stop early if first is already known. I'm very happy with the performance and it doesn't even try to do HTTP header tricks.

[1] http://quickgit.kde.org/index.php?p=amarok.git&a=blob&h=3f3d5a4bfeb066dda220886e9fce3300af0d6775&hb=c4eb51e0581f8fe354958ab840d3c4c69c02096f&f=src%2Fcore%2Fpodcasts%2FPodcastReader.cpp
ReplyDelete
Replies
AnonymousAugust 9, 2012 at 7:31 PM
And what about the idea to centralize the feed parsing and create a slim api to that centralized service? I think with gpodder.net you already provide such a feed parsing service?

We already talked about that idea and it would be possible to add some special features to the server part.
ReplyDelete
Replies
Stefan KöglAugust 10, 2012 at 9:56 AM
Calculating hashes of the HTTP response would fail for feeds that use dynamic URLs for media files (I think Google's Feedproxy does that). I don't think other types of dynamic content are very common, though.
ReplyDelete
Replies

Add comment

Comments are moderated. Not all comments will be published. Feel free to post replies on your own blog if your comment is not published here.