2012-08-09

Avoiding re-parsing of unchanged feeds

I've recently been thinking of an idea to speed up feed parsing in gPodder in cases where the feed contents don't change (i.e. no new or changed episodes), but the server doesn't support E-Tag or If-Modified-Since headers.

Right now, whenever gPodder updates a podcast feed, it sends the E-Tag and Last-Modified headers to feedparser, which will use it to make a conditional HTTP request. In case the server supports it, this will avoid parsing (and downloading) the whole feed, and the server will just tell the feedparser "Nothing new", which in turn will finish the feed update for gPodder - avoiding both the download of the whole feed data, and the parsing inside feedparser and gPodder.

Now, the idea here is that on mobile devices like the N9, we want to avoid CPU usage as much as possible (in addition to keeping the traffic down if possible). One idea now would be to calculate a hash over the downloaded feed content, and remember that hash. In cases where the server doesn't support E-Tag or If-Modified-Since, this could effectively avoid having to parse the feed again. Of course, a hash of the feed contents can only be calculated if the content is downloaded first, so this won't allow for avoiding re-downloads, but in general, I guess that it could improve performance on devices where parsing takes up a non-trivial amount of time.

This might need some experimentation if it is effective in the general case. One idea could be to check the gpodder.net Top 100 podcasts lists and see how many of them support E-Tag and If-Modified-Since and how many of them would benefit from having this special "content hash" in place to avoid re-parsing.

The issue has been filed here, let's see if it's something that could go upstream: feedparser issue 370

Update: Patch added for gPodder and feedparser, both patches can be found in gPodder bug 1634.

8 comments:

  1. Slow parsing has certain been bugging me about the gPodder on N9. I always cringe when I accidentally trigger an update. It doesn't just take a long time (for pretty modest subscriptions count of 16 feeds), but also make the entire application unresponsive.
    If it was easy I would suggest replacing the feedparser with a native QtXmlStreamReader one, possibly the one from Amarok [1]. It does incremental parsing while the feed is still being downloaded and can stop early if first is already known. I'm very happy with the performance and it doesn't even try to do HTTP header tricks.

    [1] http://quickgit.kde.org/index.php?p=amarok.git&a=blob&h=3f3d5a4bfeb066dda220886e9fce3300af0d6775&hb=c4eb51e0581f8fe354958ab840d3c4c69c02096f&f=src%2Fcore%2Fpodcasts%2FPodcastReader.cpp

    ReplyDelete
    Replies
    1. I like the idea with the "can stop early if the first is already known" (although it would break for feeds where the episodes are listed in the wrong order - what do you do about that?).

      The HTTP header tricks do allow for a big speed improvement for servers that support it (a GET request with only headers as response vs. a full feed download + parsing the feed).

      Delete
    2. Also, I've just checked, and feedparser doesn't do incremental parsing - it first downloads the whole content and then starts parsing. So maybe adding incremental parsing to feedparser could solve some of the problems.

      Delete
    3. Incorrectly organized RSS feeds (chronological while it should be latest on top) can be detected by parsing at least 2 items and comparing the pubdates.
      So bailing out early can happen at the earliest after the 2 items for decent feeds. Small price to pay.

      Delete
  2. And what about the idea to centralize the feed parsing and create a slim api to that centralized service? I think with gpodder.net you already provide such a feed parsing service?

    We already talked about that idea and it would be possible to add some special features to the server part.

    ReplyDelete
    Replies
    1. I don't really want the clients to depend on the web service. Especially since this might put too much load on the web service itself at this point.

      Would be easier if we had more web servers or if that was something like a subscriber-only feature where the demand is manageable and we can slowly scale along.

      We do have the feed service, and it's used by the gpodder.net backend. You could theoretically write a different feedcore.py module to use the web service instead of the feedparser module (although right now, model.py also has some assumptions about the format of the returned data, so it might need some additional massaging to get it right).

      Delete
  3. Calculating hashes of the HTTP response would fail for feeds that use dynamic URLs for media files (I think Google's Feedproxy does that). I don't think other types of dynamic content are very common, though.

    ReplyDelete
    Replies
    1. That will only give us false positives (i.e. we'd have to update the feed even though it didn't really cahnge), but will not give us false negatives (i.e. we don't update the feed, but the content has changed). So it's okay, and we can't (shouldn't?) really deal with the dynamic URLs - it might be that they need to be updated, anyway, for the download link to be valid after a certain period of time.

      Delete

Comments are moderated. Not all comments will be published. Feel free to post replies on your own blog if your comment is not published here.