blog.dbrgn.ch

Down the Tweet Chain Rabbit Hole

written on Wednesday, February 12, 2014 by

Today I stumbled over an apparently interesting tweet by @aendu:

The link in the tweet points to another tweet:

And, as you may have guessed, this link points to yet another tweet. After about 20 levels deep I stopped clicking on the links and wrote a Python script instead:

# -*- coding: utf-8 -*-
"""
Tracing the twitter chain, down the rabbit hole.

Dependencies:

  - requests
  - beautifulsoup4

"""
from __future__ import print_function, division, absolute_import, unicode_literals

import re
from datetime import datetime

import requests
from bs4 import BeautifulSoup


START_URL = 'https://twitter.com/aendu/status/433586683615784960'


def inception(url):
    # Request tweet page
    r = requests.get(url)
    if r.status_code == 404:
        print('TWEET DELETED, CHAIN BROKEN :(')
        return
    soup = BeautifulSoup(r.text)
    tweet = soup.select('div.tweet.permalink-tweet')[0]

    # Parse out & print tweet info
    text = tweet.find('p', class_='tweet-text').text
    user = tweet.get('data-screen-name')
    timestamp = tweet.find('span', class_='js-relative-timestamp').get('data-time')
    dt = datetime.fromtimestamp(int(timestamp))
    print('{0} @{1}: {2}'.format(dt.isoformat().replace('T', ' '), user, text))

    # And we need to go deeper!
    links = tweet.find('p', class_='tweet-text').find_all('a')
    for link in links:
        url = link.get('data-expanded-url')
        if not url:
            continue
        if re.match(r'^https?:\/{2}(www.)?twitter.*status.*$', url):
            return url


if __name__ == '__main__':
    url = START_URL
    while url:
        url = inception(url)

The code is on Gist, feel free to fork it! Here's the result: https://gist.github.com/dbrgn/8956214

The first thing you'll notice is that the chain is broken after 57 tweets, because @charlescwcooke deleted his tweet in the chain. Too bad.

There are also some other things we can see from this data, for example the time distribution:

Time distribution scatter plot

You can clearly see that the tweets were sent in "bursts" throughout the day.

Unfortunately, following down one branch of the chain is not too interesting. A much more interesting analysis would be to see how many branches there are, and which one is the longest. This could be done by two ways: If you're very rich you can buy access to the Twitter Firehose in order to analyze all the tweets sent in a few days. The other possibility is to do some kind of backtracking. I didn't use the Twitter API because I was too lazy to register a new app, and resorted to HTML scraping instead. But by using the API, one could first follow the chain down to the last working tweet, and then use the search API to find tweets containing an URL to that tweet. By doing that, you could build a dataset containing all branches, and start to analyze them.

But I leave that to somebody else :)

This entry was tagged memes, python, social_media and twitter