Today I stumbled over an apparently interesting tweet by @aendu:
Tweet of aendu/433586683615784960
The link in the tweet points to another tweet:
Tweet of madmenna/433585834193715200
And, as you may have guessed, this link points to yet another tweet. After about 20 levels deep I stopped clicking on the links and wrote a Python script instead:
# -*- coding: utf-8 -*-
"""
Tracing the twitter chain, down the rabbit hole.
Dependencies:
- requests
- beautifulsoup4
"""
from __future__ import print_function, division, absolute_import, unicode_literals
import re
from datetime import datetime
import requests
from bs4 import BeautifulSoup
START_URL = 'https://twitter.com/aendu/status/433586683615784960'
def inception(url):
# Request tweet page
r = requests.get(url)
if r.status_code == 404:
print('TWEET DELETED, CHAIN BROKEN :(')
return
soup = BeautifulSoup(r.text)
tweet = soup.select('div.tweet.permalink-tweet')[0]
# Parse out & print tweet info
text = tweet.find('p', class_='tweet-text').text
user = tweet.get('data-screen-name')
timestamp = tweet.find('span', class_='js-relative-timestamp').get('data-time')
dt = datetime.fromtimestamp(int(timestamp))
print('{0} @{1}: {2}'.format(dt.isoformat().replace('T', ' '), user, text))
# And we need to go deeper!
links = tweet.find('p', class_='tweet-text').find_all('a')
for link in links:
url = link.get('data-expanded-url')
if not url:
continue
if re.match(r'^https?:\/{2}(www.)?twitter.*status.*$', url):
return url
if __name__ == '__main__':
url = START_URL
while url:
url = inception(url)
The code is on Gist, feel free to fork it! Here's the result: https://gist.github.com/dbrgn/8956214
The first thing you'll notice is that the chain is broken after 57 tweets, because @charlescwcooke deleted his tweet in the chain. Too bad.
There are also some other things we can see from this data, for example the time distribution:
You can clearly see that the tweets were sent in "bursts" throughout the day.
Unfortunately, following down one branch of the chain is not too interesting. A much more interesting analysis would be to see how many branches there are, and which one is the longest. This could be done by two ways: If you're very rich you can buy access to the Twitter Firehose in order to analyze all the tweets sent in a few days. The other possibility is to do some kind of backtracking. I didn't use the Twitter API because I was too lazy to register a new app, and resorted to HTML scraping instead. But by using the API, one could first follow the chain down to the last working tweet, and then use the search API to find tweets containing an URL to that tweet. By doing that, you could build a dataset containing all branches, and start to analyze them.
But I leave that to somebody else :)