Releasing Arab Spring Twitter dataset
February 16, 2012 Leave a Comment
A few days ago Deen Freelon released what he legally could of his Arab Spring Twitter dataset. It’s quite the substantial dataset, spanning a time period from January to March, and a number of significant Arab Spring keywords. Unfortunately, according to Twitter’s TOS, you can’t publicly distribute these data in full. However, Deen was able to release status and user IDs of these tweets.
In a similar gesture, I’m going to release the status and user IDs of the dataset I collected from January 25, 2011 to March 1, 2011. This collection centered around Egypt and focused on Egyptian hashtags, but became a larger Arab Spring dataset as I progressively added keywords. For comparison’s sake, I compared the status IDs my dataset contains with Deen’s and found significant overlap with a number of countries (most obviously Egypt) but not too much with others. I’ve bolded the ones with over 50% overlap.
UPDATE: 2012-02-28 — There were a small percentage of duplicates in the dataset, so I’ve updated the numbers with the de-duped totals.
53792 of 85169 matched, 63.159131%
9060 of 361579 matched, 2.505676%
1674707 of 2339787 matched, 71.575190%
36355 of 48024 matched, 75.701732%
61293 of 885846 matched, 6.919148%
601008 of 665167 matched, 90.354452%
205567 of 2679617 matched, 7.671507%
10607 of 84458 matched, 12.558905%
20116 of 78823 matched, 25.520470%
111470 of 475078 matched, 23.463515%
I had one main collection going so they weren’t broken down into different datasets. I may post the numbers on which hashtags were most used in the future.
For collection I used some handrolled Perl scripts that connected to Twitter’s Streaming API. However, in my current collections I’ve been using tweepy, which works reasonably well with very little package installing. There’s a small patch that has yet to be incorporated into the trunk, so if you’re going to use this, make sure you make the change yourself.
Anyhow, without further adieu, here’s the list of 12,264,248 status IDs and their corresponding user IDs.