Friday, February 20, 2009

Getting to Know Your Followers - Explained

This is a follow-up to yesterday's post where I presented a Python program which used the Twitter Interface to get information about my followers. I experimented with a newer Python feature, called Generators, which might not be familiar to everyone, and was new to me. In addition, I slipped in some extra changes to the Python Twitter interface which is capable of retrieving the extended User information, if you want. This blog post simply explains those two features. I really love the generators (2 statements in this program, spread over 5 printed lines). Please read on, and you'll see why!

Recall the earlier post where I showed how to get information about your followers:


import csv, math
import twitter as twitterapi
pwd='removed, of course'

api = twitterapi.Api(username='amyiris', password=pwd)
me = api.GetUser('amyiris')
numpages = int(math.ceil(me.followers_count/100.0))
followerpage = (api.GetFollowers(page=x)
for x in range(1,numpages+1))
myfollower = (y[z]
for y in followerpage \
for z in range(len(y)))
csvfile=open('followers6.csv','a')
csvout=csv.writer(csvfile,dialect='excel')


for f in myfollower:
csvout.writerow([f.screen_name,
f.followers_count,
f.status,
])
csvfile.close()


Let's walk through the code. Here's the boring setup part that's easy to understand; skip down if you want to see the Generators in action:


import csv, math
import twitter as twitterapi
pwd='removed, of course'

api = twitterapi.Api(username='amyiris', password=pwd)
me = api.GetUser('amyiris')
numpages = int(math.ceil(me.followers_count/100.0))


These first few lines are initialization steps. We're going to use the csv library to create a .csv file (a comma-separated-value file which is a simplistic format suitable to import into Excel). The library "math" is only needed for the "ceil" function, a function which "rounds everything up" so that 1.01 rounds to 2. And of course, the twitter file is my modified Twitter API file, which I included a link to. (It was pointed out in the comments (oops) that I had modified an older file of Twitter API, thanks John Schneider). Basically, the changes that I made were to facilitate paging through my followers 100 at a time (my version only retrieved the first 100 followers), and to retrieve the follower_count.

The call to twitterapi.Api sets up the authentication to the Twitter service, so that I can make future calls. You'd need to supply your own authentication here (user name and password).

The next line retrieves information about me - the @amyiris user. I perform this step to find out how many followers I have, so that I can loop through them. The API is capable of retrieving followers in groups of 100. So I need to figure out how many pages of followers I have. If I have 1-100 followers, that's 1 page; 101-200, that's 2 pages, and so on. So the math.ceil() function performs that calculation in the second line above. I convert the value to an integer type (even though I know it will be a floating point number in the form x.0), so that I can pass it along to the my modified GetFollowers() method.


The Cool Stuff - Generators:

followerpage = (api.GetFollowers(page=x)
for x in range(1,numpages+1))
myfollower = (y[z]
for y in followerpage \
for z in range(len(y)))


This section contains the newer Python features that I was experimenting with - Generators. If you've never seen this before, take a close look. It's really cool stuff.

OK, here's the magic - this shows just one reason why Python is so cool. These few lines set up my generators. The give-away that these are generators are that they are in the form
symbol_ name = (something for something_else)
The keys are the parentheses (not square brackets) and the word "for".

So look at the first generator declaration statement:

followerpage = (api.GetFollowers(page=x)
for x in range(1,numpages+1))

Important: this really "does" nothing, except establish a binding for future use. It's similar to a function declaration. It says "if I refer to the symbol called 'followerpage' later, know that it's a generator that will call the api.GetFollowers() method a number of times, passing in x, within this range." The key is that it does NOTHING, yet. It simply binds the symbol 'followerpage' to a certain generating action. Each time we refer to 'followerpage' later, it's going to give us one page of followers. And it will continue to do that until it runs out of pages based on the number of pages.


And the next statement is similar:
myfollower = (y[z]
for y in followerpage \
for z in range(len(y)))


This generator says that if I later refer to the symbol 'myfollower', give me one follower. Pull it out of the list of followers that come from the followerpage. And pull them out of the list one at a time until the list is exhausted (hence, len(y)), and then go get another page from 'followerpage'. If Twitter is behaving properly, I expect len(y) to be 100 for every page except the last one, which will be a partial list of the remaining followers.

To reiterate, this generator declaration does nothing at this time, other than to set up the symbol 'myfollower' for later use.

So once again, these generators are simply "declared" at this point, but have done NOTHING at this point in the execution, except their little setup. Now back to the traditional stuff:

csvfile=open('followers6.csv','a')
csvout=csv.writer(csvfile,dialect='excel')

These two statements prepare our .csv file. The first opens the file with mode 'a' for append. The second sets the dialect to "excel" and provides us with a writer (csvout).

Now we're all done with the setup. All we need to do is loop through each follower and send it on to the csv file.


for f in myfollower:
csvout.writerow([f.screen_name,
f.followers_count,
f.status,
])


Here's where the action starts. The for loop has a reference to the 'myfollower' generator. So the first time through the loop, Python will hit the generator asking it for a single value. Well, we've declared that the 'myfollower' generator retrieves one value from the list which is returned from 'followerpage'. So the first time through, it'll hit the 'followerpage' generator asking it for a value (which will be a list of 100 or so followers), of which it will return the first one.

That one follower gets written to the CSV file and then the loop repeats. Each time through the loop, the 'myfollower' generator is asked for a value. And each time, it pulls one out of the list, or, if there are none queued up, it asks 'followerpage' to generate a new list.

Finally, the file is closed in the last statement: csvfile.close().

The really cool thing to me about this program was the generators. Performing that setup once is really a cool, modular way of programming. I got this technique from @wjhuie, who recommended David Beazley's paper from Pycon 2008. See pages I-37 and I-38 (which are both on page 19 of 71 in the PDF).

The reason I find this to be so cool is that it's highly efficient and modular. Alhough the program LOOKS like it goes and gets all the pages, and then gets all the users, potentially consuming lots of memory, really it only gets them on demand. Presumably you'll only be working with one follower at a time in the "for f" loop, so only retrieve what is needed.

Another Hidden Feature - Extended User Information:

The second element that I slipped into that post without telling, was a change to GetUser() in the Twitter API file. I added the capability to get some of the Extended user information from Twitter. The 'follower_count' is not extended information, so that's returned with a simple call to Twitter (but the 'Python Twitter' API did not expose this, and so I added it to my version of 'Python Twitter').

However, the 'friends_count' and the 'statuses_count' are not included in a basic twitter API interface - you need to request the extended User information to get that data. The way I implemented it in my modified Python Twitter API was that you can make a follow-up call to GetUser(), and it will retrieve the extended information.

In other words, when I retrieve my followers, it brings back User objects, 100 at a time, with just the basic information about each user. This includes their id, screen_name, name, status, and my newly added followers_count. But if you take that User information and ask my new Python Twitter version again to look up that User by id or screen_name, with GetUser(), then you'll get some of their extended information, such as their friends_count, and statuses_count. You could get more if you'd like; that exercise is left up to the reader, as my lazy teachers always said! I was just after the counts, for now.

This can give us the familiar "Friends / Followers / Updates" information. Here's how. If you modify the first line of the 'myfollower' generator function, you'll have access to the Friends, Followers, Updates values for each of your followers.

myfollower = (api.GetUser(y[z].id)
for y in followerpage \
for z in range(len(y)))


Then simply add those values to your CSV file:


for f in myfollower:
csvout.writerow([f.screen_name,
f.friends_count,
f.followers_count,
f.statuses_count,
f.status,
])


Note that I have added in f.friends_count and f.statuses_count into the CSV output, in the familiar "Friends / Followers / Updates" order.

The biggest issue with this method is that each call to GetUser() hits the Twitter server. And since there are rate limits (100 per hour), I found that the program doesn't work well without sleeping periodically to keep from hitting the Twitter rate limits. So depending on how many followers you have (if greater than 98 or so), you'd want to put a "time.sleep(20.0)" into the "for f" loop (and "import time" at the top). That would pace the program so that you don't exceed the Twitter API rate limits, but would also extend the amount of time it takes to run the program (from seconds to potentially hours).

Just to show you what this would look like, however, I ran it for a while to generate my FFU (Friends / Followers / Updates) spreadsheet for my followers. Here's what it looks like. Column A is the screen_name. B is the friends_count. C is the followers_count, and D is the update_count.





Once it's in Excel, you can manipulate the data several ways. For instance, the graph below shows a scattergram of a subset of my followers (yes, you have now been reduced to a dot!). The X-axis represents the number of friends that you have, and the Y-axis represents the number of followers.

As you'd expect, there's a cluster of users near the (100,100) point. And the dots stay close to the 45-degree line, with some outliers.





I hope there's some value in this for you. The graphing is interesting to me, but not very much different than expected. Still, it helps me to understand my followers better, and it gave me something geeky to do.

But mostly, I wanted to share how the Generators work, since that seemed to stump a few people who examined my code. And once again, apolgies for putting some people through pain thanks to Blogger dropping my spaces. I'm happy to use pastie or pastebin to post code if it helps. Just ask. @amyiris on Twitter.

It seems like the for loop that generates the CSV could also be simplified using generators, but my head was about to explode at that point. If you know of how to do that, I'd love to read it in the comments! Thanks!

7 comments:

Jorge Lugo said...

If you have a huge Twitter following, then your page count could change by the time of your last call to GetFollowers(). You could ignore page count and continue calling GetFollowers() until it returns an empty list.

It may also be possible for a follower to be in page 2 and page 3 of calls to GetFollowers() if your followers changed between calls. You could throw your followers in a set to prevent that problem.

Amy Iris said...

Jorge,
Thanks for the comment. Excellent feedback!

The set idea is a good one!

I like your suggestion of ignoring the page count, but I struggled with how to terminate my generator. Any suggestion as to how to do that in a Generator? (These were the first generators I've ever written)

If you help me with that, I'll post a refactored example.

Jorge Lugo said...

I'm new to generators. After a little research, I had trouble thinking of an alternative to page count for terminating the generator.

Amy Iris said...

Jorge-
@cmars tweeted that I could write a generator function, using yield and then return when I am out of values.

So more complicated generators can be written, but I'm happy with the brief version I have. Except I like your idea for using sets to avoid duplication.

I haven't worked with sets yet. Prior to sets being formally rolled into Python, I frequently found myself implementing my own sets with a dictionary pointing to a meaningless value (like {'user123':1, 'user456':1}, the 1 being the meaningless value). So I'm glad we have sets now.

Anonymous said...

Is it possible to obtain your followers email? I have been using a java4j api and I can't seem to find a way. Had you seen a way to do this?

Anonymous said...

Amy! I am so glad to of found you! I would love some help with this. I have been doing this manually and it is taking forever since we have about 10,000 followers. I don't know a whole lot about how to use code on twitter. can you please help? The profile we are gathering data for the followers are at twitter.com/garlandeharris. I am his daughter Safia. I added you on twitter to my own profile and his. talk to you soon!

mesafia said...

P.s. I retweeted and then clicked on it and there is a 404 not found.. :-(

find me on twitter please @mesafia I would love some help getting to know my followers!