Tuesday, January 13, 2009

Twitter Automations with Python Scripts

Several readers have asked for code samples of how to utilize Python to automate some Twitter tasks. Below is a simple code sample to go to the Search page, and look for users who have recently mentioned "Python Programming" in their tweets.

This creates a file that contains a stripped down version of the tweet, removing any HTML. In the first 20 characters, the Twitter user's name is listed, so that you can quickly read through the file looking for people who may share a common interest.

A similar program can be created to parse this file and add users automatically.

Amy Iris uses simple scripts like this to perform some of her tasks of building a community of followers. I hope you find it useful (even though it's far from perfect). Feel free to use this code for good only, and consider putting a pause in the code so that you do not hammer the twitter servers!



#this program finds twitter users based on topic and dumps them to a file

import re, urllib2


baseurl="http://search.twitter.com/search?q="
query="python+programming&lang=en"
addon=""

for i in range(2,100):
.... r=urllib2.urlopen(baseurl+query+addon).read()

.... f=re.findall(r'<div class="msg">.*?<a href="http://twitter.com/(.*?)".*? class="msgtxt.*?">(.*?)</span>',r,re.DOTALL)

.... for g in f:
.... .... g0=g[1]

.... .... #strip HTML tags
.... .... while "<" in g0:
.... .... .... p1=g0.index("<")
.... .... .... p2=g0.index(">",p1)
.... .... .... g0=g0[:p1]+g0[p2+1:]

.... .... g2=g0

.... .... p=(g[0]+" "*20)[0:20] + g2 +"\n"
.... .... fil=open ("topictweetspython.txt","a")
.... .... fil.write(p)
.... .... fil.close()

.... f2=re.findall(r'<a href="/search.max_id=(.*?)&page=.*?&q=(.*?)">Older</a>',r,re.DOTALL)

.... addon="&max_id="+f2[0][0]+"&page="+str(i).strip()

8 comments:

kumar said...

Hi Amy, that's a pretty neat idea. I follow a few automated feeds that mine twitter for music speak. As for your strategy, there is actually an easier way and one that is a bit more "endorsed" by the twitter servers. It's an API you can use to get data instead of HTML which is much easier to parse.

http://dev.twitter.com/2008/10/we-got-data.html
http://apiwiki.twitter.com/Search+API+Documentation

Here is a quick example using the simplejson library:

>>> import simplejson, urllib2
>>> f = urllib2.urlopen('http://search.twitter.com/search.json?q=python+programming')
>>> d = simplejson.load(f)
>>> d['results'][0]['text']
u'RT @pet3r @M4r14nn4 Decrease in Python programming jobs during recession second smallest of all languages... 1st was Lisp http://bit.ly/Kydj'
>>> d['results'][0]['from_user']
u'QuotdPython'
>>>

have fun!

PS. unfortunately a recession-related post was indeed the first result (I didn't cherry pick that)

Amy Iris said...

Kumar-
Thanks for spoiling my next post! :)

Yes, Amy Iris uses the Twitter API extensively. She's used the PythonTwitter API software from the Google Code library.

I thought the HTML parsing would be a good, readable example that the new Python programmer could make some sense out of.

Steve Oldner said...

Well, I'm a new Python programmer, still going thru the tutorials. All I can say is WOW! and Thanks!

Steve Oldner said...

Okay, please help the newbie.
I using Python3 and do not have the urllib2 module. So where can I get it?

Thanks,

Steve Oldner said...

Well, enough for today. Now I've got an error in the 1st f=re.findall line.
Traceback (most recent call last):...
File "D:\Python30\lib\re.py", line 190, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object

kumar said...

This code won't work in Python 3.0 as-is. Raw bytes are a little different and urllib (no longer called urllib2) is different. You can read all about it in http://docs.python.org/dev/3.0/whatsnew/3.0.html or just install Python 2.6 (both can co-exist peacefully).

Steve Oldner said...

Kumar,

Thanks! I worked thru the urllib2 issue, but I don't know enough of Python yet for the other.

Your suggestion to install 2.6 is probably better. I've got several old library books from 1999 - 2004 and have had some problems because of the changes. BTW, like your site!

Amy Iris said...

Sorry, I should have mentioned this is version 2 code (works on 2.5, as long as I didn't mess it up when I posted it; should work in 2.6.)

I want to emphasize that this is the HARD way to do this, and I was going to follow up with a subsequent post with the easy way.

I think the hard way is easier to understand for new Python programmers. It's just HTML scraping. But clearly using the API is shorter, and is more sanctioned - it takes less server resources from Twitter, and less network resources.