Tuesday, January 20, 2009

Twitter Automations with Python Scripts, part 2

Finding like-minded people in Twitter can be a challenge. Today's code demonstrates scraping a Twitter Directory site, called Twellow.com, looking for users who who are listed under their directory listing for Python.

Twellow isn't a complete directory (right now, they index only about 800K users, which might be around 10% complete), but it has a listing of 280 Twitter users in their Python section. It'd be nice to know who those users are.

One known issue with this code is that it grabs too many users. Since Twellow also shows recent tweets, it's possible that a Python user mentions another user in an @Reply, and this program picks up that user as well. (I consider that a feature.) Twellow reported, for example, that there are 280 Twitter users in its Python list, but this program retrieved 359 for me (which apparently picked up @replies as well).



turl="http://www.twellow.com/category_users/cat_id/181/page_num/" #cat_id=181 for Python people
seen=[]

for page in range (1,15):
r=urllib2.urlopen(turl+str(page).strip()).read()

friends=re.findall(r'<a href="http://www.twitter.com/(.*?)"',r,re.DOTALL)

fl=open('twitterspythontwellow.txt','a')
for fr in friends:
if fr not in seen:
fl.write(fr+'\n')
seen.append(fr)
fl.close()




Since my last code snippet wasn't very copy-paste friendly, I used the following tool today, to format my code for Blogger. Hopefully it will make the code more usable.

http://francois.schnell.free.fr/tools/BloggerPaste/BloggerPaste.html

Sorry about the last one!


Please leave comments with other tools. Know of a better directory than Twellow?


P.S. My social experiment yesterday was a flop. Still, the results may be worth mentioning. I suspected that I could put a half-ass blog post together, and throw it out onto a Wiki-style site (in this case, Wetpaint), and see the community rally to "finish it". Well, I was wrong! There were a couple of edits to it, but it wasn't nearly as effective as I thought it would be.

The model I was inspired by was the Twitter Fan site and the Twitter API wiki. The Twitter API is completely documented on the latter, and the former has a lot of other fun facts about Twitter. With my limited data, and unscientific trial, it appears that only about 0.2% of users will take the time to make an edit (on the first day, anyway). Makes me wonder how Wikipedia got to be so successful!


Please do me a favor. If you like this blog post, please Re-Tweet it. Thanks!

7 comments:

Timur Izhbulatov said...

Why do you use RE to parse the markup? This is so perlish!

I really like BeautifulSoup. Or you could just use HTMLParser or sax.xml from standard library.

Amy Iris said...

Timur-
I agree, it's perlish. Thanks for the feedback.

I choose to do it this way because I was coding fast, and it was trivial. I use BeautifulSoup for other projects, but in this case, I had to examine the HTML anyway, so it was as simple as copying and pasting the HTML into the code, and then replacing the sample name with "(.*?)".

Since I am using this basic code pattern for various other sites, this is fare easier than the other techniques that you suggested.

But you are correct, for production code, if you don't mind the overhead of BeautifulSoup, that's a better choice.

But this was simple and quick too. I like BeautifulSoup too, but it adds complication to a relatively simple program.

dartdog said...

wanted to send an e-mail but you don't have any....??? FWIW if your trying to do stuff on the web, become a real person with ways to contact you!
FWIW what I was going to send is:
FWIW the Eclipse pydev solution is very good , way better than the hacking I was doing, better view of structure and code hints syntax highlighting and controlled run environment... well worthwhile. Now if I can just get the d*** debugger /breakpoints going....

gotgenes said...

A couple of notes:

1) seen should be a dictionary, not a list. Doing seen as a list means that when running "if fr not in seen" runs in linear (O(n)) time. If instead you use "seen[fr] = 1", then running "if fr not in seen" will be an amortized constant time (O(1)). In other words, you should see a boost in performance speed for large friend networks if you make seen a dictionary.

2) A couple of syntax highlighting solutions exist for Blogger. I use the one here currently, but another popular solution is the one detailed here.

3) Your profile description is too self-deprecating. You should rewrite it so you sound more confident (and realistic). All of us are learning, and none of us will master everything. Only sometimes do I post the optimal answer; often I'll receive comments that are better solutions than the ones I've come up with. That's the fun of blogging. Don't be afraid to be wrong; someone will correct you. Try to find a focus for your blog (it seems you've really found one in the social web and Python programming), and say that's your area of interest, and then code and blog away!

Best,
Chris

Catherine said...

Timur, there is nothing wrong with using regular expressions in Python! In fact, I'd argue that this is more Pythonic - "do the simplest thing that could possibly work" - and going to a more "advanced" solution is only appropriate when you find the simple one isn't feeling simple anymore.

Anyway, Amy: you're in Cincinnati? Awesome! Please please please come to PyOhio next July! In fact, consider submitting a talk. Hope to see you there (if not before)!

Timur Izhbulatov said...

Catherine, is it practical to reflect all aspects of HTML markup with RE? I think it's not.

My point was that using standard universal tools is more reliable. Of course, this introduces some complexity but I think it's acceptable since you don't have to spent your time debugging regular expressions. That is, your RE is your liability while standard parser is not, and subclassing it is much cleaner than writting more and more complex RE's.

Amy, I agree. The choice of tools depends on the task. If we're talking about more general task like 'find all links pointing to given host in an HTML document', using specialized tools is better IMO. However, if the task is limited to 'find all links pointing to given host in this particular page' your approach is OK. Well, until someone changes the page and breaks your code :) But this is beyond the scope of the later task.

Timur Izhbulatov said...

Stumbled upon a very good summary of the topic with examples http://j.mp/jOySua