Friday, February 27, 2009

Practical Approach to The Next Web

Many believe that the next step in the evolution of the internet is the "automated understanding" of the massive amounts of data that is available on the internet. Some call this the Semantic Web, or Web 3.0. Others believe that this is truly the next obvious step in the evolution of artificial intelligence. In this blog post, I lay out a practical method to unleash the vast troves of information that are currently available to humans on the web, but not coded for machines to be able to interpret.

Some people will take issue with my premise. They argue that the Semantic Web is a "non-starter" - an idea that came from deep thinkers that never moved into the mainstream. And many people hate the Web 3.0 terminology. I am not trying to entertain either debate, but merely to present practical ways that we all can move the web forward.


It seems to me that there are three ways to help unlock the massive amount of data that exists on the internet. One method is to build semantics into every significant web page on the internet. Using technologies like RDF and Microformats, website editors can encode their data to include hints as to what the meaning of the data is. Efforts are underway in this direction, but it's a massive undertaking, and can really only be done by website owners, one-by-one, modifying each page (with some obvious automations).

Second, there's the API approach. Application Programming Interfaces are tools that allow programmers to easily access data without having to deal with the typical presentation information that's on a webpage, such as styles, layout, and things that help humans.
Best Buy recently released their API for access to their trove of product and store information. If each data owner releases an API, then the data can be unlocked, and meaning can be derived from the data.

Finally, there's the page-scraping approach. Currently Search Engines do this - they try to figure out what data "means" by examining the page as is. In addition, individual developers write little page scraping snippets that go get the data that they want. Python is an exceptional language for this task (reportedly, much of Google's early code was written in Python, and more recently, the creator of Python is employed by Google).

Let me show you what each of these techniques looks like, before I present my practical recommendation:


Method 1: Embedding Semantic information into the web page

Embedded semantic information does not change how websites appear to humans. However, embedded into the source of the web page would be little clues as to the meaning of the data.

This is already taking place in some places around the web. In fact, Blogger (and other blog software) will do this for you automatically. Below, you can see this blog, and the semantic information that is embedded behind the scenes. (Click on the image to get a clearer picture.)




This is more than just HTML tags. Specific classes are used which describe the type of data that follows. So the Blog Post's Title is surrounded by tags that say "hey search engines, this next part is the blog post title". The HTML tags of the 1990's simply described the page layout, but these tags describe the page semantics - what the page means.



The challenge with this method is to get everyone all on the same page! If everyone uses the same class name designations for the same purposes, then this system has a great chance of success, and is already showing signs of major progress. Organizations have sprung up to describe formats (like MicroFormats), so that class names are consistent across the web. If they aren't consistent, then we are no better off after semantic encoding than before.


Method 2: Application Programming Interfaces (APIs)

APIs can be any sort of programmatic interface to an application, but lately a certain style has emerged: URL-based (RESTful) API's. With this technique, website and data owners write a second set of web pages that are really not meant for humans. Instead, they are meant for other programs to access the data.

Below is an example of the Best Buy Remix API which was released a few weeks ago. Best Buy has made it possible for developers around the world to access their store and product data.





Some corporations and individuals view this as a risk. After all, if I expose all my pricing to all my competitors, what's stopping them from offering everything for one penny less, and beating me on every deal? Or what about ad revenue from my page?

But I think that Remix is actually genius on the part of Best Buy. After all, the semantic web WILL happen. You have a choice of being first, and defining the standard (forcing your competitors to have to chase you), or trying to protect the old way of doing things, and end up being in a follower position.

This reminds me a little of an article I just read in this month's Wired Magazine, related to how the "Netbook" caught the traditional laptop manufacturers off guard. The key learning is that if you are too worried about protecting your old business, you might lose sight of a huge new business opportunity, and then forever be chasing the new model. No, Best Buy was right on - release the API, define the standard.

Here's what I think is going to happen in the API field: as developers begin to write code for the Best Buy API, it will bring value to Best Buy. And then Wal-Mart and Target and others will be scrambling to catch up. Some companies may take the typical "Sony tactic" and try to invent their own standard. But by then it's too late. Others will copy Best Buy's API definition (be "Remix-compatible"), and have immediate compatibility to developers' code that is already being written. But neither of these positions is as desirable as Best Buy's position, of getting to define the standard!

API's are a great way to move forward toward the next evolution of the web, as I mentioned in my earlier blog post.



Method 3: Page Scraping

The concept of Page Scraping seems ugly and primitive. This has been going on for years, but both tools and knowledge are progressing so that this method has a realistic practical position in the future of the web.

This method starts with the user - some user who wants some information. In my example below, an AFL (Australian Football League) fan wanted programmatic access to scores that were available on the web.

This user simply navigated to the website that contains the information that he or she was looking for. And then a browser "view source" of the web page provides the format so that the page can be analyzed and the scraper can be built.




The user then wrote a small snippet of code that was capable of snagging that information, and feeding it to his program.





Finally, he publishes his code, and tells the world. And now the world is just a tiny bit better off than it was before.




Take a close look here, and think about what's going on. Is one little application that grabs AFL scores significant? It doesn't seem like it. I'm from Ohio; what do I care about Australian Football?

More significantly, though, is the fact that the world was made a little better by one person who had a need, and solved the problem on his own, and then provided it back to the community.

Does that model look at all familiar?





So to summarize, there seem to be three ways to advance the web, and provide meaning to the vast amounts of data that is out there. Two of them are top-down, and one is bottom-up.

Two are elegant and sturdy. One is ugly and fragile.






If the third is ugly and fragile, then why even talk about it?

I believe the answer lies in history.





Think about the landscape of the Encyclopedia market ten years ago. Surely, Brittanica knew that people wanted online access to an Encyclopedia, but it would also probably kill their print business. They likely recognized that their print business would have inevitable pressures anyway. Sound vaguely like the predicament that Best Buy and others are in? Lead change, or forever be chasing it.

Nupedia popped up, with the idea of creating an online encyclopedia, and started hand-crafting well researched articles, page by page. Never heard of Nupedia? Well, it's not easy for a small group of people to create an encyclopedia from scratch. To me, this sound vaguely like the efforts to create the Semantic web by recreating every web page and adding in Semantics. These are hard problems. And Nupedia didn't get too far, with it's model of hand-crafting a free, well-researched, edited encyclopedia.

So in January 2001, Nupedia started a side project to allow collaboration on articles. It opened the process to the world, starting with less than twenty articles, and named it Wikipedia. As encyclopedias go, it was ugly and fragile. Fifth graders could change facts, or write that "Dean Wormer Sucks" right into the article for Thomas Jefferson. But bit-by-bit, contribution after contribution, the world got to be a better place. And now, eight short year later, Wikipedia is recognized as a tremendous wealth of knowledge.

Compare the charts below, describing the Approaches to an Online Encyclopedia, to the ones above, regarding Approaches to the Next Generation of the Web:









To me it seems obvious and inevitable. These two situations map exactly.

Someone will invent a free-forever, open source, contributory mechanism to release the information that is hidden in the web. And bit-by-bit, contribution-by-contribution, the world will become a better place.

And that's what the Amy Iris project is all about. We're about ready to enter a Beta Test of our attempt at a free-forever, open source, contributory framework for pulling information out of the vast data of the internet. If you are interested in participating and seeing what we think could unleash the power of the internet, please drop me an email. I'm at amyiris at amyiris dot com.

As always, I'm open to feedback!

Smile! You're gettin' it!

Friday, February 20, 2009

Getting to Know Your Followers - Explained

This is a follow-up to yesterday's post where I presented a Python program which used the Twitter Interface to get information about my followers. I experimented with a newer Python feature, called Generators, which might not be familiar to everyone, and was new to me. In addition, I slipped in some extra changes to the Python Twitter interface which is capable of retrieving the extended User information, if you want. This blog post simply explains those two features. I really love the generators (2 statements in this program, spread over 5 printed lines). Please read on, and you'll see why!

Recall the earlier post where I showed how to get information about your followers:


import csv, math
import twitter as twitterapi
pwd='removed, of course'

api = twitterapi.Api(username='amyiris', password=pwd)
me = api.GetUser('amyiris')
numpages = int(math.ceil(me.followers_count/100.0))
followerpage = (api.GetFollowers(page=x)
for x in range(1,numpages+1))
myfollower = (y[z]
for y in followerpage \
for z in range(len(y)))
csvfile=open('followers6.csv','a')
csvout=csv.writer(csvfile,dialect='excel')


for f in myfollower:
csvout.writerow([f.screen_name,
f.followers_count,
f.status,
])
csvfile.close()


Let's walk through the code. Here's the boring setup part that's easy to understand; skip down if you want to see the Generators in action:


import csv, math
import twitter as twitterapi
pwd='removed, of course'

api = twitterapi.Api(username='amyiris', password=pwd)
me = api.GetUser('amyiris')
numpages = int(math.ceil(me.followers_count/100.0))


These first few lines are initialization steps. We're going to use the csv library to create a .csv file (a comma-separated-value file which is a simplistic format suitable to import into Excel). The library "math" is only needed for the "ceil" function, a function which "rounds everything up" so that 1.01 rounds to 2. And of course, the twitter file is my modified Twitter API file, which I included a link to. (It was pointed out in the comments (oops) that I had modified an older file of Twitter API, thanks John Schneider). Basically, the changes that I made were to facilitate paging through my followers 100 at a time (my version only retrieved the first 100 followers), and to retrieve the follower_count.

The call to twitterapi.Api sets up the authentication to the Twitter service, so that I can make future calls. You'd need to supply your own authentication here (user name and password).

The next line retrieves information about me - the @amyiris user. I perform this step to find out how many followers I have, so that I can loop through them. The API is capable of retrieving followers in groups of 100. So I need to figure out how many pages of followers I have. If I have 1-100 followers, that's 1 page; 101-200, that's 2 pages, and so on. So the math.ceil() function performs that calculation in the second line above. I convert the value to an integer type (even though I know it will be a floating point number in the form x.0), so that I can pass it along to the my modified GetFollowers() method.


The Cool Stuff - Generators:

followerpage = (api.GetFollowers(page=x)
for x in range(1,numpages+1))
myfollower = (y[z]
for y in followerpage \
for z in range(len(y)))


This section contains the newer Python features that I was experimenting with - Generators. If you've never seen this before, take a close look. It's really cool stuff.

OK, here's the magic - this shows just one reason why Python is so cool. These few lines set up my generators. The give-away that these are generators are that they are in the form
symbol_ name = (something for something_else)
The keys are the parentheses (not square brackets) and the word "for".

So look at the first generator declaration statement:

followerpage = (api.GetFollowers(page=x)
for x in range(1,numpages+1))

Important: this really "does" nothing, except establish a binding for future use. It's similar to a function declaration. It says "if I refer to the symbol called 'followerpage' later, know that it's a generator that will call the api.GetFollowers() method a number of times, passing in x, within this range." The key is that it does NOTHING, yet. It simply binds the symbol 'followerpage' to a certain generating action. Each time we refer to 'followerpage' later, it's going to give us one page of followers. And it will continue to do that until it runs out of pages based on the number of pages.


And the next statement is similar:
myfollower = (y[z]
for y in followerpage \
for z in range(len(y)))


This generator says that if I later refer to the symbol 'myfollower', give me one follower. Pull it out of the list of followers that come from the followerpage. And pull them out of the list one at a time until the list is exhausted (hence, len(y)), and then go get another page from 'followerpage'. If Twitter is behaving properly, I expect len(y) to be 100 for every page except the last one, which will be a partial list of the remaining followers.

To reiterate, this generator declaration does nothing at this time, other than to set up the symbol 'myfollower' for later use.

So once again, these generators are simply "declared" at this point, but have done NOTHING at this point in the execution, except their little setup. Now back to the traditional stuff:

csvfile=open('followers6.csv','a')
csvout=csv.writer(csvfile,dialect='excel')

These two statements prepare our .csv file. The first opens the file with mode 'a' for append. The second sets the dialect to "excel" and provides us with a writer (csvout).

Now we're all done with the setup. All we need to do is loop through each follower and send it on to the csv file.


for f in myfollower:
csvout.writerow([f.screen_name,
f.followers_count,
f.status,
])


Here's where the action starts. The for loop has a reference to the 'myfollower' generator. So the first time through the loop, Python will hit the generator asking it for a single value. Well, we've declared that the 'myfollower' generator retrieves one value from the list which is returned from 'followerpage'. So the first time through, it'll hit the 'followerpage' generator asking it for a value (which will be a list of 100 or so followers), of which it will return the first one.

That one follower gets written to the CSV file and then the loop repeats. Each time through the loop, the 'myfollower' generator is asked for a value. And each time, it pulls one out of the list, or, if there are none queued up, it asks 'followerpage' to generate a new list.

Finally, the file is closed in the last statement: csvfile.close().

The really cool thing to me about this program was the generators. Performing that setup once is really a cool, modular way of programming. I got this technique from @wjhuie, who recommended David Beazley's paper from Pycon 2008. See pages I-37 and I-38 (which are both on page 19 of 71 in the PDF).

The reason I find this to be so cool is that it's highly efficient and modular. Alhough the program LOOKS like it goes and gets all the pages, and then gets all the users, potentially consuming lots of memory, really it only gets them on demand. Presumably you'll only be working with one follower at a time in the "for f" loop, so only retrieve what is needed.

Another Hidden Feature - Extended User Information:

The second element that I slipped into that post without telling, was a change to GetUser() in the Twitter API file. I added the capability to get some of the Extended user information from Twitter. The 'follower_count' is not extended information, so that's returned with a simple call to Twitter (but the 'Python Twitter' API did not expose this, and so I added it to my version of 'Python Twitter').

However, the 'friends_count' and the 'statuses_count' are not included in a basic twitter API interface - you need to request the extended User information to get that data. The way I implemented it in my modified Python Twitter API was that you can make a follow-up call to GetUser(), and it will retrieve the extended information.

In other words, when I retrieve my followers, it brings back User objects, 100 at a time, with just the basic information about each user. This includes their id, screen_name, name, status, and my newly added followers_count. But if you take that User information and ask my new Python Twitter version again to look up that User by id or screen_name, with GetUser(), then you'll get some of their extended information, such as their friends_count, and statuses_count. You could get more if you'd like; that exercise is left up to the reader, as my lazy teachers always said! I was just after the counts, for now.

This can give us the familiar "Friends / Followers / Updates" information. Here's how. If you modify the first line of the 'myfollower' generator function, you'll have access to the Friends, Followers, Updates values for each of your followers.

myfollower = (api.GetUser(y[z].id)
for y in followerpage \
for z in range(len(y)))


Then simply add those values to your CSV file:


for f in myfollower:
csvout.writerow([f.screen_name,
f.friends_count,
f.followers_count,
f.statuses_count,
f.status,
])


Note that I have added in f.friends_count and f.statuses_count into the CSV output, in the familiar "Friends / Followers / Updates" order.

The biggest issue with this method is that each call to GetUser() hits the Twitter server. And since there are rate limits (100 per hour), I found that the program doesn't work well without sleeping periodically to keep from hitting the Twitter rate limits. So depending on how many followers you have (if greater than 98 or so), you'd want to put a "time.sleep(20.0)" into the "for f" loop (and "import time" at the top). That would pace the program so that you don't exceed the Twitter API rate limits, but would also extend the amount of time it takes to run the program (from seconds to potentially hours).

Just to show you what this would look like, however, I ran it for a while to generate my FFU (Friends / Followers / Updates) spreadsheet for my followers. Here's what it looks like. Column A is the screen_name. B is the friends_count. C is the followers_count, and D is the update_count.





Once it's in Excel, you can manipulate the data several ways. For instance, the graph below shows a scattergram of a subset of my followers (yes, you have now been reduced to a dot!). The X-axis represents the number of friends that you have, and the Y-axis represents the number of followers.

As you'd expect, there's a cluster of users near the (100,100) point. And the dots stay close to the 45-degree line, with some outliers.





I hope there's some value in this for you. The graphing is interesting to me, but not very much different than expected. Still, it helps me to understand my followers better, and it gave me something geeky to do.

But mostly, I wanted to share how the Generators work, since that seemed to stump a few people who examined my code. And once again, apolgies for putting some people through pain thanks to Blogger dropping my spaces. I'm happy to use pastie or pastebin to post code if it helps. Just ask. @amyiris on Twitter.

It seems like the for loop that generates the CSV could also be simplified using generators, but my head was about to explode at that point. If you know of how to do that, I'd love to read it in the comments! Thanks!

Get to Know Your Followers on Twitter

If you like this post, please ReTweet It.

Now that you have a set of followers, wouldn't it be nice to know a little more about them? In this blog post, I provide some Python scripts that can help you to get to know your followers better. By having this information, you could cater your tweets to their interests.

First, not all followers are created equal. Some of your followers might be following thousands of people, and so your tweets are likely to go unnoticed, lost in the mix of tweets from other people. As an aside, you may be asking how I keep up with the tweets of hundreds of followers. The examples below might give you some hints as to the software I have written to keep up on the statuses of those that I follow.

Personally, I feel that my most important followers are those that have engaged in conversations with me. (Numerically speaking, I give those followers a high "weight" value.) Send me a "custom" Direct Message or better yet, a meaningful @amyiris reply, and I know you are engaged in the discussion. ReTweets also win points in the weighting system! (Here's your chance to increase your value to me!...)

If you like this post, please ReTweet It.

Let's call the remaining followers "silent" followers - they've never engaged directly with me, or said @amyiris in a public Tweet. Even those followers are not created equally. For sake of argument, I'll assume that someone following me who has 1000 followers is "more valuable" than someone who only has 1 follower. So it may be important to know who those people are, perhaps to aim Tweets at them to try to engage them, and move them out of the "silent" group.

So I decided to write a Python Script to allow me to get to know my followers. First I wanted to find out how many followers each one had. Using the Python Twitter library (with only a slight modification to handle pagination of my followers - since I have more than 100), I came up with the following script:



import csv, math
import twitter as twitterapi
pwd='removed, of course'

api = twitterapi.Api(username='amyiris', password=pwd)
me = api.GetUser('amyiris')
numpages = int(math.ceil(me.followers_count/100.0))
followerpage = (api.GetFollowers(page=x)
for x in range(1,numpages+1))
myfollower = (y[z]
for y in followerpage \
for z in range(len(y)))
csvfile=open('followers6.csv','a')
csvout=csv.writer(csvfile,dialect='excel')


for f in myfollower:
csvout.writerow([f.screen_name,
f.followers_count,
f.status,
])
csvfile.close()

Note, I have customized my twitter.py file to do this. See http://www.pastie.org/395307 for an updated copy. (I'm not thrilled with this version, so it's not quite ready for Prime Time, but if you are experimenting, you'll want this file, or something like it.) Also see http://www.pastie.org/395445 for a version of this program, since Blogger keeps killing my indentation! Sorry!

Anyway, this gave me a very cool CSV file, that has a row for each follower. Each row contains my follower's screen_name (Column A), how many followers he or she has (Column B), and their current status (Column C). Here's what it looked like in Excel:





Opening that file in Excel, I could manipulate it to get some stats:

My top follower has 39,060 followers of their own.
I have five followers with zero people following them, and five more that only have one.
My average follower has 386 followers of his or her own, which is significantly skewed thanks to a few big ones. The 50th percentile (median) follower has 156 followers of his or her own.

My "reach" is currently about 2000 people. That is, if I tweet something, 2000 people get it. But there are 769,000 people that are one step removed (including duplicates). That is, if I send out a tweet, and in the unlikely event that all 2000 people that I reach ReTweet it, then the ReTweet would be received 769,000 times (some people receiving it multiple times). Interesting, but not necessarily something to celebrate. This is all experimentation.

Once it's in Excel, you can sort and analyze it. Here's my "long tail" graph, showing how many followers each of my followers has. 32 of my followers have more than 2000 of their own followers, but I cut the graph off there, so that it's a little more meaningful:




Now I have access to some really cool information about my followers. For instance, I've captured their most recent tweets. So I can see what's on everyone's mind, at this precise moment.

You've seen those "tag clouds" (here's a sample, from Wikipedia; credit: Markus Angermeier, so you know what I am referring to):



Imagine if I could get a snapshot of what my followers are talking about or thinking about at this very moment.

I modified my program to parse the statuses of my followers, and to provide a short list of the top 25 words that my followers are Tweeting. Here were the results of my first attempt:


Top 25 words in my Followers Statuses, and the frequency with which they appear:
652 the
611 to
522 a
427 I
325 for
308 of
303 and
291 in
257 is
219 on
(etc.)

Obviously this isn't very helpful. What I'd really like to do is compare the frequency of words to a standard word frequency table. Then I could see if a word stands out from expected conversation.

One way to do this would be to maintain common tweeted words. Another might be to start with a frequency table downloaded from the internet.

The strategy I took was to create a frequency table by grabbing the 1000 most common words from the Public Timeline. And then compare the frequency of words in my followers' statuses against the frequency table from the public timeline. Not a perfect solution, but something that can be done rather quickly with the Python Twitter interface.

You can examine the code below. I chose to normalize words, making them lower case (so the words "I" and "i" get counted together), removing contractions (so "can't" and "cant" are counted together), and changing all other punctuation and special characters to spaces. Then I ranked words relative how much more frequently they appear in my followers' statuses as opposed to the public timeline. If a word appears in my followers' statuses but never appeared in the public timeline, I arbitrarily pretend that it appeared .5 times on the public timeline (to avoid a divide by zero condition).

The results? Here are some of the interesting keywords, and how much more frequently they appeared in my followers' statuses as opposed to the public timeline (at the moment that I ran the program).

night 4.2x
morning 3.3x
google 2.8x
tomorrow 2.5x
thinking 2.5x
project 2.5x
wow 2.4x
code 2.4x
follow 2.4x
media 2.3x
tweet 2.0x
working 1.9x
facebook 1.9x
twurl 1.7x
sorry 1.7x
software 1.7x
mac 1.7x
iphone 1.7x
rt 1.6x
website 1.6x
twitter 1.6x
python 1.6x
html 1.6x
try 1.5x
reason 1.5x
design 1.5x
awesome 1.4x
firefox 1.3x
feed 1.3x
ruby 1.2x
programming 1.1x
presentation 1.1x


And words that my followers spoke about far less than the public timeline: dragon, headache, stay, god, gym, accountable, aesthetically, reasons, 4corners.

I believe that you could compile a very good profile of your followers, using these methods over time. This small sample tells me something that I suspected. My followers appear to be more likely than the average Twitter user to be interested in Python, Ruby, programming and code projects. They seem to work in the software industry, more so than the average Twitter user. You might also guess that they are apologetic that their code isn't awesome, that they don't think or talk about the gym, god, or being accountable!

The word 4corners on the public timeline is an example of a "hot" topic. Since my followers' tweets may be several hours or even days old, hot topics will not appear with the same frequencies as the public timeline. Apparently at the time of my program execution, there was a link circulating about a fire at the 4corners, and that's how that got picked up as one of the 1000 most frequently used words on the public timeline. This brings up another great way to use this data - informing your followers of hot topics that appear with abnormal frequency on the public timeline.

Here's how my code ended up:

import csv, math, threading, time
import twitter as twitterapi

pwd='removed, of course'
api = twitterapi.Api(username='amyiris', password=pwd)
me = api.GetUser('amyiris')
numpages = int(math.ceil(me.followers_count/100.0))
followerpage = (api.GetFollowers(page=x)
for x in range(1,numpages+1))
myfollower = (y[z]
for y in followerpage for z in range(len(y)))
csvfile=open('followers13.csv','a')
csvout=csv.writer(csvfile,dialect='excel')

tagcloud={}
pubcloud={}



def getpublicwords():
statuses=api.GetPublicTimeline()
for s in statuses:
for word in normalize(s.text).split():
pubcloud[word]=pubcloud.get(word,0)+1


def normalize(text):
utext=unicode(text).lower()
utext=utext.replace("'","") # remove contractions
rtext=""
for c in utext:
if ("a" <= c <="z") or ("0" <= c <= "9"):
rtext += c
else:
rtext += " "
return rtext



for f in myfollower:
if f.status:
fstatus = repr(f.status.text)[2:-1]
else:
fstatus=""

csvout.writerow([f.screen_name,
f.followers_count,
fstatus,
])
for word in normalize(fstatus).split():
tagcloud[word]=tagcloud.get(word,0)+1

csvfile.close()

while len(pubcloud)<1000:
getpublicwords()
time.sleep(61.0)

for word,count in tagcloud.items():
tagcloud[word]=count/float(pubcloud.get(word,.5))


print "Top 25 words in my Followers Statuses:"
for count, word in sorted([(v,k)
for k,v in tagcloud.items()],
reverse=True)[:25]:
print count, word




Imagine how much more interesting you can be, if you talk about things that interest your followers!

If you like this post, please ReTweet It.