Every Web Developer should learn a lesson from Twitter, and design a parallel set of web pages that are more machine readable (like the Twitter API). As we move toward the Semantic Web, this will become a business necessity.
The Semantic Web is widely considered to be the next major revolution in Internet technology. The concept is that the web currently has a wealth of information that is designed for humans to read, but there are efforts under way to make that same information meaningful to machines. Two specifications - RDF and Microformats are leading the way.
In both cases, HTML pages would be modified slightly to include "meta data" so that machines can make sense of the information that is being presented. For example, if you have contact information on a web page, why not wrap the phone number with an HTML tag that identifies it as a phone number to a web crawler reading your page. Not just A phone number, but YOUR phone number.
One more technique could provide value, though, that there's little discussion about, is the API approach.
Twitter has done a fantastic job building a very usable API that simplifies programming tasks to access its data. My previous post was an example of the hard way of scraping HTML to find the data that one might be looking for. It's not too difficult, but, as the first commenter pointed out, the API makes this much easier.
The issues with HTML scraping is that there's more setup time (the developer has to examine the HTML and decide how to parse it in advance), and there is less reliability (one change to the underlying HTML can break the scraping program).
A much better way to accomplish this task is through a well defined API. In a sense, this is what RDF and Microformats define - a predictable way to extract information from a web page.
Twitter could have used the same strategy as RDF and Microformats - they could have produed an API specification that said "If you want to programmatically interface with us, here is what you can rely on, as far as our HTML". And that would have achieved the objective.
Instead, they created a dead-simple API. You want Twitter User data or status updates, hit a certain web page with a certain parameter, and you'll get back your data in XML or JSON format (or ATOM or RSS). For example:
http://twitter.com/statuses/public_timeline.format
http://twitter.com/statuses/public_timeline.xml
This example link gives the public timeline in XML format.
The API documentation provides all you need to to get started accessing data via the API.
Astute web developers will examine this design pattern and apply it to their business. What data do you have on your website that your customers are trying to get access to? Your web pages allow humans to be able to read it, but the next step is to enable machines to get at it easier.
Sure, machines can scrape your HTML, but why not meet them half way. RDF and Microformats are a great step. But another simple step would be to provide an API.
If you're a developer, and you want to see the magnitude of this issue, simply go to any retailer's web site (Best Buy, Staples, Wal-Mart, etc.), and try to build an HTML scraper that grabs the product number, name, description and price of, say, every product that contains the word "television".
Let this be a call to every web developer in every major corporation! Build an API! You want to increase your online sales? Enable machines to be able to find your products or any other information that you have available. Enable machines to be able to search for your products, and buy your products. Look to the Twitter API as an example of simplicity!
Do NOT hide your shopping experience within a maze of session-dependent form posts. Works great for humans, but not for machines!
Subscribe to:
Post Comments (Atom)

8 comments:
>why not wrap the phone number with
>an HTML tag that identifies it as a
>phone number to a web crawler
>reading your page. Not just A phone
>number, but YOUR phone number.
because if you do this a web crawler can identify my phone number...
a whole new way simplifying telephone marketing / spam...
As good as it is to have data machine readable it also means easier ways to abuse the data
It's a difference if Twitter provides a easy way to access tweets or if there is a easy way to access personal informations.
You say:
One more technique could provide value, though, that there's little discussion about, is the API approach.
But I don't think that's true at all, well let me clarify, I think it's true that it can provide value, but I think it's untrue that there's very little discussion about API's.
If anything, I think the first two methods you described are the "old way" and API's are the "new way". The issue with a web developer designing meta data for their own web content is just that, they're developing it. Meaning that they are going to mark it up however they want. They might denote a phone number as "phone" whereas some other developer on another site uses "telephone", where another one uses "mobile". There's no standard, and scraping through the source code to figure out what the standards are would be a nightmare.
The new/better way to do it would be to use an API that is either publicly documented, or universally approved of (I doubt that will happen any time soon though!) where you say "hey if you want their phone number, ask for it and I'll give you the value and you call it whatever you want".
Abstracting the data out this way makes it easier for everyone. Kind of how XML is a markup scheme that allows you to format the data anyway you want. And if you want to give that data to someone else, you just give them the documentation saying "this means that" and they do what they want with it.
I have to agree with Timo above - opening up your API is not always a good thing, and can be downright bad (think about selling ads on your page - if you open up your API, someone might easily hijack it by yanking it through the API and providing the data on his own). Think about affiliate programs - you scrape the data, bypass affiliate links and ruin a business model.
Sounds paranoid? How about the alexa - statsaholic lawsuit? And that was just a very early bird, primitive example of the problem above.
Of course there are cases when an API is desirable (basecamp, lighthouse, github come to my mind - but these are all services, enabling you to be more effective through their API). There are non-contrived examples of non-service-only sites too, but it's really not trivial to decide what to give out in that case.
From a business perspective there are a few hiccups with building a useful API: if you are scrapable there is no need to visit your site, be annoyed with millions of adds, you can not track consumer behavior just to name a few. Doesn't sound like a great deal to me;)
Thanks for the comments.
Peter, Timo -
You make great points. My feeling is that IF a company has information that is valuable to their customer base, then they should make it available in many formats.
Let's take an example. Flight Scheduling information. There are several site that have this information (Travelocity, Expedia to name two).
Whichever of those sites comes out with an API first will capture the developers. And the developers will provide mechanisms to buy from THAT provider. So it's a race.
The company that is stingy about their information will lose market share to the company that provides an API.
Here's one link (of many) where developers are looking for this information: http://www.ruby-forum.com/topic/83722
>The issue with a web developer
>designing meta data for their own
>web content is just that, they're
>developing it.[...]They might
>denote a phone number as "phone"
>whereas some other developer on
>another site uses "telephone",
>where another one uses "mobile".
That can be (mostly) solved by microformats. However, a more serious problem is that spam and porn sites will deliberately mistag their information. So when you're looking in a phone directory, you'll see ads for Viagra that were tagged as phone numbers.
The semantic web is trying to make information easier to understand for machines by adding extra tags. But that's just an extension of our current approach - remember meta tags? - and if that approach worked, Google wouldn't be in business.
The only realistic way for machines to better understand text is to improve our technology for reading and extracting meaning from natural language.
The semantic web is a nonstarter. It's an academic exercise that, for some reason, has had incredible staying power in the media.
APIs aren't the silver bullet either. They improve automated collection of structured data, but the vast majority of the information online is in natural language - the very opposite of structured data. And there we run into the same problem - machines don't understand natural language.
The answer is to teach the machines to read. It's a hard problem, but - look at Google - even small improvements make a big difference.
Great points! I would add that besides making the API, make sure that data is available as RDF and also as Linked Data. This will help bootstrap the semantic web and view the web as a global database!
Post a Comment