The "Semantic Web" could have happened last decade

Posted Sun, Nov 28, 2004 at 10:03 am

I've occasionally referred to the web as a bunch of "text crap." To see what I mean, simply look at the source to this or any other web page. It is indeed a bunch of text crap.

Because the web involves sending blocks of poorly formatted text around, it leads to all manner of problems: "innovations" like cookies and style sheets and javascript and all the rest.

For an example from one of the organizations that perpetrated the web, see this:

...I have had a lot of success lately using XSLT to screen-scrape RDF out of XHTML pages...

That is, quite simply, absurd. Because the W3C designed the web the way it is, their home page is just one big text file with no real "semantic" information readily available. To get some of that semantic information out of their big blob of text, they're forced to screen scrape like one would if they were dealing with a dinosaur-era mainframe system. Nevertheless, to people who don't know any better this might seem "advanced." It is not.

There's another, much more interesting and useful example of screen-scraping here. In a way, these are similar to the village elders mandating that everyone has to throw their garbage in one big pit. Then, since some of that garbage might be useful, employing some village members to go into the pit looking for useful items.

All of this screen-scraping is designed to facilitate the "semantic web", the newest invention from the same people who brought you the web. Using the second case as an example, the idea would be to define what exactly a "Senator" is, then publish a list of Senators in a form that could be readily machine-processed. For instance, if you want to send emails to each Senator, in the current system you'd have to visit each of their pages and get their emails or locate their email form. With the semantic web, you could directly get a list of Senators and their emails. And, you could get other information, like their birthdays or a list of Senators from 1989 or whatever. And, with the semantic web none of this screen-scraping would be necessary.

As part of the semantic web, they define "ontologies", which describe what constitutes various fields of knowledge (baseball, chemistry, etc.) and the interrelationships of those items. For instance, here's an ontology for baseball, and here's one for people. (I'll note that an Error is not a kind of fielding play, and Male is not a Subclass of Animal, so there already seem to be problems a-brewing in the semantic web).

Could all of this have been avoided, and could we have had the semantic web from the get-go? The web didn't have to be a bunch of poorly formatted text files, it could have been something infinitely better.

Let's imagine if the web had been designed by people familiar with object-oriented design. Let's say there was no such thing as the web, and various people had been shown a mock-up of this blog's main page. Some of those people might suggest using some variant of SGML to send blocks of text around, which would then be parsed by a text processing program and would display the page. I.e., the current system.

However, an object-oriented designer would see the mock-up as a series of objects. Let's call the design he would have made the "OOWeb". With the OOWeb, you have a Site object that contains a series of Post objects. Each of those Post objects contains certain things and has certain characteristics:

CreationDate Title PostText Comment[] (the "[]" mean a series of Comment objects) Trackback[] Author

Other objects that the Post object contains are less visible. Such as copyright information, revision history, other objects that reference the Post object, etc. etc.

When someone wants to see my blog, they send a request to my server. In the current case, they get a big blob of text back. In the OOWeb case, they'd get a series of objects. Their browser would then display those objects just as in the case of the web.

However, the semantic web would have been built-in with the OOWeb. If you have a Post object, you could ask it for its Author object, then query that Author object for other things, like the Author's email. No screen-scraping required.

After visiting my blog, you would have a cache of Post objects on your computer. You could look at a list of those Post objects, or browse through them looking for Posts with comments. You could directly tell a Post object to monitor its counterpart on my server, letting you know when someone had replied to that Post. In effect, you'd have a series of intelligent "agents" on your computer that could maintain a relationship with their original versions on my server.

Each Post object could be displayed using a standard display object, or you could use a third-party display object or write your own. Let's say you want to see the copyright notice for each post in your browser. You'd modify the code that displays Post objects to get the copyright information from the Post and then display that. Similarly with things like translating the text of a Post or displaying the Post using large print or special fonts or on special devices or whatever else. All of these things would be fairly easy with the OOWeb. With the current web, all of those things are problematic and would involve some form of screen-scraping or parsing or other things which would be fraught with the possibility of error. (I can already hear the complaints from web-supporters: 7-bit gateways, security, Java didn't exist when the web was invented, editing would have been problematic, binary formats are Micro$oftian, etc. etc. There are answers to each of these objections, some of which are answered here. Overcoming all of them would have been easier than the current situation.)

Now, perhaps the very fact that the web was so horribly designed lead directly to its success. And, its horrible design has certainly created a lot of money for a lot of people as they've invented "brilliant" workarounds like cookies, javascript, and all the rest.

Nevertheless, there was a better way and it's too bad that hadn't been explored at the time rather than the current mess.

(Additional information on the semantic web can be found here, here, here, and here.)

Comments

David (not verified)

Thu, 12/09/2004 - 19:29

Permalink

You seem to fall on the common pitfall of the OO designer of thinking that objects suddenly make the world tractable. Some problems have a high inherent complexity, and the representation of human speech is one of them. Let me just try to explain why I don't think your idea would have worked:

1. Need for a shared ontology. You need a central authority to designate the semantics of an ontology that covers the whole world of human speech, last time this was tried was in the XVIII century by Wilkins, with very mixed results.
2. Overhead. For every different application that you would want to run on -- let's call it -- the objectual web, you'd have to write the semantics and behaviour of the objects. With the -- let us call it -- syntactic web, you merely have to accommodate your structure to the existing mechanisms to present information.
3. Detail. There are things that are hard to input on a form, and there are things that the objectual web would have plenty of difficulties with simply because they would have to be either 1) manually declared by the content creators or 2) parsed as in the syntactic web. Example: say I have a table containing senators and their e-mail addresses. From an objectual standpoint, this is data-structurally a table. You can't deduce from the formal behaviour of the table that it contains, as it does, senators. Therefore, how would you find a list of senators? You might raise the following objection: a table could have a field -- or in OO-speak a public object -- which would describe the type of content it holds. In order to make this work, though, you'd have to, everytime you input a table, manually set the value of the "contains" object. Maybe You're diligent enough to do it, but I doubt many people are.
3) Granularity. How granular to define the object tree? Do we go down to syntactic parsing of the sentences? If so, how do we automate this process?
4) Trust. There are two main issues of trust. 4.1) running untrusted code. Even with Java virtual machines around this hasn't been entirely solved. 4.2) trust in the content creator to truthfully describe his data, instead of -- as it would be in his rational best interest -- untruthfully using objectual tricks in order to improve search ranks or to otherwise deceive software agents that would depend on objectual information in order to do their work.
5) Objects are text. Yes, they are. Since the Church-Turing thesis, we define computation as the carrying forward of a finite number of steps on a string of symbols. That is text. You might think of it in a different way, which is often a better way -- abstraction sometimes helps in seeing the high-level problem areas -- but in the end all computation is done on a strip of ones and zeroes by a read/write head with a finite state machine on it. So, all the things that can be accomplished with objects can be accomplished with text, provable by theorem.
6) Edition and human readability. Human readable files have a long history in the Unix tradition, and for very good reasons. There exist many tools capable to perform arbitrary computation on them -- see above for text = objects -- and they are easy to modify by humans, as opposed to more heavy-weight editors. In addition, textual tools are as a rule more transparent than their non-textual counterparts -- an area in which I consider OO design misguided is in its preference of loosely-coupled systems rather than transparent systems, especially when transparent does not preclude loosely-coupled -- and this is a good thing for too many reasons to go in detail here.

In conclusion, I think the Objectual Web would partake of the same -- or more -- problems as the semantic web, and it would bring fourth complications absent in the syntactic web.

You are here

The "Semantic Web" could have happened last decade

Comments