Tags
ESP, Han Jongstra, html crawling, http://, linkedin, Netwiz, sdxp://, search index, search result quality, xml
Enterprise Search is rapidly gaining popularity and the number of customers of ES products is growing. Every day more customers acquire ESP systems and start indexing their internal documents. Also many of these companies will start to see the advantage of competetive intelligence and information access and will start crawling external business websites. This blog will show some of the consequences and disadvantages of indexing public websites outside the firewall.
Dutch companies can’t wait to all have their own search index of government sites, youtube, social networking sites (like www.hyves.nl) and finincial and news sites. When ESP will become a common product:
- increasingly high amounts of data duplicity for the index will consume the available bandwith every scheduled interval;
- websites will drop in performance thanks to the amount of crawlers creating and updating their index;
- datacenters and co-locating facilities will exceed their available space with the large amount of servers required for upgrading the dropped website performance and storing the large ESP indexes;
- besides the most important point, the quality of the search results is just awful using HTML to index a website.

screenshot from the site webwereld.nl
Example
As the screenshot of a Dutch news website on the right side shows only one third of the page is real content. The rest are ads, highlights of other interesting articles and comments by users. In my opinion we do want to index this, but it can and should not have an impact on the quality of the search results by adding unrelevant keywords and metadata to the content of the page.
We need a new protocol
My goal is to create an API on the information provider side so local indexes are no longer needed and the quality of the search results is based on the entities in the database of the website. The API should return the content in an XML format so the indexing engine can make the difference between the entities or real content of the article and on the other hand the ‘extra’ information.
A new standard will be developed which enables enterprise search in the future and search results quality will be guaranteed. An option for this search standard can be introduced (next to http://) through a new protocol, for example sdxp:// (Search Data Xml Protocol). This protocol prevents the crawler has to use HTML for indexing a website.
Why a new protocol? This prevents the endless discussion about using a subdomain or paths in the url.
XML example
In the image below an example of the XML structure is described. As you can see the XML file only contains raw data which can be parsed by the indexer.

XML example
HTML is not sufficient
HTML is intended to be a markup language, so a lot of markup is available within the document. But as enterprise search indexes plain text this markup language and other content on the website is according to the crawler connected to the content on the page. This is not always the case.
Also HTML is often badly programmed and tags are not always properly closed, so when parsing the content HTML tags can’t be trusted. Also the usage of e.g. Javascript and CSS increases this problem.
Keywords and meta-information
Naturally the quality of the search results are still based on the usage of keywords and meta-data. So the ESP API on the information provider side has got to have some intelligence for tagging the articles and extracting meta-information.
Robots.txt
To face the problem of performance-loss by crawler and from the point of information-ownership more sites will use a robots.txt file. Crawling will be prohibited and no longer possible.
System overview
Below an image is shown about the system overview. Next to the regular http:// protocol another protocol is shown, the sdxp:// protocol. This interface returns the website in a not readable format, just plain XML.

system overview
Who is going to use this? The idea is that this becomes a standard for example adopted by W3C.
Information
I was also inspired to write this article when I saw the movie “Tim Berners-Lee: The next Web of open, linked data“.