Download the Internet Index for free!

Well, kind of… dotnetdotcom.org aka Dotbot

You might have noticed a new little critter gnawing at your web server:

Mozilla/5.0 compatible; DotBot/1.1; http://www.dotnetdotcom.org/crawler@dotnetdotcom.org)

The few Seattle based guys (pseudonym on their web site) promise an index of the web available to everybody. They intent to release as much information about the web’s structure (linking) and content as possible. For a fee to cover their costs, though.

Seomoz Linkscape

The entity behind the pseudonym is actually the SEO company Seomoz. One of their best known products is their domain evaluation service Linkscape. Linkscape is building its own version of a web link graph, collecting and computing the relation of all web sites to each other. It is a nice service for webmasters but will provide quite a bit of information to your competition.

The best way to lock out the Linkscape bot from your site is to use an entry in your robots.txt contol file. It should start with the bots you like to lock out completely. For Linkscape you would have to include the following lines:

User-agent: dotbot
Disallow: /

Note that Linkscape is promoting a different way of locking out their bot, that is by including a “noindex” meta tag in the header of a web page. Alas, this version will not prevent the dotbot to crawl your site and to extract its links. It will only prevent the robot to store the content of your site for its search engine.

The Dotbot Technology

The guys tell jokily about their tools. Using C and python as a programming language, flat disk files instead of a database system, some open source software. That’s saying nothing in an elaborate way, of course.

Downloadable Index

The current dotbot’s index is available for download for everybody. The index file is constructed according to the structure: "URL-Without-Protocol NULL Optional-String-Not-Used NULL Complete-HTTP-Response NULL", with NULL as the zero byte. Actually one sees this is rather a web dump, than a searchable web index. The sorting, filtering, and indexing will have to follow. I wonder a bit, why the protocol is omitted, when keeping on the other hand the complete http response.

As of end of January 2009 the index has a tiny fraction of the web available. It comprises about 9 million pages, summing up to an index file size of 68 GB. Find a link to download it on their site (weblink below).

Sample Dump

The example consists of two URLs:

	www.example.com/  HTTP/1.1 200 OK
	Date: Sat, 20 Sep 2008 15:43:15 GMT
	Server: Apache/2.0.52 (CentOS)
	X-Powered-By: PHP/4.3.9
	Content-Length: 557
	Connection: close
	Content-Type: text/html; charset=UTF-8			

	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
	<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	<title>I am an example.</title>
	</head>
	<body>
	...
	<body>
	</html> www.example2.com/  HTTP/1.1 200 OK
	Date: Sat, 20 Sep 2008 15:43:15 GMT
	Server: Apache/2.0.52 (CentOS)
	X-Powered-By: PHP/4.3.9
	Content-Length: 557
	Connection: close
	Content-Type: text/html; charset=UTF-8			

	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
	<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	<title>I am a different example.</title>
	</head>
	<body>
	...
	<body>
	</html>

Weblinks

  • Dotbot, including a link to their Index (66 GB, torrent)

Tags: ,

One Response to “Download the Internet Index for free!”

  1. Web design Wellington Says:

    Web design Wellington…

    The truth- as always- is somewhere in the middle. These tools can be very helpful for a novice user that simply wants to do some basic SEO for non competitive keywords but they are totally useless if one wants to achieve high rankings in competitive ke…

Leave a Reply