This blog post was published on June 26 2012. §
Web Crawler In Javascript
A Web Crawler is an automated program, or script, that scans (or, crawls) through Internet pages to create an index of a relevant information on those pages, most popularly the hyperlinks. Recursively they can be used to find the hyperlinks on the web pages, and thus crawling on to whole internet. These programs are usually made to be used only once, but they can be programmed for long-term usage as well. The typical use of such programs is while making a search engine.
Generally, these crawlers are written in the server-side, mostly in programming languages like Java, Python, PHP etc. This is because on the client side, like in Javascript, if we intend to pass an URL of domain other than the host of the current page, it violates what is known as Same Origin Policy.
A very good algorithm and the implementation in Java is featured here.
However, this can be done in Javascript using James Padolsey’s jquery plugin which uses cross domain scripting with help of YQL.
First let’s create a wrapper for our display. This can be a very simple HTML page.
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Cross-Domain Ajax Demo</title>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4/jquery.min.js"></script>
<script type="text/javascript" src="js/jquery.xdomainajax.js"></script>
<style>
body{
font: 10px "Lucida Sans Unicode";
}
#links{
font: 20px "Lucida Sans Unicode";
}
</style>
</meta></head>
<body>
<div id="wrapper">
<p>Fetching all the links from Google Homepage</p>
</div>
<div id="links"></div>
</body> |
Download the jquery.xdomainajax.js plugin from Gihub page and place it at required place. The code is self explanatory and I recommend to go through it and watch powerful YQL in action.
Now we are left to add our own script to call all the links from the Google Homepage. I am writing a simple script for demonstration purpose. Readers are free to improvise and can write recursive code to make a real web crawler of their own.
$.ajax({ url:'http://google.com/', type: 'GET', success: function(res) { var head = $(res.responseText).find('a').each(function(idx,item) { var url = $(this); $('#links').append(url); $('#links').append('<br />'); }); } }); |