123ArticleOnline Logo
Welcome to 123ArticleOnline.com!

ALL >> General >> View Article

Google Spider

By Author: webriferco
Total Articles: 20

Google Spider or Webr crawler sometimes called a spider, is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites' web content. Web crawlers can copy all the pages they visit for later processing by a search engine which indexes the downloaded pages so the users can search much more efficiently.Crawlers consume resources on the systems they visit and often visit sites without tacit approval. Issues of schedule, load, and "politeness" come into play when large collections of pages are accessed. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent. For instance, including a robots.txt file can request bots to index only parts of a website, or nothing at all.

As the number of pages on the internet is extremely large, even the largest crawlers fall short of making a complete index. For that reason search engines were bad at giving relevant search results in the early years of the World Wide Web, before the year 2000. This is improved greatly by modern search engines, nowadays very good results are given instantly.Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping (see also data-driven programming).A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. If the crawler is performing archiving of websites it copies and saves the information as it goes.The archives are usually stored in such a way they can be viewed, read and navigated as they were on the live web,but are preserved as‘snapshots'.

The large volume implies the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change can imply the pages might have already been updated or even deleted.The number of possible URLs crawled being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

As Edwards et al. noted, "Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained."A crawler must carefully choose at each step which pages to visit next.Given the current size of the Web, even large search engines cover only a portion of the publicly available part. A 2009 study showed even large-scale search engines index no more than 40-70% of the indexable Web;a previous study by Steve Lawrence and Lee Giles showed that no search engine indexed more than 16% of the Web in 1999.As a crawler always downloads just a fraction of the Web pages, it is highly desirable for the downloaded fraction to contain the most relevant pages and not just a random sample of the Web.

Are you looking for promoting your website in Google Search Engine. We are Google certified company promoting Websites online by providing SEO services in Chennai. We also support SEO Services in India, USA, UK and Australia. http://www.webrifer.com/seo.html

This requires a metric of importance for prioritizing Web pages. The importance of a page is a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL (the latter is the case of vertical search engines restricted to a single top-level domain, or search engines restricted to a fixed Web site). Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set of Web pages is not known during crawling.Cho et al. made the first study on policies for crawling scheduling. Their data set was a 180,000-pages crawl from the stanford.edu domain, in which a crawling simulation was done with different strategies.The ordering metrics tested were breadth-first, backlink count and partial Pagerank calculations. One of the conclusions was that if the crawler wants to download pages with high Pagerank early during the crawling process, then the partial Pagerank strategy is the better, followed by breadth-first and backlink-count. However, these results are for just a single domain. Cho also wrote his Ph.D. dissertation at Stanford on web crawling.

Najork and Wiener performed an actual crawl on 328 million pages, using breadth-first ordering.They found that a breadth-first crawl captures pages with high Pagerank early in the crawl (but they did not compare this strategy against other strategies). The explanation given by the authors for this result is that "the most important pages have many links to them from numerous hosts, and those links will be found early, regardless of on which host or page the crawl originates.Abiteboul designed a crawling strategy based on an algorithm called OPIC (On-line Page Importance Computation).In OPIC, each page is given an initial sum of "cash" that is distributed equally among the pages it points to. It is similar to a Pagerank computation, but it is faster and is only done in one step. An OPIC-driven crawler downloads first the pages in the crawling frontier with higher amounts of "cash". Experiments were carried in a 100,000-pages synthetic graph with a power-law distribution of in-links. However, there was no comparison with other strategies nor experiments in the real Web.

It domain and 100 million pages from the WebBase crawl, testing breadth-first against depth-first, random ordering and an omniscient strategy. The comparison was based on how well PageRank computed on a partial crawl approximates the true PageRank value. Surprisingly, some visits that accumulate PageRank very quickly (most notably, breadth-first and the omniscient visit) provide very poor progressive approximations.Baeza-Yates et al. used simulation on two subsets of the Web of 3 million pages from the .gr and .cl domain, testing several crawling strategies.They showed that both the OPIC strategy and a strategy that uses the length of the per-site queues are better than breadth-first crawling, and that it is also very effective to use a previous crawl, when it is available, to guide the current one.Daneshpajouh et al. designed a community based algorithm for discovering good seeds.Their method crawls web pages with high PageRank from different communities in less iteration in comparison with crawl starting from random seeds. One can extract good seed from a previously-crawled-Web graph using this new method. Using these seeds a new crawl can be very effective.

Total Views: 58Word Count: 1221See All articles From Author

General Articles

1. A Guide To Web Hosting For The Beginner
Author: sumit gadre

2. Caliber Lims
Author: caliber universal

3. The Great Yellow Pages Classified Portal Script
Author: akshay

4. The City Of Dar El Salam
Author: Medhat Elsergany

5. Java Training Institute Teaches You The Concept Of Ejb
Author: Individual

6. Voip Softswitch Providers Give Best Technical Services
Author: sachin kumar

7. Why House Cleaning Is Vital To Your Health And Lifestyle
Author: Cassie Smitty

8. Make Your Home Décor Simple And Unique With Blanket Boxes
Author: Wooden Street

9. 12 Products That Get Rid Of Greasy Hair Fast
Author: priyanka

10. Explosive Detector Market Segmentation
Author: Shivani Singh

11. The Important Role Of Alarm Systems In The Industrial Sector
Author: Anu Walia

12. Votre Robe Calvin Klein était Hors De Ce Monde Sherobe
Author: sherobe

13. Mistakes When Opting For Family Chiropractic Clinics
Author: Nat Houston

14. How To Keep Outdoor Umbrellas From Falling
Author: Simexa

15. You Need Your Own Domain Name...
Author: sumith gadre

Login To Account
Login Email:
Forgot Password?
New User?
Sign Up Newsletter
Email Address: