Identification of Spiders and Crawlers

Identification of Spiders and Crawlers:

Spiders are small web programs that harvest information for search engines. These spiders tracks the websites. In some ways these are good by quickly showing up the websites. These programs follow certain links on the web and gather information. You can also explicitly instruct a robot not to follow any of the links on the page. Like the good spiders, bad spiders are also present known as spam spiders. These bad spiders try to harvest your email address. Some spiders may not work efficiently and goes in endless loops which are built by dynamically created webpages. So in this project we try to identify the bad spam spiders present in the webpages and try to eradicate them. And also we minimize the bot traffic. This idea was firstly proposed by Google namely Google Analytics.

Implementation Steps done:

Software used are Java+Hadoop+Hive (NoSQL database -Hive is used)

The given dataset (Google bot-spider) is analyzed for bot identification

The data is uploaded to Hadoop HDFS system

The location of file is stored under hdfs/app/hadoop

The file name is given web_log

We have to start the hadoop server first.

Then we can check the hadoop is running or not.

Upload the web_log file to hive database

We created server_log partition in hive, where are data are stored

Start Analysis, in which the dataset in hive is analyzed for bot detetction

Finally results of bot under different browser is taken and plotted as graph.

This show how many bot urls are detected in the web log

Tools used: Hive, Hadoop, Java

Project Demo

Leave a Reply

Your email address will not be published. Required fields are marked *