Identification of Spiders and Crawlers
Identification of Spiders and Crawlers:
Spiders are small web programs that harvest information for search engines. These spiders tracks the websites. In some ways these are good by quickly showing up the websites. These programs follow certain links on the web and gather information. You can also explicitly instruct a robot not to follow any of the links on the page. Like the good spiders, bad spiders are also present known as spam spiders. These bad spiders try to harvest your email address. Some spiders may not work efficiently and goes in endless loops which are built by dynamically created webpages. So in this project we try to identify the bad spam spiders present in the webpages and try to eradicate them. And also we minimize the bot traffic. This idea was firstly proposed by Google namely Google Analytics.
Implementation Steps done:
Software used are Java+Hadoop+Hive (NoSQL database -Hive is used)
The given dataset (Google bot-spider) is analyzed for bot identification
The data is uploaded to Hadoop HDFS system
The location of file is stored under hdfs/app/hadoop
The file name is given web_log
We have to start the hadoop server first.
Then we can check the hadoop is running or not.
Upload the web_log file to hive database
We created server_log partition in hive, where are data are stored
Start Analysis, in which the dataset in hive is analyzed for bot detetction
Finally results of bot under different browser is taken and plotted as graph.
This show how many bot urls are detected in the web log
Tools used: Hive, Hadoop, Java