How to make a Web crawler using Java?
There are a lot of useful information on the Internet. How can we automatically get those information? - Yes, Web Crawler.
This post shows how to make a simple Web crawler prototype using Java. Making a Web crawler is not as difficult as it sounds. Just follow the guide and you will quickly get there in 1 hour or less, and then enjoy the huge amount of information that it can get for you. As this is only a prototype, you need spend more time to customize it for your needs.
The following are prerequisites for this tutorial:
- Basic Java programming
- A little bit about SQL and MySQL Database.
If you don't want to use a database, you can use a file to track the crawling history.
1. The goal
In this tutorial, the goal is as the following:
Given a school root URL, e.g., "mit.edu", return all pages that contains a string "research" from this school
A typical crawler works in the following steps:
- Parse the root web page ("mit.edu"), and get all links from this page. To access each URL and parse HTML page, I will use JSoup which is a convenient web page parser written in Java.
- Using the URLs that retrieved from step 1, and parse those URLs
- When doing the above steps, we need to track which page has been processed before, so that each web page only get processed once. This is the reason why we need a database.
2. Set up MySQL database
If you are using Ubuntu, you can following this guide to install Apache, MySQL, PHP, and phpMyAdmin.
If you are using Windows, you can simply use WampServer. You can simple download it from wampserver.com and install it in a minute and good to go for next step.
I will use phpMyAdmin to manipulate MySQL database. It is simply a GUI interface for using MySQL. It is totally fine if you any other tools or use no GUI tools.
3. Create a database and a table
Create a database named "Crawler" and create a table called "Record" like the following:
CREATE TABLE IF NOT EXISTS `Record` ( `RecordID` INT(11) NOT NULL AUTO_INCREMENT, `URL` text NOT NULL, PRIMARY KEY (`RecordID`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ; |
4. Start crawling using Java
1). Download JSoup core library from http://jsoup.org/download.
Download mysql-connector-java-xxxbin.jar from http://dev.mysql.com/downloads/connector/j/
2). Now Create a project in your eclipse with name "Crawler" and add the JSoup and mysql-connector jar files you downloaded to Java Build Path. (right click the project --> select "Build Path" --> "Configure Build Path" --> click "Libraries" tab --> click "Add External JARs")
3). Create a class named "DB" which is used for handling database actions.
import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; public class DB { public Connection conn = null; public DB() { try { Class.forName("com.mysql.jdbc.Driver"); String url = "jdbc:mysql://localhost:3306/Crawler"; conn = DriverManager.getConnection(url, "root", "admin213"); System.out.println("conn built"); } catch (SQLException e) { e.printStackTrace(); } catch (ClassNotFoundException e) { e.printStackTrace(); } } public ResultSet runSql(String sql) throws SQLException { Statement sta = conn.createStatement(); return sta.executeQuery(sql); } public boolean runSql2(String sql) throws SQLException { Statement sta = conn.createStatement(); return sta.execute(sql); } @Override protected void finalize() throws Throwable { if (conn != null || !conn.isClosed()) { conn.close(); } } } |
4). Create a class with name "Main" which will be our crawler.
import java.io.IOException; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class Main { public static DB db = new DB(); public static void main(String[] args) throws SQLException, IOException { db.runSql2("TRUNCATE Record;"); processPage("http://www.mit.edu"); } public static void processPage(String URL) throws SQLException, IOException{ //check if the given URL is already in database String sql = "select * from Record where URL = '"+URL+"'"; ResultSet rs = db.runSql(sql); if(rs.next()){ }else{ //store the URL to database to avoid parsing again sql = "INSERT INTO `Crawler`.`Record` " + "(`URL`) VALUES " + "(?);"; PreparedStatement stmt = db.conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS); stmt.setString(1, URL); stmt.execute(); //get useful information Document doc = Jsoup.connect("http://www.mit.edu/").get(); if(doc.text().contains("research")){ System.out.println(URL); } //get all links and recursively call the processPage method Elements questions = doc.select("a[href]"); for(Element link: questions){ if(link.attr("href").contains("mit.edu")) processPage(link.attr("abs:href")); } } } } |
Now you have your own Web crawler. Of course, you will need to filter some links you don't want to crawl.
The output is the following when I run the code on May 26 2014.
Links:
Java Crawler Source Code Download
Java Crawler on GitHub
<pre><code> String foo = "bar"; </code></pre>
-
Real Estate Consultant
-
Abdely
-
Sunny Sudan
-
manish
-
Md Wasim Mallick
-
Catherine Yu
-
Deepak Kumar Verma
-
Deepak Kumar Verma
-
Rahul chauhan
-
Ashok Manghat
-
sdf
-
sayar samanta
-
Manisha Sutar
-
Shivani Singhal
-
Aman Gupta
-
Parth Patil
-
janati
-
kashif
-
kashif
-
Anurag Sharma
-
SyedDanishAli
-
SyedDanishAli
-
Tahseena Mahmud
-
Akshay Kolte
-
Jeffry Copps
-
kurian
-
test
-
asher
-
Anubhab Banerjee
-
Ronak Sangani
-
Ronak Sangani
-
Taylor Smith
-
Adrian C
-
Andy Wyne
-
ssharma
-
Aishwarya
-
Henrik
-
Henrik
-
SomebodyUnimportant
-
Hoda
-
Rajesh
-
Rajesh
-
Mehul Popat
-
Mehul Popat
-
ivusilvan
-
KelvinLegolas
-
KelvinLegolas
-
KelvinLegolas
-
xard
-
xard
-
xard
-
Patrick J
-
Nathan
-
karan shukla
-
java9
-
Zack
-
Bryan
-
NotoriousZeus
-
Jose
-
Aymën Charfi
-
Aastha
-
Luiz Ramos
-
Rituja Pawar
-
ryanlr
-
Anshu
-
Giriraj Gupta
-
siddharth ganguly
-
Kam
-
Computer Solutions
-
shahnaz
-
Patil Sir (Lab Assistant)
-
Tushar Chawla
-
Alessandro
-
Beat Course Service
-
[email protected]
-
swarnima
-
Juan
-
Ahrusin
-
disqus_Ex0bamMMcH
-
disqus_Ex0bamMMcH
-
Dwaraka
-
smily
-
Narayan Prusty
-
Gaurav Mishra