How to make a Web crawler using Java?

There are a lot of useful information on the Internet. How can we automatically get those information? - Yes, Web Crawler.

In this post, I will show you how to make a prototype of Web crawler step by step by using Java. Making a Web crawler is not as difficult as it sounds. Just follow the guide and you will quickly get there in about 1 hour or less, and then enjoy the huge amount of information that it can get for you. The goal of this tutorial is to be the simplest tutorial in the world for making a crawler by using Java. As this is only a prototype, you need spend more time to customize it for your needs.

I assume you know the following:

  • Basic Java programming
  • A little bit about SQL and MySQL Database.

If you don't want to use a database, you can use a file to track the crawling history.

1. The goal

In this tutorial, the goal is as the following:

Given a school root URL, e.g., "mit.edu", return all pages that contains a string "research" from this school

A typical crawler works in the following steps:

  1. Parse the root web page ("mit.edu"), and get all links from this page. To access each URL and parse HTML page, I will use JSoup which is a convenient and simple Java library.
  2. Using the URLs that retrieved from step 1, and parse those URLs
  3. When doing the above steps, we need to track which page has been processed before, so that each web page only get processed once. This is the reason why we need a database.

2. Set up MySQL database

If you are using Ubuntu, you can following this guide to install Apache, MySQL, PHP, and phpMyAdmin.

If you are using Windows, you can simply use WampServer. You can simple download it from wampserver.com and install it in a minute and good to go for next step.

I will use phpMyAdmin to manipulate MySQL database. It is simply a GUI interface for using MySQL. It is totally fine if you any other tools or use no GUI tools.

3. Create a database and a table

Create a database named "Crawler" and create a table called "Record" like the following:

CREATE TABLE IF NOT EXISTS `Record` (
  `RecordID` INT(11) NOT NULL AUTO_INCREMENT,
  `URL` text NOT NULL,
  PRIMARY KEY (`RecordID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

web-crawler-db

4. Start crawling using Java

1). Download JSoup core library from http://jsoup.org/download.
Download mysql-connector-java-xxxbin.jar from http://dev.mysql.com/downloads/connector/j/

2). Now Create a project in your eclipse with name "Crawler" and add the JSoup and mysql-connector jar files you downloaded to Java Build Path. (right click the project --> select "Build Path" --> "Configure Build Path" --> click "Libraries" tab --> click "Add External JARs")

3). Create a class named "DB" which is used for handling database actions.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
 
public class DB {
 
	public Connection conn = null;
 
	public DB() {
		try {
			Class.forName("com.mysql.jdbc.Driver");
			String url = "jdbc:mysql://localhost:3306/Crawler";
			conn = DriverManager.getConnection(url, "root", "admin213");
			System.out.println("conn built");
		} catch (SQLException e) {
			e.printStackTrace();
		} catch (ClassNotFoundException e) {
			e.printStackTrace();
		}
	}
 
	public ResultSet runSql(String sql) throws SQLException {
		Statement sta = conn.createStatement();
		return sta.executeQuery(sql);
	}
 
	public boolean runSql2(String sql) throws SQLException {
		Statement sta = conn.createStatement();
		return sta.execute(sql);
	}
 
	@Override
	protected void finalize() throws Throwable {
		if (conn != null || !conn.isClosed()) {
			conn.close();
		}
	}
}

4). Create a class with name "Main" which will be our crawler.

import java.io.IOException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
 
public class Main {
	public static DB db = new DB();
 
	public static void main(String[] args) throws SQLException, IOException {
		db.runSql2("TRUNCATE Record;");
		processPage("http://www.mit.edu");
	}
 
	public static void processPage(String URL) throws SQLException, IOException{
		//check if the given URL is already in database
		String sql = "select * from Record where URL = '"+URL+"'";
		ResultSet rs = db.runSql(sql);
		if(rs.next()){
 
		}else{
			//store the URL to database to avoid parsing again
			sql = "INSERT INTO  `Crawler`.`Record` " + "(`URL`) VALUES " + "(?);";
			PreparedStatement stmt = db.conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
			stmt.setString(1, URL);
			stmt.execute();
 
			//get useful information
			Document doc = Jsoup.connect("http://www.mit.edu/").get();
 
			if(doc.text().contains("research")){
				System.out.println(URL);
			}
 
			//get all links and recursively call the processPage method
			Elements questions = doc.select("a[href]");
			for(Element link: questions){
				if(link.attr("href").contains("mit.edu"))
					processPage(link.attr("abs:href"));
			}
		}
	}
}

Now you have your own Web crawler. Of course, you will need to filter some links you don't want to crawl.

The output is the following when I run the code on May 26 2014.

web-crawler-java-result
Will you let me know if this is not the simplest crawler in the world? 🙂

Java Crawler Source Code Download
Java Crawler on GitHub

Category >> Java Projects  
If you want someone to read your code, please put the code inside <pre><code> and </code></pre> tags. For example:
<pre><code> 
String foo = "bar";
</code></pre>

  1. Gaurav Mishra on 2013-9-18

    Jsoup.connect(“http://www.udel.edu/”).get();

    This line is throwing an exception, could you tell me why? and how to handle it

  2. Narayan Prusty on 2013-10-9

    https://www.udemy.com/building-a-search-engine/ this tutorial explains everything

  3. smily on 2013-10-9

    plz try Jsoup.connect(“url”).timeout(0).get()

  4. Dwaraka on 2013-10-16

    thank you…
    I m working on project to develop a web crawler for YouTube.
    .so can u help me? ? or give some ideas.

  5. disqus_Ex0bamMMcH on 2013-10-24

    wezxcxzczxczxc

  6. disqus_Ex0bamMMcH on 2013-10-24

    import org.jsoup.Jsoup; proide link for this api

  7. Ahrusin on 2013-11-1

    Thank you for this :), but you are only processing the links on the first URL http://www.mit.edu/. Because in the Jsoup.connect, it should be Jsoup.connect(URL) instead of Jsoup.connect(http://www.mit.edu/).

  8. Juan on 2013-12-2

    When I run Main, it only shows the “conn built”, And I see that in the Record Table stores all URLs inside the HTML page when I open it from PHPMyAdmin. But nothing more is shown in the console.
    I tryed changin PhD for other word.
    And something strange is, when I ran few more times, it keeps displaying all the URL (with no filter) in console. Don’t no why it doesn’t do anymore that. Now I try and it only does what I said at beginig.

  9. swarnima on 2014-1-3

    if i am using DB2 then what i have to write in place of

    ResultSet runSql(String sql) and public boolean runSql2(String sql)methods

  10. [email protected] on 2014-1-12

    how can I download all images from this site by java
    http://gaga.vn

  11. Beat Course Service on 2014-2-3

    I changed throws Exception insted of IOException. It is OK in my case
    public static void main(String[] args) throws SQLException, Exception {
    db.runSql2(“TRUNCATE Record;”);
    processPage(“http://www.beatcourse.com”);

  12. Alessandro on 2014-2-26

    Hi. I want to thank you for this guide! It’s been really helpfull! Only one question:
    Could you tell me what does this code exactly mean?

    sql = “INSERT INTO `Crawler`.`Record` ” + “(`URL`) VALUES ” + “(?);”;

    Because my aim was to add the entire sourcecode of each page taken by the result of the .get() into my database, and not only the URL. I’ve already made a new column called Source where i want to save it, but i can’t find the right query.

    Thank you very much!

  13. Tushar Chawla on 2014-3-3

    Muahahahahahah!
    This is my lab assignment! Copy-Paste it!

  14. Patil Sir (Lab Assistant) on 2014-3-3

    Tumhari aisi taisi…lab k andar harkatein mat pela karo

  15. shahnaz on 2014-3-12

    how to run this programme? t shows error 404 is not found

  16. Computer Solutions on 2014-3-14

    sql = “INSERT INTO `Crawler`.`Record` ” + “(`URL`) VALUES ” + “(?);”;

    This INSERTS the crawled url into the “Record” TABLE belonging to the DATABASE called “Crawler”

  17. Kam on 2014-3-24

    lolll!!! ^^

  18. siddharth ganguly on 2014-4-10

    hey thanx a lot bro… i use to think that making a crawler is a big deal u know.. but never new it is that easy..
    it took me a couple of minutes to understand .. u made my life a lot easy …thanx n god bless.. cheers!!

  19. Giriraj Gupta on 2014-4-16

    Getting following exception when trying to get link to a pdf file

    org.jsoup.UnsupportedMimeTypeException: Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml. Mimetype=application/pdf, URL=http://www.xyz.com/files/.pdf

    Any suggestions?

  20. Anshu on 2014-5-26

    Thanks nice tutorial.

    I had one question, if I use web Crawler and showing that data or information in my own site. Is it legal or we need to take permission to those URL owner.

    Thanks

  21. ryanlr on 2014-5-26

    I think some part with reference link is ok, but the whole page of the content is not.

  22. Rituja Pawar on 2014-5-30

    Exception in thread “main” java.lang.NullPointerException in following lines
    Statement sta = conn.createStatement();
    db.runSql2(“TRUNCATE Record;”);

  23. Luiz Ramos on 2014-6-1

    Nice!

  24. Aastha on 2014-7-2

    This is amazing and so easy to implement. Thanks for sharing you knowledge.

  25. Aymën Charfi on 2014-7-12

    Netbeans’s console displays nothing !!!

  26. Jose on 2014-7-12

    Wang, Thanks. I tried the code and I am now search all the pages I want and getting my results per keyword i choose. This is very good and appreciate your gesture in schooling us.

  27. NotoriousZeus on 2014-7-17

    If you are using Wampm I’m pretty sure you did not get rid of the password field mentioned in the DB class.

    It should something like this ” conn = DriverManager.getConnection(url, “root”, “”);”

  28. Bryan on 2014-7-28

    Hi, I am in the mySQL database guide under step 2 and trying to create my new site in Apache2 (under Virtual Hosts). After I run the command gksudo gedit /etc/apache2/sites-available/mysite.conf and access the file, I can’t find the Directory directive to complete step 4. When I don’t follow his step, http://localhost doesn’t work, giving me the error that “I Don’t have access to this page.” Help!

  29. Zack on 2014-7-30

    Jsoup.connect(URL).ignoreContentType(true).get();

  30. java9 on 2014-8-28

    I can never find a java source that shows a proper crawler. All of these limited to domain specific things. Would be cool if someone did a writeup of an actual commercial style crawler ( kind of the point of a crawler ) because there are already a thousand simple programss for small range collection.

  31. karan shukla on 2014-9-18

    If I want to crawl to the specific keyword inside the webpage. how can i do that? can u please help me out on this one?

  32. Nathan on 2014-10-6

    This line might have a bug.

    if(link.attr(“href”).contains(“mit.edu”))

    I think it should have abs: in it so that relative URLs are changed to include the domain.

    `if(link.attr(“abs:href”).contains(“mit.edu”))`

  33. Patrick J on 2014-10-9

    Hey it works for me except the “db.runSql(“TRUNCATE Record;”);” I keep getting the error “Exception in thread “main” java.sql.SQLException: Can not issue data manipulation statements with executeQuery().” When I comment out the line, everything works great?

  34. xard on 2014-10-11

    I get a NullPointerException for the line: Statement sta = conn.createStatement();

    Does anyone know what to do?

  35. xard on 2014-10-11

    Eclipse says I do not get a response from the server…

  36. xard on 2014-10-11

    I do not get a response from the server, cause by line: Statement sta = conn.createStatement(); (NullPointer Exception). Does anyone know what the problem might be?

  37. KelvinLegolas on 2014-10-22

    Do you have a java to mysql connector?

  38. KelvinLegolas on 2014-10-22

    It should be db.runSql2(“TRUNCATE Record;”); since you are making changes in the table. Use executeQuery() only if you’re just getting data from the table. 🙂

  39. KelvinLegolas on 2014-10-23

    I’m getting an error “java.net.ConnectException: Connection timed out: connect”.

    I recently placed a “timeout(0)” before .get() on the Document. Is there any other solutions for this? My firewalls are completely shut off.

  40. ivusilvan on 2014-12-23

    I’m pretty sure you had to check the link ‘URL’ for the string “research”, instead of hard coding “www.mit.edu” for each recursive call. Good job on the crawler though

  41. Mehul Popat on 2015-1-6

    We can use selenium to crawl a website.
    See below code. It will open firefox window and will crawl a website.

    I think this is the simplest way to do it.

    Hope this will help to all of you.

    Thanks.

    – Mehul Popat

    import java.util.List;

    import org.openqa.selenium.By;
    import org.openqa.selenium.WebDriver;
    import org.openqa.selenium.WebElement;
    import org.openqa.selenium.firefox.FirefoxDriver;

    public class SimpleCrawler {

    public static void main(String[] args) {

    WebDriver wd = new FirefoxDriver();
    wd.get(“http://www.mit.edu”);

    List allLinks = wd.findElements(By.tagName(“a”));

    for(int i = 0 ; i< allLinks.size() ; i++)
    {
    if(allLinks.get(i).getAttribute("href").contains("mit.edu"))
    {
    System.out.println(allLinks.get(i).getAttribute("href"));
    }
    }

    wd.close();

    }

    }

  42. Mehul Popat on 2015-1-6

    We can use selenium to crawl a website.
    See below code. It will open firefox window and will crawl a website.

    I think this is the simplest way to do it.

    Hope this will help to all of you.

    Thanks.

    – Mehul Popat

    import java.util.List;

    import org.openqa.selenium.By;
    import org.openqa.selenium.WebDriver;
    import org.openqa.selenium.WebElement;
    import org.openqa.selenium.firefox.FirefoxDriver;

    public class SimpleCrawler {

    public static void main(String[] args) {

    WebDriver wd = new FirefoxDriver();
    wd.get(“http://www.mit.edu”);

    List allLinks = wd.findElements(By.tagName(“a”));

    for(int i = 0 ; i< allLinks.size() ; i++)
    {
    if(allLinks.get(i).getAttribute("href").contains("mit.edu"))
    {
    System.out.println(allLinks.get(i).getAttribute("href"));
    }
    }

    wd.close();

    }

    }

  43. Rajesh on 2015-2-11

    Access denied for user ‘root’@’localhost’ (using password: YES)

    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)

    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4096)

    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4028)

    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:951)

    at com.mysql.jdbc.MysqlIO.proceedHandshakeWithPluggableAuthentication(MysqlIO.java:1717)

    at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1276)

    at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2395)

    at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2428)

    at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2213)

    at com.mysql.jdbc.ConnectionImpl.(ConnectionImpl.java:797)

    at com.mysql.jdbc.JDBC4Connection.(JDBC4Connection.java:47)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)

    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)

    at java.lang.reflect.Constructor.newInstance(Unknown Source)

    at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)

    at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:389)

    at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:305)

    at java.sql.DriverManager.getConnection(Unknown Source)

    at java.sql.DriverManager.getConnection(Unknown Source)

    at DB.(DB.java:11)

    at Main.(Main.java:13)

    Exception in thread “main” java.lang.NullPointerException

    at DB.runSql2(DB.java:26)

    at Main.main(Main.java:16)

  44. Rajesh on 2015-2-11

    Im getting this sort of an error on running my java code. Any ideas?

  45. Hoda on 2015-2-28

    Thanks it was a very good starting point….

  46. SomebodyUnimportant on 2015-3-5

    For those new to this, here are some tips:

    1) To add jar files, right click the project name->properties->Java Build Path->Add External Jars…
    2) I had to add mysql-connector jar file as well
    3) change the username/password (I’m not sure what it is) from “admin213” to “”. This means the line should be >conn = DriverManager.getConnection(url, “root”,””);

  47. Henrik on 2015-3-8

    Doesn’t really crawl as it is the same page that is fetched every time. Use:

    Document doc = Jsoup.connect(URL).get();

  48. Henrik on 2015-3-8

    Also, using a database for tracking is a bit over-engineered since it is cleared on startup. However, it will come in handy if you want to track which pages has been crawled if you stop the process and want to continue later on.

  49. Aishwarya on 2015-3-29

    After inserting 46 rows, it refuses to insert any more in the database. How to insert the rest of the links?

  50. ssharma on 2015-4-16

    Hi,

    Thanks for this wonderful code that you have put up! Makes writing the first crawler very very easy!
    😀

    Well, the problem that I am currently facing is that I am not sure of how to distinguish between a page that has an article like: http://tech.firstpost.com/news-analysis/new-twist-trai-says-war-between-telco-media-house-caused-net-neutrality-debate-263618.html and a page that has many links to other articles like: http://tech.firstpost.com/.

    Is there some meta tag or some other mechanism that exists to distinguish between these two types of pages?

    I also want to filter out pages that photos, slideshows, links to social media accounts, etc. Other than checking for words like ‘photo’,’gallery’,’slidehow’,’plus.google.com’ in the URL is there any way of filtering these out using some smarter jsoup(any other api) utilities?

  51. Andy Wyne on 2015-5-5

    It worked!!
    THANX ALOT!!

  52. Adrian C on 2015-5-15

    Hi, Im new to making web crawlers and am doing so for the final project in my class. I want my web crawler to take in an address from a user and plug into maps.google.com and then take the route time and length to use in calculations. How do I adapt the crawler you provided to do that? Or if not possilbe, how do I write a crawler that can do that operation?

  53. Taylor Smith on 2015-5-19

    Hey, nice post. It’s worth mentioning in crawling you should parse the domain’s robots.txt first and create a URL exclusion set to make sure you don’t anger any webmasters 😉

  54. Ronak Sangani on 2015-6-25

    Changed to
    Document doc = Jsoup.connect(URL).timeout(0).get();

    still getting :
    Exception in thread “main” java.lang.IllegalArgumentException: usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]

    at org.jsoup.helper.Validate.isTrue(Validate.java:45)

    at org.jsoup.examples.HtmlToPlainText.main(HtmlToPlainText.java:35)

  55. Ronak Sangani on 2015-6-25

    Changed to
    Document doc = Jsoup.connect(URL).timeout(0).get();

    still getting :
    Exception in thread “main” java.lang.IllegalArgumentException: usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]

    at org.jsoup.helper.Validate.isTrue(Validate.java:45)

    at org.jsoup.examples.HtmlToPlainText.main(HtmlToPlainText.java:35)

  56. Anubhab Banerjee on 2015-7-2

    whenever i am running the code, an error message is showing “java.sql.SQLException: Access denied for user ‘root’@’localhost’ (using password: YES)”….tell me what to do next

  57. asher on 2015-7-8

    Console isn’t displaying anything when I run it… but I’m not getting any error messages either. Anyone know what’s going on?

  58. test on 2015-8-2

    Hi Iam also getting the same error. I have installed WAMP, but no clue how to create records in it

  59. kurian on 2015-8-6

    error: cannot find symbol

    public static DB db = new DB();

  60. Jeffry Copps on 2015-10-5

    Check your password for connecting with your database.
    The default password in wampserver is blank “”.

  61. Akshay Kolte on 2015-10-26

    Exception in thread “main” java.lang.IllegalArgumentException: usage: supply url to fetch

    at org.jsoup.helper.Validate.isTrue(Validate.java:45)

    at org.jsoup.examples.ListLinks.main(ListLinks.java:16)

  62. Tahseena Mahmud on 2015-11-25

    I cannot crawl any other webpages other than http://www.mit.edu

  63. SyedDanishAli on 2016-1-4

    TO all those guys who are having this error, It is a database related error which states that the connection to the database was not successful due to the incorrect parameters you supplied to make a connection i.e.

    url="jdbc:mysql://localhost:3306/DatabaseName";
    String user = "WhateverYourUserNameIs", password = "WhateverYourPasswordis";
    Connection conn = DriverManager.getConnection(url,user,password);

  64. SyedDanishAli on 2016-1-4

    Hello Taylor,
    Can you please elaborate this?

  65. Anurag Sharma on 2016-4-2

    can we use this program to crawl multiple websites? and not just one(www.mitedu.com)?

  66. kashif on 2016-4-20

    Hi!
    Thanks for the help
    I am able to write data to file but it shows “ClassNotFoundException” when trying to use daatabase.
    Ho do I solve that issue.
    I searched the net but did not got any satisfactory solutions
    Please help.

  67. kashif on 2016-4-23

    how do I extract and store images and emails from a webpage to my database….
    I have looked through the web but was not able to get a solution….please help

  68. janati on 2016-5-13

    thank u very much really very helpfull

  69. Parth Patil on 2016-5-27

    i could run your program successfully but when i change the URL it gives my error of unknown host exception can you help?
    also how should one get desired data from URL in my case i’m looking for product name and its price in a excel

  70. Aman Gupta on 2016-6-7

    I copied the exact same code but I am getting only the links present in http://www.mit.edu/ . It’s not returning all pages that contains a string “research” ..Please help ..

  71. Shivani Singhal on 2016-6-22

    I am getting this error how to resolve it?????????
    java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Unknown Source)
    at DB.(DB.java:13)
    at Main.(Main.java:13)
    Exception in thread “main” java.lang.NullPointerException
    at DB.runSql2(DB.java:30)
    at Main.main(Main.java:16)

  72. Manisha Sutar on 2016-6-30

    I want the content of that url..How does it works.
    eg:contact details name
    address
    number..
    can you help me out?

  73. sayar samanta on 2016-7-16

    java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:264)
    at com.sayar.home.wiki.search.db.DatabaseManager.(DatabaseManager.java:15)
    at com.sayar.home.wiki.search.crawler.Main.(Main.java:15)
    Exception in thread “main” java.lang.NullPointerException
    at com.sayar.home.wiki.search.db.DatabaseManager.runSql2(DatabaseManager.java:32)
    at com.sayar.home.wiki.search.crawler.Main.main(Main.java:18)

    Why i am having this error while compiling the program.Please help me .

  74. sdf on 2016-8-11

    Add mysql-connector-java-5.1.23-bin.jar in your lib folder

  75. Ashok Manghat on 2016-9-30

    Yes, I found it useful.

  76. Rahul chauhan on 2016-11-12

    Sir which search technologie used in this program mean like (bfs ,dfs ) which one is used

  77. Deepak Kumar Verma on 2016-11-16

    By the code, if u see from this snippet:
    //get all links and recursively call the processPage method
    Elements questions = doc.select(“a[href]”);
    for(Element link: questions){
    if(link.attr(“href”).contains(“mit.edu”))
    processPage(link.attr(“abs:href”));
    }

    Getting all the childs of a node , and inside loop recursively calling one of its child. It means it will go deep first in the tree. SO it is DFS

  78. Deepak Kumar Verma on 2016-11-16

    Rather than complicated this with use of DB, u could have simple used some java collection like Set to store traversed URLs. I guess many finding problem here with setting DB connectors.

  79. Catherine Yu on 2016-11-18

    hey, thanks so much for the upload!

    I am just wondering why I can only extract data from mit.edu?
    I changed search word from “research” to “people” and used other websites but nothing show up

    only mit.edu works. why?
    how can i use this program on other websites?

  80. Md Wasim Mallick on 2017-1-12

    its working fine

  81. manish on 2017-2-21

    can someone help me to extract data, in above given example he extract links only
    i want to extract research based publication databases..
    thank you

  82. Sunny Sudan on 2017-6-9
  83. Abdely on 2017-11-8

    Thank you for this simple but yet powerfull crawling app. I was actually learning to develop a crawling program in Python, then I had the ideal to look for the same kind of program in Java. I’m pretty satisfied with your blog post. Awesome work!

  84. Real Estate Consultant on 2017-11-13

    Thanks For posting web crawler programming . Thanks

Leave a comment

*