Java Code Examples for edu.uci.ics.crawler4j.crawler.CrawlController#start()

The following examples show how to use edu.uci.ics.crawler4j.crawler.CrawlController#start() . You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example 1
Source File: CrawlerController.java    From Java-for-Data-Science with MIT License 6 votes vote down vote up
public static void main(String[] args) throws Exception {
  int numberOfCrawlers = 2;
  CrawlConfig config = new CrawlConfig();
  String crawlStorageFolder = "data";
  
  config.setCrawlStorageFolder(crawlStorageFolder);
  config.setPolitenessDelay(500);
  config.setMaxDepthOfCrawling(2);
  config.setMaxPagesToFetch(20);
  config.setIncludeBinaryContentInCrawling(false);

  PageFetcher pageFetcher = new PageFetcher(config);
  RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
  RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
  CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

  controller.addSeed("https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly");

  controller.start(SampleCrawler.class, numberOfCrawlers);
}
 
Example 2
Source File: HtmlCrawlerController.java    From tutorials with MIT License 6 votes vote down vote up
public static void main(String[] args) throws Exception {
    File crawlStorage = new File("src/test/resources/crawler4j");
    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorage.getAbsolutePath());
    config.setMaxDepthOfCrawling(2);

    int numCrawlers = 12;

    PageFetcher pageFetcher = new PageFetcher(config);
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
    RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
    CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

    controller.addSeed("https://www.baeldung.com/");

    CrawlerStatistics stats = new CrawlerStatistics();
    CrawlController.WebCrawlerFactory<HtmlCrawler> factory = () -> new HtmlCrawler(stats);

    controller.start(factory, numCrawlers);
    System.out.printf("Crawled %d pages %n", stats.getProcessedPageCount());
    System.out.printf("Total Number of outbound links = %d %n", stats.getTotalLinksCount());
}
 
Example 3
Source File: ImageCrawlerController.java    From tutorials with MIT License 6 votes vote down vote up
public static void main(String[] args) throws Exception {
    File crawlStorage = new File("src/test/resources/crawler4j");
    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorage.getAbsolutePath());
    config.setIncludeBinaryContentInCrawling(true);
    config.setMaxPagesToFetch(500);
    
    File saveDir = new File("src/test/resources/crawler4j");
    
    int numCrawlers = 12;
    
    PageFetcher pageFetcher = new PageFetcher(config);
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
    RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
    CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
    
    controller.addSeed("https://www.baeldung.com/");
    
    CrawlController.WebCrawlerFactory<ImageCrawler> factory = () -> new ImageCrawler(saveDir);
    
    controller.start(factory, numCrawlers);
}