使用方法请参考 crawler4j-simple 模块
crawler4j-core: crawler4j core module.
crawler4j-simple: a simple WEB crawler implementation base on crawler4j-core.
0.Install the git and maven command line:
yum install git
or: apt-get install git
cd ~
wget http://www.apache.org/dist/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz
tar zxvf apache-maven-3.0.4-bin.tar.gz
vi .bash_profile
- edit: export PATH=$PATH:~/apache-maven-3.0.4/bin
source .bash_profile
1.Checkout the crawler4j source code:
cd ~
git clone https://github.com/zhuoran/crawler4j.git
cd crawler4j
git checkout -b crawler4j-0.1
git checkout master
2.Import the source code to eclipse project:
cd crawler4j/crawler4j-simple
mvn eclipse:eclipse
Eclipse -> Menu -> File -> Import -> Exsiting Projects to Workspace -> Browse -> Finish
Edit Config:
crawler4j-simple/src/main/resources/crawler4j.xml
3.Build the binary package:
cd ~/crawler4j
mvn clean install -Dmaven.test.skip
ll
4.Install the crawler4j simple demo:
cd ~/crawler4j/crawler4j-simple/target
tar zxvf crawler4j-simple-0.1-assembly.tar.gz
cd crawler4j-simple-0.1/bin
./start.sh
Crawler Config Example :
- See crawler4j-simple/src/main/resources/crawler4j.xml
- See Jsoup selector-syntax
- You can configure multiple crawling tasks in crawler4j.xml
<site>
<charset>UTF-8</charset>
<name>Infoq-News</name>
<delay>1000</delay>
<url><![CDATA[http://www.infoq.com/infoq.action?newsidx=10]]></url>
<extract_links_elementId>div.box-content-3</extract_links_elementId>
<parser>me.zhuoran.crawler4j.simple.ParserDemo</parser>
<crawler>me.zhuoran.crawler4j.simple.InfoqCrawler</crawler>
</site>
[charset] For target html page charset,like UTF-8 GBK .
[name] Task name or thread name.
[delay] Each fetch of a html page delay some time, unit is ms.
[url] URL can be list page url or target page url.
URL can contain one or more list page url must use "," split.
URL can contain one or more target page url must use "|" split.
[extract_links_elementId] This tag can be a regex string or a jsoup query.
[parser] This is class full name of parser.
[crawler] This is class full name of cralwer.
For more, please refer to:
[Wiki](http://www.zhuoran.me/crawler4j/wiki) 准备中