wikireverse

Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles. Launched using the elasticrawl CLI tool.

Running using Elasticrawl

Here is how to configure Elasticrawl and an example of parsing some data. To run this example you need an AWS account and it will cost between 40 and 80 cents.

$ ./elasticrawl init wikireverse-2014-52
Enter AWS Access Key ID:
Enter AWS Secret Access Key: 
…

Bucket s3://wikireverse-2014-52 created
Config dir /home/vagrant/.elasticrawl created
Config complete
steps:
  parse:
    jar: 's3://wikireverse/jar/wikireverse-0.0.1.jar'
    class: 'org.wikireverse.commoncrawl.WikiReverse'
    input_filter: 'wat/*.warc.wat.gz'
    emr_config: #'s3://wikireverse/jar/parse-mapred-site.xml'
  combine:
    jar: 's3://wikireverse/jar/wikireverse-0.0.1.jar'
    class: 'org.wikireverse.commoncrawl.SegmentCombiner'
    input_filter: 'part-*'
    emr_config: #'s3://wikireverse/jar/combine-mapred-site.xml'
$ ./elasticrawl parse CC-MAIN-2014-52 --max-segments 2 --max-files 2
Segments
Segment: 1418802765002.8 Files: 176
Segment: 1418802765093.40 Files: 176

Job configuration
Crawl: CC-MAIN-2014-52 Segments: 2 Parsing: 2 files per segment

Cluster configuration
Master: 1 m1.medium  (Spot: 0.12)
Core:   2 m1.medium  (Spot: 0.12)
Task:   --
Launch job? (y/n)
y

Job: 1422436508058 Job Flow ID: j-2KMT57YJN4EJA
$ ./elasticrawl combine --input-jobs 1422436508058
No entry for terminal type "xterm";
using dumb terminal settings.
Job configuration
Combining: 2 segments

Cluster configuration
Master: 1 m1.medium  (Spot: 0.12)
Core:   2 m1.medium  (Spot: 0.12)
Task:   --
Launch job? (y/n)
y

Job: 1422438064880 Job Flow ID: j-1A6Q7LJ1G9TX
JAR Location: s3://wikireverse/jar/wikireverse-0.0.1.jar
Arguments: org.wikireverse.commoncrawl.OutputToText s3://wikireverse-2014-52/data/2-combine/1422438064880/part-* s3://wikireverse-2014-52/data/3-output/2014-52-test/
./elasticrawl destroy
WARNING:
Bucket s3://wikireverse-2014-52 and its data will be deleted
Config dir /Users/ross/.elasticrawl will be deleted
Delete? (y/n)
y

Bucket s3://wikireverse-2014-52 deleted
Config dir /Users/ross/.elasticrawl deleted
Config deleted

Quickstart Build Instructions

$ mvn clean package
$ aws s3 cp ./target/wikireverse-0.0.1.jar s3://yourbucket/jar/new-1.0.jar

TODO

Thanks

License

This code is licensed under the MIT license.