T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples

This repository contains the extraction framework for T-REx Dataset. More details found at: T-REx Website

Paper accepted to LREC2018 link

@inproceedings{DBLP:conf/lrec/ElSaharVRGHLS18,
  author    = {Hady ElSahar and
               Pavlos Vougiouklis and
               Arslen Remaci and
               Christophe Gravier and
               Jonathon S. Hare and
               Fr{\'{e}}d{\'{e}}rique Laforest and
               Elena Simperl},
  title     = {T-REx: {A} Large Scale Alignment of Natural Language with Knowledge
               Base Triples},
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources
               and Evaluation, {LREC} 2018, Miyazaki, Japan, May 7-12, 2018.},
  year      = {2018},
  crossref  = {DBLP:conf/lrec/2018},
  timestamp = {Fri, 18 May 2018 10:35:14 +0200},
  biburl    = {https://dblp.org/rec/bib/conf/lrec/ElSaharVRGHLS18},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Setup

For the English version of the dataset run startup_multilang.sh

For multilingual versions of the dataset run it with equivalent language code (es, eo, ar are supported) e.g. startup_multilang.sh es

to run DBpedia spotlight server on port 2222 run in a separate session

cd dbpedia-spotlight
# dbpedia spotlight server needs at least 6gb of ram
java -Xmx6g -jar dbpedia-spotlight-latest.jar en http://localhost:2222/rest

Knowledge Base Dumps

DBpedia

DBPedia triples and same as links are downloaded automatically with the setup.sh script

Wikidata

Downloaded automatically using the setup.sh script

Wikidata provides a tool for exporting RDF dumps

Simple RDF dumps were used in which each statement is represented in a triple and statements with qualifiers are omitted Wikidata RDF dumps 20160801

Sameas links between Wikidata and DBpedia are already extracted and can be found on wikidata.dbpedia.org

The latest available version in this project is available on the extraction page from 20150330

The downloaded dump is available here: 20150330-sameas-all-wikis.ttl.bz2

Text Dumps

Wikipedia Articles dump

go to ./datasets/Wikipedia/ run "setup.sh" the script will download wikipedia latest dump and extract text in articles.

DBpedia Abstracts dump

go to ./datasets/wikipedia-abstracts/ run "setup.sh" the script will download dbpedia abstracts latest dump from DBpedia website and extract text in article

Output Format :

All of the modules in the pipeline take the a single json file [as described below] and outputs the same file after filling in some of its attributes.

  {
        "docid":                   Document id     -- Wikipedia document id when dealing with wikipedia dump
        "title":                    title of the wikipedia document
        "uri":                      URI of the item containing the main page
        "text":                     The whole text of the document
        "sentences_boundaries":                start and end offsets of sentences
                                    [(start,end),(start,end)] start/ end are character indices
        "words_boundaries":                                      # list of tuples (start, end) of each word in Wikipedia Article, start/ end are character indices
        "entities":                                             # list of Entities   (Class Entity)
                                    [
                                    {
                                    "uri":
                                    "boundaries": (start,end)   # tuple containing the of the surface form of the entity
                                    "surface-form": ""
                                    "annotator" : ""            # the annotator name used to detect this entity [NER,DBpediaspotlight,coref]
                                    }
                                    ]
        "triples":                  list of triples that occur in the document
                                    We opt of having them exclusive of other fields so they can be self contained and easy to process
                                    [
                                    {
                                    "subject":          class Entity
                                    "predicate":        class Entity
                                    "object":           class Entity
                                    "dependency_path": "lexicalized dependency path between sub and obj if exists" or None (if not existing)
                                    "confidence":      # confidence of annotation if possible
                                    "annotator":       # annotator used to annotate this triple with the sentence
                                    "sentence_id":     # integer shows which sentence does this triple lie in
                                    }
                                    ]
    }