The Penn Discourse Treebank 2.0 (PDTB) is an incredibly rich resource for studying not only the way discourse coherence is expressed but also how information about discourse commitments (content attribution) is conveyed linguistically. However, the file format and annotation methods of the standard distribution can be an obstacle to research with this resource. The goal of this code is to remove those obstacles.
pdtb2.csv.zip: Reformatted and repackaged corpus. This link is password protected. I will give out the password to people who have the requisite LDC license. Unzip the file to use it.
pdtb2.py: Python classes for working with the corpus in the
pdtb2_functions.py: illustrations of
pdtb-template.dot: template for Graphviz output of
The main interface provided by pdtb.py is the
from pdtb2 import CorpusReader corpus = CorpusReader('pdtb2.csv')
The central method for
CorpusReader objects is
allows you to iterate through the data in the corpus. Intuitively,
iter_data reads each row of the source csv file
turns it into a
Datum object, which has lots of methods and
attributes for doing cool things. See
for a simple illustration (counting
datum.Relation instances). There
Datum objects in the corpus.
Datum objects have huge numbers of attributes and methods. For lots of details, see here. Here's a simple example of working with text and trees (with row 17 chosen because it's a manageable but illustrative case):
from pdtb2 import CorpusReader, Datum iterator = CorpusReader('pdtb2.csv').iter_data(display_progress=False) for _ in range(17): next(iterator) d = next(iterator) d.arg1_words() ['that', '*T*-1', 'hung', 'over', 'parts', 'of', 'the', 'factory', ','] d.arg1_words(lemmatize=True) ['that', '*T*-1', 'hang', 'over', 'part', 'of', 'the', 'factory', ','] d.arg1_pos(wn_format=True) [('that', 'wdt'), ('*T*-1', '-none-'), ('hung', 'v'), ('over', 'in'), \ ('parts', 'n'), ('of', 'in'), ('the', 'dt'), ('factory', 'n'), (',', ',')] d.arg1_pos(lemmatize=True) [('that', 'wdt'), ('*T*-1', '-none-'), ('hang', 'v'), ('over', 'in'), \ ('part', 'n'), ('of', 'in'), ('the', 'dt'), ('factory', 'n'), (',', ',')] len(d.Arg1_Trees) 5 for t in d.Arg1_Trees: t.pprint() (WHNP-1 (WDT that)) (NP-SBJ (-NONE- *T*-1)) (VBD hung) (PP-LOC (IN over) (NP (NP (NNS parts)) (PP (IN of) (NP (DT the) (NN factory))))) (, ,)
There are similarly named methods for Sups, connectives, and attributions.
GornList attributes are for connecting with the
Penn Treebank files. The relevant material is already inserted into
the CSV file and accessible via the
attributes, so you probably won't need it, but it is there just in
case you need to connect with the external files.
There's a much fuller overview here: http://compprag.christopherpotts.net/swda.html