A toolkit for making domain-specific probabilistic parsers
Do you have domain-specific text data that would be much more useful if you could derive structure from the strings? This toolkit will help you create a custom NLP model that learns from patterns in real data and then uses that knowledge to process new strings automatically. All you need is some training data to teach your parser about its domain.
Given a string, a probabilistic parser will break it out into labeled components. The parser uses conditional random fields to label components based on (1) features of the component string and (2) the order of labels.
A probabilistic parser is particularly useful for sets of strings that may have common structure/patterns, but which deviate from those patterns in ways that are difficult to anticipate with hard-coded rules.
For example, in most cases, addresses in the United States start with a street number. But there are exceptions: sometimes valid U.S. addresses deviate from this pattern (e.g., addresses starting with a building name or a P.O. box). Furthermore, addresses in real data sets often include typos and other errors. Because there are infinitely many patterns and possible typos to account for, a probabilistic parser is well-suited to parse U.S. addresses.
With a probabilistic (as opposed to a rule-based approach) approach, the parser can continually learn from new training data and thus continually improve its performance!
Some other examples of domains where a probabilistic parser can be useful:
Try out these parsers on our web interface!
For more details on each step, see the parserator documentation.
Initialize a new parser
pip install parserator
parserator init [YOUR PARSER NAME]
python setup.py develop
Configure the parser to your domain
Define features relevant to your domain
Prepare training data
parserator label [infile] [outfile] [modulename]
parserator label unlabeled/rawstrings.csv labeled_xml/labeled.xml usaddress
Train your parser
parserator train [traindata] [modulename]
parserator train labeled_xml/labeled.xml usaddress
or parserator train "labeled_xml/*.xml" usaddress
Repeat steps 3-5 as needed!
Once you are able to create a model from training data, install your custom parser by running python setup.py develop
.
Then, in a Python shell, you can import your parser and use the parse
and tag
methods to process new strings. For example, to use the probablepeople module:
>>> import probablepeople
>>> probablepeople.parse('Mr George "Gob" Bluth II')
[('Mr', 'PrefixMarital'), ('George', 'GivenName'), ('"Gob"', 'Nickname'), ('Bluth', 'Surname'), ('II', 'SuffixGenerational')]
If something is not behaving intuitively, it is a bug and should be reported. Report an issue.
We welcome your ideas! You can make suggestions in the form of GitHub issues (bug reports, feature requests, general questions), or you can submit a code contribution via a pull request.
How to contribute code:
Copyright (c) 2016 DataMade. Released under the MIT License.