There is some Python 2 support scattered throughout but the library has not been fully tested against it.
This library is in development -- APIs may change and features may be unstable.
broca is a NLP library for experimenting with various approaches. So everything in this library is somewhat experimental and meant for rapid prototyping of NLP methods.
When I implement a new method, often from a paper or another source, I add it here so that it can be re-applied elsewhere.
Eventually I hope that
broca can become a battery of experimental NLP methods which can easily be thrown at a new problem.
broca is structured like so:
common: misc utilities and classes reused across the whole library. Also includes shared objects.
distance: for measuring string distance. This should probably be renamed though, since "distance" means a lot more than just string distance.
tokenize: various tokenization methods
keyword: keyword-based tokenization methods (i.e. keyword extraction methods)
vectorize: various ways of representing documents as vectors
similarity: various ways of computing similarity
term: for computing similarity between two terms
doc: for computing similarity matrices for sets of documents
preprocess: for preprocessing text, i.e. cleaning
knowledge: tools for preparing or incorporating external knowledge sources, such as Wikipedia or IDF on auxiliary corpora
pipeline: for easily chaining
brocaclasses into pipelines - useful for rapid prototyping
broca is available through pypi, but the library is under active development, so it's recommended to install via git:
$ pip install git+ssh://email@example.com/ftzeng/broca.git
Or, if adding to a
requirements.txt, add the line:
If developing, you can clone the repo and from within the repo directory, install via
$ pip install --editable .
Your installed version will be aliased directly from the repo directory, so changes are always immediately accessible.
You also need to install the
nltk libraries' data:
$ python -m spacy.en.download $ python -m nltk.downloader all
You can use
broca's module conventionally, or you can take advantage of its pipelines:
from broca import Pipeline from broca.preprocess import BasicCleaner, HTMLCleaner from broca.vectorize import BoWVectorizer, DCSVectorizer p = Pipeline( HTMLCleaner(), BasicCleaner(), BoWVectorizer() ) vecs = p(docs)
Pipelines allow you to chain
broca's objects and easily swap them out.
Pipelines are validated upon creation to ensure that the outputs and inputs of adjacent components ("pipes") are compatible.
You can also build multi-pipelines to try out a variety of pipelines simultaneously:
p = Pipeline( HTMLCleaner(), BasicCleaner(), [BoWVectorizer(), DCSVectorizer()] ) vecs1, vecs2 = p(docs)
This results in two pipelines which are run simultaneously when
p(docs) is executed:
HTMLCleaner() -> BasicCleaner() -> BoWVectorizer()
HTMLCleaner() -> BasicCleaner() -> DCSVectorizer()
You can also nest pipelines and multi-pipelines:
clean = Pipeline( HTMLCleaner(), BasicCleaner(), ) vectr_pipeline = Pipeline( clean, [BoWVectorizer(), DCSVectorizer()] ) vecs1, vecs2 = p(docs)
Pipes can support input from multiple pipes or output to multiple pipes simultaneously.
Where multi-pipelines create distinct and separate pipelines, a branching pipeline is a singular pipeline where its inputs get mapped at branching segments and reduced afterwards.
Branches are specified as tuples.
Here's an example:
class A(Pipe): input = Pipe.type.vals output = Pipe.type.vals def __call__(self, vals): return [v+1 for v in vals] class B(Pipe): input = Pipe.type.vals output = Pipe.type.vals def __call__(self, vals): return [v+2 for v in vals] class C(Pipe): input = Pipe.type.vals output = Pipe.type.vals def __call__(self, vals): return [v+3 for v in vals] class D(Pipe): input = Pipe.type.vals output = Pipe.type.vals def __call__(self, vals): return [v+4 for v in vals] class E(Pipe): input = (Pipe.type.vals, Pipe.type.vals, Pipe.type.vals) output = Pipe.type.vals def __call__(self, vals1, vals2, vals3): return [sum([v1,v2,v3]) for v1,v2,v3 in zip(vals1,vals2,vals3)] branching_pipeline = Pipeline( A(), (B(), C(), D()) # A branching segment (B(), C(), D()) # Another branching segment E() # Reduced ) branching_pipeline([1,2,3,4]) # [24,27,30,33]
Whatever follows a branching segment must accept multiple inputs - this could be a single Pipe or another branching segment of equal size.
A Pipe in the example could have had its output defined as a tuple:
class A(Pipe): input = Pipe.type.vals output = (Pipe.type.vals, Pipe.type.vals, Pipe.type.vals) def __call__(self, vals): return [v+1 for v in vals], [v+2 for v in vals], [v+3 for v in vals] # Running the above with A defined as such returns [27, 30, 33, 36] instead.
By default, pipelines are frozen - that is, each pipe's output memoized to disk based on the inputs it receives. If the input changes or the pipe's
__call__ method is redefined, its output will be recomputed; otherwise, it will be loaded from disk. This means you can easily swap out components in a pipeline without needing to redundantly recompute parts which are not affected.
You can disable this behavior for a pipeline by specifying
p = Pipeline( HTMLCleaner(), BasicCleaner(), freeze=False )
You can force the recomputation of an entire pipeline by specifying
p = Pipeline( HTMLCleaner(), BasicCleaner(), refresh=True )
Implementing your own pipeline component is easy. Just define a class which inherits from
broca.pipeline.Pipe and define its
__call__ method and
output class attributes, which should be from
The call method must take only two arguments:
self and then the input from the preceding pipe. If there are parameters to be specified, they should be handled in the pipe's
from broca import Pipe class MyPipe(Pipe): input = Pipe.type.docs output = Pipe.type.vecs def __init__(self, some_param): self.some_param = some_param def __call__(self, docs): # do something with docs to get vectors vecs = make_vecs_func(docs, self.some_param) return vecs
__init__ method saves the initialization
kwargs as properties by their key names, so you won't need to implement
__init__ if you only need it to pass arguments to
You can use anything for your input and output pipe types, e.g.
Pipe.type.hello_there. They are dynamically generated as needed.
Sometimes you need a pipe to pass along input unmodified.
For example, the
WikipediaSimilarity pipe takes in as input
(Pipe.type.docs, Pipe.type.tokens). You want to pass docs and tokenized versions of those docs.
This can be accomplished with branching and the
IdentityPipe, which requires you specify the input pipe type:
docs = [ 'I am a cat', 'I have a lion' ] p = Pipeline( (IdentityPipe(Pipe.type.docs), OverkillTokenizer()), WikipediaSimilarity() )
Some pipes, including
HTMLCleaner), have support for parallel processing (multiprocessing) which is activated by specifying the
n_jobs keyword argument as something other than 0, e.g.
n_jobs is less than 1, the total number of cores minus that value will be used. For instance, if
n_jobs=-1 and your machine has 4 cores, then 3 cores will be used.
There is a bit of an overhead to setup multiprocessing; the performance gains are only seen with larger amounts of data. There's also an additional memory cost for each separate process, so keep in mind that there is a speed/memory trade-off.
There are a few usage examples in the
Unit tests can be run using
$ nosetests tests