Ocular

Ocular is a state-of-the-art historical OCR system.

Its primary features are:

It is described in the following publications:

Unsupervised Transcription of Historical Documents [pdf]
Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein
ACL 2013

Improved Typesetting Models for Historical OCR [pdf]
Taylor Berg-Kirkpatrick and Dan Klein
ACL 2014

Unsupervised Code-Switching for Multilingual Historical Document Transcription [pdf] [data]
Dan Garrette, Hannah Alpert-Abrams, Taylor Berg-Kirkpatrick, and Dan Klein
NAACL 2015

An Unsupervised Model of Orthographic Variation for Historical Document Transcription [pdf] [data]
Dan Garrette and Hannah Alpert-Abrams
NAACL 2016

Continued development of Ocular is supported in part by a Digital Humanities Implementation Grant from the National Endowment for the Humanities for the project Reading the First Books: Multilingual, Early-Modern OCR for Primeros Libros.

Contents of this README

  1. Quick-Start Guide
  2. Listing of Command-Line Options

1. Quick-Start Guide

Obtaining Ocular

The easiest way to get the Ocular software is to download the self-contained jar from http://www.dhgarrette.com/maven-repository/snapshots/edu/berkeley/cs/nlp/ocular/0.3-SNAPSHOT/ocular-0.3-SNAPSHOT-with_dependencies.jar

Once you have this jar, you will be able to run Ocular according to the instructions below in the Using Ocular section; the code in this repository is not a requirement if all you'd like to do is run the software.

The jar is executable, so when you use go to use Ocular, you will run it following this template (where [MAIN-CLASS] will specify which program to run, as detailed in the Using Ocular section below):

java -Done-jar.main.class=[MAIN-CLASS] -mx7g -jar ocular-0.3-SNAPSHOT-with_dependencies.jar [options...]

This jar includes all the necessary dependencies, so you should be able to move it to, and run it from, wherever you like.

Optional: Building Ocular from source code

Clone this repository, and compile the project into a jar:

git clone https://github.com/tberg12/ocular.git
cd ocular
./make_jar.sh

This creates precisely the same ocular-0.3-SNAPSHOT-with_dependencies.jar jar file discussed above. Thus, this is sufficient to be able to run Ocular, as stated above, using the detailed instructions in the Using Ocular section below.

Also like above, since this jar includes all the necessary dependencies, so you should be able to move it wherever you like, without the rest of the contents of this repository.

Compiling to an executable script instead of jar

Alternatively, if you do not wish to create the entire jar, you can run make_run_script.sh, which compiles the code and generates an executable script target/start. This script can be used directly, in lieu of the jar file. Thus to run Ocular, it is sufficient to run the make_run_script.sh script and then use the following template instead of the template given above:

export JAVA_OPTS="-mx7g"     # Increase the available memory
target/start [MAIN-CLASS] [options...]

Optional: Obtaining Ocular via a dependency management system

To incorporate Ocular into a larger project, you may use a dependency management system like Maven or SBT with the following information:

Repository location: http://www.dhgarrette.com/maven-repository/snapshots
Group ID: edu.berkeley.cs.nlp
Artifact ID: ocular
Version: 0.3-SNAPSHOT

Using Ocular

  1. Initialize a language model:

    Acquire some files with text written in the language(s) of your documents. For example, download a book in English. The path specified by -inputTextPath should point to a text file or directory or directory hierarchy of text files; the path will be searched recursively for files. Use -outputLmPath to specify where the trained LM should be written.

    java -Done-jar.main.class=edu.berkeley.cs.nlp.ocular.main.InitializeLanguageModel -mx7g -jar ocular-0.3-SNAPSHOT-with_dependencies.jar \ -inputTextPath texts/pg2600.txt \ -outputLmPath lm/english.lmser

    For a multilingual (code-switching) model, specify multiple -inputTextPath entries composed of a language name and a path to files containing text in that language. For example, a combined Spanish/Latin/Nahuatl might be trained as follows:

    java -Done-jar.main.class=edu.berkeley.cs.nlp.ocular.main.InitializeLanguageModel -mx7g -jar ocular-0.3-SNAPSHOT-with_dependencies.jar \ -inputTextPath "spanish->texts/sp/,latin->texts/la/,nahuatl->texts/na/" \ -outputLmPath lm/trilingual.lmser

    This program will work with any languages, and any number of languages; simply add an entry for every relevant language. The set of languages chosen should match the set of languages found in the documents that are to be transcribed.

    More details on the various command-line options can be found below.

  2. Initialize a font:

    Before a font can be trained from texts, a font model consisting of a "guess" for each character must be initialized based on the fonts on your computer. Use -outputFontPath to specify where the initialized font should be written. Since different languages use different character sets, a language model must be given in order for the system to know what characters to initialize (-inputLmPath).

    java -Done-jar.main.class=edu.berkeley.cs.nlp.ocular.main.InitializeFont -mx7g -jar ocular-0.3-SNAPSHOT-with_dependencies.jar \ -inputLmPath lm/trilingual.lmser \ -outputFontPath font/trilingual-init.fontser

  3. Train a font:

    To train a font, a set of document pages must be given (-inputDocPath), along with the paths to the language model and initial font model. Use -outputFontPath to specify where the trained font model should be written, and -outputPath to specify where transcriptions and (optional) evaluation metrics should be written. The path specified by -inputDocPath should point to a pdf or image file or directory or directory hierarchy of such files. The value given by -inputDocPath will be searched recursively for non-.txt files; the transcriptions written to the -outputPath will maintain the same directory hierarchy.

    java -Done-jar.main.class=edu.berkeley.cs.nlp.ocular.main.TrainFont -mx7g -jar ocular-0.3-SNAPSHOT-with_dependencies.jar \ -inputFontPath font/trilingual-init.fontser \ -inputLmPath lm/trilingual.lmser \ -inputDocPath sample_images/advertencias \ -numDocs 10 \ -outputFontPath font/advertencias/trained.fontser \ -outputPath train_output

    Since the operation of the font trainer is to take in a font model (-inputFontPath) and output a new and improved font model (-outputFontPath), TrainFont can be run multiple times, passing the output back in as the input of the next round, to continue to making improvements.

    Many more command-line options, including several that affect speed and accuracy, can be found below.

    Optional: Glyph substitution modeling for variable orthography

    Ocular has the optional ability to learn, unsupervised, a mapping from archaic orthography to the orthography reflected in the trained language model. We call this a "glyph substitution model" (GSM). To train a GSM, add the -allowGlyphSubstitution, -updateGsm and -outputGsmPath options. If no -inputGsmPath is given, a new GSM will be created and then trained along with the font.

    java -Done-jar.main.class=edu.berkeley.cs.nlp.ocular.main.TrainFont -mx7g -jar ocular-0.3-SNAPSHOT-with_dependencies.jar \ -inputFontPath font/trilingual-init.fontser \ -inputLmPath lm/trilingual.lmser \ -inputDocPath sample_images/advertencias \ -numDocs 10 \ -outputFontPath font/advertencias/trained.fontser \ -outputPath train_output \ -allowGlyphSubstitution true \ -updateGsm true \ -outputGsmPath gsm/advertencias/trained.gsmser

    If -allowGlyphSubstitution is set to true, Ocular will produce simultaneous dual transcriptions: one diplomatic (literal) and one normalized to match the LM training data's orthography.

  4. Transcribe some pages:

    To transcribe pages, -inputFontPath should point to the newly-trained font model (the -outputFontPath from the training step, instead of the "initial" font model used during font training).

    java -Done-jar.main.class=edu.berkeley.cs.nlp.ocular.main.Transcribe -mx7g -jar ocular-0.3-SNAPSHOT-with_dependencies.jar \ -inputDocPath sample_images/advertencias \ -inputLmPath lm/trilingual.lmser \ -inputFontPath font/advertencias/trained.fontser \ -outputPath transcribe_output

    As above, if -allowGlyphSubstitution is set to true and the -inputGsmPath is given, Ocular will produce simultaneous dual transcriptions: one diplomatic (literal) and one normalized to match the LM training data's orthography.

    Many more command-line options, including several that affect speed and accuracy, can be found below. Among these, -skipAlreadyTranscribedDocs might be particularly useful.

    Optional: Continued model improvements during transcription

    Since training is a model is done in an unsupervised fashion (it requires no gold transcriptions), the operation of transcribing is actually a subset of EM font training. Because of this, it is possible make further improvements to the models during transcription, without having to make multiple iterations over the documents. This can be done by setting -updateFont to true, and -updateDocBatchSize to a reasonable number of training documents:

    java -Done-jar.main.class=edu.berkeley.cs.nlp.ocular.main.Transcribe -mx7g -jar ocular-0.3-SNAPSHOT-with_dependencies.jar \ -inputDocPath sample_images/advertencias \ -inputLmPath lm/trilingual.lmser \ -inputFontPath font/advertencias/trained.fontser \ -outputPath transcribe_output \ -updateFont true \ -updateDocBatchSize 50 \ -outputFontPath font/advertencias/trained.fontser

    The same can be done to update the glyph substitution model by passing in the previously-trained model (-inputGsmPath) and setting -updateGsm to true.

    -allowGlyphSubstitution true \
    -inputGsmPath gsm/advertencias/trained.gsmser \
    -updateGsm true \
    -outputGsmPath gsm/advertencias/trained.gsmser

    Optional: Checking accuracy with a gold transcription

    If a gold standard transcription is available for a file, it should be written in a .txt file in the same directory as the corresponding image, and given the same filename (but with a different extension). These files will be used to evaluate the accuracy of the transcription (during either training or testing). Likewise, if a gold normalized transcription is available, it should be given the same filename, but with _normalized appended. For example:

    path/to/some/image_001.jpg # document image path/to/some/image_001.txt # corresponding transcription path/to/some/image_001_normalized.txt # corresponding normalized transcription

    For pdf files, the transcription filename is based on both the pdf filename and the relevant page number (as a 5-digit number):

    path/to/some/filename.pdf # document image path/to/some/filename_pdf_page00001.txt # transcription of the document's first page path/to/some/filename_pdf_page00001_normalized.txt # corresponding normalized transcription

2. All Command-Line Options

InitializeLanguageModel

Required
Additional Options
Rarely Used Options

InitializeFont

Required
Additional Options
Rarely Used Options

TrainFont

Main Options
Additional Options

These options affect the speed of font training

Glyph Substitution Model Options

Glyph substitution is the feature that allows Ocular to use a probabilistic mapping from modern orthography (as used in the language model training text) to the orthography seen in the documents. If the glyph substitution feature is used, Ocular will jointly produce dual transcriptions: one that is an exact transcription of the document, and one that is a normalized version of the text.

Language Model Training Options
Line Extraction Options
Evaluate During Training
Rarely Used Options

Transcribe

Main Options
Additional Options

These options affect the speed of transcription

Glyph Substitution Model Options

Glyph substitution is the feature that allows Ocular to use a probabilistic mapping from modern orthography (as used in the language model training text) to the orthography seen in the documents. If the glyph substitution feature is used, Ocular will jointly produce dual transcriptions: one that is an exact transcription of the document, and one that is a normalized version of the text.

Model Updating Options

For updating the font model

For updating the glyph substitution model

For updating the language model

Line Extraction Options
Evaluate During Training
Rarely Used Options