QA catalogues - a metadata quality assessment tool for MARC records

This tool reads MARC dump files (in binary MARC or MARCXML formats), analyses different aspects of quality, and saves the results into CSV files. These CSV files could be used in different context, we provide a lightweight, web-based user interface for that.

Table of Contents

Quick start guide


  1. wget
  2. unzip
  3. cd metadata-qa-marc-0.2-SNAPSHOT/


  1. cp
  2. nano

set your path to root MARC directories:

# the input directory, where your MARC dump files exist
# the input directory, where the output CSV files will land
  1. Create configuration based on some existing config files:
    • cp scripts/ scripts/[abbreviation-of-your-library].sh
    • edit scripts/[abbreviation-of-your-library].sh according to configuration guide

With docker

An experimental Docker image is publicly available in Docker Hub. This imsage contain an Ubuntu 18.08 with Java, R and the current software. No installation is needed (given you have a Docker running environment). You only have to specify the directory on your local machine where the MARC files are located. The first issue of this command will download the Docker image, which takes a time. Once it is downloaded you will be entered into the bash shell (I denoted this with the # symbol), where you have to change directory to /opt/metadata-qa-marc the location of the application.

> docker run -t -i -v [your-MARC-directory]:/opt/metadata-qa-marc/marc pkiraly/metadata-qa-marc /bin/bash
# cd /opt/metadata-qa-marc

Everything else works the same way as in other environments, so follow the next sections.

Note: at the time of writing this Docker image doesn't contain the web user interface, only the command line interface (the content of this repository).


scripts/[abbreviation-of-your-library].sh all-analyses
scripts/[abbreviation-of-your-library].sh all-solr

For a catalogue with around 1 milion record the first command will take 5-10 minutes, the later 1-2 hours.


Prerequisites: Java 8 (I use OpenJDK), and Maven 3

  1. Optional step: clone and build the parent library, metadata-qa-api project:
git clone
cd metadata-qa-api
mvn clean install
cd ..
  1. Mandatory step: clone and build the current metadata-qa-marc project
git clone
cd metadata-qa-marc
mvn clean install

... or download

The released versions of the software is available from Maven Central repository. The stable releases (currently 0.1) is available from all Maven repos, the developer version (0.2-SNAPSHOT) is avalable from the Sonatype Maven repository. What you need to select is the file metadata-qa-marc-0.2-[timestamp]-1-jar-with-dependencies.jar.

Be aware that no automation exists for creating a this current developer version as nightly build, so there is a chance that the latest features are not available in this version. If you want to use the latest version, do build it.

Since the jar file doesn't contain the helper scipts, you might also consider to download them from this GitHib repository:


You should adjust common-script to point to the jar file you just downloaded.


Helper scripts

The tool comes with some bash helper scripts to run all these with default values. The generic scripts locate in the root directory and library specific configuration like scripts exist in the scripts directory. You can find predefined scripts for 19 library catalogues (if you want to run it, first you have to configure it).


scripts/[your script] [command]

The following commands are supported:

You can find information about these functionalities below this document.


  1. create the configuration file (

  2. edit the file configuration file. Two lines are important here

  1. edit the library specific file

Here is an example file for analysing Library of Congress' MARC records

#!/usr/bin/env bash

. ./

. ./common-script

echo "DONE"
exit 0

Three variable are important here:

  1. NAME is a name for the output directory. The analysis result will land under $BASE_OUTPUT_DIR/$NAME directory
  2. MARC_DIR is the location of MARC files. All the files should be in the same direcory
  3. MASK is a file mask, such as .mrc or .marc

You can add here any other parameters this document mentioned at the description of individual command, wrapped in TYPE_PARAMS variable e.g. for the Deutche Nationalbibliothek's config file, one can find this

TYPE_PARAMS="--marcVersion DNB --marcxml"

This line sets the DNB's MARC version (to cover fields defined within DNB's MARC version), and XML as input format.

Detailed instructions

We will use the same jar file in every command, so we save its path into a variable.

export JAR=target/metadata-qa-marc-0.2-SNAPSHOT-jar-with-dependencies.jar

Validating MARC records

java -cp $JAR de.gwdg.metadataqa.marc.cli.Validator [options] [file]

or with a bash script

./validator [options] [file]


The file argument might contain any wildcard the operating system supports ('*', '?', etc.)

It creates a file given at fileName parameter.

Currently it detects the following errors:

Leader specific errors:

Control field specific errors:

Data field specific errors

Errors of specific fields:

An example:

Error in '   00000034 ': 
  110$ind1 has invalid code: '2'
Error in '   00000056 ': 
  110$ind1 has invalid code: '2'
Error in '   00000057 ': 
  082$ind1 has invalid code: ' '
Error in '   00000086 ': 
  110$ind1 has invalid code: '2'
Error in '   00000119 ': 
  700$ind1 has invalid code: '2'
Error in '   00000234 ': 
  082$ind1 has invalid code: ' '
Errors in '   00000294 ': 
  050$ind2 has invalid code: ' '
  260$ind1 has invalid code: '0'
  710$ind2 has invalid code: '0'
  710$ind2 has invalid code: '0'
  710$ind2 has invalid code: '0'
  740$ind2 has invalid code: '1'
Error in '   00000322 ': 
  110$ind1 has invalid code: '2'
Error in '   00000328 ': 
  082$ind1 has invalid code: ' '
Error in '   00000374 ': 
  082$ind1 has invalid code: ' '
Error in '   00000395 ': 
  082$ind1 has invalid code: ' '
Error in '   00000514 ': 
  082$ind1 has invalid code: ' '
Errors in '   00000547 ': 
  100$ind2 should be empty, it has '0'
  260$ind1 has invalid code: '0'
Errors in '   00000571 ': 
  050$ind2 has invalid code: ' '
  100$ind2 should be empty, it has '0'
  260$ind1 has invalid code: '0'

some post processing usage examples

After running the validation as tab separated file validation-report.txt

get the number of errors:

wc -l validation-report.txt

get the number of records having errors

awk -F "\t" '{print $1}' validation-report.txt | uniq -c | wc -l

Display one MARC record

java -cp $JAR de.gwdg.metadataqa.marc.cli.Formatter [options] [file]

or with a bash script

./formatter [options] [file]

The output of the script is something like this one:

LEADER 01697pam a2200433 c 4500
001 1023012219
003 DE-101
005 20160912065830.0
007 tu
008 120604s2012    gw ||||| |||| 00||||ger  
015   $a14,B04$z12,N24$2dnb
016 7 $2DE-101$a1023012219
020   $a9783860124352$cPp. : EUR 19.50 (DE), EUR 20.10 (AT)$9978-3-86012-435-2
024 3 $a9783860124352
035   $a(DE-599)DNB1023012219
035   $a(OCoLC)864553265
035   $a(OCoLC)864553328
040   $a1145$bger$cDE-101$d1140
041   $ager
044   $cXA-DE-SN
082 04$81\u$a622.0943216$qDE-101$222/ger
083 7 $a620$a660$qDE-101$222sdnb
084   $a620$a660$qDE-101$2sdnb
085   $81\u$b622
085   $81\u$z2$s43216
090   $ab
110 1 $0(DE-588)4665669-8$0$0(DE-101)963486896$aHalsbrücke$4aut
245 00$aHalsbrücke$bzur Geschichte von Gemeinde, Bergbau und Hütten$chrsg. von der Gemeinde Halsbrücke anlässlich des Jubliäums "400 Jahre Hüttenstandort Halsbrücke". [Hrsg.: Ulrich Thiel]
264  1$a[Freiberg]$b[Techn. Univ. Bergakad.]$c2012
300   $a151 S.$bIll., Kt.$c31 cm, 1000 g
653   $a(Produktform)Hardback
653   $aGemeinde Halsbrücke
653   $aHüttengeschichte
653   $aFreiberger Bergbau
653   $a(VLB-WN)1943: Hardcover, Softcover / Sachbücher/Geschichte/Regionalgeschichte, Ländergeschichte
700 1 $0(DE-588)1113208554$0$0(DE-101)1113208554$aThiel, Ulrich$d1955-$4edt$eHrsg.
850   $aDE-101a$aDE-101b
856 42$mB:DE-101$qapplication/pdf$u$3Inhaltsverzeichnis
925 r $arb

Calculating simple completeness

java -cp $JAR de.gwdg.metadataqa.marc.cli.Completeness [options] [file]

or with a bash script

./completeness [options] [file]

The process will create two files in the output directory:

Calculating Thompson-Traill completeness

Kelly Thompson and Stacie Traill recently published their approach to calculate the quality of ebook records comming from different data sources. Their article is Implementation of the scoring algorithm described in Leveraging Python to improve ebook metadata selection, ingest, and management. In Code4Lib Journal, Issue 38, 2017-10-18.

java -cp $JAR de.gwdg.metadataqa.marc.cli.ThompsonTraillCompleteness [options] [file]

or with a bash script

./tt-completeness [options] [file]

It produces a CSV file like this:

id,ISBN,Authors,Alternative Titles,Edition,Contributors,Series,TOC,Date 008,Date 26X,LC/NLM, \
LoC,Mesh,Fast,GND,Other,Online,Language of Resource,Country of Publication,noLanguageOrEnglish,RDA,total

Indexing MARC records with Solr

Set autocommit the following way in solrconfig.xml (inside Solr):


It needs because in the library's code there is no commit, which makes the parallel indexing faster.

In schema.xml (or in Solr web interface):

<dynamicField name="*_sni" type="string" indexed="false" stored="true"/>
<copyField source="*_ss" dest="_text_"/>

or use Solr API:


// add copy field
curl -X POST -H 'Content-type:application/json' --data-binary '{
}' $SOLR

curl -X POST -H 'Content-type:application/json' --data-binary '{
}' $SOLR

Run indexer:

java -cp $JAR de.gwdg.metadataqa.marc.cli.MarcToSolr [options] [file]


The Solr URL is something like this: http://localhost:8983/solr/loc. It uses the Self Descriptive MARC code, in which encoded values are decoded to human readble values (e.g. Leader/5 = "c" becames Leader_recordStatus = "Corrected or revised") so a record looks like this:

        "id":"   00004081 ",
        "Leader_ss":["00928cam a22002531  4500"],
        "Leader_recordStatus_ss":["Corrected or revised"],
        "Leader_typeOfRecord_ss":["Language material"],
        "Leader_typeOfControl_ss":["No specified type"],
        "Leader_encodingLevel_ss":["Full level, material not examined"],
        "Leader_multipartResourceRecordLevel_ss":["Not specified or not applicable"],
        "ControlNumber_ss":["   00004081 "],
        "PhysicalDescription_categoryOfMaterial_ss":["Electronic resource"],
        "PhysicalDescription_color_ss":["No attempt to code"],
        "PhysicalDescription_dimensions_ss":["22 cm."],
        "PhysicalDescription_sound_ss":["No attempt to code"],
        "PhysicalDescription_fileFormats_ss":["No attempt to code"],
        "PhysicalDescription_qualityAssuranceTargets_ss":["No attempt to code"],
        "PhysicalDescription_antecedentOrSource_ss":["No attempt to code"],
        "PhysicalDescription_levelOfCompression_ss":["No attempt to code"],
        "PhysicalDescription_reformattingQuality_ss":["No attempt to code"],
        "GeneralInformation_ss":["870303s1900    iauc          000 0 eng  "],
        "GeneralInformation_typeOfDateOrPublicationStatus_ss":["Single known date/probable date"],
        "GeneralInformation_date2_ss":["    "],
        "GeneralInformation_modifiedRecord_ss":["Not modified"],
        "GeneralInformation_catalogingSource_ss":["National bibliographic agency"],
        "GeneralInformation_illustrations_ss":["Portraits, No illustrations"],
        "GeneralInformation_targetAudience_ss":["Unknown or not specified"],
        "GeneralInformation_formOfItem_ss":["None of the following"],
        "GeneralInformation_natureOfContents_ss":["No specified nature of contents"],
        "GeneralInformation_governmentPublication_ss":["Not a government publication"],
        "GeneralInformation_conferencePublication_ss":["Not a conference publication"],
        "GeneralInformation_festschrift_ss":["Not a festschrift"],
        "GeneralInformation_index_ss":["No index"],
        "GeneralInformation_literaryForm_ss":["Not fiction (not further specified)"],
        "GeneralInformation_biography_ss":["No biographical material"],
        "IdentifiedByLccn_ss":["   00004081 "],
        "AdminMetadata_catalogingAgency_ss":["United States, Library of Congress"],
        "AdminMetadata_modifyingAgency_ss":["United States, Library of Congress"],
        "ClassificationLcc_ind1_ss":["Item is in LC"],
        "ClassificationLcc_ind2_ss":["Assigned by LC"],
        "MainPersonalName_personalName_ss":["Miller, James N."],
        "MainPersonalName_fullerForm_ss":["(James Newton)"],
        "Title_ind1_ss":["No added entry"],
        "Title_responsibilityStatement_ss":["by James N. Miller ..."],
        "Title_mainTitle_ss":["The story of Andersonville and Florence,"],
        "Publication_agent_ss":["Welch, the Printer,"],
        "Publication_ind1_ss":["Not applicable/No information provided/Earliest available publisher"],
        "Publication_place_ss":["Des Moines, Ia.,"],
        "PhysicalDescription_extent_ss":["47 p. incl. front. (port.)"],
        "AdditionalPhysicalFormAvailable_ss":["Also available in digital form on the Library of Congress Web site."],
        "CorporateNameSubject_ind2_ss":["Library of Congress Subject Headings"],
        "CorporateNameSubject_ss":["Florence Prison (S.C.)"],
        "CorporateNameSubject_ind1_ss":["Name in direct order"],
        "Geographic_ss":["United States"],
        "Geographic_generalSubdivision_ss":["Prisoners and prisons."],
        "Geographic_chronologicalSubdivision_ss":["Civil War, 1861-1865"],
        "Geographic_ind2_ss":["Library of Congress Subject Headings"],
        "ElectronicLocationAndAccess_materialsSpecified_ss":["Page view"],
        "ElectronicLocationAndAccess_ind2_ss":["Version of resource"],

"marc-tags" format

"100a_ss":["Jung-Baek, Myong Ja"],
"245c_ss":["Vorgelegt von Myong Ja Jung-Baek."],
"245ind2_ss":["No nonfiling characters"],
"245a_ss":["S. Tret'jakov und China /"],
"245ind1_ss":["Added entry"],
"260b_ss":["Georg-August-Universität Göttingen,"],
"260a_ss":["Göttingen :"],
"260ind1_ss":["Not applicable/No information provided/Earliest available publisher"],
"300a_ss":["141 p."],

"human-readable" format

"MainPersonalName_personalName_ss":["Jung-Baek, Myong Ja"],
"Title_responsibilityStatement_ss":["Vorgelegt von Myong Ja Jung-Baek."],
"Title_mainTitle_ss":["S. Tret'jakov und China /"],
"Title_titleAddedEntry_ss":["Added entry"],
"Title_nonfilingCharacters_ss":["No nonfiling characters"],
"Publication_sequenceOfPublishingStatements_ss":["Not applicable/No information provided/Earliest available publisher"],
"Publication_agent_ss":["Georg-August-Universität Göttingen,"],
"Publication_place_ss":["Göttingen :"],
"PhysicalDescription_extent_ss":["141 p."],

"mixed" format

"100a_MainPersonalName_personalName_ss":["Jung-Baek, Myong Ja"],
"245a_Title_mainTitle_ss":["S. Tret'jakov und China /"],
"245ind1_Title_titleAddedEntry_ss":["Added entry"],
"245ind2_Title_nonfilingCharacters_ss":["No nonfiling characters"],
"245c_Title_responsibilityStatement_ss":["Vorgelegt von Myong Ja Jung-Baek."],
"260b_Publication_agent_ss":["Georg-August-Universität Göttingen,"],
"260a_Publication_place_ss":["Göttingen :"],
"260ind1_Publication_sequenceOfPublishingStatements_ss":["Not applicable/No information provided/Earliest available publisher"],
"300a_PhysicalDescription_extent_ss":["141 p."],

I have created a distinct project metadata-qa-marc-web, which provised a single page web application to build a facetted search interface for this type of Solr index.

Indexing MARC JSON records with Solr

java -cp $JAR de.gwdg.metadataqa.marc.cli.utils.MarcJsonToSolr [Solr url] [MARC JSON file]

The MARC JSON file is a JSON serialization of binary MARC file. See more the MARC Pipeline project.

Export mapping table

to Avram JSON

Some background info: MARC21 structure in JSON.

java -cp $JAR de.gwdg.metadataqa.marc.cli.utils.MappingToJson [options] > marc-schema


An example output:

  "label":"Library of Congress Control Number",
    "national":"Mandatory if applicable",
    "minimal":"Mandatory if applicable"
      "label":"LC control number",
        "Data Management\/Identify",
        "Data Management\/Process"
        "national":"Mandatory if applicable",
        "minimal":"Mandatory if applicable"
  "label":"Patent Control Information",
        "name":"MARC Code List for Countries",
          "-ac":{"label":"Ashmore and Cartier Islands"},
          "aca":{"label":"Australian Capital Territory"},


To export the HTML table described at Self Descriptive MARC code

java -cp $JAR de.gwdg.metadataqa.marc.cli.utils.MappingToHtml > mapping.html

Extending the functionalities

The project is available from Maven Central, the central respository of open source Java projects as jar files. If you want to use it in your Java or Scala application, put this code snippet into the list of dependencies:




libraryDependencies += "de.gwdg.metadataqa" % "metadata-qa-marc" % "0.1"

or you can directly download the jars from

User interface

There is a web application for displaying and navigation through the output of the tool (written in PHP):

Appendix I: Where can I get MARC records?

Here is a list of data sources I am aware of so far:

United States of America



Thanks Johann Rolschewski and Phú for their help in collecting this list! Do you know some more data sources? Please let me know.

There are two more datasource worth mention, however they do not provide MARC records, but derivatives:

Appendix II: handling MARC versions

The tool provides two levels of customization:

The different MARC versions has an identifier. This is defined in the code as an enumeration:

public enum MarcVersion {
  MARC21("MARC21", "MARC21"),
  DNB("DNB", "Deutsche Nationalbibliothek"),
  GENT("GENT", "Universiteitsbibliotheek Gent"),
  SZTE("SZTE", "Szegedi Tudományegyetem"),
  FENNICA("FENNICA", "National Library of Finland")

When you add version specific modification, you have to use one of these values.

  1. Defining version specific indicator codes:
Indicator::putVersionSpecificCodes(MarcVersion, List<Code>)

Code is a simple object, it has two property: code and label.


public class Tag024 extends DataFieldDefinition {
   ind1 = new Indicator("Type of standard number or code")
                    new Code(" ", "Not specified")
  1. Defining version specific subfields:
DataFieldDefinition::putVersionSpecificSubfields(MarcVersion, List<SubfieldDefinition>)

SubfieldDefinition contains a definition of a subfield. You can construct it with three String parameters: a code, a label and a cardinality code which denotes whether the subfield can be repeatable ("R") or not ("NR").


public class Tag024 extends DataFieldDefinition {
         new SubfieldDefinition("9", "Standardnummer (mit Bindestrichen)", "NR")
  1. Marking indicator codes as obsolete:

The list should be pairs of code and description.

public class Tag082 extends DataFieldDefinition {
   ind1 = new Indicator("Type of edition")
                 " ", "No edition information recorded (BK, MU, VM, SE) [OBSOLETE]",
                 "2", "Abridged NST version (BK, MU, VM, SE) [OBSOLETE]"
  1. Marking subfields as obsolete:

The list should be pairs of code and description.

public class Tag020 extends DataFieldDefinition {
      "b", "Binding information (BK, MP, MU) [OBSOLETE]"

Appendix III: Special build process

"deployment" build (when deploying artifacts to Maven Central)

mvn clean deploy -Pdeploy

Docker image

Build and test

docker-compose -f docker-compose.yml build app
docker-compose -f docker-compose.yml up
docker run -t -i -v [local-MARC-dir]:/opt/metadata-qa-marc/marc metadata-qa-marc /bin/bash
cd /opt/metadata-qa-marc
scripts/[lib].sh all-analyses

Upload to Docker Hub:

docker tag metadata-qa-marc:latest pkiraly/metadata-qa-marc:latest
docker login
docker push pkiraly/metadata-qa-marc:latest

Feedbacks are welcome!

Build Status Coverage Status