May 22, 2019

Catmandu 1.20

On May 21th 2019, Nicolas Steenlant (our main developer and guru of Catmandu) released version 1.20 of our Catmandu toolkit with some very interesting new features. The main addition is a brand new way how Catmandu Fix-es can be implemented using the new Catmandu::Path implementation. This coding by Nicolas will make it much easier and straightforward to implement any kind of fixes in Perl.

In the previous versions of Catmandu there were only two options to create new fixes:

Create a Perl package in the Catmandu::Fix namespace which implements a fix method. This was very easy: update the $data hash you got as first argument, return the updated $data and you were done. Then disadvantage was that accessing fields in a deeply nested record was tricky and slow to code.
Create a Perl package in the Catmandu::Fix namespace which implemented emit functions. These were functions that generate Perl code on the fly. Using emit functions it was easier to get fast access to deeply nested data. But, to create Fix packages was pretty complex.

In Catmandu 1.20 there is now support for a third and easy way to create new Fixes using the Catmandu::Fix::Builder and Catmandu::Fix::Path class. Let me give an simple example of a skeleton Fix that does nothing:

package Catmandu::Fix::rot13;

use Catmandu::Sane;
use Moo;
use Catmandu::Util::Path qw(as_path);
use Catmandu::Fix::Has;

with 'Catmandu::Fix::Builder';

has path => (fix_arg => 1);

sub _build_fixer {
   my ($self) = @_;
   sub {
     my $data = $_[0];
     # ..do some magic here ...
     $data;
   }
}

1;

In the code above we start implementing a rot13(path) Fix that should read a string on a JSON path and encrypt it using the ROT13 algorithm. This Fix is only the skeleton which doesn’t do anything. What we have is:

- We import the as_path method be able to easily access data on JSON paths/
- We import Catmandu::Fix::Has to be able to use has path constructs to read in arguments for our Fix.
- We import Catmandu::Fix::Builder to use the new Catmandu 1.20 builder class provides a _build_fixermethod.
- The builder is nothing more than a closure that reads the data, does some action on the data and return the data.

We can use this skeleton builder to implement our ROT13 algorithm. Add these lines instead of the # do some magic part:

# On the path update the string value...
as_path($self->path)->updater(
   if_string => sub {
       my $value = shift;
       $value =~ tr{N-ZA-Mn-za-m}{A-Za-z};
       $value;
   },
)->($data);

The as_path method receives a JSON path string an creates an object which you can use to manipulate data on that path. One can update the values found with the updater method, or read data at that path with the getter method or create a new path with the creator method. In our example, we update the string found at the JSON path using if_string condition. The updaterhas many conditions:

if_string needs a closure what should happen when a string is found on the JSON path.
if_array_ref needs a closure what should happen when an array is found on the JSON path.
if_hash_refneeds a closure what should happen when a hash is found on the JSON path.

In our case we are only interested in transforming strings using our rot13(path) fix. The ROT13 algorithm is very easy and only switched the order of some characters. When we execute this fix on some sample data we get this result:


$ catmandu -I lib convert Null to YAML --fix 'add_field(demo,hello);rot13v2(demo)'
---
demo: uryyb
...

In this case the Fix can be written much shorter when we know that every Catmandu::Path method return a closure (hint: look at the ->($data) in the code. The complete Fix can look like:

package Catmandu::Fix::rot13;

use Catmandu::Sane;
use Moo;
use Catmandu::Util::Path qw(as_path);
use Catmandu::Fix::Has;

with 'Catmandu::Fix::Builder';

has path  => (fix_arg => 1);

sub _build_fixer {
    my ($self) = @_;
    # On the path update the string value...
    as_path($self->path)->updater(
      if_string => sub {
         my $value = shift;
         $value =~ tr{N-ZA-Mn-za-m}{A-Za-z};
         $value;
      },
   );
}

1;

This is as easy as it can get to manipulate deeply nested data with your own Perl tools. All the code is in Perl, there is no limit on the number of external CPAN packages one can include in these Builder fixes.

We can’t wait what Catmandu extensions you will create.

April 8, 2019

LPW 2018: “Contrarian Perl” – Tom Hukins

At 09:10, Tom Hukins shares his enthusiasm for Catmandu!

June 22, 2017

Introducing FileStores

Catmandu is always our tool of choice when working with structured data. Using the Elasticsearch or MongoDB Catmandu::Store-s it is quite trivial to store and retrieve metadata records. Storing and retrieving a YAML, JSON (and by extension XML, MARC, CSV,…) files can be as easy as the commands below:

$ catmandu import YAML to database < input.yml $ catmandu import JSON to database < input.json $ catmandu import MARC to database < marc.data$ catmandu export database to YAML > output.yml

A catmandu.yml configuration file is required with the connection parameters to the database:

$ cat catmandu.yml
---
store:
  database:
    package: ElasticSearch
    options:
       client: '1_0::Direct' 
       index_name: catmandu
...

Given these tools to import and export and even transform structured data, can this be extended to unstructured data? In institutional repositories like LibreCat we would like to manage metadata records and binary content (for example PDF files related to the metadata). Catmandu 1.06 introduces the Catmandu::FileStore as an extension to the already existing Catmandu::Store to manage binary content.

A Catmandu::FileStore is a Catmandu::Store where each Catmandu::Bag acts as a “container” or a “folder” that can contain zero or more records describing File content. The files records themselves contain pointers to a backend storage implementation capable of serialising and streaming binary files. Out of the box, one Catmandu::FileStore implementation is available Catmandu::Store::File::Simple, or short File::Simple, which stores files in a directory.

Some examples. To add a file to a FileStore, the stream command needs to be executed:

$ catmandu stream /tmp/myfile.pdf to File::Simple --root /data --bag 1234 --id myfile.pdf

In the command above: /tmp/myfile.pdf is the file up be uploaded to the File::Store. File::Simple is the name of the File::Store implementation which requires one mandatory parameter, --root /data which is the root directory where all files are stored. The--bag 1234 is the “container” or “folder” which contains the uploaded files (with a numeric identifier 1234). And the --id myfile.pdf is the identifier for the new created file record.

To download the file from the File::Store, the stream command needs to be executed in opposite order:
$ catmandu stream File::Simple --root /data --bag 1234 --id myfile.pdf to /tmp/file.pdf

or
$ catmandu stream File::Simple --root /data --bag 1234 --id myfile.pdf > /tmp/file.pdf

On the file system the files are stored in some deep nested structure to be able to spread out the File::Store over many disks:


/data
  `--/000
      `--/001
          `--/234
              `--/myfile.pdf

A listing of all “containers” can be retreived by requesting an export of the default (index) bag of the File::Store:

$ catmandu export File::Simple --root /data to YAML _id: 1234 ...

A listing of all files in the container “1234” can be done by adding the bag name to the export command:
$ catmandu export File::Simple --root /data --bag 1234 to YAML _id: myfile.pdf _stream: !!perl/code '{ "DUMMY" }' content_type: application/pdf created: 1498125394 md5: '' modified: 1498125394 size: 883202 ...

Each File::Store implementation supports at least the fields presented above:

_id: the name of the file
_stream: a callback function to retrieve the content of the file (requires an IO::Handle as input)
content_type: the MIME-Type of the file
created: a timestamp when the file was created
modified: a timestamp when the file was last modified
size: the byte length of the file
md5: optional a MD5 checksum

We envision in Catmandu that many implementations of FileStores can be created to be able to store files in GitHub, BagIts, Fedora Commons and more backends.

Using the Catmandu::Plugin::SideCar Catmandu::FileStore-s and Catmandu::Store-s can be combined as one endpoint. Using Catmandu::Store::Multi and Catmandu::Store::File::Multi many different implementations of Stores and FileStores can be combined.

This is a short introduction, but I hope you will experiment a bit with the new functionality and provide feedback to our project.

March 24, 2017

Catmandu 1.04

Catmandu 1.04 has been released to with some nice new features. There are some new Fix routines that were asked by our community:

error

The “error” fix stops immediately the execution of the Fix script and throws an error. Use this to abort the processing of a data stream:

$ cat myfix.fix
unless exists(id)
    error("no id found?!")
end
$ catmandu convert JSON --fix myfix.fix < data.json

valid

The “valid” fix condition can be used to validate a record (or part of a record) against a JSONSchema. For instance we can select only the valid records from a stream:

$ catmandu convert JSON --fix 'select valid('', JSONSchema, schema:myschema.json)' < data.json

Or, create some logging:

$ cat myfix.fix
unless valid(author, JSONSchema, schema:authors.json)
log("errors in the author field")
end
$ catmandu convert JSON --fix myfix.fix < data.json

rename

The “rename” fix can be used to recursively change the names of fields in your documents. For example, when you have this JSON input:

{
"foo.bar": "123",
"my.name": "Patrick"
}

you can transform all periods (.) in the key names to underscores with this fix:

rename('','\.','_')

The first parameter is the fields “rename” should work on (in our case it is an empty string, meaning the complete record). The second and third parameters are the regex search and replace parameters. The result of this fix is:

{
"foo_bar": "123",
"my_name": "Patrick"
}

The “rename” fix will only work on the keys of JSON paths. For example, given the following path:

my.deep.path.x.y.z

The keys are:

my
deep
path
x
y
z

The second and third argument search and replaces these seperate keys. When you want to change the paths as a whole take a look at the “collapse()” and “expand()” fixes in combination with the “rename” fix:

collapse()
rename('',"my\.deep","my.very.very.deep")
expand()

Now the generated path will be:

my.very.very.deep.path.x.y.z

Of course the example above could be written more simple as “move_field(my.deep,my.very.very.deep)”, but it serves as an example that powerful renaming is possible.

import_from_string

This Fix is a generalisation of the “from_json” Fix. It can transform a serialised string field in your data into an array of data. For instance, take the following YAML record:


---
foo: '{"name":"patrick"}'
...

The field ‘foo’ contains a JSON fragment. You can transform this JSON into real data using the following fix:


import_from_string(foo,JSON)

Which creates a ‘foo’ array containing the deserialised JSON:


---
foo:
- name: patrick

The “import_from_string” look very much like the “from_json” string, but you can use any Catmandu::Importer. It always created an array of hashes. For instance, given the following YAML record:


---
foo: "name;hobby\nnicolas;drawing\npatrick;music"

You can transform the CSV fragment in the ‘foo’ field into data by using this fix:


import_from_string(foo,CSV,sep_char:";")

Which gives as result:


---
foo:
- hobby: drawing
  name: nicolas
- hobby: music
  name: patrick
...

I the same way it can process MARC, XML, RDF, YAML or any other format supported by Catmandu.

export_to_string

The fix “export_to_string” is the opposite of “import_from_string” and is the generalisation of the “to_json” fix. Given the YAML from the previous example:


---
foo:
- hobby: drawing
  name: nicolas
- hobby: music
  name: patrick
...

You can create a CSV fragment in the ‘foo’ field with the following fix:


export_to_string(foo,CSV,sep_char:";")

Which gives as result:


---
foo: "name;hobby\nnicolas;drawing\npatrick;music"

search_in_store

The fix “search_in_store” is a generalisation of the “lookup_in_store” fix. The latter is used to query the “_id” field in a Catmandu::Store and return the first hit. The former, “search_in_store” can query any field in a store and return all (or a subset) of the results. For instance, given the YAML record:


---
foo: "(title:ABC OR author:dave) AND NOT year:2013"
...

then the following fix will replace the ‘foo’ field with the result of the query in a Solr index:


search_in_store('foo', store:Solr, url: 'http://localhost:8983/solr/catalog')

As a result, the document will be updated like:


---
foo:
    start: 0,
    limit: 0,
    hits: [...],
    total: 1000
...

where

start: the starting index of the search result
limit: the number of result per page
hits: an array containing the data from the result page
total: the total number of search results

Every Catmandu::Solr can have another layout of the result page. Look at the documentation of the Catmandu::Solr implementations for the specific details.

Thanks for all your support for Catmandu and keep on data converting 🙂

June 16, 2016

Metadata Analysis at the Command-Line

I was last week at the ELAG 2016 conference in Copenhagen and attended the excellent workshop by Christina Harlow of Cornell University on migrating digital collections metadata to RDF and Fedora4. One of the important steps required to migrate and model data to RDF is understanding what your data is about. Probably old systems need to be converted for which little or no documentation is available. Instead of manually processing large XML or MARC dumps, tools like metadata breakers can be used to find out which fields are available in the legacy system and how they are used. Mark Phillips of the University of North Texas wrote recently in Code4Lib a very inspiring article how this could be done in Python. In this blog post I’ll demonstrate how this can be done using a new Catmandu tool: Catmandu::Breaker.

To follow the examples below, you need to have a system with Catmandu installed. The Catmandu::Breaker tools can then be installed with the command:

$ sudo cpan Catmandu::Breaker

A breaker is a command that transforms data into a line format that can be easily processed with Unix command line tools such as grep, sort, uniq, cut and many more. If you need an introduction into Unix tools for data processing please follow the examples Johan Rolschewski of Berlin State Library and I presented as an ELAG bootcamp.

As a simple example lets create a YAML file and demonstrate how this file can be analysed using Catmandu::Breaker:

$ cat test.yaml
---
name: John
colors:
 - black
 - yellow
 - red
institution:
 name: Acme
  years:
   - 1949
   - 1950
   - 1951
   - 1952

This example has a combination of simple name/value pairs a list of colors and a deeply nested field. To transform this data into the breaker format execute the command:

$ catmandu convert YAML to Breaker < test.yaml
1 colors[]  black
1 colors[]  yellow
1 colors[]  red
1 institution.name  Acme
1 institution.years[] 1949
1 institution.years[] 1950
1 institution.years[] 1951
1 institution.years[] 1952
1 name  John

The breaker format is a tab-delimited output with three columns:

An record identifier: read from the _id field in the input data, or a counter when no such field is present.
A field name. Nested fields are seperated by dots (.) and list are indicated by the square brackets ([])
A field value

When you have a very large JSON or YAML field and need to find all the values of a deeply nested field you could do something like:

$ catmandu convert YAML to Breaker < data.yaml | grep "institution.years"

Using Catmandu you can do this analysis on input formats such as JSON, YAML, XML, CSV, XLS (Excell). Just replace the YAML by any of these formats and run the breaker command. Catmandu can also connect to OAI-PMH, Z39.50 or databases such as MongoDB, ElasticSearch, Solr or even relational databases such as MySQL, Postgres and Oracle. For instance to get a breaker format for an OAI-PMH repository issue a command like:

$ catmandu convert OAI --url http://lib.ugent.be/oai to Breaker

If your data is in a database you could issue an SQL query like:

$ catmandu convert DBI --dsn 'dbi:Oracle' --query 'SELECT * from TABLE WHERE ...' --user 'user/password' to Breaker

Some formats, such as MARC, doesn’t provide a great breaker format. In Catmandu, MARC files are parsed into a list of list. Running a breaker on a MARC input you get this:

$ catmandu convert MARC to Breaker < t/camel.usmarc  | head
fol05731351     record[][]  LDR
fol05731351     record[][]  _
fol05731351     record[][]  00755cam  22002414a 4500
fol05731351     record[][]  001
fol05731351     record[][]  _
fol05731351     record[][]  fol05731351
fol05731351     record[][]  082
fol05731351     record[][]  0
fol05731351     record[][]  0
fol05731351     record[][]  a

The MARC fields are part of the data, not part of the field name. This can be fixed by adding a special ‘marc’ handler to the breaker command:

$ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc  | head
fol05731351     LDR 00755cam  22002414a 4500
fol05731351     001 fol05731351
fol05731351     003 IMchF
fol05731351     005 20000613133448.0
fol05731351     008 000107s2000    nyua          001 0 eng
fol05731351     010a       00020737
fol05731351     020a    0471383147 (paper/cd-rom : alk. paper)
fol05731351     040a    DLC
fol05731351     040c    DLC
fol05731351     040d    DLC

Now all the MARC subfields are visible in the output.

You can use this format to find, for instance, all unique values in a MARC file. Lets try to find all unique 008 values:

$ catmandu convert MARC to Breaker --handler marc < camel.usmarc | grep "\t008" | cut -f 3 | sort -u
000107s2000 nyua 001 0 eng
000203s2000 mau 001 0 eng
000315s1999 njua 001 0 eng
000318s1999 cau b 001 0 eng
000318s1999 caua 001 0 eng
000518s2000 mau 001 0 eng
000612s2000 mau 000 0 eng
000612s2000 mau 100 0 eng
000614s2000 mau 000 0 eng
000630s2000 cau 001 0 eng
00801nam 22002778a 4500

Catmandu::Breaker doesn’t only break input data in a easy format for command line processing, it can also do a statistical analysis on the breaker output. First process some data into the breaker format and save the result in a file:

$ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc > result.breaker

Now, use this file as input for the ‘catmandu breaker’ command:

$ catmandu breaker result.breaker
| name | count | zeros | zeros% | min | max | mean | median | mode   | variance | stdev | uniq | entropy |
|------|-------|-------|--------|-----|-----|------|--------|--------|----------|-------|------|---------|
| 001  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 003  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 005  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 008  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 010a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 020a | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 9    | 3.3/3.3 |
| 040a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 040c | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 040d | 5     | 5     | 50.0   | 0   | 1   | 0.5  | 0.5    | [0, 1] | 0.25     | 0.5   | 1    | 1.0/3.3 |
| 042a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 050a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 050b | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3 |
| 0822 | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 1    | 0.0/3.3 |
| 082a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 3    | 0.9/3.3 |
| 100a | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 8    | 3.1/3.3 |
| 100d | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 100q | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 111a | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 111c | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 111d | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 245a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 9    | 3.1/3.3 |
| 245b | 3     | 7     | 70.0   | 0   | 1   | 0.3  | 0      | 0      | 0.21     | 0.46  | 3    | 1.4/3.3 |
| 245c | 9     | 1     | 10.0   | 0   | 1   | 0.9  | 1      | 1      | 0.09     | 0.3   | 8    | 3.1/3.3 |
| 250a | 3     | 7     | 70.0   | 0   | 1   | 0.3  | 0      | 0      | 0.21     | 0.46  | 3    | 1.4/3.3 |
| 260a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 6    | 2.3/3.3 |
| 260b | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 5    | 2.0/3.3 |
| 260c | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 2    | 0.9/3.3 |
| 263a | 6     | 4     | 40.0   | 0   | 1   | 0.6  | 1      | 1      | 0.24     | 0.49  | 4    | 2.0/3.3 |
| 300a | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 5    | 1.8/3.3 |
| 300b | 3     | 7     | 70.0   | 0   | 1   | 0.3  | 0      | 0      | 0.21     | 0.46  | 1    | 0.9/3.3 |
| 300c | 4     | 6     | 60.0   | 0   | 1   | 0.4  | 0      | 0      | 0.24     | 0.49  | 4    | 1.8/3.3 |
| 300e | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 500a | 2     | 8     | 80.0   | 0   | 1   | 0.2  | 0      | 0      | 0.16     | 0.4   | 2    | 0.9/3.3 |
| 504a | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 630a | 2     | 9     | 90.0   | 0   | 2   | 0.2  | 0      | 0      | 0.36     | 0.6   | 2    | 0.9/3.5 |
| 650a | 15    | 0     | 0.0    | 1   | 3   | 1.5  | 1      | 1      | 0.65     | 0.81  | 6    | 1.7/3.9 |
| 650v | 1     | 9     | 90.0   | 0   | 1   | 0.1  | 0      | 0      | 0.09     | 0.3   | 1    | 0.5/3.3 |
| 700a | 5     | 7     | 70.0   | 0   | 2   | 0.5  | 0      | 0      | 0.65     | 0.81  | 5    | 1.9/3.6 |
| LDR  | 10    | 0     | 0.0    | 1   | 1   | 1    | 1      | 1      | 0        | 0     | 10   | 3.3/3.3

As a result you get a table listing the usage of subfields in all the input records. From this output we can learn:

The ‘001’ field is available in 10 records (see: count)
One record doesn’t contain a ‘020a’ subfield (see: zeros)
The ‘650a’ is available in all records at least once at most 3 times (see: min, max)
Only 8 out of 10 ‘100a’ subfields have unique values (see: uniq)
The last column ‘entropy’ provides a number how interesting the field is for search engines. The higher the entropy, the more uniq content can be found.

I hope this tools are of some use in your projects!

May 10, 2016

Catmandu 1.01

Catmandu 1.01 has been released today. There has been some speed improvements processing fixes due to switching from the Data::Util to the Ref::Util package which has better a support on many Perl platforms.

For the command line there is now support for preprocessing Fix scripts. This means, one can read in variables from the command line into a Fix script. For instance, when processing data you might want to keep some provenance data about your data sources in the output. This can be done with the following commands:


$ catmandu convert MARC --fix myfixes.fix --var source=Publisher1 
--var date=2014-2015 < data.mrc

with a myfixes.fix like:


add_field(my_source,{{source}})
add_field(my_data,{{date}})
marc_field(245,title)
marc_field(022,issn)
.
.
.
etc
.
.

Your JSON output will now contain the clean ‘title’ and ‘issn’ fields but also for each record a ‘my_source’ with value ‘Publisher1’ and a ‘my_date’ with value ‘2014-2015’.

By using the Text::Hogan compiler full support of the mustache language is available.

In this new Catmandu version there have been also some new fix functions you might want to try out, see our Fixes Cheat Sheet for a full overview.

April 20, 2016

Parallel Processing with Catmandu

In this blog post I’ll show a technique to scale out your data processing with Catmandu. All catmandu scripts use a single process, in a single thread. This means that if you need to process 2 times as much data , you need 2 times at much time. Running a catmandu convert command with the -v option will show you the speed of a typical conversion:

$ catmandu convert -v MARC to JSON --fix heavy_load.fix < input.marc > output.json
added       100 (55/sec)
added       200 (76/sec)
added       300 (87/sec)
added       400 (92/sec)
added       500 (90/sec)
added       600 (94/sec)
added       700 (97/sec)
added       800 (97/sec)
added       900 (96/sec)
added      1000 (97/sec)

In the example above we process an ‘input.marc’ MARC file into a ‘output.json’ JSON file with some difficult data cleaning in the ‘heave_load.fix’ Fix script. Using a single process we can reach about 97 records per second. It would take 2.8 hours to process one million records and 28 hours to process ten million records.

Can we make this any faster?

When you buy a computer they are all equipped with multiple processors. Using a single process, only one of these processors are used for calculations. One would get much ‘bang for the buck’ if all the processors could be used. One technique to do that is called ‘parallel processing’.

To check the amount of processors available on your machine use the file ‘/proc/cpuinfo’: on your Linux system:

$ cat /proc/cpuinfo | grep processor
processor   : 0
processor   : 1

The example above shows two lines: I have two cores available to do processing on my laptop. In my library we have servers which contain 4 , 8 , 16 or more processors. This means that if we could do our calculations in a smart way then our processing could be 2, 4, 8 or 16 times as fast (in principle).

To check if your computer is using all that calculating power, use the ‘uptime’ command:

$ uptime
11:15:21 up 622 days,  1:53,  2 users,  load average: 1.23, 1.70, 1.95

In the example above I ran did ‘uptime’ on one of our servers with 4 processors. It shows a load average of about 1.23 to 1.95. This means that in the last 15 minutes between 1 and 2 processors where being used and the other two did nothing. If the load average is less than the number of cores (4 in our case) it means: the server is waiting for input. If the load average is equal to the number of cores it means: the server is using all the CPU power available. If the load is bigger than the number of cores, then there is more work available than can be executed by the machine, some processes need to wait.

Now you know some Unix commands we can start using the processing power available on your machine. In my examples I’m going to use a Unix tool called ‘GNU parallel’ to run Catmandu scripts on all the processors in my machine in the most efficient way possible. To do this you need to install GNU parallel:

sudo yum install parallel

The second ingredient we need is a way to cut our input data into many parts. For instance if we have a 4 processor machine we would like to create 4 equal chunks of data to process in parallel. There are very many ways to cut your data in to many parts. I’ll show you a trick we use in at Ghent University library with help of a MongoDB installation.

First install, MongoDB and the MongoDB catmandu plugins (these examples are taken from our CentOS documentation):

$ sudo cat > /etc/yum.repos.d/mongodb.repo <<EOF
[mongodb]
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64
gpgcheck=0
enabled=1
name=MongoDB.org repository
EOF

$ sudo yum install -y mongodb-org mongodb-org-server mongodb-org-shell mongodb-org-mongos mongodb-org-tools
$ sudo cpanm Catmandu::Store::MongoDB

Next, we are going to store our input data in a MongoDB database with help of a Catmandu Fix script that adds some random numbers the data:

$ catmandu import MARC to MongoDB --database_name data --fix random.fix < input.marc

With the ‘random.fix’ like:


random("part.rand2","2")
random("part.rand4","4")
random("part.rand8","8")
random("part.rand16","16")
random("part.rand32","32")

The ‘random()’ Fix function will be available in Catmandu 1.003 but can also be downloaded here (install it in a directory ‘lib/Catmandu/Fix’). This will will make sure that every record in your input file contains four random numbers ‘part.rand2’, ‘part.rand4′ ,’part.rand8′,’part.rand16′,’part.rand32’. This will makes it possible to chop your data into two, four, eight, sixteen or thirty-two parts depending on the number of processors you have in your machine.

To access one chunk of your data the ‘catmandu export’ command can be used with a query. For instance, to export two equal chunks do:

$ catmandu export MongoDB --database_name -q '{"part.rand2":0}' > part1
$ catmandu export MongoDB --database_name -q '{"part.rand2":1}' > part2

We are going to use these catmandu commands in a Bash script which makes use of GNU parallel run many conversions simultaneously.

#!/bin/bash
# file: parallel.sh
CPU=$1

if [ "${CPU}" == "" ]; then
    /usr/bin/parallel -u $0 {} <<EOF
0
1
EOF
elif [ "${CPU}" != "" ]; then
     catmandu export MongoDB --database_name data -q "{\"part.rand2\":${CPU}}" to JSON --line_delimited 1 --fix heavy_load.fix > result.${CPU}.json
fi

This example script above shows how a conversion process could run on a 2-processor machine. The lines with ‘/usr/bin/parallel’ show how GNU parallel is used to call this script with two arguments ‘0’ and ‘1’ (for the 2-processor example). In the lines with ‘catmandu export’ shows how chunks of data are read from the database and processed with the ‘heavy_load.fix’ Fix script.

If you have a 32-processor machine, you would need to provide parallel an input which contains the numbers 0,1,2 to 31 and change the query to ‘part.rand32’.

GNU parallel is a very powerfull command. It gives the opportunity to run many processes in parallel and even to spread out the load over many machines if you have a cluster. When all these machines have access to your MongoDB database then all can receive chunks of data to be processed. The only task left is to combine all results which can be as easy as a simple ‘cat’ command:

$ cat result.*.json > final_result.json

February 25, 2016

Catmandu 1.00

After 4 years of programming, 88 minor releases we are finally there: the release of Catmandu 1.00! We have pushed the test coverage of the code to 93.97% and added and cleaned a lot of our documentation.

For the new features read our Changes file.

A few important changes should be noted.

By default Catmandu will read and write valid JSON files. In previous versions the default input format was (new)line delimited JSON records as in:


{"record":"1"}
{"record":"2"}
{"record":"3"}

instead of the valid JSON array format:


[{"record":"1"},{"record":"2"},{"record":"3"}]

The old format can still be used as input but will be read much faster when using the –line_delimited option on the command line. Thus, write:


# fast
$ catmandu convert JSON --line_delimited 1  < lines.json.txt

instead of:


# slow
$ catmandu convert JSON < lines.json.txt

By default Catmandu will export in the valid JSON-array format. If you still need to use the old format, then provide the –line_delimited option on the command line:


$ catmandu convert YAML to JSON --line_delimited 1 < data.yaml

We thank all contributors for these wonderful four years of open source coding and we wish you all four new hacking years. Our thanks goes to:

Nicolas Steenlant
Christian Pietsch
Dave Sherohman
Dries Moreels
Friedrich Summann
Jakob Voss
Johann Rolschewski
Jonas Smedegaard
Jörgen Eriksson
Magnus Enger
Maria Hedberg
Mathias Lösch
Najko Jahn
Nicolas Franck
Patrick Hochstenbach
Petra Kohorst
Robin Sheat
Snorri Briem
Upasana Shukla
Vitali Peil
Deutsche Forschungsgemeinschaft for providing us the travel funds
Lund University Library , Ghent University Library and Bielefeld University Library to provide us a very welcome environment for open source collaboration.

June 19, 2015

Catmandu Chat

On Friday June 26 2015 16:00 CEST, we’ll provide a one hour introduction/demo into processing data with Catmandu.

If you are interested, join us on the event page:

https://plus.google.com/hangouts/_/event/c6jcknos8egjlthk658m1btha9o

More instructions on the exact Google Hangout coordinates for this chat will follow on this web page at Friday June 26 15:45.

To enter the chat session, a working version of the Catmandu VirtualBox needs to be running on your system:

https://librecatproject.wordpress.com/get-catmandu/

June 3, 2015

Matching authors against VIAF identities

At Ghent University Library we enrich catalog records with VIAF identities to enhance the search experience in the catalog. When searching for all the books about ‘Chekov’ we want to match all name variants of this author. Consult VIAF http://viaf.org/viaf/95216565/#Chekhov,_Anton_Pavlovich,_1860-1904 and you will see many of them.

Chekhov
Čehov
Tsjechof
Txékhov
etc

Any of the these names variants can be available in the catalog data if authority control is not in place (or not maintained). Searching any of these names should result in results for all the variants. In the past it was a labor intensive, manual job for catalogers to maintain an authority file. Using results from Linked Data Fragments research by Ruben Verborgh (iMinds) and the Catmandu-RDF tools created by Jakob Voss (GBV) and RDF-LDF by Patrick Hochstenbach, Ghent University started an experiment to automatically enrich authors with VIAF identities. In this blog post we will report on the setup and results of this experiment which will also be reported at ELAG2015.

Context

Three ingredients are needed to create a web of data:

A scalable way to produce data.
The infrastructure to publish data.
Clients accessing the data and reusing them in new contexts.

On the production site there doesn’t seem to be any problem creating huge datasets by libraries. Any transformation of library data to linked data will quickly generate an enormous number of RDF triples. We see this in the size of public available datasets:

UGent Academic Bibliography: 12.000.000 triples
Libris catalog: 50.000.000 triples
Gallica: 72.000.000 triples
DBPedia: 500.000.000 triples
VIAF: 600.000.000 triples
Europeana: 900.000.000 triples
The European Library: 3.500.000.000 triples
PubChem: 60.000.000.000 triples

Also for accessing data, from a consumers perspective the “easy” part seems to be covered. Instead of thousands of APIs available and many documents formats for any dataset, SPARQL and RDF provide the programmer a single protocol and document model.

The claim of the Linked Data Fragments researchers is that on the publication side, reliable queryable access to public Linked Data datasets largely remains problematic due to the low availability percentages of public SPARQL endpoints [Ref]. This is confirmed by the 2013 study by researchers from Pontificia Universidad Católica in Chili and National University of Ireland where more than half of the public SPARQL endpoints seem to be offline 1.5 days per month. This gives an availability rate of less than 95% [Ref].

The source of this high rate of inavailability can be traced back to the service model of Linked Data where two extremes exists to publish data (see image below).

From: http://www.slideshare.net/RubenVerborgh/dbpedias-triple-pattern-fragments

At one side, data dumps (or dereferencing of URLs) can be made available which requires a simple HTTP server and lots of processing power on the client side. At the other side, an open SPARQL endpoint can be provided which requires a lot of processing power (hence, hardware investment) on the serverside. With SPARQL endpoints, clients can demand the execution of arbitrarily complicated queries. Furthermore, since each client requests unique, highly specific queries, regular caching mechanisms are ineffective, since they can only optimized for repeated identical requests.

This situation can be compared with providing a database SQL dump to endusers or open database connection on which any possible SQL statement can be executed. To a lesser extent libraries are well aware of the different modes of operation between running OAI-PMH services and Z39.50/SRU services.

Linked Data Fragment researchers provide a third way, Triple Pattern Fragments, to publish data which tries to provide the best of both worlds: access to a full dump of datasets while providing a queryable and cachable interface. For more information on the scalability of this solution I refer to the report presented at the 5th International USEWOD Workshop.

The experiment

VIAF doesn’t provide a public SPARQL endpoint, but a complete dump of the data is available at http://viaf.org/viaf/data/. In our experiments we used the VIAF (Virtual International Authority File), which is made available under the ODC Attribution License. From this dump we created a HDT database. HDT provides a very efficient format to compress RDF data while maintaining browser and search functionality. Using command line tools RDF/XML, Turtle and NTriples can be compressed into a HDT file with an index. This standalone file can be used to without the need of a database to query huge datasets. A VIAF conversion to HDT results in a 7 GB file and a 4 GB index.

Using the Linked Data Fragments server by Ruben Verborgh, available at https://github.com/LinkedDataFragments/Server.js, this HDT file can be published as a NodeJS application.

For a demonstration of this server visit the iMinds experimental setup at: http://data.linkeddatafragments.org/viaf

Using Triple Pattern Fragments a simple REST protocol is available to query this dataset. For instance it is possible to download the complete dataset using this query:


$ curl -H "Accept: text/turtle" http://data.linkeddatafragments.org/viaf

If we only want the triples concerning Chekhov (http://viaf.org/viaf/95216565) we can provide a query parameter:


$ curl -H "Accept: text/turtle" http://data.linkeddatafragments.org/viaf?subject=http://viaf.org/viaf/95216565

Likewise, using the predicate and object query any combination of triples can be requested from the server.


$ curl -H "Accept: text/turtle" http://data.linkeddatafragments.org/viaf?object="Chekhov"

The memory requirements of this server are small enough to run a copy of the VIAF database on a MacBook Air laptop with 8GB RAM.

Using specialised Triple Pattern Fragments clients, SPARQL queries can be executed against this server. For the Catmandu project we created a Perl client RDF::LDF which is integrated into Catmandu-RDF.

To request all triples from the endpoint use:


$ catmandu convert RDF --url http://data.linkeddatafragments.org/viaf --sparql 'SELECT * {?s ?p ?o}'

Or, only those Triples that are about “Chekhov”:


$ catmandu convert RDF --url http://data.linkeddatafragments.org/viaf --sparql 'SELECT * {?s ?p "Chekhov"}'

In the Ghent University experiment a more direct approach was taken to match authors to VIAF. First, as input a MARC dump from the catalog is being streamed into a Perl program using a Catmandu iterator. Then, we extract the 100 and 700 fields which contain $a (name) and $d (date) subfields. These two fields are combined in a search query, as if we would search:


Chekhov, Anton Pavlovich, 1860-1904

If there is exactly one hit in our local VIAF copy, then the result is reported. A complete script to process MARC files this way is available at a GitHub gist. To run the program against a MARC dump execute the import_viaf.pl command:


$ ./import_viaf.pl --type USMARC file.mrc
000000089-2 7001  L $$aEdwards, Everett Eugene,$$d1900- http://viaf.org/viaf/110156902
000000122-8 1001  L $$aClelland, Marjorie Bolton,$$d1912-   http://viaf.org/viaf/24253418
000000124-4 7001  L $$aSchein, Edgar H.
000000124-4 7001  L $$aKilbridge, Maurice D.,$$d1920-   http://viaf.org/viaf/29125668
000000124-4 7001  L $$aWiseman, Frederick.
000000221-6 1001  L $$aMiller, Wilhelm,$$d1869- http://viaf.org/viaf/104464511
000000256-9 1001  L $$aHazlett, Thomas C.,$$d1928-  http://viaf.org/viaf/65541341

[edit: 2017-05-18 an updated version of the code is available as a Git project https://github.com/LibreCat/MARC2RDF ]

All the authors in the MARC dump will be exported. If there is exactly one single match against VIAF it will be added to the author field. We ran this command for one night in a single thread against 338.426 authors containing a date and found 135.257 exact matches in VIAF (=40%).

In a quite recent follow up of our experiments, we investigated how LDF clients can be used in a federated setup. When combining in the LDF algorithm the triples result from many LDF servers, one SPARQL query can be run over many machines. These results are demonstrated at the iMinds demo site where a single SPARQL query can be executed over the combined VIAF and DBPedia datasets. A Perl implementation of this federated search is available in the latest version of RDF-LDF at GitHub.

We strongly believe in the success of this setup and the scalability of this solution as demonstrated by Ruben Verborgh at the USEWOD Workshop. Using Linked Data Fragments a range of solutions are available to publish data on the web. From simple data dumps to a full SPARQL endpoint any service level can be provided given the resources available. For more than a half year DBPedia has been running an LDF server with 99.9994% availability on a 8 CPU , 15 GB RAM Amazon server with 4.5 million requests. Scaling out, services such has the LOD Laundromat cleans 650.000 datasets and provides access to them using a single fat LDF server (256 GB RAM).

For more information on the Federated searches with Linked Data Fragments visit the blog post of Ruben Verborgh at: http://ruben.verborgh.org/blog/2015/06/09/federated-sparql-queries-in-your-browser/