Catmandu Catmandu About Download Tutorial May 22, 2019 Catmandu 1.20 On May 21th 2019, Nicolas Steenlant (our main developer and guru of Catmandu) released version 1.20 of our Catmandu toolkit with some very interesting new features. The main addition is a brand new way how Catmandu Fix-es can be implemented using the new Catmandu::Path implementation. This coding by Nicolas will make it much easier and straightforward to implement any kind of fixes in Perl. In the previous versions of Catmandu there were only two options to create new fixes: Create a Perl package in the Catmandu::Fix namespace which implements a fix method. This was very easy: update the $data hash you got as first argument, return the updated $data and you were done. Then disadvantage was that accessing fields in a deeply nested record was tricky and slow to code. Create a Perl package in the Catmandu::Fix namespace which implemented emit functions. These were functions that generate Perl code on the fly. Using emit functions it was easier to get fast access to deeply nested data. But, to create Fix packages was pretty complex. In Catmandu 1.20 there is now support for a third and easy way to create new Fixes using the Catmandu::Fix::Builder and Catmandu::Fix::Path class. Let me give an simple example of a skeleton Fix that does nothing: package Catmandu::Fix::rot13; use Catmandu::Sane; use Moo; use Catmandu::Util::Path qw(as_path); use Catmandu::Fix::Has; with 'Catmandu::Fix::Builder'; has path => (fix_arg => 1); sub _build_fixer { my ($self) = @_; sub { my $data = $_[0]; # ..do some magic here ... $data; } } 1; In the code above we start implementing a rot13(path) Fix that should read a string on a JSON path and encrypt it using the ROT13 algorithm. This Fix is only the skeleton which doesn’t do anything. What we have is: We import the as_path method be able to easily access data on JSON paths/ We import Catmandu::Fix::Has to be able to use has path constructs to read in arguments for our Fix. We import Catmandu::Fix::Builder to use the new Catmandu 1.20 builder class provides a _build_fixermethod. The builder is nothing more than a closure that reads the data, does some action on the data and return the data. We can use this skeleton builder to implement our ROT13 algorithm. Add these lines instead of the # do some magic part: # On the path update the string value... as_path($self->path)->updater( if_string => sub { my $value = shift; $value =~ tr{N-ZA-Mn-za-m}{A-Za-z}; $value; }, )->($data); The as_path method receives a JSON path string an creates an object which you can use to manipulate data on that path. One can update the values found with the updater method, or read data at that path with the getter method or create a new path with the creator method. In our example, we update the string found at the JSON path using if_string condition. The updaterhas many conditions: if_string needs a closure what should happen when a string is found on the JSON path. if_array_ref needs a closure what should happen when an array is found on the JSON path. if_hash_refneeds a closure what should happen when a hash is found on the JSON path. In our case we are only interested in transforming strings using our rot13(path) fix. The ROT13 algorithm is very easy and only switched the order of some characters. When we execute this fix on some sample data we get this result: $ catmandu -I lib convert Null to YAML --fix 'add_field(demo,hello);rot13v2(demo)' --- demo: uryyb ... In this case the Fix can be written much shorter when we know that every Catmandu::Path method return a closure (hint: look at the ->($data) in the code. The complete Fix can look like: package Catmandu::Fix::rot13; use Catmandu::Sane; use Moo; use Catmandu::Util::Path qw(as_path); use Catmandu::Fix::Has; with 'Catmandu::Fix::Builder'; has path => (fix_arg => 1); sub _build_fixer { my ($self) = @_; # On the path update the string value... as_path($self->path)->updater( if_string => sub { my $value = shift; $value =~ tr{N-ZA-Mn-za-m}{A-Za-z}; $value; }, ); } 1; This is as easy as it can get to manipulate deeply nested data with your own Perl tools. All the code is in Perl, there is no limit on the number of external CPAN packages one can include in these Builder fixes. We can’t wait what Catmandu extensions you will create. Written by hochstenbach Leave a comment Posted in Advanced, Updates Tagged with catmandu, fix language, perl April 8, 2019 LPW 2018: “Contrarian Perl” – Tom Hukins At 09:10, Tom Hukins shares his enthusiasm for Catmandu! Written by hochstenbach Leave a comment Posted in Uncategorized June 22, 2017 Introducing FileStores Catmandu is always our tool of choice when working with structured data. Using the Elasticsearch or MongoDB Catmandu::Store-s it is quite trivial to store and retrieve metadata records. Storing and retrieving a YAML, JSON (and by extension XML, MARC, CSV,…) files can be as easy as the commands below: $ catmandu import YAML to database < input.yml $ catmandu import JSON to database < input.json $ catmandu import MARC to database < marc.data $ catmandu export database to YAML > output.yml A catmandu.yml  configuration file is required with the connection parameters to the database: $ cat catmandu.yml --- store: database: package: ElasticSearch options: client: '1_0::Direct' index_name: catmandu ... Given these tools to import and export and even transform structured data, can this be extended to unstructured data? In institutional repositories like LibreCat we would like to manage metadata records and binary content (for example PDF files related to the metadata).  Catmandu 1.06 introduces the Catmandu::FileStore as an extension to the already existing Catmandu::Store to manage binary content. A Catmandu::FileStore is a Catmandu::Store where each Catmandu::Bag acts as a “container” or a “folder” that can contain zero or more records describing File content. The files records themselves contain pointers to a backend storage implementation capable of serialising and streaming binary files. Out of the box, one Catmandu::FileStore implementation is available Catmandu::Store::File::Simple, or short File::Simple, which stores files in a directory. Some examples. To add a file to a FileStore, the stream command needs to be executed: $ catmandu stream /tmp/myfile.pdf to File::Simple --root /data --bag 1234 --id myfile.pdf In the command above: /tmp/myfile.pdf is the file up be uploaded to the File::Store. File::Simple is the name of the File::Store implementation which requires one mandatory parameter, --root /data which is the root directory where all files are stored.  The--bag 1234 is the “container” or “folder” which contains the uploaded files (with a numeric identifier 1234). And the --id myfile.pdf is the identifier for the new created file record. To download the file from the File::Store, the stream command needs to be executed in opposite order: $ catmandu stream File::Simple --root /data --bag 1234 --id myfile.pdf to /tmp/file.pdf or $ catmandu stream File::Simple --root /data --bag 1234 --id myfile.pdf > /tmp/file.pdf On the file system the files are stored in some deep nested structure to be able to spread out the File::Store over many disks: /data `--/000 `--/001 `--/234 `--/myfile.pdf A listing of all “containers” can be retreived by requesting an export of the default (index) bag of the File::Store: $ catmandu export File::Simple --root /data to YAML _id: 1234 ... A listing of all files in the container “1234” can be done by adding the bag name to the export command: $ catmandu export File::Simple --root /data --bag 1234 to YAML _id: myfile.pdf _stream: !!perl/code '{ "DUMMY" }' content_type: application/pdf created: 1498125394 md5: '' modified: 1498125394 size: 883202 ... Each File::Store implementation supports at least the fields presented above: _id: the name of the file _stream: a callback function to retrieve the content of the file (requires an IO::Handle as input) content_type: the MIME-Type of the file created: a timestamp when the file was created modified: a timestamp when the file was last modified size: the byte length of the file md5: optional a MD5 checksum We envision in Catmandu that many implementations of FileStores can be created to be able to store files in GitHub, BagIts, Fedora Commons and more backends. Using the Catmandu::Plugin::SideCar  Catmandu::FileStore-s and Catmandu::Store-s can be combined as one endpoint. Using Catmandu::Store::Multi and Catmandu::Store::File::Multi many different implementations of Stores and FileStores can be combined. This is a short introduction, but I hope you will experiment a bit with the new functionality and provide feedback to our project. Written by hochstenbach Leave a comment Posted in Uncategorized March 24, 2017 Catmandu 1.04 Catmandu 1.04 has been released to with some nice new features. There are some new Fix routines that were asked by our community: error The “error” fix stops immediately the execution of the Fix script and throws an error. Use this to abort the processing of a data stream: $ cat myfix.fix unless exists(id)     error("no id found?!") end $ catmandu convert JSON --fix myfix.fix < data.json valid The “valid” fix condition can be used to validate a record (or part of a record) against a JSONSchema. For instance we can select only the valid records from a stream: $ catmandu convert JSON --fix 'select valid('', JSONSchema, schema:myschema.json)' < data.json Or, create some logging: $ cat myfix.fix unless valid(author, JSONSchema, schema:authors.json) log("errors in the author field") end $ catmandu convert JSON --fix myfix.fix < data.json rename The “rename” fix can be used to recursively change the names of fields in your documents. For example, when you have this JSON input: { "foo.bar": "123", "my.name": "Patrick" } you can transform all periods (.) in the key names to underscores with this fix: rename('','\.','_') The first parameter is the fields “rename” should work on (in our case it is an empty string, meaning the complete record). The second and third parameters are the regex search and replace parameters. The result of this fix is: { "foo_bar": "123", "my_name": "Patrick" } The “rename” fix will only work on the keys of JSON paths. For example, given the following path: my.deep.path.x.y.z The keys are: my deep path x y z The second and third argument search and replaces these seperate keys. When you want to change the paths as a whole take a look at the “collapse()” and “expand()” fixes in combination with the “rename” fix: collapse() rename('',"my\.deep","my.very.very.deep") expand() Now the generated path will be: my.very.very.deep.path.x.y.z Of course the example above could be written more simple as “move_field(my.deep,my.very.very.deep)”, but it serves as an example  that powerful renaming is possible. import_from_string This Fix is a generalisation of the “from_json” Fix. It can transform a serialised string field in your data into an array of data. For instance, take the following YAML record: --- foo: '{"name":"patrick"}' ... The field ‘foo’ contains a JSON fragment. You can transform this JSON into real data using the following fix: import_from_string(foo,JSON) Which creates a ‘foo’ array containing the deserialised JSON: --- foo: - name: patrick The “import_from_string” look very much like the “from_json” string, but you can use any Catmandu::Importer. It always created an array of hashes. For instance, given the following YAML record: --- foo: "name;hobby\nnicolas;drawing\npatrick;music" You can transform the CSV fragment in the ‘foo’ field into data by using this fix: import_from_string(foo,CSV,sep_char:";") Which gives as result: --- foo: - hobby: drawing name: nicolas - hobby: music name: patrick ... I the same way it can process MARC, XML, RDF, YAML or any other format supported by Catmandu. export_to_string The fix “export_to_string” is the opposite of “import_from_string” and is the generalisation of the “to_json” fix. Given the YAML from the previous example: --- foo: - hobby: drawing name: nicolas - hobby: music name: patrick ... You can create a CSV fragment in the ‘foo’ field with the following fix: export_to_string(foo,CSV,sep_char:";") Which gives as result: --- foo: "name;hobby\nnicolas;drawing\npatrick;music" search_in_store The fix “search_in_store” is a generalisation of the “lookup_in_store” fix. The latter is used to query the “_id” field in a Catmandu::Store and return the first hit. The former, “search_in_store” can query any field in a store and return all (or a subset) of the results. For instance, given the YAML record: --- foo: "(title:ABC OR author:dave) AND NOT year:2013" ... then the following fix will replace the ‘foo’ field with the result of the query in a Solr index: search_in_store('foo', store:Solr, url: 'http://localhost:8983/solr/catalog') As a result, the document will be updated like: --- foo: start: 0, limit: 0, hits: [...], total: 1000 ... where start: the starting index of the search result limit: the number of result per page hits: an array containing the data from the result page total: the total number of search results Every Catmandu::Solr can have another layout of the result page. Look at the documentation of the Catmandu::Solr implementations for the specific details. Thanks for all your support for Catmandu and keep on data converting 🙂 Written by hochstenbach Leave a comment Posted in Uncategorized June 16, 2016 Metadata Analysis at the Command-Line I was last week at the ELAG  2016 conference in Copenhagen and attended the excellent workshop by Christina Harlow  of Cornell University on migrating digital collections metadata to RDF and Fedora4. One of the important steps required to migrate and model data to RDF is understanding what your data is about. Probably old systems need to be converted for which little or no documentation is available. Instead of manually processing large XML or MARC dumps, tools like metadata breakers can be used to find out which fields are available in the legacy system and how they are used. Mark Phillips of the University of North Texas wrote recently in Code4Lib a very inspiring article how this could be done in Python. In this blog post I’ll demonstrate how this can be done using a new Catmandu tool: Catmandu::Breaker. To follow the examples below, you need to have a system with Catmandu installed. The Catmandu::Breaker tools can then be installed with the command: $ sudo cpan Catmandu::Breaker A breaker is a command that transforms data into a line format that can be easily processed with Unix command line tools such as grep, sort, uniq, cut and many more. If you need an introduction into Unix tools for data processing please follow the examples Johan Rolschewski of Berlin State Library and I presented as an ELAG bootcamp. As a simple example lets create a YAML file and demonstrate how this file can be analysed using Catmandu::Breaker: $ cat test.yaml --- name: John colors: - black - yellow - red institution: name: Acme years: - 1949 - 1950 - 1951 - 1952 This example has a combination of simple name/value pairs a list of colors and a deeply nested field. To transform this data into the breaker format execute the command: $ catmandu convert YAML to Breaker < test.yaml 1 colors[] black 1 colors[] yellow 1 colors[] red 1 institution.name Acme 1 institution.years[] 1949 1 institution.years[] 1950 1 institution.years[] 1951 1 institution.years[] 1952 1 name John The breaker format is a tab-delimited output with three columns: An record identifier: read from the _id field in the input data, or a counter when no such field is present. A field name. Nested fields are seperated by dots (.) and list are indicated by the square brackets ([]) A field value When you have a very large JSON or YAML field and need to find all the values of a deeply nested field you could do something like: $ catmandu convert YAML to Breaker < data.yaml | grep "institution.years" Using Catmandu you can do this analysis on input formats such as JSON, YAML, XML, CSV, XLS (Excell). Just replace the YAML by any of these formats and run the breaker command. Catmandu can also connect to OAI-PMH, Z39.50 or databases such as MongoDB, ElasticSearch, Solr or even relational databases such as MySQL, Postgres and Oracle. For instance to get a breaker format for an OAI-PMH repository issue a command like: $ catmandu convert OAI --url http://lib.ugent.be/oai to Breaker If your data is in a database you could issue an SQL query like: $ catmandu convert DBI --dsn 'dbi:Oracle' --query 'SELECT * from TABLE WHERE ...' --user 'user/password' to Breaker Some formats, such as MARC, doesn’t provide a great breaker format. In Catmandu, MARC files are parsed into a list of list. Running a breaker on a MARC input you get this: $ catmandu convert MARC to Breaker < t/camel.usmarc | head fol05731351 record[][] LDR fol05731351 record[][] _ fol05731351 record[][] 00755cam 22002414a 4500 fol05731351 record[][] 001 fol05731351 record[][] _ fol05731351 record[][] fol05731351 fol05731351 record[][] 082 fol05731351 record[][] 0 fol05731351 record[][] 0 fol05731351 record[][] a The MARC fields are part of the data, not part of the field name. This can be fixed by adding a special ‘marc’ handler to the breaker command: $ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc | head fol05731351 LDR 00755cam 22002414a 4500 fol05731351 001 fol05731351 fol05731351 003 IMchF fol05731351 005 20000613133448.0 fol05731351 008 000107s2000 nyua 001 0 eng fol05731351 010a 00020737 fol05731351 020a 0471383147 (paper/cd-rom : alk. paper) fol05731351 040a DLC fol05731351 040c DLC fol05731351 040d DLC Now all the MARC subfields are visible in the output. You can use this format to find, for instance, all unique values in a MARC file. Lets try to find all unique 008 values: $ catmandu convert MARC to Breaker --handler marc < camel.usmarc | grep "\t008" | cut -f 3 | sort -u 000107s2000 nyua 001 0 eng 000203s2000 mau 001 0 eng 000315s1999 njua 001 0 eng 000318s1999 cau b 001 0 eng 000318s1999 caua 001 0 eng 000518s2000 mau 001 0 eng 000612s2000 mau 000 0 eng 000612s2000 mau 100 0 eng 000614s2000 mau 000 0 eng 000630s2000 cau 001 0 eng 00801nam 22002778a 4500 Catmandu::Breaker doesn’t only break input data in a easy format for command line processing, it can also do a statistical analysis on the breaker output. First process some data into the breaker format and save the result in a file: $ catmandu convert MARC to Breaker --handler marc < t/camel.usmarc > result.breaker Now, use this file as input for the ‘catmandu breaker’ command: $ catmandu breaker result.breaker | name | count | zeros | zeros% | min | max | mean | median | mode | variance | stdev | uniq | entropy | |------|-------|-------|--------|-----|-----|------|--------|--------|----------|-------|------|---------| | 001 | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 10 | 3.3/3.3 | | 003 | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0.0/3.3 | | 005 | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 10 | 3.3/3.3 | | 008 | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 10 | 3.3/3.3 | | 010a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 10 | 3.3/3.3 | | 020a | 9 | 1 | 10.0 | 0 | 1 | 0.9 | 1 | 1 | 0.09 | 0.3 | 9 | 3.3/3.3 | | 040a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0.0/3.3 | | 040c | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0.0/3.3 | | 040d | 5 | 5 | 50.0 | 0 | 1 | 0.5 | 0.5 | [0, 1] | 0.25 | 0.5 | 1 | 1.0/3.3 | | 042a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0.0/3.3 | | 050a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0.0/3.3 | | 050b | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 10 | 3.3/3.3 | | 0822 | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0.0/3.3 | | 082a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 3 | 0.9/3.3 | | 100a | 9 | 1 | 10.0 | 0 | 1 | 0.9 | 1 | 1 | 0.09 | 0.3 | 8 | 3.1/3.3 | | 100d | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 100q | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 111a | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 111c | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 111d | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 245a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 9 | 3.1/3.3 | | 245b | 3 | 7 | 70.0 | 0 | 1 | 0.3 | 0 | 0 | 0.21 | 0.46 | 3 | 1.4/3.3 | | 245c | 9 | 1 | 10.0 | 0 | 1 | 0.9 | 1 | 1 | 0.09 | 0.3 | 8 | 3.1/3.3 | | 250a | 3 | 7 | 70.0 | 0 | 1 | 0.3 | 0 | 0 | 0.21 | 0.46 | 3 | 1.4/3.3 | | 260a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 6 | 2.3/3.3 | | 260b | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 5 | 2.0/3.3 | | 260c | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 2 | 0.9/3.3 | | 263a | 6 | 4 | 40.0 | 0 | 1 | 0.6 | 1 | 1 | 0.24 | 0.49 | 4 | 2.0/3.3 | | 300a | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 5 | 1.8/3.3 | | 300b | 3 | 7 | 70.0 | 0 | 1 | 0.3 | 0 | 0 | 0.21 | 0.46 | 1 | 0.9/3.3 | | 300c | 4 | 6 | 60.0 | 0 | 1 | 0.4 | 0 | 0 | 0.24 | 0.49 | 4 | 1.8/3.3 | | 300e | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 500a | 2 | 8 | 80.0 | 0 | 1 | 0.2 | 0 | 0 | 0.16 | 0.4 | 2 | 0.9/3.3 | | 504a | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 630a | 2 | 9 | 90.0 | 0 | 2 | 0.2 | 0 | 0 | 0.36 | 0.6 | 2 | 0.9/3.5 | | 650a | 15 | 0 | 0.0 | 1 | 3 | 1.5 | 1 | 1 | 0.65 | 0.81 | 6 | 1.7/3.9 | | 650v | 1 | 9 | 90.0 | 0 | 1 | 0.1 | 0 | 0 | 0.09 | 0.3 | 1 | 0.5/3.3 | | 700a | 5 | 7 | 70.0 | 0 | 2 | 0.5 | 0 | 0 | 0.65 | 0.81 | 5 | 1.9/3.6 | | LDR | 10 | 0 | 0.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 10 | 3.3/3.3 As a result you get a table listing the usage of subfields in all the input records. From this output we can learn: The ‘001’ field is available in 10 records (see: count) One record doesn’t contain a ‘020a’ subfield (see: zeros) The ‘650a’ is available in all records at least once at most 3 times (see: min, max) Only 8 out of 10 ‘100a’ subfields have unique values (see: uniq) The last column ‘entropy’ provides a number how interesting the field is for search engines. The higher the entropy, the more uniq content can be found. I hope this tools are of some use in your projects! Written by hochstenbach 8 Comments Posted in Uncategorized May 10, 2016 Catmandu 1.01 Catmandu 1.01 has been released today. There has been some speed improvements processing fixes due to switching from the Data::Util to the Ref::Util package which has better a support on many Perl platforms. For the command line there is now support for preprocessing  Fix scripts. This means, one can read in variables from the command line into a Fix script. For instance, when processing data you might want to keep some provenance data about your data sources in the output. This can be done with the following commands: $ catmandu convert MARC --fix myfixes.fix --var source=Publisher1 --var date=2014-2015 < data.mrc with a myfixes.fix like: add_field(my_source,{{source}}) add_field(my_data,{{date}}) marc_field(245,title) marc_field(022,issn) . . . etc . . Your JSON output will now contain the clean ‘title’ and ‘issn’ fields but also for each record a ‘my_source’ with value ‘Publisher1’ and a ‘my_date’ with value ‘2014-2015’. By using the Text::Hogan compiler full support of the mustache language is available. In this new Catmandu version there have been also some new fix functions you might want to try out, see our Fixes Cheat Sheet for a full overview.   Written by hochstenbach Leave a comment Posted in Updates April 20, 2016 Parallel Processing with Catmandu In this blog post I’ll show a technique to scale out your data processing with Catmandu. All catmandu scripts use a single process, in a single thread. This means that if you need to process 2 times as much data , you need 2 times at much time. Running a catmandu convert command with the -v option will show you the speed of a typical conversion: $ catmandu convert -v MARC to JSON --fix heavy_load.fix < input.marc > output.json added 100 (55/sec) added 200 (76/sec) added 300 (87/sec) added 400 (92/sec) added 500 (90/sec) added 600 (94/sec) added 700 (97/sec) added 800 (97/sec) added 900 (96/sec) added 1000 (97/sec) In the example above we process an ‘input.marc’ MARC file into a ‘output.json’ JSON file with some difficult data cleaning in the ‘heave_load.fix’ Fix script. Using a single process we can reach about 97 records per second. It would take 2.8 hours to process one million records and 28 hours to process ten million records. Can we make this any faster? When you buy a computer they are all equipped with multiple processors. Using a single process, only one of these processors are used for calculations. One would get much ‘bang for the buck’  if all the processors could be used. One technique to do that is called ‘parallel processing’. To check the amount of processors available on your machine use the file ‘/proc/cpuinfo’: on your Linux system: $ cat /proc/cpuinfo | grep processor processor : 0 processor : 1 The example above  shows two lines: I have two cores available to do processing on my laptop. In my library we have servers which contain  4 , 8 , 16 or more processors. This means that if we could do our calculations in a smart way then our processing could be 2, 4, 8 or 16 times as fast (in principle). To check if your computer  is using all that calculating power, use the ‘uptime’ command: $ uptime 11:15:21 up 622 days, 1:53, 2 users, load average: 1.23, 1.70, 1.95 In the example above I ran did ‘uptime’ on one of our servers with 4 processors. It shows a load average of about 1.23 to 1.95. This means that in the last 15 minutes between 1 and 2 processors where being used and the other two did nothing. If the load average is less than the number of cores (4 in our case) it means: the server is waiting for input. If the load average is equal to the number of cores  it means: the server  is using all the CPU power available. If the load is bigger than the number of cores, then there is more work available than can be executed by the machine, some processes need to wait. Now you know some Unix commands we can start using the processing power available on your machine. In my examples I’m going to use a Unix tool called ‘GNU parallel’ to run Catmandu  scripts on all the processors in my machine in the most efficient way possible. To do this you need to install GNU parallel: sudo yum install parallel The second ingredient we need is a way to cut our input data into many parts. For instance if we have a 4 processor machine we would like to create 4 equal chunks of data to process in parallel. There are very many ways to cut your data in to many parts. I’ll show you a trick we use in at Ghent University library with help of a MongoDB installation. First install, MongoDB and the MongoDB catmandu plugins (these examples are taken from our CentOS documentation): $ sudo cat > /etc/yum.repos.d/mongodb.repo < part1 $ catmandu export MongoDB --database_name -q '{"part.rand2":1}' > part2 We are going to use these catmandu commands in a Bash script which makes use of GNU parallel run many conversions simultaneously. #!/bin/bash # file: parallel.sh CPU=$1 if [ "${CPU}" == "" ]; then /usr/bin/parallel -u $0 {} < result.${CPU}.json fi This example script above shows how a conversion process could run on a 2-processor machine. The lines with ‘/usr/bin/parallel’ show how GNU parallel is used to call this script with two arguments ‘0’ and ‘1’ (for the 2-processor example). In the lines with ‘catmandu export’ shows how chunks of data are read from the database and processed with the ‘heavy_load.fix’ Fix script. If you have a 32-processor machine, you would need to provide parallel an input which contains the numbers 0,1,2 to 31 and change the query to ‘part.rand32’. GNU parallel is a very powerfull command. It gives the opportunity to run many processes in parallel and even to spread out the load over many machines if you have a cluster. When all these machines have access to your MongoDB database then all can receive chunks of data to be processed. The only task left is to combine all results which can be as easy as a simple ‘cat’ command: $ cat result.*.json > final_result.json Written by hochstenbach 4 Comments Posted in Advanced Tagged with catmandu, JSON Path, library, Linux, marc, parallel procesing, perl February 25, 2016 Catmandu 1.00 After 4 years of programming, 88 minor releases we are finally there: the release of Catmandu 1.00! We have pushed the test coverage of the code to 93.97% and added and cleaned a lot of our documentation. For the new features read our Changes file. A few important changes should be noted.     By default Catmandu will read and write valid JSON files. In previous versions the default input format was (new)line delimited JSON records as in: {"record":"1"} {"record":"2"} {"record":"3"} instead of the valid JSON array format: [{"record":"1"},{"record":"2"},{"record":"3"}] The old format can still be used as input but will be read much faster when using the –line_delimited  option on the command line. Thus, write: # fast $ catmandu convert JSON --line_delimited 1 < lines.json.txt instead of: # slow $ catmandu convert JSON < lines.json.txt By default Catmandu will export in the valid JSON-array format. If you still need to use the old format, then provide the –line_delimited option on the command line: $ catmandu convert YAML to JSON --line_delimited 1 < data.yaml We thank all contributors for these wonderful four years of open source coding and we wish you all four new hacking years. Our thanks goes to: Nicolas Steenlant Christian Pietsch Dave Sherohman Dries Moreels Friedrich Summann Jakob Voss Johann Rolschewski Jonas Smedegaard Jörgen Eriksson Magnus Enger Maria Hedberg Mathias Lösch Najko Jahn Nicolas Franck Patrick Hochstenbach Petra Kohorst Robin Sheat Snorri Briem Upasana Shukla Vitali Peil Deutsche Forschungsgemeinschaft for providing us the travel funds Lund University Library , Ghent University Library and Bielefeld University Library to provide us a very welcome environment for open source collaboration. Written by hochstenbach Leave a comment Posted in Uncategorized June 19, 2015 Catmandu Chat On Friday June 26 2015 16:00 CEST, we’ll  provide a one hour introduction/demo into processing data with Catmandu. If you are interested, join us on the event page: https://plus.google.com/hangouts/_/event/c6jcknos8egjlthk658m1btha9o More instructions on the exact Google Hangout coordinates for this chat will follow on this web page at Friday June 26 15:45. To enter the chat session, a working version of the Catmandu VirtualBox needs to be running on your system: https://librecatproject.wordpress.com/get-catmandu/ Written by hochstenbach Leave a comment Posted in Events June 3, 2015 Matching authors against VIAF identities At Ghent University Library we enrich catalog records with VIAF identities to enhance the search experience in the catalog. When searching for all the books about ‘Chekov’ we want to match all name variants of this author. Consult VIAF http://viaf.org/viaf/95216565/#Chekhov,_Anton_Pavlovich,_1860-1904 and you will see many of them. Chekhov Čehov Tsjechof Txékhov etc Any of the these names variants can be available in the catalog data if authority control is not in place (or not maintained). Searching any of these names should result in results for all the variants. In the past it was a labor intensive, manual job for catalogers to maintain an authority file. Using results from Linked Data Fragments research by Ruben Verborgh (iMinds) and the Catmandu-RDF tools created by Jakob Voss (GBV) and RDF-LDF by Patrick Hochstenbach, Ghent University started an experiment to automatically enrich authors with VIAF identities. In this blog post we will report on the setup and results of this experiment which will also be reported at ELAG2015. Context Three ingredients are needed to create a web of data: A scalable way to produce data. The infrastructure to publish data. Clients accessing the data and reusing them in new contexts. On the production site there doesn’t seem to be any problem creating huge datasets by libraries. Any transformation of library data to linked data will quickly generate an enormous number of RDF triples. We see this in the size of public available datasets: UGent Academic Bibliography: 12.000.000 triples Libris catalog: 50.000.000 triples Gallica: 72.000.000 triples DBPedia: 500.000.000 triples VIAF: 600.000.000 triples Europeana: 900.000.000 triples The European Library: 3.500.000.000 triples PubChem: 60.000.000.000 triples Also for accessing data, from a consumers perspective the “easy” part seems to be covered. Instead of thousands of APIs available and many documents formats for any dataset, SPARQL and RDF provide the programmer a single protocol and document model. The claim of the Linked Data Fragments researchers is that on the publication side, reliable queryable access to public Linked Data datasets largely remains problematic due to the low availability percentages of public SPARQL endpoints [Ref]. This is confirmed by the 2013 study by researchers from Pontificia Universidad Católica in Chili and National University of Ireland where more than half of the public SPARQL endpoints seem to be offline 1.5 days per month. This gives an availability rate of less than 95% [Ref]. The source of this high rate of inavailability can be traced back to the service model of Linked Data where two extremes exists to publish data (see image below). From: http://www.slideshare.net/RubenVerborgh/dbpedias-triple-pattern-fragments At one side, data dumps (or dereferencing of URLs) can be made available which requires a simple HTTP server and lots of processing power on the client side. At the other side, an open SPARQL endpoint can be provided which requires a lot of processing power (hence, hardware investment) on the serverside. With SPARQL endpoints, clients can demand the execution of arbitrarily complicated queries. Furthermore, since each client requests unique, highly specific queries, regular caching mechanisms are ineffective, since they can only optimized for repeated identical requests. This situation can be compared with providing a database SQL dump to endusers or open database connection on which any possible SQL statement can be executed. To a lesser extent libraries are well aware of the different modes of operation between running OAI-PMH services and Z39.50/SRU services. Linked Data Fragment researchers provide a third way, Triple Pattern Fragments, to publish data which tries to provide the best of both worlds: access to a full dump of datasets while providing a queryable and cachable interface. For more information on the scalability of this solution I refer to the report  presented at the 5th International USEWOD Workshop. The experiment VIAF doesn’t provide a public SPARQL endpoint, but a complete dump of the data is available at http://viaf.org/viaf/data/. In our experiments we used the VIAF (Virtual International Authority File), which is made available under the ODC Attribution License.  From this dump we created a HDT database. HDT provides a very efficient format to compress RDF data while maintaining browser and search functionality. Using command line tools RDF/XML, Turtle and NTriples can be compressed into a HDT file with an index. This standalone file can be used to without the need of a database to query huge datasets. A VIAF conversion to HDT results in a 7 GB file and a 4 GB index. Using the Linked Data Fragments server by Ruben Verborgh, available at https://github.com/LinkedDataFragments/Server.js, this HDT file can be published as a NodeJS application. For a demonstration of this server visit the iMinds experimental setup at: http://data.linkeddatafragments.org/viaf Using Triple Pattern Fragments a simple REST protocol is available to query this dataset. For instance it is possible to download the complete dataset using this query: $ curl -H "Accept: text/turtle" http://data.linkeddatafragments.org/viaf If we only want the triples concerning Chekhov (http://viaf.org/viaf/95216565) we can provide a query parameter: $ curl -H "Accept: text/turtle" http://data.linkeddatafragments.org/viaf?subject=http://viaf.org/viaf/95216565 Likewise, using the predicate and object query any combination of triples can be requested from the server. $ curl -H "Accept: text/turtle" http://data.linkeddatafragments.org/viaf?object="Chekhov" The memory requirements of this server are small enough to run a copy of the VIAF database on a MacBook Air laptop with 8GB RAM. Using specialised Triple Pattern Fragments clients, SPARQL queries can be executed against this server. For the Catmandu project we created a Perl client RDF::LDF which is integrated into Catmandu-RDF. To request all triples from the endpoint use: $ catmandu convert RDF --url http://data.linkeddatafragments.org/viaf --sparql 'SELECT * {?s ?p ?o}' Or, only those Triples that are about “Chekhov”: $ catmandu convert RDF --url http://data.linkeddatafragments.org/viaf --sparql 'SELECT * {?s ?p "Chekhov"}' In the Ghent University experiment a more direct approach was taken to match authors to VIAF. First, as input a MARC dump from the catalog is being streamed into a Perl program using a Catmandu iterator. Then, we extract the 100 and 700 fields which contain $a (name) and $d (date) subfields. These two fields are combined in a search query, as if we would search: Chekhov, Anton Pavlovich, 1860-1904 If there is exactly one hit in our local VIAF copy, then the result is reported. A complete script to process MARC files this way is available at a GitHub gist. To run the program against a MARC dump execute the import_viaf.pl command: $ ./import_viaf.pl --type USMARC file.mrc 000000089-2 7001 L $$aEdwards, Everett Eugene,$$d1900- http://viaf.org/viaf/110156902 000000122-8 1001 L $$aClelland, Marjorie Bolton,$$d1912- http://viaf.org/viaf/24253418 000000124-4 7001 L $$aSchein, Edgar H. 000000124-4 7001 L $$aKilbridge, Maurice D.,$$d1920- http://viaf.org/viaf/29125668 000000124-4 7001 L $$aWiseman, Frederick. 000000221-6 1001 L $$aMiller, Wilhelm,$$d1869- http://viaf.org/viaf/104464511 000000256-9 1001 L $$aHazlett, Thomas C.,$$d1928- http://viaf.org/viaf/65541341 [edit: 2017-05-18 an updated version of the code is available as a Git project https://github.com/LibreCat/MARC2RDF ] All the authors in the MARC dump will be exported. If there is exactly one single match against VIAF it will be added to the author field. We ran this command for one night in a single thread against 338.426 authors containing a date and found 135.257 exact matches in VIAF (=40%). In a quite recent follow up of our experiments, we investigated how LDF clients can be used in a federated setup. When combining in the LDF algorithm the triples result from many LDF servers, one SPARQL query can be run over many machines. These results are demonstrated at the iMinds demo site where a single SPARQL query can be executed over the combined VIAF and DBPedia datasets. A Perl implementation of this federated search is available in the latest version of RDF-LDF at GitHub. We strongly believe in the success of this setup and the scalability of this solution as demonstrated by Ruben Verborgh at the USEWOD Workshop. Using Linked Data Fragments a range of solutions are available to publish data on the web. From simple data dumps to a full SPARQL endpoint any service level can be provided given the resources available. For more than a half year DBPedia has been running an LDF server with 99.9994% availability on a 8 CPU , 15 GB RAM Amazon server with 4.5 million requests. Scaling out, services such has the LOD Laundromat cleans 650.000 datasets and provides access to them using a single fat LDF server (256 GB RAM). For more information on the Federated searches with  Linked Data Fragments  visit the blog post of Ruben Verborgh at: http://ruben.verborgh.org/blog/2015/06/09/federated-sparql-queries-in-your-browser/ Written by hochstenbach Leave a comment Posted in Advanced Tagged with LDF, Linked Data, marc, perl, RDF, SPARQL, Triple Pattern Fragments, VIAF Older Posts Recent Posts Catmandu 1.20 LPW 2018: “Contrarian Perl” – Tom Hukins Introducing FileStores Catmandu 1.04 Metadata Analysis at the Command-Line Catmandu 1.01 Parallel Processing with Catmandu Catmandu 1.00 Catmandu Chat Matching authors against VIAF identities Preprocessing Catmandu fixes Earthquake in Kathmandu Importing files from a hotfolder directory LibreCat/Memento Hackathon Day 18: Merry Christmas! Day 17: Exporting RDF data with Catmandu Day 16: Importing RDF data with Catmandu Day 15 : MARC to Dublin Core Day 14: Set up your own OAI data service Day 13: Harvest data with OAI-PMH Day 12: Index your data with ElasticSearch Day 11: Store your data in MongoDB Day 10: Working with CSV and Excel files Day 9: Processing MARC with Catmandu Day 8: Processing JSON data from webservices Day 7: Catmandu JSON paths Day 6: Introduction into Catmandu Day 5: Editing text with nano Day 4: grep, less and wc Day 3: Bash basics Create a free website or blog at WordPress.com. Catmandu Create a free website or blog at WordPress.com. Email (Required) Name (Required) Website   Loading Comments... Comment × Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use. To find out more, including how to control cookies, see here: Cookie Policy