MarcEdit 7.5 Update Status

By reeset / On / In Uncategorized

I’m planning to start making testing versions of the new MarcEdit instance available around the first of the year broadly, to a handful of testers in mid-Dec.  The translation from .NET 4.7.2 to .NET 5 was more significant than I would have thought – and includes a number of swapped default values – so hunting down behavior changes.  Currently, the follow updates have been completed.

    • Framework used: .NET 5.0
    • RDA Helper: 100$e process modified. Added criteria to $e generation. Previously, if a $e is already present, an new $e wasn’t added. Now, if a $e or $4 is present, a $e won’t be generated.
    • RDA Helper: Changes related to RDA updates
    • Added new elements to the new window programs for pinning
    • XML Editor: Delete Block element added
    • XML Editor: XQuery processing option
    • If a set of records include bibliographic and authority records, the RDA helper will skip the authority records
    • Updated Installation Wizard (allows migration of 6.x and 7.x content into the tool)
    • Updating OCLC Integration to use new Metadata API Search
    • Delimited Text Translator — added ability to use custom mnemonic replacements
    • Delimited Text Translator — no longer a stand alone program
      • App part of main marcedit app
      • Command line options folded into marcedit app
    • [in process] linked data rules file version 2
      • Enhancements to the rules file schema
  • -tr

Changes to System.Diagnostics.Process in .NET Core

By reeset / On / In Uncategorized

In .NET Core, one of the changes that caught me by surprise is the change related to starting processes.  In the .NET framework – you can open a web site, file, etc. just by using the following:\

System.Diagnostics.Process.Start(path);

However, in .NET Core – this won’t work.  When trying to open a file, the process will fail – reporting that a program isn’t associated with the file type.  When trying to open a folder on the system, the process will fail with a permission error unless the application is running with administrator permissions (which you don’t want to be doing).  The change is related to a change in a property default – specifically:

System.Diagnostics.ProcessStartInfo.UseShellExecute

In the .NET framework – this property is set to true by default.  In the .NET Core, it is set to false.  The difference here probably makes sense – .NET Core is meant to be more portable and you do need to change this value on some systems.  To fix this, I’d recommend removing any direct calls to this assembly and run in through a function like this:

<code>

public static void OpenURL(string url)
  {
    var psi = new System.Diagnostics.ProcessStartInfo
    {
      FileName = url,
      UseShellExecute = true
    };
    try {
      System.Diagnostics.Process.Start(psi);
    } catch {
      psi.UseShellExecute = false;
      System.Diagnostics.Process.Start(psi);
    }
  }

public static void OpenFileOrFolder(string spath, string sarg = "")
  {
    var psi = new System.Diagnostics.ProcessStartInfo
    {
      FileName = spath,
      UseShellExecute = true
    };
    try {
      System.IO.FileAttributes attr = System.IO.File.GetAttributes(spath);
      if ((attr & System.IO.FileAttributes.Directory) == System.IO.FileAttributes.Directory) {
          System.Diagnostics.Process.Start(psi);
      } else {
        if (sargs.Trim().Length !=0) {
          psi.Arguments = sargs;
        }
        System.Diagnostics.Process.Start(psi);
      }
    } catch {
      psi.UseShellExecute = false;
      System.IO.FileAttributes attr = System.IO.File.GetAttributes(spath);
      if ((attr & System.IO.FileAttributes.Directory) == System.IO.FileAttributes.Directory) {
          System.Diagnostics.Process.Start(psi);
      } else {
        if (sargs.Trim().Length !=0) {
          psi.Arguments = sargs;
        }
      System.Diagnostics.Process.Start(psi);
    }
  }

Since this vexed me for a little bit – I’m putting this here so I don’t forget.

tr

MarcEdit 7.5/MarcEdit Mac 3.5 Work

By reeset / On / In MarcEdit

Every year, around this time, I try to dedicate significant time to address any large project work that may have been percolating around MarcEdit.  This year will be no different.  Over the past 4 months, I’ve been working on moving MarcEdit away from the .NET 4.7.2 Framework to .NET Core 3.1.  There a lot of reasons for looking at this, the most important being that this is the direction Microsoft is taking the framework – a move to unify the various .NET development platforms to make distribution and maintenance easier.  Well, with the release of .NET 5 this Nov., all the tools I need to officially make this transition are now in place.

So, over the next two months, I’ll be working on shifting MarcEdit away from Framework 4.7.2 and to .NET 5.  I believe this will be possible – I only have concerns about two libraries that I rely on – and if I have to, both are open source so I can look at potentially spending time helping the project maintainers target a non-framework build.  My hope is to have a working version of MarcEdit using NET 5 by Thanksgiving that I can start unit testing and testing locally. 

Of course, with this change, I’ll also have to change the installer process.  The reason is that this transition will remove the necessity of having to have .NET installed on one’s machine.  One of the changes to the framework is the ability to publish self contained applications – allowing for faster startup and lower memory usage.  This is something I’m excited about as I currently move slowly updating build frameworks due to the need to have these frameworks installed locally.  By removing that dependency, I’m hoping to be able to take advantages of changes to the C# language that make programming easier and more efficient, while also allowing me to remove some of the work around code I’ve had to develop to account for bugs or limitations in previous frameworks.

Finally, this change is going to simplify a lot of cross platform development – and once the initial transition has occurred, I’ll be spending time working on expanding the MarcEdit MacOS version.  There are a couple of areas where this program still lacks parity in relation to the Windows version, and these changes will give me the opportunity to close many of these gaps. 

–tr

MarcEdit: Identifying Invalid UTF-8 Data in MARC Records

By reeset / On / In MarcEdit

The fifth circle, illustrated by Stradanus

Ah Dante – if only he had been a librarian.  I’m almost certain that had the divine comedy been written by a cataloger – character encodings and those that mangle them – would definitely make an appearance.  I can almost see the story in my head.  Our wayward traveler, confused when our guide, Virgil, comments on the unholy mess libraries, vendors, and tool writers in general have made of the implementation of UTF-8 across the library spectrum – takes us to the 5th circle of hell filled with broken characters and undefined character boxes.  But spend anytime working in metadata management today, and the problems of mixed Unicode normalizations, the false equivalency of ISO-8859-2 and UTF-8 (especially by vendors that server Western European markets), lackluster font development, and applications and programming languages that quietly and happily mangle UTF-8 data as part of general course – and you can suddenly see why we might make a stop at the lake of fire and eternal damnation.

Within MarcEdit, one of the hardest things that the application does is attempt to correct and normalize character encodings across the various known codepoints.  This isn’t super easy – especially when our MARC forepersons made that fateful decision to create MARC-8, a 100% imaginary character encoding only (kind of) supported within the Library community and software.  These kinds of decisions, and the desire to maintain legacy compatibility, has haunted our metadata and made working with it immensely complicated.  Sometimes, these complications can be managed, other times, they are so gruesomely mangled that Brutus, himself, would cry yield.  That’s what this new option will attempt to help remediate.

Through the years, I’ve often helped individuals come up with a wide variety of ways to identify invalid UTF-8 characters that litter library records.  Sometimes, this can be straightforward, but more often, it’s not.  To that end, I’ve attempted to provide a couple of tools that will hopefully help to identify and support some kind of remediation for catalogers haunted by the specter of bad data.

Identification

The first enhancement comes in the MARCValidator.  When validating a record against the rules file, the tool will automatically attempt to determine if UTF-8 data (if present) found within a record is valid.  If not, the information will be presented as a warning – identifying the field, record number, and data where the invalid data was identified.

Image

By facilitating a process to identify invalid UTF-8 record data within the validator – the idea is that this will empower catalogers looking to take a more active role in rooting out bad diacritical data before a record is loaded into the catalog and  made available to the public.

Removing bad data

In addition to identification, I’ve added three new options to give users different options for dealing with invalid character data.

Delete Subfields

Added to the Edit Subfield Utility – I’ve included an option to evaluate and delete a subfield if invalid character set data is encountered.

Image

Delete Fields

Added to the Add/Delete Field Utility – I’ve included an option to evaluate and delete a field if invalid character set data is encountered.

Image

Delete Records

Added to the Delete Records tool within the MarcEditor – I’ve included an option to delete a record if a field or field group has been identified as having invalid character set data.  Additionally, this tool will create a second file in the same directory as the file being processed, that will contain the deleted records in a file structured as: [name of original file]_bad_yyyyMMddhhmmss.mrk

Caveat Emptor

Hopefully, the above sounds useful.  I think it will be.  There have been many times where I wish I had these tools readily at my fingertips.  If it were only this easy.  I believe I mentioned above….encodings are difficult.  The Unicode specification is constantly changing, and identifying invalid characters is definitely more art than science in many cases.  There are tools and established algorithms.  I use these approaches.  I’m also leveraging a method with the .NET Framework — CharUnicodeInfo.GetUnicodeCategory – which attempts to take a character and break it down into its character classification.  When a character isn’t classified – that’s usually a good indicator that it’s not valid.  But this process won’t catch everything – but it hopefully will provide a good starting point for users vexed with these issues and in need of a tool in their toolbox to attempt to remediate them.

Conclusion

My hope is that these new options will give catalogers a little more control and insight into their records – specifically given how invisible character encoding issues often are.  And maybe too, by shedding light on this most vexing of issues, I can buy myself a little less time in cataloging purgatory as I’m sure there will come a point, somewhere, sometime, where my own contributions to keeping MARC alive and active will be held to account.

These new options will show up in MarcEdit and MarcEdit Mac in versions 7.2.210 (Windows) and 3.2.100 (Mac).

Questions, let me know.

–tr

[1] The fifth circle, illustrated by Stradanus (https://en.wikipedia.org/wiki/Inferno_(Dante)#/media/File:Stradano_Inferno_Canto_08.jpg)

MarcEdit 7.2.200

By reeset / On / In MarcEdit

I’ve worked on a number of updates this weekend– here is the list:

UI Changes

I’ve removed the quick links on the front page, and changed this to a list of selectable topics.  This will make it easier for me to add to this list.

image

I’ve added a new Quick Access button to the top ribbon.  At this point, this isn’t configurable.  Will work to make it configurable later.

image

These Quick Access items have been added to the Marc Tools window – with the removal of the old quick links as well.

image

Network Changes

MarcEdit uses .Net 4.7.2.  Internally, the tool has traditionally used the HTTPWebRequest Assembly.  Accessing this assembly directly has been deprecated, with the preferred method shifting to the System.Net.Http Assembly.  This is object is thread-safe and works natively with the System.Threading.Tasks structure.  This also has the benefit of allowing me to allow .NET to gracefully support older TLS standards, which isn’t the default.  By default, .NET selects support for the default TLS instance utilized by the operating system and disables older standards.  This is problematic – and these changes will give me more control over which TLS instances are supported and how fallback is supported.  This required updating 9 assemblies.

MarcEditor Changes

Bug Fix:  When Opening mrc records into the MarcEditor, a memory leak can occur with large files.  I’ve corrected this.

Bug Fix: MarcEdit uses a custom created control that allows the tool to select the most current version of the Richtext library when showing the MarcEditor.  In .NET 4.7 – there appears to have been behavior change, in the that names used to register classes in Windows needed to be all upper case.  If they weren’t then an error would be thrown when mixing the enhanced control and the .NET frameworks default Richtextbox control (which uses the older richtext library).  For example: if internally, the enhanced control used RichEdit5W and then the Richtextbox was used, the program would throw an error.  This wasn’t a problem in MarcEdit, because I only use the enhanced control, but users that may create plugins against MarcEdit may experience issues.  The correction is the use uppercase text to normalize class names now used by .NET 4.7+ (Example: RICHEDIT5W).

Z39.50/SRU Changes

Enhancement: Cleaned up some code related to how records display inside the Results Viewer when pulling non-MARC data.

Validate Headings

Behavior Change: Check $a Only with Subjects.  When working with 60x or 610– this setting doesn’t work like folks might expect.  This is because names often include additional information that must be provided or false variants can be noted.  When working with 60x or 610 data – the program will now include all subfields used when validating the 1xx fields and update data with variants accordingly.  When $a isn’t selected, then the tool will utilize all fields noted as used for validation in the rules file.  This is a behavior change, but likely more in line with the expectations that I’m guess most folks have when using the $a option.

Behavior Change: When changing variants – it appears that multiple $a’s would be placed.  I’m not sure if there was a change on the source record side or not – so instead, I just updated the code to ensure that the tool validated specific data before making updates.

–tr

Build New Field Changes

By reeset / On / In MarcEdit

** Updated: Official Help page in the KB: https://marcedit.reeset.net/build-new-field

This isn’t going to meet all the use cases I’ve seen – but this should address the most common question that comes up – the ability to have the build new field generate multiple fields.

The process will be based on the presence or lack of a new element in the pattern – a variable marker that will MarcEdit uses internally to hold an internal variable.

Example:

=040  \\$aMiU$cMiU

=040  \\$aBDS$beng$cBDS$dOCLCQ$dABCU

=041  \\$aengrusger

=043  \\$ae-gx—$ae-uk—$an-us—

=090  \\$aTK1005$b(INTERNET) $c[UK.]

Say we have these fields – and the pattern I want to create is a 999 field, and in that field, I want to create a new 999 field for each 040$a – but I would also like to have the 090$a to be a part of the pattern.

The new pattern would look like this:

=999  \\$a{040$a[x]} : {090$a}

This pattern would generate the following results:

=999  \\$aMiU : TK1005

=999  \\$aBDS : TK1005

If I changed the pattern to:

=999  \\$a{040$a} : {090$a}

The program falls back to use the current functionality (only one field is created).

Please note, you cannot ask for a specific 040 to be used (outside of using find/reg functions inside the pattern) – the data inside the [x] isn’t an integer you can set.  It is a value that indicates to MarcEdit that the subfield should be tracked and multiple fields are desired.

The [x] syntax works both after the subfield or after the field number, with data being scoped based on the location of the [x].  Any other value other than [x] will likely result in inconsistent results.  The [x] bracket is a reserved element within the field to indicate that multiple field generation is desired, and to tell the program to tokenize the data marked.

Finally – the tool placed data in the index range of the new field being generated.  So, consider this example:

=040  \\$aMiU$cMiU

=040  \\$aBDS$beng$cBDS$dOCLCQ$dABCU

=041  \\$aengrusger

=043  \\$ae-gx—$ae-uk—$an-us—

=090  \\$aTK1005$b(INTERNET) $c[UK.]

If I used the following pattern:

=999  \\$a{040$a[x]} : {090$a[x]}

The expected results would be:

=999  \\$aMiU : TK1005

=999  \\$aBDS :

Why?  Because the tool will slot values marked with the multi-field value [x] into the same field groups.  Since only one 090$a exists, the tool only updates the field group that it belongs.  However, if I had the following data:

=040  \\$aMiU$cMiU

=040  \\$aBDS$beng$cBDS$dOCLCQ$dABCU

=041  \\$aengrusger

=043  \\$ae-gx—$ae-uk—$an-us—

=090  \\$aTK1005$b(INTERNET) $c[UK.]

=090  \\$aG24211$b(INTERNET)

And used this pattern:

=999  \\$a{040$a[x]} : {090$a[x]}

I would expect the following result:

=999  \\$aMiU : TK1005

=999  \\$aBDS : G24211

Again – internally, MarcEdit is creating tokens of data with the [x] and placing them within the same scope.  So, the tool would create new fields, placing data within the same scope onto the new fields.

I started making these changes with the last update – and have finished updating the tokenization algorithms so that the tracking of the data is correct.  I’ll be turning this new option on with the next update – and across both the Windows and Mac version.

Since the presence of the [x] is necessary to turn on the multi-field generation, any existing patterns within tasks shouldn’t be impacted by the changes.  They will work as they had previously.  Only patterns with the new [x] structure will activate the new processing logic.

Summary of recent MarcEdit Changes between Feb. 2020–May 5, 2020

By reeset / On / In MarcEdit

Like a lot of folks, I’ve been working from home and have had some free time to do some extra work on MarcEdit.  Nearly all of these changes (save for the XML Editor) were made in the Mac Version as well (or will be made in the Mac Version by weeks end (specifically, the Bibframe 2 MARC integration).

At this point, I’m doing some additional work adding some additional transliterations, updating the bibframe 2 marc tool to make it more performant, and adding XQuery support to the XML Editor for both editing and transformations.

Anyway, here’s the list of changes that have been implemented over the past couple months.

Transliterations:

The Library of Congress provided me with their rules files for their transliteration work.  So, I’ve been working on adding new transliterations to the applications.  So far, this includes:

  1. Latin 2 Yiddish
  2. Latin 2 Serbian
  3. Serbian to Latin
  4. Classical Greek to Latin
  5. Latin to Classical Greek
  6. Latin to Belorussian
  7. Belorussian to Latin
  8. Bulgarian to Latin
  9. Latin to Bulgarian
  10. Latin to Russian
  11. Russian to Latin
  12. Latin to Ukrainian
  13. Ukrainian to Latin
  14. Updates to Latin to Arabic
  15. Update to Arabic to Latin

Additionally, I updated the transliteration tool to allow for transliterations to be run over the entire file, as well as new configuration settings to determine which fields/subfields should be included and excluded from the transliteration process.

Installer Changes:

  1. Added pre-check tool that determines if a mismatched version of the application is installed.  This way, you cannot install the User and Administrator version of MarcEdit on the same machine.
  2. Updated a bug/behavior change in Windows 10 1909 2020-04 Cumulative Update that caused registry keys on 64 bit systems to write to the 32-bit hive.
  3. Added an Updated Chinese language file for the MarcEdit UI


Format Translations:

  1. Integrated the Bibframe 2 MARC translation released by the US Library of Congress.  Additionally, enhanced the tool so that it can be run over a file with multiple works and instances, rather than a single work/instance pair.
  2. Add the JSON 2 MARC, MARC 2 JSON, and XML to JSON processing functions to the batch records processing tool


New and Updated Tools:

  1. Added an XML Editor to MarcEdit.  This is a light-weight XML editor that supports find/replace, as well as XSL transformation testing.
  2. Updated the MARCCompare application template to provide options to just show changed records.
  3. Updated the ILS Integration tooling with a new UI to make it easier to add new integrations, and provide templates for known ILS Integration patterns.
  4. Updated a large number of dependency files related to Saxon and the Linked Data framework in MarcEdit.  These changes introduced a bug in the Clustering Tool, which was later fixed.
  5. Added an Application Error Log to make debugging specific issues easier.
  6. Updated the DeDuplication Records Tool to allow users running the tool outside of the MarcEditor to run the tool on a single file.
  7. Updated the Classify Tool to allow call numbers to be added to any field.  Previously, there was a rule that limited call numbers to fields less than 100.
  8. Updated the MarcEdit Command-Line tool to make the silent function a bit more silent.  There were a few instances where the terminal, regardless of if the silent option was set, would output feedback.
  9. I added a new troubleshooting tool on the Main Window that will now guide users through the importing of settings data from previous versions of MarcEdit (had a user not imported the data on update)


MarcEditor Changes:

  1. Fixed a bug in the Conditional Replace function that was causing regular expressions to be interpreted as simple in-string searches when using the AND/OR conditionals.
  2. Added the ability to show line numbers in the MarcEditor.
  3. Returned the ability to have MarcEdit highlight the active line.
  4. Added a new Edit Shortcut that allows users to add a generic LDR field to any records missing one.
  5. Updated the Task Debugger UI
  6. Added the Task Debugger to the MacOS version

MarcEdit 7.2.160 Updates

By reeset / On / In Uncategorized

There is one update that I want to highlight related to the new update, and that related to the installer.  I noticed that with the Windows 10 2020-04 cumulative update, registry reflection (the process of moving registry keys into the 32-bit hive), has affected the MarcEdit installer.  This directly impacts the applications ability to determine which type of installer the program should download when doing automated updates. 

To fix this – I’ve added a check to the application that will see if there is a type mismatch between the installer downloaded and the version of MarcEdit currently installed.  This new check will prompt users to let them know that the mismatch exists and provide and option to uninstall the existing system or to stop the installation.

I’ve recorded an explanation of exactly what is happening here:

If you have trouble, please let me know.

–tr