Ted Lawless

Ted Lawless
Work notebook
	Datasette hosting costs
I've been hosting a Datasette (https://baseballdb.lawlesst.net, aka baseballdb) of historical baseball data for a few years and the last year or so it has been hosted on Google Cloud Run I thought I would share my hosting costs for 2020 as a point of reference for others who might be interested in running a Datasette but aren't sure how much it may cost.

The total hosting cost on Google Cloud Run for 2020 for the baseballdb was $51.31, or a monthly average of about $4.28 USD The monthly bill did vary a fair amount from as high as $13 in May to as low as $2 in March Since I did no deployments during this time or updates to the site, I assume the variation in costs is related to the amount queries the Datasette was serving I don't have a good sense of how many total queries per month this instance is serving since I'm not using Google Analytics or similar.

Google does report that it is subtracting $49.28 in credits for the year but I don't expect those credits/promotions to expire anytime soon since my projected costs for 2021 is $59.

This cost information is somewhat incomplete without knowing the number of queries served per month but it is a benchmark
	Connecting Python's RDFLib to AWS Neptune
I've written previously about using Python's RDFLib to connect to various triple stores For a current project, I'm using Amazon Neptune as a triple store and the RDFLib SPARQLStore implemenation did not work out of the box I thought I would share my solution.

The problem

Neptune returns ntriples by default and RDFLib, by default in version 4.2.2, is expecting CONSTRUCT queries to return RDF/XML The solution is to override RDFLib's SPARQLStore to explictly request RDF/XML from Neptune via HTTP content negotiation.

Once this is in place, you can query and update Neptune via SPARQL with RDFLib the same way that you would other triple stores.

Code

If you are interested in working with Neptune using RDFLib, here's a "NeptuneStore" and "NeptuneUpdateStore" implementation that you can use.
	Usable sample researcher profile data
I've published a small set of web harvesting scripts to fetch information about researchers and their activities from the NIH Intramural Research Program website.

On various projects I've been involved with, it has been difficult to acquire usable sample, or test data, about researchers and their activities You either need access to a HR system and a research information system (for the activities) or create mock data Mock, or fake data, doesn't work well when you want to start integrating information across systems or develop tools to find new publications It's hard to build a publication harvesting tool without real author names and research interests.


To that end, the scripts I've published crawl the NIH Intramural Research Program website and pull out profile information for the thousand or so researchers that are members of the program, including a name, email, photo, short biography research interests, and the Pubmed IDs for selected publications.

A second script harvests the organizational structure of the program Both types of data are outputted to a simple JSON structure that then can be mapped to your destination system
	Exploring 10 years of the New Yorker Fiction Podcast with Wikidata
Note: The online Datasette that supported the sample queries below is no longer available The raw data is at: https://github.com/lawlesst/new-yorker-fiction-podcast-data.


The New Yorker Fiction Podcast recently celebrated its ten year anniversary For those of you not familiar, this is a monthly podcast hosted by New Yorker fiction editor Deborah Treisman where a writer who has published a short story in the New Yorker selects a favorite story from the magazine's archive and reads and discusses it on the podcast with Treissman.1

I've been a regular listener to the podcast since it started in 2007 and thought it would be fun to look a little deeper at who has been invited to read and what authors they selected to read and discuss.

The New Yorker posts all episodes of the Fiction podcast on their website in nice clean, browseable HTML pages I wrote a Python script to step through the pages and pull out the basic details about each episode:


title
url
summary
date published
writer
reader


The reader and the writer for each story is embedded in the title so a bit of text processing was required to cleanly identify each reader and writer I also had to manually reconcile a few episodes that didn't follow the same pattern as the others.

All code used here and harvested data is available on Github.

Matching to Wikidata

I then took each of the writers and readers and matched them to Wikidata using the searchentities API.

With the Wikidata ID, I'm able to retrieve many attributes each reader and writer by querying the Wikidata SPARQL endpoint, such as gender, date of birth, awards received, Library of Congress identifier, etc.

Publishing with Datasette

I saved this harvested data to two CSV files - episodes.csv and people.csv - and then built a sqlite database to publish with Datasette using the built-in integration with Zeit Now
	Now Publishing Complete Lahman Baseball Database with Datasette
Summary: The Datasette API available at https://baseballdb.lawlesst.net now contains the full Lahman Baseball Database.


In a previous post, I described how I'm using Datasette to publish a subset of the Lahman Baseball Database At that time, I only published three of the 27 tables available in the database I've since expanded that Datasette API to include the complete Baseball Database.

The process for this was quite straightforward I ran the MySQL dump Lahman helpfully provides through this mysql2sqlite tool to provide an import file for sqlite Importing into sqlite for publishing with Datasette was as simple as:

$ ./mysql2sqlite lahman2016.sql | sqlite3 baseball.db


The complete sqlite version of the Lahman database is 31 megabytes.

Querying

With the full database now loaded, there are many more interesting queries that can be run
	Publishing the Lahman Baseball Database with Datasette
Summary: publishing the Lahman Baseball Database with Datasette API available at https://baseballdb.lawlesst.net.


For those of us interested in open data, an exciting new tool was released this month It's by Simon Willison and called Datasette Datasette allows you to very quickly convert CSV files to a sqlite database and publish on the web with an API Head over to Simon's site for more details
	SPARQL to Pandas Dataframes
Update: See this Python module for converting SPARQL query results into Pandas dataframes.


Using Pandas to explore data SPARQL

Pandas is a Python based power tool for munging and analyzing data While working with data from SPARQL endpoints, you may prefer to explore and analyze it with pandas given its full feature set, strong documentation and large community of users.

The code below is an example of issuing a query to the Wikidata SPARQL endpoint and loading the data into a pandas dataframe and running basic operations on the returned data.

This is a modified version of code from Su Labs Here we remove the types returned by the SPARQL endpoint since they add noise and we will prefer to handle datatypes with Pandas.

{% notebook sparql_dataframe.ipynb %}

With a few lines of code, we can connect data stored in SPARQL endpoints with pandas, the powerful Python data munging and analysis library.

See the Su Labs tutorial for more examples.

You can also download the examples from this post as a Jupyter notebook.
	Querying Wikidata to Identify Globally Famous Baseball Players
Earlier this year I had the pleasure of attending a lecture by Cesar Hidalgo of MIT's Media Lab One of the projects Hidalgo discussed was Pantheon Pantheon is a website and dataset that ranks "globally famous individuals" based on a metric the team created called the Historical Popularity Index (HPI) A key component of HPI is the number of Wikipedia pages an individual has in in various languages For a complete description of the project, see:


  Yu, A
	Python ETL and JSON-LD
I've written an extension to petl, a Python ETL library, that applies JSON-LD contexts to data tables for transformation into RDF.

The problem

Converting existing data to RDF, such as for VIVO, often involves taking tabular data exported from a system of record, transforming or augmenting it in some way, and then mapping it to RDF for ingest into the platform The W3C maintains an extensive list of tools designed to map tabular data to RDF.

General purpose CSV to RDF tools, however, almost always require some advanced preparation or cleaning of the data This means that developers and data wranglers often have to write custom code This code can quickly become verbose and difficult to maintain Using an ETL toolkit can help with this.

ETL with Python

One such ETL tool that I'm having good results with is petl, Python ETL
	OrgRef data as RDF
Summary: Notes on mapping OrgRef to DBPedia and publishing with Linked Data Fragments .

This past fall, Data Salon, a UK-based data services company, released an open dataset about academic and research organizations called OrgRef The data is available as a CSV and contains basic information about over 30,000 organizations.


  OrgRef was created with publishers in mind, and so its main focus is on institutions involved with academic content: universities, colleges, schools, hospitals, government agencies and companies involved in research.


This announcement caught our attention at my place of work because we are compiling information about educational organizations in multiple systems, including a VIVO instance, and are looking for manageable ways to consume Linked Data that will enrich or augment our local systems Since the OrgRef data has been curated and focuses on a useful subset of data that we are interested in, it seemed to be a good candidate for investigation, even it isn't published as RDF Due to it's size, it is also easier to work with than attempting to consume and process something like VIAF or DBPedia itself.

Process

We downloaded the OrgRef CSV dataset and used the ever helpful csvkit tool to get handle on what data elements exist.


$ csvstat --unique orgref.csv
  1 Name: 31149
  2