Let’s visualize some HAMLET data! Or, d3 and t-SNE for the lols. – andromeda yelton


	Skip to content

	
													andromeda yelton

										
						Menu					
				
			
				Home
	About
	Contact
	Resume


	HAMLET
	LITA
	Talks
	Machine Learning (ALA Midwinter 2019)
	Boston Python Meetup (August 21, 2018)
	SWiB16
	LibTechConf 2016
	Code4Lib 2015 Keynote
	Texas Library Association 2014
	Online Northwest 2014: Five Conversations About Code
	New Jersey ESummit (May 2, 2013)
	Westchester Library Association (January 7, 2013)
	Bridging the Digital Divide with Mobile Services (Webjunction, July 25 2012)


			Let’s visualize some HAMLET data! Or, d3 and t-SNE for the lols.
				
			
				Andromeda			
							
					Uncategorized				
			
			
				November 20, 2020			

					
In 2017, I trained a neural net on ~44K graduate theses using the Doc2Vec algorithm, in hopes that doing so would provide a backend that could support novel and delightful discovery mechanisms for unique library content. The result, HAMLET, worked better than I hoped; it not only pulls together related works from different departments (thus enabling discovery that can’t be supported with existing metadata), but it does a spirited job on documents whose topics are poorly represented in my initial data set (e.g. when given a fiction sample it finds theses from programs like media studies, even though there are few humanities theses in the data set).


That said, there are a bunch of exploratory tools I’ve had in my head ever since 2017 that I’ve not gotten around to implementing. But here, in the spirit of tossing out things that don’t bring me joy (like 2020) and keeping those that do, I’m gonna make some data viz!


There are only two challenges with this:


	By default Doc2Vec embeds content in a 100-dimensional space, which is kind of hard to visualize. I need to project that down to 2 or 3 dimensions. I don’t actually know anything about dimensionality reduction techniques, other than that they exist.
	I also don’t know know JavaScript much beyond a copy-paste level. I definitely don’t know d3, or indeed the pros and cons of various visualization libraries. Also art. Or, like, all that stuff in Tufte’s book, which I bounced off of.


(But aside from that, Mr. Lincoln, how was the play?)


I decided I should start with the pages that display the theses most similar to a given thesis (shout-out to Jeremy Brown, startup founder par excellence) rather than with my ideas for visualizing the whole collection, because I’ll only need to plot ten or so points instead of 44K. This will make it easier for me to tell visually if I’m on the right track and should let me skip dealing with performance issues for now. On the down side, it means I may need to throw out any code I write at this stage when I’m working on the next one. 🤷‍♀️


And I now have a visualization on localhost! Which you can’t see because I don’t trust it yet. But here are the problems I’ve solved thus far:


	It’s hard to copy-paste d3 examples on the internet. d3’s been around for long enough there’s substantial content about different versions, so you have to double-check. But also most of the examples are live code notebooks on Observable, which is a wicked cool service but not the same environment as a web page! If you just copy-paste from there you will have things that don’t work due to invisible environment differences and then you will be sad. 😢 I got tipped off to this by Mollie Marie Pettit’s great Your First d3 Scatterplot notebook, which both names the phenomenon and provides two versions of the code (the live-editable version and the one you can actually copy/paste into your editor).
	If you start googling for dimensionality reduction techniques you will mostly find people saying “use t-SNE”, but t-SNE is a lying liar who lies. Mind you, it’s what I’m using right now because it’s so well-documented it was the easiest thing to set up. (This is why I said above that I don’t trust my viz.) But it produces different results for the same data on different pageloads (obviously different, so no one looking at the page will trust it either), and it’s not doing a good job preserving the distances I care about. (I accept that anything projecting from 100d down to 2d will need to distort distances, but I want to adequately preserve meaning — I want the visualization to not just look pretty but to give people an intellectually honest insight into the data — and I’m not there yet.) 


Conveniently this is not my first time at the software engineering rodeo, so I encapsulated my dimensionality reduction strategy inside a function, and I can swap it out for whatever I like without needing to rewrite the d3 as long as I return the same data structure.


So that’s my next goal — try out UMAP (hat tip to Matt Miller for suggesting that to me), try out PCA, fiddle some parameters, try feeding it just the data I want to visualize vs larger neighborhoods, see if I’m happier with what I get. UMAP in particular alleges itself to be fast with large data sets, so if I can get it working here I should be able to leverage that knowledge for my ideas for visualizing the whole thing.


Onward, upward, et cetera. 🎉


			Share this:
	Twitter
	Facebook
	

Like this:
Like Loading...


			Tagged
	fridAI
	hamlet

	
			Published by Andromeda

		
			Romantic analytical technologist librarian.			
				View all posts by Andromeda			
		

			Published
			November 20, 2020		

	
		Post navigation

		Previous Post AI in the Library, round one
Next Post Of such stuff are (deep)dreams made: convolutional networks and neural style transfer


			One thought on “Let’s visualize some HAMLET data! Or, d3 and t-SNE for the lols.”		


				Pingback: Though these be matrices, yet there is method in them. – andromeda yelton 			

		
		Leave a Reply Cancel reply


	Enter your comment here...
	

		Fill in your details below or click an icon to log in:

			
					Email (required) (Address never made public)
					

					Name (required)
					

					Website
					

			You are commenting using your WordPress.com account.			
				( Log Out / 
				Change )
			
			
			You are commenting using your Google account.			
				( Log Out / 
				Change )
			
			
			You are commenting using your Twitter account.			
				( Log Out / 
				Change )
			
			
			You are commenting using your Facebook account.			
				( Log Out / 
				Change )
			
			
		Cancel

		Connecting to %s

	
	 Notify me of new comments via email.
 Notify me of new posts via email.


				Create a free website or blog at WordPress.com.
				
							
		Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use. 

To find out more, including how to control cookies, see here:
				
			Cookie Policy		
 

		%d bloggers like this: