Shining A Blacklight on Scandal

It can be fascinating to see how technology, data and world events intersect. Last week, the news broke about “The Panama Papers.” Almost three terabytes of data (11.5 million documents) were lifted from a law firm that specializes in setting up offshore accounts to hide money in other jurisdictions and dodge paying their fair share of the taxes due. The results of this exposure were shocking. Iceland’s Prime Minister stepped down. Other world leaders have had a light shone on their financial shuffling. The full impact of these leaks is still playing out. With this story exposed, there are theories emerging that as much as 32 trillion dollars is hiding in safe, untaxable and untouchable bank accounts.

A terabyte of data is a daunting amount of information. Between the books, articles, computer code, audio, video and images I have generated over my life, I have about 2 terabytes of data under my belt. It can fit on a portable drive, thanks to technology, but it’s too much for any small team to dissect, digest and turn into something useable. Were my 2TB of data all images, that would amount to 40,000,000 pictures to pour through. Were my stash solely videos, that would amount to 500 DVDs (2,000+ hours of video).

The Panama Papers represented wealth of scanned documents, emails, and digital transactions. A wealth too big to digest through manual means. The material still wasn't searchable. The team who had access to the data first scanned the documents, using OCR to turn the information into something that was more readily searchable. Next, they turned to an Open Source application called Blacklight that enabled participating journalists worldwide to access it. For example, Walker Guevara said that journalists could upload a list of the names of all the members of the House of Lords and match it to the data.

Blacklight is a Ruby on Rails application that ties into Solr. More information of the project and its extensions is available here:

What does this have to do with digital humanities?

Project Blacklight followed a lineage of applications striving for better text searches of library collections, or as the developers including Bethany Nowviskie put it, “Adapting an Open-Source Scholarly Web 2.0 System for Findability in Library Collections (or: “Frankly, Vendors, WeDon’t Give a Damn.”)” It was first adopted by the University of Virginia in 2008, followed by others:
  • 2009 - Stanford, Agriculture Network, Northwest Digital Archive
  • 2010 - NCSU, WGBH Open Vault, Wisconsin-Madison
  • 2011 - Alice Law, Clermont, Columbia, Johns Hopkins, NYPL, Penn State, Rock’n’Roll Hall of Fame, Tufts, U.S. Holocaust Museum, Hull, World Maritime Univ.
The ability to pull relevance and relevant passages from large texts is key work in the field of digital humanities. Building a better mouse trap lets researchers do their work more effectively. Project Blacklight is one of those tools. The Panama Papers demonstrates that a tool like Project Blacklight has a critical role in the world as we become a more data-centric civilization. Digital humanities is about using technology to find the deeper meaning in humanist works. Usually focussed on art and literature, there is an argument for its utility in less creative human works (like legal contracts and bank documents). The DH pursuit to finding a deeper meaning document does not have to stop at the doorway of DH Labs: it can have worldwide and far reaching impact.

More reading

“How OPACs Suck, Part 1: Relevance Rank (Or the Lack of It) | ALA TechSource.” Accessed November 1, 2015.
“How OPACs Suck, Part 2: The Checklist of Shame | ALA TechSource.” Accessed November 1, 2015.
“How OPACs Suck, Part 3: The Big Picture | ALA TechSource.” Accessed November 2, 2015.
“Hydra for CNI Spring 2014 Meeting.” Accessed November 1, 2015.
“BSTF Final Report” Accessed November 1, 2015.