Preserving Our Digital Heritage: Harnessing Open Data for Historical Archives

Some years ago, I attended FOSDEM, an annual Open source Software conference hosted in Brussels. Initially, I followed a lecture about linters, an interesting but challenging topic. After 45 minutes, with a lot of notes and a mild headache, I decided to take it easy and join some friends in the history track of the conference. The topic was Digital Archaeology, I was unsure of what to expect, but the lecture ended up being one of the most fascinating talks I have ever attended at a conference.

It will be impossible to study retro-computers’ history in a few years since their software was written in now-obsolete languages. The hardware upon which it ran is dying, and the cassette tapes holding the software are physically disintegrating. Furthermore, the tapes and hardware production lines are long gone; a physical reconstruction will be implausible or at least astronomically expensive, and the emulators have their limitations.

This realisation blew my mind. At an age where you can lose your phone but can access all your data when you set up the next one, an illusion is created: digital objects are indestructible. I remember we talked about that a lot that night and a cautionary tale about the illusion of our data resilience nested in my mind.

During the early stages of my PhD studies, I came across an academic paper discussing the preservation of data journalism records. As tech companies close or stop to support products, data journalism and especially an integral part of it, visualisation, is exceptionally susceptible to deprecation. If you lose the visualisation of an article, you will lose some information. But as long as you have the data set, you can still reproduce it somehow; it will take time and effort, but it is doable. If you lose the data set, things are bad. You have completely lost the data part and only have the old fashion text, but that means you have lost half of the article since you can only see the commentary about the infographics and the data. The paper stops at that point since it is worried about the impact of technological change on data journalism.

Since journalism articles are a rough draft of history and more and more media companies are moving from print to digital editions, we must first identify the risks for our historical archives. What will happen if a digital media agency has to shut down? What about their digital archives and collected data sets?

I can think about three scenarios.

1 – Someone will buy it and keep the operation going, or, in the worst case, it will incorporate the data into its digital media operation. In that case, there may be lost data, but most of the publications will likely be available.

2 – Someone will buy it, but will not put money and effort. Therefore gradually, things will start to malfunction. Missing pictures, unsupported libraries and broken links will be commonplace after some time.

3 – It just shut down. No one cares, and most likely, all the digital archives stored in the server will probably be deleted to make space for the next paying customer.

I have to mention on this point that some initiatives for the preservation of digital archives already exist, like https://web.archive.org, and although old websites are available in a basic form, they contain several findability limitations, there are problems with unsupported multimedia components and almost nonexistent support for anything that has to do with infographics and datasets. See, for example, what happens if you try to access The Guardian’s data sets through the web.archive web site.

Now that we have identified the risks, we can think about possible solutions. As we know, open data can be considered the data that are free of charge and available to reuse (as long as you mention the source), which journalistic articles are. The problem is that in their simplest form, they are plain text without any indicators about the information they contain. Therefore, they are not available in a machine-readable format; however, that is something we can work on.

We could create an open data set out of digital journalistic publications. This can be achieved by scraping the data from digital media organisations. By applying machine learning algorithms, we can extract the main topics of the texts and other important information like the publication date and the place and store them as metadata. With this endeavour, we can create a digital library that can be fully automated in the part of data collection and can preserve the journalistic data for future researchers. Using tags for searching the data set will significantly decrease the effort of history researchers in the future.

This is the first step, and it can be implemented with relevant ease, but it covers only the cases of traditional digital journalism. We also have to address the instances of data journalism where interactive infographics are used, and external data sets have to be incorporated. Firstly we have to secure the data sets used in the articles. A possible solution could be to process the data sets, which will require advanced data interoperability techniques and tools. Afterwards, we must adopt/create and maintain a library that can provide the functionality needed to depict the infographics.

Up to that point, many historians are conducting their research based on old newspapers that have been maintained in public or personal archives and libraries, some of them have been digitised, but most, especially the ones that are in private hands, are not.
We are experiencing a decrease in printed media and an increase in digital ones. Unfortunately, this, combined with the continuous progress of technology and the lack of maintenance for outdated software, poses a significant risk to the preservation of our historical archives. When we also consider that digital media can close, and we don’t possess any way to archive the data sets and the infographics effectively, it becomes clear that as we progress technologically, we will lose more and more data from our history. Still, proper planning and research can shift that to our collective advantage as we can embrace new open-data related technologies to store and catalogue our data effectively. That way, we can prevent the loss of digital archives and data journalism articles, along with their datasets, that could prove to be of immense importance when feature historians will study them to “decode” our times.

Authors: