“I call it our digital Dunkirk. We’re literally getting anything that will float out there.”
Dena Strong, Senior Information Design Specialist at Technology Services
Written by Dana Mancuso
Beyond the personal and physical devastation and loss taking place in Ukraine, Internet data is disappearing at an unimaginable scale. Ukrainian websites are under attack, and the Internet is no longer a safe place for storage. Groups and institutions hope to quickly preserve their digital collections, images, archives, data, and more.
A collective to save collections
An ad hoc group of international volunteers has taken on the challenge to preserve Ukrainian digital heritage using their high-tech skills. The Ukrainian digital heritage preservation project founded by iSchool ’09 alum Quinn Dombrowski of Stanford University, Anna Kijas of Tufts University, and Sebastian Majstorovic of Vienna (https://www.sucho.org/) has spent recent weeks working across continents, time zones, and language divides to develop methods of taking in and storing data. The aim is to capture data and images from Ukrainian cultural heritage institutions, “before the Russian bombs can take their ISPs offline,” according to Dena Strong, ’14 iSchool graduate and Senior Information Design Specialist at Technology Services.
Strong has taken personal time off to help the effort before more records disappear.
She and a core group are devising logistics and workflow, while using large-scale storage venues for high-speed archiving systems so that individuals can upload data.
Her skill set, which she said is “at the intersection of theater improvisation and workflow management,” meant she was able to quickly integrate into what she termed a “digital librarian army” in a matter of days.
The work is moving at breakneck speed, explained Strong.
“I call it our digital Dunkirk. We’re literally getting anything that will float out there into the digital ocean, ranging from mobile phone coracles to HPC cluster superyachts, and bringing back boatloads of data any way we can. Somebody has even tried to press their 3-year-old’s pastel-covered toddler computer into service. It’s amazing!”
The week of March 7, the information collected for the project made up 15% of the entire intake volume of the Internet Archive. That number has increased since then with the addition of so many volunteers and help from data storage companies. “The Internet Archive was capturing 20,000 URLs per second even before SUCHO ramped up,” she noted.
As Strong and a core team of coders and programmers press on, they are making great strides in short sprints. For example, after four hours on a recent Saturday, Strong and colleagues from New York and Finland—neither of whom Strong had met—had put together a way for the volunteer data capturers to be able to capture DSpace repositories 1 without a specific programming background.
“1,200 plus volunteers now don’t need to speak Python to capture these hundreds of DSpace repository archives,” Strong said.
Volunteers are self-sorting by expertise and interest into subgroups—communicating mostly through 15 unique chat channels.
Strong said her personal efforts focus on creating, refining, and documenting processes so that the work can continue regardless of who is volunteering.
How they save the data
Anyone can submit a URL through the SUCHO intake form. The address then is sorted into a workflow that takes it down one of a few branching pathways. The websites are crawled or “scraped” either with specialized software or manually by a volunteer, depending on the nature of the digital resources on the site.
The automatic scraping is moving quickly and getting great results, according to Strong. “We have site archives of as many websites as Browsertrix software is able to crawl. Ilya Kremer, the man who built it, is in the chat with us, and he is releasing new versions as we go,” Strong remarked.
“We try to automate the capturing when we can, but sometimes host mirroring gets in the way or sometimes there are dynamic visual tours that require a person to go through it with a web recorder,” she said.
Illinois SHIELD experience proved invaluable
Strong noted that working on the University of Illinois Urbana-Champaign COVID-19 SHIELD program gave her the confidence that she could help at this level.
The SHIELD effort required an immediate ramp up for collecting and safely storing many types of data during the pandemic. It has been the backbone for managing COVID testing and tracking at the university and across the state of Illinois. Strong was involved with quickly spinning up documentation and advising on documentation best practices for the project.
She indicated that SHIELD moved quickly, and a lightweight process and simple storage was the best way for the entire team to stay connected and make progress. The same is true for SUCHO’s work.
“SHIELD taught me how to dream that things like [saving Ukraine’s digital data] could be done, at scale, at speed, with an all-volunteer squad who didn’t even know each other.
SHIELD taught me about, ‘Hey, who can do X and Y? OK, you, you, and me, here’s a Google doc and an .ipynb notebook 2, let’s go’.”
Strong is balancing learning on the job with teaching others so that the effort will be even more successful.
“I’ve been learning things one day, and the next day I’m teaching them to volunteers from 20 time zones,” she marveled.
Volunteers aren’t sure if they have found all the sites there are to find. Strong indicated that the more websites they can save in this two-week stretch, the more they can catch their collective breath and process the metadata later.
“Right now, we are racing the bombs,” she said.
“This is what I can do from where I am. We preserve the heritage and the institutional memory because neither librarians nor the Internet forget.”
1 DSpace – A special type of repository used extensively by libraries and archives, but which interact differently than many websites do.
2 .ipynb – A file type used by Jupyter notebooks within Google Collab (and in other places too). Jupyter notebooks allow people to write annotated Python code by alternating text explanations and either individually-runnable or batch-runnable Python code snippets.