Legacy Code Support & Enhancement for Internet archive

The client

The Internet Archive, a non-profit, builds a vast digital library of internet sites and cultural artifacts to provide universal access to all knowledge.

The goal

Support and enhance the book digitization software, Scribe, by improving performance, camera integration, centralized log-tracking, and ensuring a consistent team and effective knowledge transfer.

The outcome

COXIT provided a dedicated team of Python engineers. Within a month, the team was fully integrated and began producing quality work, facilitating four major software releases with over 40 new features and numerous improvements. This collaboration supports the digitization of thousands of books per week from stations worldwide.

challenge

Think About Digitalizing 40,000,000 Books. How Do You Keep the System Alive?

The Internet Archive, often likened to the modern-day Alexandria Library, is a non-profit organization building a vast digital library of internet sites and cultural artifacts. This immense repository includes 835 billion web pages, 44 million books and texts, 15 million audio recordings, 10.6 million videos, 4.8 million images, and 1 million software programs.

Central to their mission is the Scribe3 project, complex system responsible for digitalizing millions of books.

Archive

SOLUTION

Meet the Internet Archive Books Digitization cornerstone: one-piece, 40,000-line core soft

The Archive's book digitization process, built over the years, was designed to scan thousands of books weekly.

At the heart of it was Scribe3, the Python-based software powering their scanning stations. Operators use it to control cameras, capture images, and process the data into digital books.

Any changes to Scribe3 required navigating a maze of undocumented code written over the years by various single developers.

Fragility within the system limited the ability to add support for new cameras and interfered with plans to improve digitization quality and speed.

tempest
Review avatar

Jude Coelho,

Software Manager, Internet Archive

I’m really impressed with how quickly COXIT was able to come up to speed and learn about our software. I also like how communicative and receptive to feedback they are, especially because they deal with feedback from the people in our book development team as well as 100 other people in the field doing the work of digitizing books.

The new development partner, COXIT, brought the technical expertise necessary to navigate the legacy code and integrate into the Archive's existing workflow.

After getting acquainted with Scribe3's codebase and its interactions with the Archive's broader ecosystem, the team was ready to immerse themselves in the project, streamline the code, and start decluttering the system's development and support.

Graph Graph
Review avatar

Jude Coelho,

Software Manager, Internet Archive

We needed a company that could help us scale our development capabilities to make it easier for more people to work on the project and not have knowledge confined to
a single person.

SOLUTION

How COXIT added dozens of enhancements over 5 years (and counting)

The IA Books Digitization Team's mission isn't limited to just single, bound works. New tasks come up every day, and when they don’t, the COXIT team proposes enhancements based on what they see, which the Archive team is always happy to hear.

Here are but a few improvement projects from this long list:

1

Improved Scribe3 Camera Efficiency

COXIT improved Scribe3's camera integration, addressing compatibility issues with newer Sony models. Using the Sony SDK, they reduced shot-time from 8s to 3s for models like the Sony a7r4, greatly increasing efficiency for scanning thousands of pages daily.

2

Centralized Log Tracking with Grafana Loki

COXIT implemented Grafana Loki, a centralized log aggregation system, to address the lack of centralized log-tracking. This allowed quick pinpointing of errors across 40+ scan centers worldwide, saving hours and ensuring smoother system functioning with a centralized, searchable log dashboard.

3

RAW Image Support Added to Scribe3 for Enhanced Fidelity

COXIT enabled RAW image support within Scribe3, providing operators with more granular control over image data. This ensured the highest fidelity for archival purposes.

4

LCP DRM Integration Secures Archive's eBook Lending

To facilitate the Archive's secure and ethical eBook lending program, COXIT integrated LCP DRM (Digital Rights Management) software. This added protection to copyrighted works while respecting the Archive's core values of accessible information.

5

Improved Upload System Reduces Disk Space Usage

COXIT implemented a better uploading system that drastically reduced disk space usage as files awaited uploading. This allowed the Archive to handle multiple days of bandwidth/uploading issues without running out of local disk space.

author
author

6

Machine Learning Enhances Book Cover Display on Archive.org

COXIT used machine learning to train software that determines if a book cover is suitable as the primary display image or if it should be excluded. This improved the display of books on Archive.org.

and counting..

Being a part of a greater mission

Our partnership with the Internet Archive remains strong and dynamic. This collaboration has significantly enhanced the Internet Archive’s ability to digitize and preserve millions of books, leading to faster, more efficient processes.

The Internet Archive now operates with increased efficiency and reliability, thanks to our joint efforts.

With COXIT's ongoing support, the Archive is well-positioned to expand its digital collection and further its mission of universal access to knowledge.

Looking ahead, the COXIT-Internet Archive partnership promises more innovations, ensuring the preservation of human history for future generations.

Reviewer avatar

Jude Coelho,

Software Manager, Internet Archive

COXIT's support drove efficiencies in our digitization efforts, allowing us to process thousands of books per week. Their technically proficient team is able to work effectively with minimal supervision, and their responsiveness and receptiveness to feedback ensure a seamless workflow.