Converting Historical Archives into Cloud-based, Worldwide Libraries

Documenting The History of Science: California Institute of Technology’s Einstein Papers Project Accessing and Integrating Albert Einstein’s Massive Written Legacy

Located on the campus of the California Institute of Technology is a small building that houses the library of Albert Einstein. Caltech, along with other prominent libraries in the US and Israel, is engaged in the noteworthy quest to collect, catalog, and organize documents from among the many works of Albert Einstein.

The Collected Papers of Albert Einstein is one of the most ambitious publishing ventures ever undertaken in the documentation of science history. It provides the first complete picture of Einstein’s massive written legacy ranging from his first work on the special and general theories of relativity to the origins of the quantum theory to his active involvement with international collaboration, cooperation, human rights, education, and disarmament.

The published volumes draw upon more than 40,000 documents contained in the personal papers of Einstein (1879-1955), and more than 30,000 additional Einstein and Einstein-related documents discovered by the editors since the 1980s. The printed series will contain over 14,000 scientific and non-scientific documents, and will fill close to 30 volumes.

The director and general editor of the Einstein Papers Project at Caltech engaged the services of Global Archives to help plan the launch of an online, electronic archive.

Launching a multilingual, international archive of written treasures using the existing, hard copy document index scheme provided by the Caltech team, Global Archives built an electronic version of the works. This ensured that the library’s contents would be universally and seamlessly integrated to other Einstein libraries worldwide (now that’s truly a “global archive”!). The library contains a wide range of document types, including letters, bound documents, magazine articles and other personal and professional pieces. Using the existing document indexing method, those accessing the archived library online can easily access, search and share documents; data searches can be conducted by index, content, and through a variety of  languages.

Global Archives analyzed the Einstein Library’s paper indexing method, evaluated the document types, and built a database structure to incorporate the director’s requirements. Global Archives also created a front-end tool based upon LockBox to open up access by other libraries to library contents, allowing them to securely render URL links onto their respective sites of target documents.

Secure, customized search and access of the Einstein Papers Project — namely, a universal library of all things Einstein — is the primary goal of the caretakers at Caltech and elsewhere. Global Archives led a painstaking, meticulous conversion phase, wherein source documents were imaged and their key indexing data populated into the record’s database system. In addition, a full content scrape was performed on most mechanical print, including typewritten and published text. Since much of the personal papers were handwritten, these documents could be accessed via the conventional index method only. Languages scraped by OCR included Hebrew, German, French and English.

Now, the Einstein Papers Project is universally accessible using Global Archives’ LockBox. Users worldwide can review the entire library from either LockBox’s website or via the Einstein Archives Online home page (

Content Scraping

Scraping content data is an Optical Character Reader (OCR) function that captures content electronically, in a few different ways:

  1. Zones. Data capture can be organized to look at specific areas within a document. These areas are called “zones.” The zones are scraped to capture machine generated characters, barcodes, text and other data. The scraped data is then interpreted and written to a hidden text file attached to the image and can be used as a search or index field value. This is a fast and effective method to capture specific data on a form.
  2. Full OCR. Scraping the entire document is the generally accepted method for a full OCR search of the document’s content. This is most effective when employed for all machine generated documents, including typewritten or type set documents. As in the Zone method, a hidden text file is attached to the document image and can be used for content search. Many documents in the library were able to be processed in this manner.
  3. Custom Reference Library. The custom reference library is used when a document has been handwritten or created in a non-standard character set such that conventional scraping is ineffective. In this process, sample letters are captured in pixel format and entered into a character table. This character table has corresponding matching characters for each handwritten character. Several language interpretations are allowed. Although there is a lower probability of success, most Custom Reference Libraries can provide a reasonable definition of the content in electronic format and, most importantly, allow electronic viewing and content searching.

