UC Berkeley details plans to digitize majority of pre-1912 Chinese texts

Photo of East Asian Library
David McAllister/Staff
UC Berkeley's C.V. Starr East Asian Library is planning to digitize nearly 10,000 texts. The digitized materials will be available to students.

Related Posts

On May 20, the UC Berkeley C.V. Starr East Asian Library announced plans to digitize most of its pre-1912 collection of Chinese texts.

The project will make nearly 10,000 texts from the East Asian Library — which holds the largest collection of Chinese texts out of research libraries in North America — digitally accessible to students around the world, according to a UC Berkeley press release.

The project will be conducted in collaboration with Sichuan University, a public research university in China, and is funded by the Alibaba Foundation, a philanthropic organization focused on outreach and accessibility.

Peter Zhou, director of the East Asian Library, will oversee the creation of digital images of the texts. The Sichuan University will then analyze them using a process called optical character recognition, or OCR, to create searchable text from the images.

“They’re training the machine to recognize Asian characters with many different variants through history, and they are used in different books in different ways, so to convert the image into the text, you have to build up a large vocabulary where the word and its variants can be matched,” Zhou said. “Here in Berkeley, we don’t have such staff resources, so in a way, we divide it up.”

Deborah Rudolph, curator of the Fong Yun Wah Rare Book Room at the East Asian Library, said many texts contain written commentary from collectors. She noted that the digital images the library is creating will also help researchers compare different copies of the documents.

The materials will be publicly available for all students to use, as is common among institutions in the United States, according to Zhou. However, this type of open access is not widely embraced in China, he added.

Zhou said he hopes this project will help foster a “new ecosystem” of collaboration across institutions globally.

“It has been difficult for our students and our scholars who try to access such collections in libraries in China and other countries because these materials are blocked, so we hope we can set an example that by sharing such content in the open web, such a barrier can be removed,” Zhou said. “It’s a game-changer to imagine our students and scholars could have access to all kinds of special collections in China and Japan and Korea.”

The availability of digital copies will help preserve the original materials because fewer people will need to handle them physically, he added.

Zhou said the project will take place in two phases, each spanning three years, with the eventual goal of digitizing the East Asian Library’s entire pre-1912 collection. According to Zhou and the UC Berkeley press release, each text will be made available through the UC Berkeley Library Digital Collections portal as it is digitized.

“(Any project like this) may not be an eye-opener for the entire population, but for scholars with a certain interest or focus, they’ll certainly find material that is important for them,” Rudolph said.

Contact Emma Taila at [email protected] and follow her on Twitter at @emmataila.