Historical-document recognition system improves upon existing technology

Sifting through documents that date back to the advent of the printing press can be time-consuming, but UC Berkeley researchers launched a new font-reading program this week that can simplify the process.

Ocular, a historical optical character recognition, or OCR, system, transcribes scanned documents produced on a printing press into text files. The program can read fonts it has not seen before more accurately than other systems and was created by Dan Klein, campus associate professor of electrical engineering and computer sciences, along with graduate students Taylor Berg-Kirkpatrick and Greg Durrett — all members of the Berkeley Natural Language Processing Group.

“Ocular works by using machine learning to adapt to the historical document, figuring out these glyph shapes directly from the document itself,” Berg-Kirkpatrick said. “It’s kind of like old-school code-breaking in a way because it deciphers the relationship between the image of characters and what the character is itself.”

When transcribing documents, Ocular’s error rate is 26 percent lower than existing software.

Berg-Kirkpatrick said the system was created to analyze printing-press-era documents because existing systems are not designed to scan them. Ocular, which took a year and a half to develop, can accommodate for characters not placed in a straight line and for characters printed in different tones — discrepancies that occur in 19th-century printing press documents, he added.

“We see it as inspired both by existing OCR systems but also from techniques used in machine translation systems like Google translate and automatic speech recognition systems like Siri,” Durret said.

Quinn Dombrowski, a digital humanities coordinator in the research IT group who has worked with such systems, said existing OCR software requires a lot of human intervention to correct transcription errors. Dombrowski said she thought Ocular would be more effective than other systems because it has a greater ability to self-correct its errors.

“If you are looking at crime reports across the 19th century and you want to look at all of the documents, you are looking at thousands to hundreds of thousands of documents,” Dombrowski said. “As a human, you can’t look at every single page image of the documents, so it’s really important to have OCR systems.”

Ocular can transcribe one page per minute on a newer computer, but on an older computer, the process can take “a lot longer,” according to Berg-Kirkpatrick. If placed in fast mode, Ocular can transcribe one page every 30 seconds, but the transcript would be 5 percent less accurate, he added.

As a result of character variability, the system is unable to analyze handwritten text, according to Durrett. Ocular also cannot properly transcribe newspapers, because the system can only transcribe single lines of text and cannot accommodate for newspaper columns.

But Durrett said he hopes later versions of the program will be able to separate columns into single lines of text, as well as transcribe handwritten documents.

Contact Sophie Mattson at [email protected] and follow her on Twitter @MattsonSophie.