Uncategorized —

Google to scan 800,000 manuscripts, books from Indian university

Google's dream of organizing all the world's information has come one step …

Need to dig up some information from a centuries-old text on ayurvedic medicine? Soon you'll be able to do so from the comfort of your living room. Google has agreed to index and digitize 800,000 texts stored at the University of Mysore in India as part of its attempt to broaden the Google Book Search program, according to the Indo-Asian News Service.

"Written in both papers and palm leaves, there are around 100,000 manuscripts in our library, some dating back to the eighth century," said the vice chancellor of Mysore. "The effort is to restore and preserve this cultural heritage for effective dissemination of knowledge." He also added, cryptically, that the University plans to "patent them before making them available on public domain."

Google has been aggressively expanding its Book Search program to include non-English library materials. It recently announced a deal with the University of Lausanne to scan a large collection of French-language works, and the new partnership with Mysore will digitize works in Sanskrit and Kannada. These schools lack the fear of Google displayed by the French government, which has so far introduced projects like Gallica and Quaero to challenge the search giant without any apparent success.

India has become increasingly important to Google in the last few years. The company opened a billion-dollar data center in Andhra Pradesh, and it recently announced the availability of Google News in Hindi. But how will the might of Google's technology fare when confronted with handwritten Sanskrit?

How steady is your hand?

Making an archive like this useful to scholars will involve using optical character recognition to translate the handwritten texts into searchable characters—and it's a tough task. Our own Jon Stokes has done extensive research in this area and says, "The hard part about doing a project like this lies not so much in the actual digitization of the page images, but in doing OCR on a handwritten script. OCR can work quite well on handwritten manuscript pages, if the handwriting is regular enough. Researchers doing this stuff with Greek manuscripts have gotten some good results, but again only on regular hands."

Google has developed open-source tools like OCRopus to address these problems. The new project is built on Tesseract, the company's open-source OCR engine, and it adds a handwriting recognizer and "novel high-performance layout analysis methods." The research is clearly of more than academic interest to Google. As it expands its digitization efforts, OCR is the only feasible way to convert handwriting into text on such a massive scale. But the problems go beyond the actual character recognition—storage and markup of the data is also a problem.

The Text Encoding Initiative (TEI) was founded in 1987 with the aim of providing SGML-compliant, machine-readable texts for humanities scholars and social scientists. The organization's "P3" text encoding guidelines have been in use since 1994 in a range of digital library and manuscript encoding projects, but marking up documents into a TEI-compliant format is a challenge.

If Google is using the OCRopus project to do the handwriting recognition, then the engine is probably going to generate text encoded in the HTML format. "HTML is fine if you're making the texts directly available online," Jon says, "but the Holy Grail is really to do automated capture of handwritten texts into some TEI-compliant flavor of SGML. Once the text is marked up with TEI tags, you can output to HTML or any other format from that. You can also let scholars come behind the OCR engine and do things to make the marked-up version more useful, like tagging proper names, changes in hand or ink color, supralinear and marginal corrections, and so on."

Until that Holy Grail is found, hand-coding of handwriting remains the only solution.

Channel Ars Technica