Marion No More

October 05, 2005

Google Print and OCR

Although one cannot know what is truly going on since all partners are mum on the technology, standards, and methods being used, it is generally assumed that the scanning machine in use for Google Print is the Kirtas book scanner. Having witnessed a full Kirtas demo in May, I can attest to two things. First, it is indeed an amazing machine with a lot of potential, although it has some very clear drawbacks: can't handle certain paper weights, largely useless for unbound materials, can't handle books cracked along the spine, etc. So those books and materials--of which we collectively have millions--require special handling.

Second, and more significantly, Kirtas is not currently experimenting with ABBYY's FineReader XIX, which is the only viable Fraktur OCR engine at this point. The software behind the Kirtas scanner uses FineReader (and only FineReader, might I add). When I asked the rep about XIX I got a blank stare. One could say that he's just a sales rep, but Kirtas is a tiny firm, i.e.- they all tend to know a lot about the various parts of the business, plus he was otherwise intimately familiar with the workings of the software and OCR engine. Also, Kirtas freely admits that many languages have no viable OCR options at all, first and foremost Arabic. Thankfully, no important texts were ever penned in German before 1945 nor in Arabic, so we don't have to worry about that!

Another thing to consider is OCR quality. Google Print's is garbage, but I'm not sure they care. Why? Because newer books will probably be added from native digital source, i.e.- no OCR will be necessary, or, if they are scanned, OCRing texts from recent decades is child's play in OCR terms. It's the newer books that will drive the business model by creating sales and spinoff revenue. Consider that Gale, when they create OCR for products such as ECCO (pdf), uses five different OCR engines, compares results, and then runs the best match against a painstakingly developed proprietary dictionary that copes with a host of quirks such as ligatures, f/s issues, variant forms of caps, etc. And after all of that, they're still so displeased with the results (which are actually quite good) that they developed fuzzy search technology to allow even mismatched terms to be found. I'm not saying that Google couldn't do all of this, too, but frankly, they won't I'd wager. It's too expensive and they'd have to have a market to whom they could sell the results for a high price (have you seen the price tag on ECCO lately?).

Incidentally, for the first time today I searched for a title in Google Print and was asked to log in with my Google Account info to see the text. I'm beginning to grow weary of Google wanting to track the usage of everything I touch. I've stopped using Google Desktop and won't even touch a gmail account, no matter how hip they are at the moment. The fact that they're willing to singlehandedly decide what copyright means in 2005 makes me wonder how seriously they take privacy. If publishers can be beaten back by Google, what chance do I as an individual consumer have if they choose to misuse my personal data. No thanks. As is being written these days, we have a new Microsoft on our hands.

No, the irony that this blog is hosted by a Google firm is not lost on me. I dislike Microsoft's tactics, but haven't penned a document in anything but Word for about 15 years.

0 Comments:

Post a Comment

<< Home