Google can now OCR PDFs November 10, 2008
Posted by Jill (@bonnjill) in Tools.trackback
Google just keeps offering new and exciting improvements that make our lives easier – from Google Print, Google Earth, Google Video to Google Translate and now the Tesseract OCR Engine. You simply have to respect a company that has the goal of making every last bit of the world’s information searchable. It is an awesome endeavor indeed. I am catching up on my feed reads and just learned that it can now OCR PDFs, and has been doing so since October 31st. For those of you who are not familiar with the abbreviations, OCR stands for optical character recognition and PDF stands for portable document format.
As announced on the Official Google Blog, the company is now performing OCR on documents that it indexes and identifies as having been scanned as PDFs. Google has indexed documents that were saved as text-based PDFs for quite some time, but many documents wind up being made into PDFs through scans, which store the text as images. You can see the words on the screen, but your computer doesn’t. When you put this scan up on a Web site, search engines have been unable to index the content of those documents because it didn’t recognize the text as text … until now.
According to the Google Code Blog:
In a nutshell, we are all about making information available to users, and when this information is in a paper document, OCR is the process by which we can convert the pages of this document into text that can then be used for indexing.
…
For now it only supports the English language, and does not include a page layout analysis module (yet), so it will perform poorly on multi-column material. It also doesn’t do well on grayscale and color documents, and it’s not nearly as accurate as some of the best commercial OCR packages out there. Yet, as far as we know, despite its shortcomings, Tesseract is far more accurate than any other Open Source OCR package out there.
As a medical translator I frequently get my source texts as a PDF. I create a PDF using ABBYY FineReader and generate Word files using OCR to allow me to translate them with the use of Trados. People can use the service to create texts from scanned PDFs by simply uploading them to the web site (caveat: do not upload documents you want kept private – particularly translations and source texts that belong to the client), but I am more excited about the prospects for Internet research.
This is good news to those of us who rely on Internet research to earn our bread and butter. Google’s latest innovation has potential in this respect. The impact on Internet research will be enormous. Since Google will be able to OCR PDFs, PDFs that were images will finally be indexed and searchable. Google’s “View as HTML” feature is quite useful for these documents, especially if you need to copy portions of them for notes or to paste found terms into your translation from them.
Oh, this is good news! Thanks for sharing!