Big Blog

Arts & Culture
Banking
Biological Science
Blog Watch
Celebrities
Computer Games
Computer Security
Cricket
Data Privacy
Developer
Domain Names
E-commerce
Gadgets
General Science
Handhelds
IP & Patents
Java
Linux
Mobile Technology
Movie Reviews
MP3
Nanotech
Online Auctions
Online Legal Issues
Open Source
Personal Finance
Photography
Quirky
Robotics
Search Engines
Space Science
Top Internet
Top Stories
Top Tech
Video Games
Web Developer
Webmaster Tips
XML & Metadata
{Home}



parse: search

OCR Tech Allows Google to Index Millions of Scanned Documents

GoogleScanned PDFs are a kind of darknet on a web — at best search engines see an image inside a PDF, but can’t parse out the actual text. But now that’s changed as Google recently announced that it will begin using OCR (optical character recognition) technology to index the text inside scanned PDF documents.

Tip: Detect XML document encodings with SAX and XNI: Quickly find input encoding with streaming APIs

Sometimes when you forward XML documents, you just want to copy the bytes from point A to point B. You don't necessarily want to parse the entire thing, but you do need to determine the character encoding to set the metadata appropriately. In these cases, streaming APIs such as SAX and XNI offer a fast and efficient way to inspect the encoding without paying for full parsing.


Search News:


Copyright © 2001-2008 Jonathan Hedley