Augmented reality is a top topic right now, especially with iPhone’s hardware capabilities finally catching up to people’s imagination. I recently saw “Word Lens” app in the App Store and their demo video really got me excited about this space.

In the next series of posts, I’ll document my attempt to create a framework that can do what Word Lens is doing. I’ve been meaning to add video and audio processing capabilities to Cloud Browse for a while now, so this will be a good excuse for me to dig into how to access the camera in iOS.

Overall Plan

Word Lens uses OCR to detect letters in video frames, then it removes them and replaces the letters with translated text while matching the location and size of the replaced text.

Abstracting away the details, you have these basic of steps:

  1. Capture video frame data
  2. Detect features
  3. Remove unwanted features
  4. Rendering augmentation

What is impressive about Word Lens is that they do all that on the iPhone itself, it is convenient for the user, but we just care about duplicating the functionality right now, so we’ll be doing most of our work on the desktop. Apparently they spent 2 1/2 years creating their technology, we will use off the shelf open source technologies and cobble them together quickly to get the desired effect.

Here is my plan:

  1. Capture video on iPhone and send to server
  2. Detect letters using Tesseract-ocr
  3. Remove letters using techniques found in Patch Match or Resynthesizer
  4. Translate using Google Translate
  5. Render translated word using ImageMagick
  6. Send image back to be displayed on iPhone

The hardest step seems to be removing the letters since there isn’t a ready made library for doing this. We’ll tackle that step last. The first step will be to get image to the server and running OCR on it. That will be the subject of the next post.