Quantcast
Channel: Open Codex by Joseph Reagle
Viewing all articles
Browse latest Browse all 65

Improving PDF Annotations from GoodReader

$
0
0

For many years now, I’ve printed out PDFs and scribbled annotations on them. I then dictate my annotations (i.e., excerpts and comments) into a text file that I can transform and include in my bibliographic mindmap system (see de.py in thunderdell).

With the purchase of an iPad—I gave up on waiting for a decent Android tablet—I’m now annotating PDFs via the GoodReader app. Of course, the accuracy of the text highlighted is only as good as the PDF. The copyable text, generated by OCR, can have conjoined words or suffer from errors resulting from misunderstood ligatures, accents, or cruft. Also, the actual page number of the PDF probably doesn’t correspond to the document’s pagination.

With the short python script gr-fix.py, I use a dictionary to correct OCR errors and transform from the GoodReader format into that used by de.py. This doesn’t correct everything (e.g., words with capitals) and can introduce a few errors itself—but it’s greatly improved on the original OCR. The --number argument also lets you correct the page numbers by an offset.


Viewing all articles
Browse latest Browse all 65

Latest Images

Trending Articles





Latest Images