Recent changes to 128: Add support for ocrmypdf

#128 Add support for ocrmypdf

nwdm — Sun, 29 Jan 2023 20:41:58 -0000

I've been using it in a bash script that uses gnu find to list pdfs, then pdfinfo to ignore then unless they came from gscan2pdf and pdftotext to ignore ones that already have text. then it just runs the bash command line to ocrmypdf.

#128 Add support for ocrmypdf

Jeffrey Ratcliffe — Sun, 29 Jan 2023 15:08:47 -0000

I am in the process of rewriting gscan2pdf in Python. When I have finished, it should be reasonably easy to hook into ocrmypdf (which is also Python) to more accurately place the text than gscan2pdf currently does.

Add support for ocrmypdf

nwdm — Sun, 29 Jan 2023 15:03:41 -0000

ocrmypdf is awesome!
https://ocrmypdf.readthedocs.io/en/latest/introduction.html

I found that Tesseract was better than gocr in gscan2pdf. But while investigating a way to asynchronously offload OCRing, I stumbled upon a python project called ocrmypdf. And it is amazing!

It uses Tesseract I think. It seems to be more accurate. It lines up better with the document image. And it is easier to select and copy from the way it uses transparent font, so you see the original document image, but you select the OCR'd text.

I'd love it if was an option in gscan2pdf next to Tesseract and gOCR