Add support for ocrmypdf
Brought to you by:
ra28145
ocrmypdf is awesome!
https://ocrmypdf.readthedocs.io/en/latest/introduction.html
I found that Tesseract was better than gocr in gscan2pdf. But while investigating a way to asynchronously offload OCRing, I stumbled upon a python project called ocrmypdf. And it is amazing!
It uses Tesseract I think. It seems to be more accurate. It lines up better with the document image. And it is easier to select and copy from the way it uses transparent font, so you see the original document image, but you select the OCR'd text.
I'd love it if was an option in gscan2pdf next to Tesseract and gOCR
I am in the process of rewriting gscan2pdf in Python. When I have finished, it should be reasonably easy to hook into ocrmypdf (which is also Python) to more accurately place the text than gscan2pdf currently does.
I've been using it in a bash script that uses gnu find to list pdfs, then pdfinfo to ignore then unless they came from gscan2pdf and pdftotext to ignore ones that already have text. then it just runs the bash command line to ocrmypdf.