gscan2pdf / Feature Requests / #128 Add support for ocrmypdf

#128 Add support for ocrmypdf

Milestone: Next_Release_(example)

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2023-02-05

Created: 2023-01-29

Creator: nwdm

Private: No

ocrmypdf is awesome!
https://ocrmypdf.readthedocs.io/en/latest/introduction.html

I found that Tesseract was better than gocr in gscan2pdf. But while investigating a way to asynchronously offload OCRing, I stumbled upon a python project called ocrmypdf. And it is amazing!

It uses Tesseract I think. It seems to be more accurate. It lines up better with the document image. And it is easier to select and copy from the way it uses transparent font, so you see the original document image, but you select the OCR'd text.

I'd love it if was an option in gscan2pdf next to Tesseract and gOCR

Discussion

Jeffrey Ratcliffe - 2023-01-29

I am in the process of rewriting gscan2pdf in Python. When I have finished, it should be reasonably easy to hook into ocrmypdf (which is also Python) to more accurately place the text than gscan2pdf currently does.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

nwdm - 2023-01-29

I've been using it in a bash script that uses gnu find to list pdfs, then pdfinfo to ignore then unless they came from gscan2pdf and pdftotext to ignore ones that already have text. then it just runs the bash command line to ocrmypdf.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Add support for ocrmypdf

Group

Searches

Help

#128 Add support for ocrmypdf

Discussion