issue #4: Support document split pattern to separate a document into several parts. #20

Closed
AceTheFace wants to merge 0 commits from documentSplitPattern into master

Implements https://geimist.eu:30443/geimist/synOCR/issues/4.

This PR provides the ability to split a document into several parts before synOCR processing. A given pattern is searched within the original document. If found the document is split on all pages this pattern occurs.

For pattern searching "pdfToText" is used. Splitting the original document is done by using qpdf (which is already contained in OCRMyPDF.

Example log:

Searching for document split pattern in document SCN_0005.pdf
0 split pages detected in file SCN_0005.pdf
Searching for document split pattern in document test.pdf
3 split pages detected in file test.pdf
splitting pdf: pages 9-z into test-4.pdf
splitting pdf: pages 6-7 into test-3.pdf
splitting pdf: pages 4-4 into test-2.pdf
splitting pdf: pages 1-2 into test-1.pdf
document split processing finished
Implements https://geimist.eu:30443/geimist/synOCR/issues/4. This PR provides the ability to split a document into several parts before synOCR processing. A given pattern is searched within the original document. If found the document is split on all pages this pattern occurs. For pattern searching "pdfToText" is used. Splitting the original document is done by using qpdf (which is already contained in OCRMyPDF. Example log: ``` Searching for document split pattern in document SCN_0005.pdf 0 split pages detected in file SCN_0005.pdf Searching for document split pattern in document test.pdf 3 split pages detected in file test.pdf splitting pdf: pages 9-z into test-4.pdf splitting pdf: pages 6-7 into test-3.pdf splitting pdf: pages 4-4 into test-2.pdf splitting pdf: pages 1-2 into test-1.pdf document split processing finished ```
AceTheFace added 1 commit 2 years ago
Owner

OCR preprocessing for the source file is additionally required.

OCR preprocessing for the source file is additionally required.
geimist closed this pull request 2 years ago
This pull request cannot be reopened because the branch was deleted.
Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

No dependencies set.

Reference: geimist/synOCR#20
Loading…
There is no content yet.