Upload a PDF document or a JPG, PNG or GIF image and extract the text in it (TXT, HOCR, BOX). Read the barcodes. Convert the document or the image to a PDF/A. Validate a PDF/A. Scan a PDF searching for potential threats.

Configure how images are extracted from a PDF and how images are prepared for the OCR or the barcode reader (resolution, orientation, contrast, brightness, resizing, cropping, borders, etc.) and reuse this set of parameters by program with the API.

legal_en.pdf • 351.3k
legal_en.pdf

 • 

legal_en.txt

The PDF contains the 2 images of the 2 pages of the legal information of the website, probably a photocopy. The text was read with Tesseract in mode 6 - Assume a single uniform block of text - after resizing the images to 125% and sharpening the contours. NOTE: The document was analyzed with the trained data for the the French and the English languages, because of the accent on SàRL. Click on a link to download a file.

 •   •   •  NEWDOC
2137919

Only 1 QR is read by ZBar. Using YOLO, the image is analyzed and cropped into 2 distinct images which ZBar can easily decode.

Ask us to add a specific processing of the text extracted from your documents (clear text from a PDF or read from images by OCR, content of a barcode) to verify the result, correct it, obtain formatted data output in CSV, JSON or XML which you will be able to feed directly to another service.

All functionalities are available for free in the interface of your personal space or by program as a paid service through a simple REST API. See the User's Guide.

Tesseract is an open-source optical character recognition engine sponsored by Google since 2006.

ZBar is an open source software for reading barcodes (EAN-13/UPC-A, UPC-E, EAN-8, Code 128, Code 39, Interleaved 2 of 5 and QR Code).

YOLO (You Only Look Once) is an image processing system for the detection of objects with free of rights implementations.

The PDF/A is an ISO-standardized version of the PDF format specialized for use in the archiving and preservation of electronic documents.

The veraPDF consortium, led by the Open Preservation Foundation and the PDF Association, was created in response to the EU Commission's PREFORMA challenge to develop an open-source validator for the PDF/A format.

Ghostscript is a suite of software for processing Postscript and PDF files.

Poppler provides a set of commands for extracting the pages, the text and the images of PDF files.

ClamAV is a free antivirus.

All communications are encrypted.

The files you upload are inaccessible to others and the files which are processed and generated by the API are automatically deleted.

You wish to add reading the text in images by an OCR with how images are extracted from a PDF and how images are prepared for the OCR or the barcode reader in your application? tesseractor.com is a software developed by an editor open to sharing knowledge and code. To contact mcPaLo, click here.