User's guide

Scan the text

scantext

POST https://tesseractor.com/api/v1/scantext?login=&password=

login	Your identification code.
password	Your password.
multipart/form-data
file	Content of the PDF or JPG, PNG or GIF image in binary.
lang	Language of the text.
psm	Scan mode.
out	Output type.
firstpage	First page to process in a PDF.
lastpage	Last page to process in a PDF.
resolution	Resolution in dpi of the image generated for each page of a PDF .
text	Directly extract just the plain text in a PDF.
images	Directly extract just the images in a PDF.
rotate	Rotate images.
crop	Crop images. Cut the text.
reframe	Reframe images on a background.
unborder	Remove border lines.
resize	Resize images.
negate	Revert colors.
normalize	Add contrast to the colors.
colorspace	Convert to grayscale.
unsharp	Sharpen the contours.
dots	Remove white dots.

lang - language of the text: eng, fra, deu, spa, ita or rus. Specify several languages by separating them with a +, e.g. eng+fra. NOTE: The order is important.

psm - Page Segmentation Mode: 1 - Automatic page segmentation with OSD (Orientation and Script Detection), 3 - Fully automatic page segmentation, but no OSD, 4 - Assume a single column of text of variable sizes, 6 - Assume a single uniform block of text).

out - selection of the output type: txt hocr box.

Specify the extraction mode of each page of a PDF:

firstpage : Number of the first page to process,
lastpage : Number of the last page to process,
resolution : resolution of the image generated in dpi - 50, 75, 100, 125, 150 or 200. IMPORTANT: If a page contains only one image and no text, the image is systematically directly extracted from the document.
images : 1 - directly extract only the images.

Activate the processing options of each image before analysis:

rotate : 180 to flip the image, -90 to rotate it to the left or to the right,
crop - : cut the image to the size specified by a width and a height separated by an x from a position specified by x and y coordinates preceded by a + in pixels for the given resolution, e.g. 640x200+50+80,
reframe - : reframe the image on a background with a blur level between 1 and 20, e.g. 5,
unborder - : remove the borders with, separated by an x, the maximum width and height of a text between 10 and 1000 pixels, e.g. 30x30,
resize - : resize the image by 50, 75, 125, 150 or 200 %,
negate - : 1 - revert colors,
normalize - : 1 - add contrast to the colors,
colorspace - : 1 - convert the image to grayscale,
unsharp - : 1 - sharpen the contours,
dots - : 1 - remove white dots.

IMPORTANT: Image processing options are run in the above order.

To extract the plain text in a PDF, use the following options:

text : 1 - directly extract just the plain text in a PDF,
firstpage : Number of the first page to process,
lastpage : Number of the last page to process,
resolution : resolution of the image of a page of a PDF in dpi - 50, 75, 100, 125, 150 or 200,
crop - : extract the text in the area defined by a width and a height separated by an x from a position specified by x and y coordinates preceded by a + in pixels for the given resolution, e.g. 640x200+50+80.

To have a correct understanding of the effects of these parameters, test them in the interface of your personal space.

fox.jpg

$ curl -s --fail --show-error -X POST "https://tesseractor.com/api/v1/scantext?login=abcdef&password=ABCDEF" -F "lang=eng" -F "psm=6" -F "file=@fox.jpg" -o -
The quick brown fox
jumps over
the lazy dog.

The text is read with Tesseract in mode 6 - Assume a single uniform block of text - with the trained data for the English language.

legal_en.pdf • 351.3k

legal_en.pdf

$ curl -s --fail --show-error -X POST "https://tesseractor.com/api/v1/scantext?login=abcdef&password=ABCDEF" -F "lang=fra+eng" -F "psm=4" -F "out=hocr" -F "resize=125" -F "unsharp=1" -F "file=@legal_en.pdf" -o ocr.html

The PDF is a photocopy which contains 1 image per page. The text is read with Tesseract in mode 4 - Assume a single column of text of variable sizes - after resizing the images to 125% and sharpening the contours. The output is formatted in HTML. NOTE: The document was analyzed with the trained data for the French and the English languages, because of the accent on SàRL.

Display the HTML in your navigator:

$ firefox ocr.html

Trying adding -F "images=1" to the command line. Since the PDF contains only one image per page, the process is the same, slightly faster.

legal_en.pdf

$ curl -s --fail --show-error -X POST "https://tesseractor.com/api/v1/scantext?login=abcdef&password=ABCDEF" -F "lang=fra+eng" -F "psm=6" -F "out=hocr" -F "resolution=150" -F "file=@legal_fr.pdf" -o ocr.html

This PDF is the result of the Print in a file function of the navigator on the page Legal information of the website. If you pass the option images=1, no image is found, and the result is an empty file. NOTE: If you upload this PDF in the interface of your personal space, you can directly retrieve the plain text it contains without analyzing it.

Download the code of the sendpost and file_mime_type functions from the iZend library. Copy the files in the space of your application.

sendhttp.php

filemimetype.php

NOTE: See the page Call the service API for a description of the sendpost and file_mime_type functions.

Add the file scantext.php with the following content:

scantext.php

require_once 'sendhttp.php';
require_once 'filemimetype.php';

Loads the code of the sendpost and file_mime_type functions.

function scantext($login, $password, $file, $lang='eng', $psm='3', $out='txt', $output='ocr.txt', $params=false) {

Defines the function scantext. $login is your identification code. $password is your password. $file is the pathname of the PDF, JPEG, PNG or GIF file to convert. $lang is the language of the text, e.g. 'eng' or 'eng+fra'. $psm specifies the analysis mode of the text, i.e. 1, 3, 4 or 6. $out is the output format, i.e. txt, hocr or box. $output is the pathname of the file which will contains the text or the HTML returned by the analysis of $file. $params is an associative array containing the names and the values of the parameters specifying the extraction mode of each page of a PDF and the processing options of each image before analysis, e.g. array('resolution' => 125, 'unsharp' => true)).

$curl = 'https://tesseractor.com/api/v1/scantext' . '?' . 'login=' . urlencode($login) . '&' . 'password=' . urlencode($password);

Sets $curl to the URL of the scantext action with the identification code and the password of the user's account. $login and $password must be escaped.

$args = array(
'lang' => $lang,
'psm' => $psm,
'out' => $out,
);
$args = array_merge($args, $params);

Prepares the list of arguments of the POST.

$files=array('file' => array('name' => basename($file), 'tmp_name' => $file, 'type' => file_mime_type($file)));

Prepares the list of files attached to the POST: file - the PDF, JPEG, PNG or GIF to analyze with the name of the file, the pathname of the file and its MIME type.

$response=sendpost($curl, $args, $files);

Sends the HTTP request with sendpost. The arguments login and password are already in $curl.

if (!$response or $response[0] != 200) {
return false;
}

If $response is false, the server is unreachable. If $response[0] doesn't contain the HTTP return code 200 Ok, an execution error has occurred. In case of error, scantext returns false.

return @file_put_contents($output, $response[2]);
}

Returns true if the text or the HTML returned by the request could be written to the output file, false otherwise.

EXAMPLE

Assuming you have saved the files sendhttp.php, filemimetype.php and scantext.php in the current directory, run PHP in interactive mode, load the scantext function and call it with your identification code and password, the pathname of a PDF, JPEG, PNG or GIF file, a language, an analysis mode, an output type and the name of the output file in argument:

$ php -a
php > require_once 'scantext.php';
php > scantext('abcdef', 'ABCDEF', 'file.pdf', 'eng', '4', 'hocr', 'ocr.html', array('resolution' => 125, 'unsharp' => true));
php > quit

Display the result in HTML in your navigator:

$ firefox ocr.html

Add the following tag in the <head> section of the output file in HTML to display the words read in red when moving the mouse over the text:

<style>

.ocrx_word:hover {color:#f30}

</style>

Comments

To add a comment, click here.

tesseractor.com

User's guide

Scan the text

scantext

EXAMPLE

SEE ALSO

Comments