Scan the text
scantext
POST https://tesseractor.com/api/v1/scantext?login=&password=
login | Your identification code. |
---|---|
password | Your password. |
multipart/form-data | |
file | Content of the PDF or JPG, PNG or GIF image in binary. |
lang | Language of the text. |
psm | Scan mode. |
out | Output type. |
firstpage | First page to process in a PDF. |
lastpage | Last page to process in a PDF. |
resolution | Resolution in dpi of the image generated for each page of a PDF . |
text | Directly extract just the plain text in a PDF. |
images | Directly extract just the images in a PDF. |
rotate | Rotate images. |
crop | Crop images. Cut the text. |
reframe | Reframe images on a background. |
unborder | Remove border lines. |
resize | Resize images. |
negate | Revert colors. |
normalize | Add contrast to the colors. |
colorspace | Convert to grayscale. |
unsharp | Sharpen the contours. |
dots | Remove white dots. |
lang
- language of the text: eng
, fra
, deu
, spa
, ita
or rus
.
Specify several languages by separating them with a +
, e.g. eng+fra
.
NOTE: The order is important.
psm
- Page Segmentation Mode:
1
- Automatic page segmentation with OSD (Orientation and Script Detection),
3
- Fully automatic page segmentation, but no OSD,
4
- Assume a single column of text of variable sizes,
6
- Assume a single uniform block of text).
out
- selection of the output type:
txt
hocr
box
.
Specify the extraction mode of each page of a PDF:
firstpage
: Number of the first page to process,
lastpage
: Number of the last page to process,
resolution
: resolution of the image generated in dpi - 50
, 75
, 100
, 125
, 150
or 200
.
IMPORTANT: If a page contains only one image and no text, the image is systematically directly extracted from the document.
images
: 1
- directly extract only the images.
Activate the processing options of each image before analysis:
rotate
: 180
to flip the image, -90
to rotate it to the left or to the right,
crop
- : cut the image to the size specified by a width and a height separated by an x
from a position specified by x and y coordinates preceded by a +
in pixels for the given resolution, e.g. 640x200+50+80
,
reframe
- : reframe the image on a background with a blur level between 1
and 20
, e.g. 5
,
unborder
- : remove the borders with, separated by an x
, the maximum width and height of a text between 10
and 1000
pixels, e.g. 30x30
,
resize
- : resize the image by 50
, 75
, 125
, 150
or 200
%,
negate
- : 1
- revert colors,
normalize
- : 1
- add contrast to the colors,
colorspace
- : 1
- convert the image to grayscale,
unsharp
- : 1
- sharpen the contours,
dots
- : 1
- remove white dots.
IMPORTANT: Image processing options are run in the above order.
To extract the plain text in a PDF, use the following options:
text
: 1
- directly extract just the plain text in a PDF,
firstpage
: Number of the first page to process,
lastpage
: Number of the last page to process,
resolution
: resolution of the image of a page of a PDF in dpi - 50
, 75
, 100
, 125
, 150
or 200
,
crop
- : extract the text in the area defined by a width and a height separated by an x
from a position specified by x and y coordinates preceded by a +
in pixels for the given resolution, e.g. 640x200+50+80
.
To have a correct understanding of the effects of these parameters, test them in the interface of your personal space.
$ curl -s --fail --show-error -X POST "https://tesseractor.com/api/v1/scantext?login=abcdef&password=ABCDEF" -F "lang=eng" -F "psm=6" -F "file=@fox.jpg" -o -
The quick brown fox
jumps over
the lazy dog.
The text is read with Tesseract in mode 6 - Assume a single uniform block of text - with the trained data for the English language.
legal_en.pdf • 351.3k
$ curl -s --fail --show-error -X POST "https://tesseractor.com/api/v1/scantext?login=abcdef&password=ABCDEF" -F "lang=fra+eng" -F "psm=4" -F "out=hocr" -F "resize=125" -F "unsharp=1" -F "file=@legal_en.pdf" -o ocr.html
The PDF is a photocopy which contains 1 image per page. The text is read with Tesseract in mode 4 - Assume a single column of text of variable sizes - after resizing the images to 125% and sharpening the contours. The output is formatted in HTML. NOTE: The document was analyzed with the trained data for the French and the English languages, because of the accent on SàRL.
Display the HTML in your navigator:
$ firefox ocr.html
Trying adding -F "images=1"
to the command line.
Since the PDF contains only one image per page, the process is the same, slightly faster.
$ curl -s --fail --show-error -X POST "https://tesseractor.com/api/v1/scantext?login=abcdef&password=ABCDEF" -F "lang=fra+eng" -F "psm=6" -F "out=hocr" -F "resolution=150" -F "file=@legal_fr.pdf" -o ocr.html
This PDF is the result of the Print in a file function of the navigator on the page Legal information of the website.
If you pass the option images=1
, no image is found, and the result is an empty file.
NOTE: If you upload this PDF in the interface of your personal space, you can directly retrieve the plain text it contains without analyzing it.
Download the code of the sendpost
and file_mime_type
functions from the iZend library.
Copy the files in the space of your application.
NOTE: See the page Call the service API for a description of the sendpost
and file_mime_type
functions.
Add the file scantext.php with the following content:
- require_once 'sendhttp.php';
- require_once 'filemimetype.php';
Loads the code of the sendpost
and file_mime_type
functions.
- function scantext($login, $password, $file, $lang='eng', $psm='3', $out='txt', $output='ocr.txt', $params=false) {
Defines the function scantext
.
$login
is your identification code. $password
is your password.
$file
is the pathname of the PDF, JPEG, PNG or GIF file to convert.
$lang
is the language of the text, e.g. 'eng'
or 'eng+fra'
.
$psm
specifies the analysis mode of the text, i.e. 1
, 3
, 4
or 6
.
$out
is the output format, i.e. txt
, hocr
or box
.
$output
is the pathname of the file which will contains the text or the HTML returned by the analysis of $file
.
$params
is an associative array containing the names and the values of the parameters specifying the extraction mode of each page of a PDF and the processing options of each image before analysis, e.g. array('resolution' => 125, 'unsharp' => true))
.
- $curl = 'https://tesseractor.com/api/v1/scantext' . '?' . 'login=' . urlencode($login) . '&' . 'password=' . urlencode($password);
Sets $curl
to the URL of the scantext action with the identification code and the password of the user's account.
$login
and $password
must be escaped.
- $args = array(
- 'lang' => $lang,
- 'psm' => $psm,
- 'out' => $out,
- );
- $args = array_merge($args, $params);
Prepares the list of arguments of the POST.
- $files=array('file' => array('name' => basename($file), 'tmp_name' => $file, 'type' => file_mime_type($file)));
Prepares the list of files attached to the POST: file
- the PDF, JPEG, PNG or GIF to analyze with the name of the file, the pathname of the file and its MIME type.
- $response=sendpost($curl, $args, $files);
Sends the HTTP request with sendpost
.
The arguments login
and password
are already in $curl
.
- if (!$response or $response[0] != 200) {
- return false;
- }
If $response
is false
, the server is unreachable.
If $response[0]
doesn't contain the HTTP return code 200 Ok, an execution error has occurred.
In case of error, scantext
returns false.
- return @file_put_contents($output, $response[2]);
- }
Returns true
if the text or the HTML returned by the request could be written to the output file, false
otherwise.
EXAMPLE
Assuming you have saved the files sendhttp.php, filemimetype.php and scantext.php in the current directory, run PHP in interactive mode, load the scantext
function and call it with your identification code and password, the pathname of a PDF, JPEG, PNG or GIF file, a language, an analysis mode, an output type and the name of the output file in argument:
$ php -a
php > require_once 'scantext.php';
php > scantext('abcdef', 'ABCDEF', 'file.pdf', 'eng', '4', 'hocr', 'ocr.html', array('resolution' => 125, 'unsharp' => true));
php > quit
Display the result in HTML in your navigator:
$ firefox ocr.html
Add the following tag in the <head>
section of the output file in HTML to display the words read in red when moving the mouse over the text:
<style>
.ocrx_word:hover {color:#f30}
</style>
Comments
To add a comment, click here.