@ -52,7 +53,7 @@ createWorker is a factory function that creates a tesseract worker, a worker is
@@ -52,7 +53,7 @@ createWorker is a factory function that creates a tesseract worker, a worker is
A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is:
- load
- FS functions // optional
- loadLanguauge
- initialize
@ -80,6 +82,23 @@ Each function is async, so using async/await or Promise is required. When it is
@@ -80,6 +82,23 @@ Each function is async, so using async/await or Promise is required. When it is
jobId is generated by Tesseract.js, but you can put your own when calling any of the function above.
<aname="worker-load"></a>
### Worker.load(jobId): Promise
Worker.load() loads tesseract.js-core scripts (download from remote if not presented), it makes Web Worker/Child Process ready for next action.
**Arguments:**
- `jobId` Please see details above
**Examples:**
```javascript
(async () => {
await worker.load();
})();
```
<aname="worker-writeText"></a>
### Worker.writeText(path, text, jobId): Promise
@ -206,7 +225,7 @@ Worker.setParameters() set parameters for Tesseract API (using SetVariable()), i
@@ -206,7 +225,7 @@ Worker.setParameters() set parameters for Tesseract API (using SetVariable()), i
- `params` an object with key and value of the parameters
@ -215,8 +234,11 @@ Worker.setParameters() set parameters for Tesseract API (using SetVariable()), i
@@ -215,8 +234,11 @@ Worker.setParameters() set parameters for Tesseract API (using SetVariable()), i
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful the content in image is limited |
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words |
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** |
This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.)
| tessjs\_create\_hocr | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result |
| tessjs\_create\_tsv | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result |
| tessjs\_create\_box | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result |
| tessjs\_create\_unlv | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result |
| tessjs\_create\_osd | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result |
**Examples:**
@ -240,9 +262,8 @@ Figures out what words are in `image`, where the words are in `image`, etc.
@@ -240,9 +262,8 @@ Figures out what words are in `image`, where the words are in `image`, etc.
**Arguments:**
- `image` see [Image Format](./image-format.md) for more details.
- `options` an object of customized options
- `options` a object of customized options
- `rectangle` an object to specify the regions you want to recognized in the image, should contain top, left, width and height, see example below.
- `output` an object specifying which output formats to return (by default `text`, `blocks`, `hocr`, and `tsv` are returned)
- `jobId` Please see details above
**Output:**
@ -252,7 +273,8 @@ Figures out what words are in `image`, where the words are in `image`, etc.
@@ -252,7 +273,8 @@ Figures out what words are in `image`, where the words are in `image`, etc.
```javascript
const { createWorker } = Tesseract;
(async () => {
const worker = await createWorker();
const worker = createWorker();
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize(image);
@ -265,7 +287,8 @@ With rectangle
@@ -265,7 +287,8 @@ With rectangle
@ -290,7 +313,8 @@ Worker.detect() does OSD (Orientation and Script Detection) to the image instead
@@ -290,7 +313,8 @@ Worker.detect() does OSD (Orientation and Script Detection) to the image instead
```javascript
const { createWorker } = Tesseract;
(async () => {
const worker = await createWorker();
const worker = createWorker();
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data } = await worker.detect(image);
@ -337,7 +361,7 @@ Scheduler.addWorker() adds a worker into the worker pool inside scheduler, it is
@@ -337,7 +361,7 @@ Scheduler.addWorker() adds a worker into the worker pool inside scheduler, it is
Tesseract.js is the JavaScript/Webassembly port of the Tesseract OCR engine. We do not edit the underlying Tesseract recognition engine in any way. Therefore, if you encounter bugs caused by the Tesseract engine you may open an issue here for the purposes of raising awareness to other users, but fixing is outside the scope of this repository.
If you encounter a Tesseract bug you would like to see fixed you should confirm the behavior is the same in the [main (CLI) version](https://github.com/tesseract-ocr/tesseract) of Tesseract and then open a Git Issue in that repository.
# Trained Data
## How does tesseract.js download and keep \*.traineddata?
The language model is downloaded by `worker.loadLanguage()` and you need to pass the langs to `worker.initialize()`.
@ -16,5 +9,34 @@ During the downloading of language model, Tesseract.js will first check if \*.tr
@@ -16,5 +9,34 @@ During the downloading of language model, Tesseract.js will first check if \*.tr
## How can I train my own \*.traineddata?
See the documentation from the main [Tesseract project](https://tesseract-ocr.github.io/tessdoc/) for training instructions.
For tesseract.js v2, check [TrainingTesseract 4.00](https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00)
For tesseract.js v1, check [Training Tesseract 3.03–3.05](https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05)
## How can I get HOCR, TSV, Box, UNLV, OSD?
Starting from 2.0.0-beta.1, you can get all these information in the final result.
The main Tesseract.js functions (ex. recognize, detect) take an `image` parameter. The image formats and data types supported are listed below.
The main Tesseract.js functions (ex. recognize, detect) take an `image` parameter, which should be something that is like an image. What's considered "image-like" differs depending on whether it is being run from the browser or through NodeJS.
Support Image Formats: **bmp, jpg, png, pbm, webp**
On a browser, an image can be:
- an `img` or `canvas` element
- a `File` object (from a file `<input>`)
- a `Blob` object
- a path or URL to an accessible image
- a base64 encoded image fits `data:image\/([a-zA-Z]*);base64,([^"]*)` regexp
For browser and Node, supported data types are:
- string with base64 encoded image (fits `data:image\/([a-zA-Z]*);base64,([^"]*)` regexp)
- buffer
In Node.js, an image can be
- a path to a local image
- a Buffer storing binary image
- a base64 encoded image fits `data:image\/([a-zA-Z]*);base64,([^"]*)` regexp
For browser only, supported data types are:
- `File` or `Blob` object
- `img` or `canvas` element
For Node only, supported data types are:
- string containing a path to local image
Note: images must be a supported image format **and** a supported data type. For example, a buffer containing a png image is supported. A buffer containing raw pixel data is not supported.
@ -33,6 +33,6 @@ A string specifying the location of the [worker.js](./dist/worker.min.js) file.
@@ -33,6 +33,6 @@ A string specifying the location of the [worker.js](./dist/worker.min.js) file.
A string specifying the location of the tesseract language files, with default value 'https://tessdata.projectnaptha.com/4.0.0'. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`.
### corePath
A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm.js'.
A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm.js' (fallback to tesseract-core.asm.js when WebAssembly is not available).
Another WASM option is 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.js' which is a script that loads 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm'. But it fails to fetch at this moment.