Version 2 is now available and under development in the master branch<br>
Version 2 beta is now available and under development in the master branch<br>
Check the <ahref="https://github.com/naptha/tesseract.js/tree/support/1.x">support/1.x</a> branch for version 1
</h3>
@ -26,25 +26,45 @@ It works in the browser using [webpack](https://webpack.js.org/) or plain script
@@ -26,25 +26,45 @@ It works in the browser using [webpack](https://webpack.js.org/) or plain script
After you [install it](#installation), using it is as simple as:
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await woker.terminate();
})();
```
[Check out the docs](#docs) for a full explanation of the API.
## Major changes in v2
- Upgrade to tesseract v4
## Major changes in v2 beta
- Upgrade to tesseract v4.1 (using emscripten 1.38.45)
- Support multiple languages at the same time, eg: eng+chi_tra for English and Traditional Chinese
- Supported image formats: png, jpg, bmp, pbm
- Support WebAssembly (fallback to ASM.js when browser doesn't support)
- Support Typescript
## Installation
@ -54,7 +74,7 @@ Tesseract.js works with a `<script>` tag via local copy or CDN, with webpack via
@@ -54,7 +74,7 @@ Tesseract.js works with a `<script>` tag via local copy or CDN, with webpack via
The development server will be available at http://localhost:3000/examples/browser/demo.html in your favorite browser.
It will automatically rebuild `tesseract.dev.js` and `worker.min.js` when you change files in the src folder.
It will automatically rebuild `tesseract.dev.js` and `worker.dev.js` when you change files in the **src** folder.
You can also run the development server in Gitpod ( a free online IDE and dev environment for GitHub that will automate your dev setup ) with a single click.
createWorker is a factory function that creates a tesseract worker, a worker is basically a Web Worker in browser and Child Process in Node.
**Arguments:**
- `options` an object of customized options
- `corePath` path for tesseract-core.js script
- `langPath` path for downloading traineddata, do not include `/` at the end of the path
- `workerPath` path for downloading worker script
- `dataPath` path for saving traineddata in WebAssembly file system, not common to modify
- `cachePath` path for the cached traineddata, more useful for Node, for browser it only changes the key in IndexDB
- `cacheMethod` a string to indicate the method of cache management, should be one of the following options
- write: read cache and write back (default method)
- readOnly: read cache and not to write back
- refresh: not to read cache and write back
- none: not to read cache and not to write back
- `workerBlobURL` a boolean to define whether to use Blob URL for worker script, default: true
- `gzip` a boolean to define whether the traineddata from the remote is gzipped, default: true
- `logger` a function to log the progress, a quick example is `m => console.log(m)`
**Examples:**
```javascript
const { createWorker } = Tesseract;
const worker = createWorker({
langPath: '...',
logger: m => console.log(m),
});
```
## Worker
A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is:
- load
- loadLanguauge
- initialize
- setParameters // optional
- recognize or detect
- terminate
Each function is async, so using async/await or Promise is required. When it is resolved, you get an object:
```json
{
"jobId": "Job-1-123",
"data": { ... }
}
```
jobId is generated by Tesseract.js, but you can put your own when calling any of the function above.
<aname="worker-load"></a>
### Worker.load(jobId): Promise
Worker.load() loads tesseract.js-core scripts (download from remote if not presented), it makes Web Worker/Child Process ready for next action.
**Arguments:**
- `jobId` Please see details above
**Examples:**
```javascript
(async () => {
await worker.load();
})();
```
<aname="worker-load-language"></a>
### Worker.loadLanguage(langs, jobId): Promise
Worker.loadLanguage() loads traineddata from cache or download traineddata from remote, and put traineddata into the WebAssembly file system.
**Arguments:**
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra**
- `jobId` Please see details above
**Examples:**
```javascript
(async () => {
await worker.loadLanguage('eng+chi_tra');
})();
```
<aname="worker-initialize"></a>
### Worker.initialize(langs, oem, jobId): Promise
Worker.initialize() initializes the Tesseract API, make sure it is ready for doing OCR tasks.
**Arguments:**
- `langs` a string to indicate the languages loaded by Tesseract API, it can be the subset of the languauge traineddata you loaded from Worker.loadLanguage.
- `oem` a enum to indicate the OCR Engine Mode you use
- `jobId` Please see details above
**Examples:**
```javascript
(async () => {
/** You can load more languages in advance, but use only part of them in Worker.initialize() */
await worker.loadLanguage('eng+chi_tra');
await worker.initialize('eng');
})();
```
<aname="worker-set-parameters"></a>
### Worker.setParameters(params, jobId): Promise
Worker.setParameters() set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful.
**Arguments:**
- `params` an object with key and value of the parameters
- `jobId` Please see details above
**Supported Paramters:**
| name | type | default value | description |
| ---- | ---- | ------------- | ----------- |
| tessedit\_ocr\_engine\_mode | enum | OEM.LSTM\_ONLY | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L268) for definition of each mode |
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful the content in image is limited |
| tessjs\_create\_hocr | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result |
| tessjs\_create\_tsv | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result |
| tessjs\_create\_box | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result |
| tessjs\_create\_unlv | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result |
| tessjs\_create\_osd | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result |
Scheduler.getNumWorkers() returns the length of job queue.
<aname="scheduler-get-num-workers"></a>
### Scheduler.getNumWorkers(): number
Scheduler.getNumWorkers() returns number of workers added into the scheduler
<aname="scheduler-terminate"></a>
### Scheduler.terminate(): Promise
Scheduler.terminate() terminates all workers added, useful to do quick clean up.
**Examples:**
```javascript
(async () => {
await scheduler.terminate();
})();
```
<aname="set-logging"></a>
## setLogging(logging: boolean)
setLogging() sets the logging flag, you can `setLogging(true)` to see detailed information, useful for debugging.
**Arguments:**
- `logging` boolean to define whether to see detailed logs, default: false
**Examples:**
```javascript
const { setLogging } = Tesseract;
setLogging(true);
```
<aname="recognize"></a>
## recognize(image, langs, options): Promise
recognize() is a function to quickly achieve recognize() task, it is not recommended to use in real application, but useful when you want to save some time.
See [Tesseract.js](../src/Tesseract.js)
<aname="detect"></a>
## detect(image, options): Promise
Same background as recongize(), but it does detect instead.
## How does tesseract.js download and keep \*.traineddata?
When you execute recognize() function (ex: `recognize(image, 'eng')`), the language model to download is determined by the 2nd argument of recognize(). (`eng` in the example)
The language model is downloaded by `worker.loadLanguage()` and you need to pass the langs to `worker.initialize()`.
Tesseract.js will first check if \*.traineddata already exists. (browser: [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API), Node.js: fs, in the folder you execute the command) If the \*.traineddata doesn't exist, it will fetch \*.traineddata.gz from [tessdata](https://github.com/naptha/tessdata), ungzip and store in IndexedDB or fs, you can delete it manually and it will download again for you.
During the downloading of language model, Tesseract.js will first check if \*.traineddata already exists. (browser: [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API), Node.js: fs, in the folder you execute the command) If the \*.traineddata doesn't exist, it will fetch \*.traineddata.gz from [tessdata](https://github.com/naptha/tessdata), ungzip and store in IndexedDB or fs, you can delete it manually and it will download again for you.
@ -9,10 +9,20 @@ Because of this we recommend loading `tesseract.js` from a CDN. But if you reall
@@ -9,10 +9,20 @@ Because of this we recommend loading `tesseract.js` from a CDN. But if you reall
In Node.js environment, the only path you may want to customize is languages/langPath.
@ -23,6 +33,6 @@ A string specifying the location of the [worker.js](./dist/worker.min.js) file.
@@ -23,6 +33,6 @@ A string specifying the location of the [worker.js](./dist/worker.min.js) file.
A string specifying the location of the tesseract language files, with default value 'https://tessdata.projectnaptha.com/4.0.0'. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`.
### corePath
A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.10/tesseract-core.wasm.js' (fallback to tesseract-core.asm.js when WebAssembly is not available).
A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.13/tesseract-core.wasm.js' (fallback to tesseract-core.asm.js when WebAssembly is not available).
Another WASM option is 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.10/tesseract-core.js' which is a script that loads 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.10/tesseract-core.wasm'. But it fails to fetch at this moment.
Another WASM option is 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.13/tesseract-core.js' which is a script that loads 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.13/tesseract-core.wasm'. But it fails to fetch at this moment.
In the 3rd argument of `TesseractWorker.recognize()`, you can pass a params object to customize the result of OCR, below are supported parameters in tesseract.js so far.
When initializing
In the 3rd argument of `ecognize()`, you can pass a params object to customize the result of OCR, below are supported parameters in tesseract.js so far.
Example:
```javascript
import Tesseract from 'tesseract.js';
import { createWorker, OEM, PSM } from 'tesseract.js';
| tessedit\_ocr\_engine\_mode | enum | OEM.LSTM\_ONLY | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L268) for definition of each mode |
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful the content in image is limited |
| tessjs\_create\_pdf | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js generates a pdf output |
| tessjs\_create\_hocr | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result |
| tessjs\_create\_tsv | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result |
| tessjs\_create\_box | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result |
| tessjs\_create\_unlv | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result |
| tessjs\_create\_osd | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result |
| tessjs\_pdf\_name | string | 'tesseract.js-ocr-result' | the name of the generated pdf file |
| tessjs\_pdf\_title | string | 'Tesseract.js OCR Result' | the title of the generated pdf file |
| tessjs\_pdf\_auto\_download | boolean | true | If the value is true, tesseract.js will automatic download/writeFile pdf file |
| tessjs\_pdf\_bin | boolean | false | whether to include pdf binary array in the result object (result.files.pdf) |
| tessjs\_image\_rectangle\_left | number | 0 | The left of the sub-rectangle of the image. |
| tessjs\_image\_rectangle\_top | number | 0 | The top of the sub-rectangle of the image. |
| tessjs\_image\_rectangle\_width | number | -1 | The width of the sub-rectangle of the image, -1 means auto width detection |
| tessjs\_image\_rectangle\_height | number | -1 | The height of the sub-rectangle of the image, -1 means auto height detection |