tesseract.js/docs/api.md

# API

- [createWorker()](#create-worker)
  - [Worker.load](#worker-load)
  - [Worker.loadLanguage](#worker-load-language)
  - [Worker.initialize](#worker-initialize)
  - [Worker.setParameters](#worker-set-parameters)
  - [Worker.recognize](#worker-recognize)
  - [Worker.detect](#worker-detect)
  - [Worker.terminate](#worker-terminate)
- [createScheduler()](#create-scheduler)
  - [Scheduler.addWorker](#scheduler-add-worker)
  - [Scheduler.addJob](#scheduler-add-job)
  - [Scheduler.getQueueLen](#scheduler-get-queue-len)
  - [Scheduler.getNumWorkers](#scheduler-get-num-workers)
- [setLogging()](#set-logging)
- [recognize()](#recognize)
- [detect()](#detect)
- [PSM](#psm)
- [OEM](#oem)

---

<a name="create-worker"></a>
## createWorker(options): Worker

createWorker is a factory function that creates a tesseract worker, a worker is basically a Web Worker in browser and Child Process in Node.

**Arguments:**

- `options` an object of customized options
  - `corePath` path for tesseract-core.js script
  - `langPath` path for downloading traineddata, do not include `/` at the end of the path
  - `workerPath` path for downloading worker script
  - `dataPath` path for saving traineddata in WebAssembly file system, not common to modify
  - `cachePath` path for the cached traineddata, more useful for Node, for browser it only changes the key in IndexDB
  - `cacheMethod` a string to indicate the method of cache management, should be one of the following options
    - write: read cache and write back (default method)
    - readOnly: read cache and not to write back
    - refresh: not to read cache and write back
    - none: not to read cache and not to write back
  - `workerBlobURL` a boolean to define whether to use Blob URL for worker script, default: true
  - `gzip` a boolean to define whether the traineddata from the remote is gzipped, default: true
  - `logger` a function to log the progress, a quick example is `m => console.log(m)`


**Examples:**

```javascript
const { createWorker } = Tesseract;
const worker = createWorker({
  langPath: '...',
  logger: m => console.log(m),
});
```

## Worker

A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is:

- load
- loadLanguauge
- initialize
- setParameters // optional
- recognize or detect
- terminate

Each function is async, so using async/await or Promise is required. When it is resolved, you get an object:

```json
{
  "jobId": "Job-1-123",
  "data": { ... }
}
```

jobId is generated by Tesseract.js, but you can put your own when calling any of the function above.

<a name="worker-load"></a>
### Worker.load(jobId): Promise

Worker.load() loads tesseract.js-core scripts (download from remote if not presented), it makes Web Worker/Child Process ready for next action.

**Arguments:**

- `jobId` Please see details above

**Examples:**

```javascript
(async () => {
  await worker.load();
})();
```

<a name="worker-load-language"></a>
### Worker.loadLanguage(langs, jobId): Promise

Worker.loadLanguage() loads traineddata from cache or download traineddata from remote, and put traineddata into the WebAssembly file system.

**Arguments:**

- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra**
- `jobId` Please see details above

**Examples:**

```javascript
(async () => {
  await worker.loadLanguage('eng+chi_tra');
})();
```

<a name="worker-initialize"></a>
### Worker.initialize(langs, oem, jobId): Promise

Worker.initialize() initializes the Tesseract API, make sure it is ready for doing OCR tasks.

**Arguments:**

- `langs` a string to indicate the languages loaded by Tesseract API, it can be the subset of the languauge traineddata you loaded from Worker.loadLanguage.
- `oem` a enum to indicate the OCR Engine Mode you use
- `jobId` Please see details above

**Examples:**

```javascript
(async () => {
  /** You can load more languages in advance, but use only part of them in Worker.initialize() */
  await worker.loadLanguage('eng+chi_tra');
  await worker.initialize('eng');
})();
```
<a name="worker-set-parameters"></a>
### Worker.setParameters(params, jobId): Promise

Worker.setParameters() set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful.

**Arguments:**

- `params` an object with key and value of the parameters
- `jobId` Please see details above

**Supported Paramters:**

| name | type | default value | description |
| ---- | ---- | ------------- | ----------- |
| tessedit\_ocr\_engine\_mode | enum | OEM.DEFAULT | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L268) for definition of each mode | 
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful the content in image is limited |
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words |
| tessjs\_create\_hocr | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result |
| tessjs\_create\_tsv | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result |
| tessjs\_create\_box | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result |
| tessjs\_create\_unlv | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result |
| tessjs\_create\_osd | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result |

**Examples:**

```javascript
(async () => {
  await worker.setParameters({
    tessedit_char_whitelist: '0123456789',
  });
})
```

<a name="worker-recognize"></a>
### Worker.recognize(image, options, jobId): Promise

Worker.recognize() provides core function of Tesseract.js as it executes OCR

Figures out what words are in `image`, where the words are in `image`, etc.
> Note: `image` should be sufficiently high resolution.
> Often, the same image will get much better results if you upscale it before calling `recognize`.

**Arguments:**

- `image` see [Image Format](./image-format.md) for more details.
- `options` a object of customized optons
  - `rectangles` an array of objects to specify the region you want to recognized in the image, the object should contain top, left, width and height, see example below.
- `jobId` Please see details above

**Output:**

**Examples:**

```javascript
const { createWorker } = Tesseract;
(async () => {
  const worker = createWorker();
  await worker.load();
  await worker.loadLanguage('eng');
  await worker.initialize('eng');
  const { data: { text } } = await worker.recognize(image);
  console.log(text);
})();
```

With rectangles

```javascript
const { createWorker } = Tesseract;
(async () => {
  const worker = createWorker();
  await worker.load();
  await worker.loadLanguage('eng');
  await worker.initialize('eng');
  const { data: { text } } = await worker.recognize(image, {
    rectangles: [
      { top: 0, left: 0, width: 100, height: 100 },
    ],
  });
  console.log(text);
})();
```

<a name="worker-detect"></a>
### Worker.detect(image, jobId): Promise

Worker.detect() does OSD (Orientation and Script Detection) to the image instead of OCR.

**Arguments:**

- `image` see [Image Format](./image-format.md) for more details.
- `jobId` Please see details above

**Examples:**

```javascript
const { createWorker } = Tesseract;
(async () => {
  const worker = createWorker();
  await worker.load();
  await worker.loadLanguage('eng');
  await worker.initialize('eng');
  const { data } = await worker.detect(image);
  console.log(data);
})();
```

<a name="worker-terminate"></a>
### Worker.terminate(jobId): Promise

Worker.terminate() terminates the worker and clean up

**Arguments:**

- `jobId` Please see details above

```javascript
(async () => {
  await worker.terminate();
})();
```

<a name="create-scheduler"></a>
## createScheduler(): Scheduler

createScheduler() is a factory function to create a scheduler, a scheduler manage a job queue and workers to enable multiple workers to work together, it is useful when you want to speed up your performance.

**Examples:**

```javascript
const { createScheduler } = Tesseract;
const scheduler = createScheduler();
```

### Scheduler

<a name="scheduler-add-worker"></a>
### Scheduler.addWorker(worker): string

Scheduler.addWorker() adds a worker into the worker pool inside scheduler, it is suggested to add one worker to only one sheduler.

**Arguments:**

- `worker` see Worker above

**Examples:**

```javascript
const { createWorker, createScheduler } = Tesseract;
const scheduler = createScheduler();
const worker = createWorker();
scheduler.addWorker(worker);
```

<a name="scheduler-add-job"></a>
### Scheduler.addJob(action, ...payload): Promise

Scheduler.addJob() adds a job to the job queue and scheduler waits and finds an idle worker to take the job.

**Arguments:**

- `action` a string to indicate the action you want to do, right now only **recognize** and **detect** are supported
- `payload` a arbitrary number of args depending on the action you called.

**Examples:**

```javascript
(async () => {
 const { data: { text } } = await scheduler.addJob('recognize', image, options);
 const { data } = await scheduler.addJob('detect', image);
})();
```

<a name="scheduler-get-queue-len"></a>
### Scheduler.getQueueLen(): number

Scheduler.getNumWorkers() returns the length of job queue.

<a name="scheduler-get-num-workers"></a>
### Scheduler.getNumWorkers(): number

Scheduler.getNumWorkers() returns number of workers added into the scheduler

<a name="scheduler-terminate"></a>
### Scheduler.terminate(): Promise

Scheduler.terminate() terminates all workers added, useful to do quick clean up.

**Examples:**

```javascript
(async () => {
  await scheduler.terminate();
})();
```

<a name="set-logging"></a>
## setLogging(logging: boolean)

setLogging() sets the logging flag, you can `setLogging(true)` to see detailed information, useful for debugging.

**Arguments:**

- `logging` boolean to define whether to see detailed logs, default: false

**Examples:**

```javascript
const { setLogging } = Tesseract;
setLogging(true);
```

<a name="recognize"></a>
## recognize(image, langs, options): Promise

recognize() is a function to quickly do recognize() task, it is not recommended to use in real application, but useful when you want to save some time.

See [Tesseract.js](../src/Tesseract.js)

<a name="detect"></a>
## detect(image, options): Promise

Same background as recongize(), but it does detect instead.

See [Tesseract.js](../src/Tesseract.js)

<a name="psm"></a>
## PSM

See [PSM.js](../src/constatns/PSM.js)

<a name="oem"></a>
## OEM

See [OEM.js](../src/constatns/OEM.js)
Update README.md & docs 6 years ago			`# API`

Update docs 5 years ago			`- [createWorker()](#create-worker)`
			`- [Worker.load](#worker-load)`
			`- [Worker.loadLanguage](#worker-load-language)`
			`- [Worker.initialize](#worker-initialize)`
			`- [Worker.setParameters](#worker-set-parameters)`
			`- [Worker.recognize](#worker-recognize)`
			`- [Worker.detect](#worker-detect)`
			`- [Worker.terminate](#worker-terminate)`
			`- [createScheduler()](#create-scheduler)`
			`- [Scheduler.addWorker](#scheduler-add-worker)`
			`- [Scheduler.addJob](#scheduler-add-job)`
			`- [Scheduler.getQueueLen](#scheduler-get-queue-len)`
			`- [Scheduler.getNumWorkers](#scheduler-get-num-workers)`
			`- [setLogging()](#set-logging)`
			`- [recognize()](#recognize)`
			`- [detect()](#detect)`
			`- [PSM](#psm)`
			`- [OEM](#oem)`

			`---`

			`<a name="create-worker"></a>`
			`## createWorker(options): Worker`

			`createWorker is a factory function that creates a tesseract worker, a worker is basically a Web Worker in browser and Child Process in Node.`

			`Arguments:`

			- `options` an object of customized options
			- `corePath` path for tesseract-core.js script
			- `langPath` path for downloading traineddata, do not include `/` at the end of the path
			- `workerPath` path for downloading worker script
			- `dataPath` path for saving traineddata in WebAssembly file system, not common to modify
			- `cachePath` path for the cached traineddata, more useful for Node, for browser it only changes the key in IndexDB
			- `cacheMethod` a string to indicate the method of cache management, should be one of the following options
			`- write: read cache and write back (default method)`
			`- readOnly: read cache and not to write back`
			`- refresh: not to read cache and write back`
			`- none: not to read cache and not to write back`
			- `workerBlobURL` a boolean to define whether to use Blob URL for worker script, default: true
			- `gzip` a boolean to define whether the traineddata from the remote is gzipped, default: true
			- `logger` a function to log the progress, a quick example is `m => console.log(m)`


			`Examples:`

			```javascript
			`const { createWorker } = Tesseract;`
			`const worker = createWorker({`
			`langPath: '...',`
			`logger: m => console.log(m),`
			`});`
			```

			`## Worker`

			`A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is:`

			`- load`
			`- loadLanguauge`
			`- initialize`
			`- setParameters // optional`
			`- recognize or detect`
			`- terminate`

			`Each function is async, so using async/await or Promise is required. When it is resolved, you get an object:`

			```json
			`{`
			`"jobId": "Job-1-123",`
			`"data": { ... }`
			`}`
			```

			`jobId is generated by Tesseract.js, but you can put your own when calling any of the function above.`

			`<a name="worker-load"></a>`
			`### Worker.load(jobId): Promise`

			`Worker.load() loads tesseract.js-core scripts (download from remote if not presented), it makes Web Worker/Child Process ready for next action.`

			`Arguments:`

			- `jobId` Please see details above

			`Examples:`

			```javascript
			`(async () => {`
			`await worker.load();`
			`})();`
			```

			`<a name="worker-load-language"></a>`
			`### Worker.loadLanguage(langs, jobId): Promise`

			`Worker.loadLanguage() loads traineddata from cache or download traineddata from remote, and put traineddata into the WebAssembly file system.`

			`Arguments:`

			- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with +, ex: eng+chi\_tra
			- `jobId` Please see details above

			`Examples:`

			```javascript
			`(async () => {`
			`await worker.loadLanguage('eng+chi_tra');`
			`})();`
			```

			`<a name="worker-initialize"></a>`
			`### Worker.initialize(langs, oem, jobId): Promise`

			`Worker.initialize() initializes the Tesseract API, make sure it is ready for doing OCR tasks.`

			`Arguments:`

			- `langs` a string to indicate the languages loaded by Tesseract API, it can be the subset of the languauge traineddata you loaded from Worker.loadLanguage.
			- `oem` a enum to indicate the OCR Engine Mode you use
			- `jobId` Please see details above

			`Examples:`

			```javascript
			`(async () => {`
			`/** You can load more languages in advance, but use only part of them in Worker.initialize() */`
			`await worker.loadLanguage('eng+chi_tra');`
			`await worker.initialize('eng');`
			`})();`
			```
			`<a name="worker-set-parameters"></a>`
			`### Worker.setParameters(params, jobId): Promise`

			`Worker.setParameters() set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful.`

			`Arguments:`

			- `params` an object with key and value of the parameters
			- `jobId` Please see details above

			`Supported Paramters:`

			`\| name \| type \| default value \| description \|`
			`\| ---- \| ---- \| ------------- \| ----------- \|`
Update api.md 5 years ago			`\| tessedit\_ocr\_engine\_mode \| enum \| OEM.DEFAULT \| Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L268) for definition of each mode \|`
Update docs 5 years ago			`\| tessedit\_pageseg\_mode \| enum \| PSM.SINGLE\_BLOCK \| Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode \|`
			`\| tessedit\_char\_whitelist \| string \| '' \| setting white list characters makes the result only contains these characters, useful the content in image is limited \|`
Add tessedit_char_whitelist test case 5 years ago			`\| preserve\_interword\_spaces \| string \| '0' \| '0' or '1', keeps the space between words \|`
Update docs 5 years ago			`\| tessjs\_create\_hocr \| string \| '1' \| only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result \|`
			`\| tessjs\_create\_tsv \| string \| '1' \| only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result \|`
			`\| tessjs\_create\_box \| string \| '0' \| only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result \|`
			`\| tessjs\_create\_unlv \| string \| '0' \| only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result \|`
			`\| tessjs\_create\_osd \| string \| '0' \| only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result \|`

			`Examples:`

			```javascript
			`(async () => {`
			`await worker.setParameters({`
			`tessedit_char_whitelist: '0123456789',`
			`});`
			`})`
			```

			`<a name="worker-recognize"></a>`
			`### Worker.recognize(image, options, jobId): Promise`
Update documents 5 years ago
			`Worker.recognize() provides core function of Tesseract.js as it executes OCR`

			Figures out what words are in `image`, where the words are in `image`, etc.
			> Note: `image` should be sufficiently high resolution.
			> Often, the same image will get much better results if you upscale it before calling `recognize`.

			`Arguments:`

			- `image` see [Image Format](./image-format.md) for more details.
			- `options` a object of customized optons
			- `rectangles` an array of objects to specify the region you want to recognized in the image, the object should contain top, left, width and height, see example below.
			- `jobId` Please see details above

			`Output:`

			`Examples:`

Fix api.md 5 years ago			```javascript
Update documents 5 years ago			`const { createWorker } = Tesseract;`
			`(async () => {`
			`const worker = createWorker();`
			`await worker.load();`
			`await worker.loadLanguage('eng');`
			`await worker.initialize('eng');`
			`const { data: { text } } = await worker.recognize(image);`
			`console.log(text);`
			`})();`
			```

			`With rectangles`

Fix api.md 5 years ago			```javascript
Update documents 5 years ago			`const { createWorker } = Tesseract;`
			`(async () => {`
			`const worker = createWorker();`
			`await worker.load();`
			`await worker.loadLanguage('eng');`
			`await worker.initialize('eng');`
			`const { data: { text } } = await worker.recognize(image, {`
			`rectangles: [`
			`{ top: 0, left: 0, width: 100, height: 100 },`
			`],`
			`});`
			`console.log(text);`
			`})();`
			```

Update docs 5 years ago			`<a name="worker-detect"></a>`
			`### Worker.detect(image, jobId): Promise`
Update documents 5 years ago
			`Worker.detect() does OSD (Orientation and Script Detection) to the image instead of OCR.`

			`Arguments:`

			- `image` see [Image Format](./image-format.md) for more details.
			- `jobId` Please see details above

			`Examples:`

Fix api.md 5 years ago			```javascript
Update documents 5 years ago			`const { createWorker } = Tesseract;`
			`(async () => {`
			`const worker = createWorker();`
			`await worker.load();`
			`await worker.loadLanguage('eng');`
			`await worker.initialize('eng');`
			`const { data } = await worker.detect(image);`
			`console.log(data);`
			`})();`
			```

Update docs 5 years ago			`<a name="worker-terminate"></a>`
			`### Worker.terminate(jobId): Promise`

Update documents 5 years ago			`Worker.terminate() terminates the worker and clean up`

			`Arguments:`

			- `jobId` Please see details above

			```javascript
			`(async () => {`
			`await worker.terminate();`
			`})();`
			```

Update docs 5 years ago			`<a name="create-scheduler"></a>`
			`## createScheduler(): Scheduler`

Update documents 5 years ago			`createScheduler() is a factory function to create a scheduler, a scheduler manage a job queue and workers to enable multiple workers to work together, it is useful when you want to speed up your performance.`

			`Examples:`

Fix api.md 5 years ago			```javascript
Update documents 5 years ago			`const { createScheduler } = Tesseract;`
			`const scheduler = createScheduler();`
			```

			`### Scheduler`

Update docs 5 years ago			`<a name="scheduler-add-worker"></a>`
			`### Scheduler.addWorker(worker): string`

Update documents 5 years ago			`Scheduler.addWorker() adds a worker into the worker pool inside scheduler, it is suggested to add one worker to only one sheduler.`

			`Arguments:`

			- `worker` see Worker above

			`Examples:`

			```javascript
			`const { createWorker, createScheduler } = Tesseract;`
			`const scheduler = createScheduler();`
			`const worker = createWorker();`
			`scheduler.addWorker(worker);`
			```

Update docs 5 years ago			`<a name="scheduler-add-job"></a>`
Update documents 5 years ago			`### Scheduler.addJob(action, ...payload): Promise`

			`Scheduler.addJob() adds a job to the job queue and scheduler waits and finds an idle worker to take the job.`

			`Arguments:`

			- `action` a string to indicate the action you want to do, right now only recognize and detect are supported
			- `payload` a arbitrary number of args depending on the action you called.

			`Examples:`

			```javascript
			`(async () => {`
			`const { data: { text } } = await scheduler.addJob('recognize', image, options);`
			`const { data } = await scheduler.addJob('detect', image);`
			`})();`
			```
Update docs 5 years ago
			`<a name="scheduler-get-queue-len"></a>`
			`### Scheduler.getQueueLen(): number`

			`Scheduler.getNumWorkers() returns the length of job queue.`

			`<a name="scheduler-get-num-workers"></a>`
			`### Scheduler.getNumWorkers(): number`

			`Scheduler.getNumWorkers() returns number of workers added into the scheduler`

			`<a name="scheduler-terminate"></a>`
			`### Scheduler.terminate(): Promise`

			`Scheduler.terminate() terminates all workers added, useful to do quick clean up.`

			`Examples:`

			```javascript
			`(async () => {`
			`await scheduler.terminate();`
			`})();`
			```

			`<a name="set-logging"></a>`
			`## setLogging(logging: boolean)`

			setLogging() sets the logging flag, you can `setLogging(true)` to see detailed information, useful for debugging.

			`Arguments:`

			- `logging` boolean to define whether to see detailed logs, default: false

			`Examples:`

			```javascript
			`const { setLogging } = Tesseract;`
			`setLogging(true);`
			```

			`<a name="recognize"></a>`
			`## recognize(image, langs, options): Promise`

Update documents 5 years ago			`recognize() is a function to quickly do recognize() task, it is not recommended to use in real application, but useful when you want to save some time.`
Update docs 5 years ago
			`See [Tesseract.js](../src/Tesseract.js)`

			`<a name="detect"></a>`
			`## detect(image, options): Promise`

			`Same background as recongize(), but it does detect instead.`

			`See [Tesseract.js](../src/Tesseract.js)`

			`<a name="psm"></a>`
			`## PSM`

			`See [PSM.js](../src/constatns/PSM.js)`

			`<a name="oem"></a>`
			`## OEM`

			`See [OEM.js](../src/constatns/OEM.js)`