Browse Source

Update docs

develop
Jerome Wu 5 years ago
parent
commit
b59d144af3
  1. 48
      README.md
  2. 244
      docs/api.md
  3. 288
      docs/examples.md
  4. 38
      docs/faq.md
  5. 20
      docs/local-installation.md
  6. 15
      docs/tesseract_parameters.md

48
README.md

@ -11,7 +11,7 @@ @@ -11,7 +11,7 @@
[![Downloads Month](https://img.shields.io/npm/dm/tesseract.js.svg)](https://www.npmjs.com/package/tesseract.js)
<h3 align="center">
Version 2 is now available and under development in the master branch<br>
Version 2 beta is now available and under development in the master branch<br>
Check the <a href="https://github.com/naptha/tesseract.js/tree/support/1.x">support/1.x</a> branch for version 1
</h3>
@ -26,25 +26,45 @@ It works in the browser using [webpack](https://webpack.js.org/) or plain script @@ -26,25 +26,45 @@ It works in the browser using [webpack](https://webpack.js.org/) or plain script
After you [install it](#installation), using it is as simple as:
```javascript
import { TesseractWorker } from 'tesseract.js';
const worker = new TesseractWorker();
worker.recognize(myImage)
.progress(progress => {
console.log('progress', progress);
}).then(result => {
console.log('result', result);
});
import Tesseract from 'tesseract.js';
Tesseract.recognize(
'https://tesseract.projectnaptha.com/img/eng_bw.png',
'eng',
{ logger: m => console.log(m) }
).then(({ data: { text } }) => {
console.log(text);
})
```
Or more imperative
```javascript
import { createWorker } from 'tesseract.js';
const worker = createWorker({
logger: m => console.log(m)
});
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await woker.terminate();
})();
```
[Check out the docs](#docs) for a full explanation of the API.
## Major changes in v2
- Upgrade to tesseract v4
## Major changes in v2 beta
- Upgrade to tesseract v4.1 (using emscripten 1.38.45)
- Support multiple languages at the same time, eg: eng+chi_tra for English and Traditional Chinese
- Supported image formats: png, jpg, bmp, pbm
- Support WebAssembly (fallback to ASM.js when browser doesn't support)
- Support Typescript
## Installation
@ -54,7 +74,7 @@ Tesseract.js works with a `<script>` tag via local copy or CDN, with webpack via @@ -54,7 +74,7 @@ Tesseract.js works with a `<script>` tag via local copy or CDN, with webpack via
### CDN
```html
<!-- v2 -->
<script src='https://unpkg.com/tesseract.js@v2.0.0-alpha.16/dist/tesseract.min.js'></script>
<script src='https://unpkg.com/tesseract.js@v2.0.0-beta.1/dist/tesseract.min.js'></script>
<!-- v1 -->
<script src='https://unpkg.com/tesseract.js@1.0.19/src/index.js'></script>
@ -103,7 +123,7 @@ npm start @@ -103,7 +123,7 @@ npm start
```
The development server will be available at http://localhost:3000/examples/browser/demo.html in your favorite browser.
It will automatically rebuild `tesseract.dev.js` and `worker.min.js` when you change files in the src folder.
It will automatically rebuild `tesseract.dev.js` and `worker.dev.js` when you change files in the **src** folder.
You can also run the development server in Gitpod ( a free online IDE and dev environment for GitHub that will automate your dev setup ) with a single click.

244
docs/api.md

@ -1,5 +1,249 @@ @@ -1,5 +1,249 @@
# API
- [createWorker()](#create-worker)
- [Worker.load](#worker-load)
- [Worker.loadLanguage](#worker-load-language)
- [Worker.initialize](#worker-initialize)
- [Worker.setParameters](#worker-set-parameters)
- [Worker.recognize](#worker-recognize)
- [Worker.detect](#worker-detect)
- [Worker.terminate](#worker-terminate)
- [createScheduler()](#create-scheduler)
- [Scheduler.addWorker](#scheduler-add-worker)
- [Scheduler.addJob](#scheduler-add-job)
- [Scheduler.getQueueLen](#scheduler-get-queue-len)
- [Scheduler.getNumWorkers](#scheduler-get-num-workers)
- [setLogging()](#set-logging)
- [recognize()](#recognize)
- [detect()](#detect)
- [PSM](#psm)
- [OEM](#oem)
---
<a name="create-worker"></a>
## createWorker(options): Worker
createWorker is a factory function that creates a tesseract worker, a worker is basically a Web Worker in browser and Child Process in Node.
**Arguments:**
- `options` an object of customized options
- `corePath` path for tesseract-core.js script
- `langPath` path for downloading traineddata, do not include `/` at the end of the path
- `workerPath` path for downloading worker script
- `dataPath` path for saving traineddata in WebAssembly file system, not common to modify
- `cachePath` path for the cached traineddata, more useful for Node, for browser it only changes the key in IndexDB
- `cacheMethod` a string to indicate the method of cache management, should be one of the following options
- write: read cache and write back (default method)
- readOnly: read cache and not to write back
- refresh: not to read cache and write back
- none: not to read cache and not to write back
- `workerBlobURL` a boolean to define whether to use Blob URL for worker script, default: true
- `gzip` a boolean to define whether the traineddata from the remote is gzipped, default: true
- `logger` a function to log the progress, a quick example is `m => console.log(m)`
**Examples:**
```javascript
const { createWorker } = Tesseract;
const worker = createWorker({
langPath: '...',
logger: m => console.log(m),
});
```
## Worker
A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is:
- load
- loadLanguauge
- initialize
- setParameters // optional
- recognize or detect
- terminate
Each function is async, so using async/await or Promise is required. When it is resolved, you get an object:
```json
{
"jobId": "Job-1-123",
"data": { ... }
}
```
jobId is generated by Tesseract.js, but you can put your own when calling any of the function above.
<a name="worker-load"></a>
### Worker.load(jobId): Promise
Worker.load() loads tesseract.js-core scripts (download from remote if not presented), it makes Web Worker/Child Process ready for next action.
**Arguments:**
- `jobId` Please see details above
**Examples:**
```javascript
(async () => {
await worker.load();
})();
```
<a name="worker-load-language"></a>
### Worker.loadLanguage(langs, jobId): Promise
Worker.loadLanguage() loads traineddata from cache or download traineddata from remote, and put traineddata into the WebAssembly file system.
**Arguments:**
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra**
- `jobId` Please see details above
**Examples:**
```javascript
(async () => {
await worker.loadLanguage('eng+chi_tra');
})();
```
<a name="worker-initialize"></a>
### Worker.initialize(langs, oem, jobId): Promise
Worker.initialize() initializes the Tesseract API, make sure it is ready for doing OCR tasks.
**Arguments:**
- `langs` a string to indicate the languages loaded by Tesseract API, it can be the subset of the languauge traineddata you loaded from Worker.loadLanguage.
- `oem` a enum to indicate the OCR Engine Mode you use
- `jobId` Please see details above
**Examples:**
```javascript
(async () => {
/** You can load more languages in advance, but use only part of them in Worker.initialize() */
await worker.loadLanguage('eng+chi_tra');
await worker.initialize('eng');
})();
```
<a name="worker-set-parameters"></a>
### Worker.setParameters(params, jobId): Promise
Worker.setParameters() set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful.
**Arguments:**
- `params` an object with key and value of the parameters
- `jobId` Please see details above
**Supported Paramters:**
| name | type | default value | description |
| ---- | ---- | ------------- | ----------- |
| tessedit\_ocr\_engine\_mode | enum | OEM.LSTM\_ONLY | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L268) for definition of each mode |
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful the content in image is limited |
| tessjs\_create\_hocr | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result |
| tessjs\_create\_tsv | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result |
| tessjs\_create\_box | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result |
| tessjs\_create\_unlv | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result |
| tessjs\_create\_osd | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result |
**Examples:**
```javascript
(async () => {
await worker.setParameters({
tessedit_char_whitelist: '0123456789',
});
})
```
<a name="worker-recognize"></a>
### Worker.recognize(image, options, jobId): Promise
<a name="worker-detect"></a>
### Worker.detect(image, jobId): Promise
<a name="worker-terminate"></a>
### Worker.terminate(jobId): Promise
<a name="create-scheduler"></a>
## createScheduler(): Scheduler
<a name="scheduler-add-worker"></a>
### Scheduler.addWorker(worker): string
<a name="scheduler-add-job"></a>
### Scheduler.addJob(worker): Promise
<a name="scheduler-get-queue-len"></a>
### Scheduler.getQueueLen(): number
Scheduler.getNumWorkers() returns the length of job queue.
<a name="scheduler-get-num-workers"></a>
### Scheduler.getNumWorkers(): number
Scheduler.getNumWorkers() returns number of workers added into the scheduler
<a name="scheduler-terminate"></a>
### Scheduler.terminate(): Promise
Scheduler.terminate() terminates all workers added, useful to do quick clean up.
**Examples:**
```javascript
(async () => {
await scheduler.terminate();
})();
```
<a name="set-logging"></a>
## setLogging(logging: boolean)
setLogging() sets the logging flag, you can `setLogging(true)` to see detailed information, useful for debugging.
**Arguments:**
- `logging` boolean to define whether to see detailed logs, default: false
**Examples:**
```javascript
const { setLogging } = Tesseract;
setLogging(true);
```
<a name="recognize"></a>
## recognize(image, langs, options): Promise
recognize() is a function to quickly achieve recognize() task, it is not recommended to use in real application, but useful when you want to save some time.
See [Tesseract.js](../src/Tesseract.js)
<a name="detect"></a>
## detect(image, options): Promise
Same background as recongize(), but it does detect instead.
See [Tesseract.js](../src/Tesseract.js)
<a name="psm"></a>
## PSM
See [PSM.js](../src/constatns/PSM.js)
<a name="oem"></a>
## OEM
See [OEM.js](../src/constatns/OEM.js)
## TesseractWorker.recognize(image, lang, [, options]) -> [TesseractJob](#tesseractjob)
Figures out what words are in `image`, where the words are in `image`, etc.
> Note: `image` should be sufficiently high resolution.

288
docs/examples.md

@ -12,217 +12,147 @@ Example repositories: @@ -12,217 +12,147 @@ Example repositories:
### basic
```javascript
import Tesseract from 'tesseract.js';
const { TesseractWorker } = Tesseract;
const worker = new TesseractWorker();
worker
.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png')
.progress((p) => {
console.log('progress', p);
})
.then(({ text }) => {
console.log(text);
worker.terminate();
});
import { createWorker } from 'tesseract.js';
const worker = createWorker();
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();
```
### with detailed progress
```javascript
import Tesseract from 'tesseract.js';
const { TesseractWorker } = Tesseract;
const worker = new TesseractWorker();
worker
.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png')
.progress((p) => {
console.log('progress', p);
})
.then(({ text }) => {
console.log(text);
worker.terminate();
});
import { createWorker } from 'tesseract.js';
const worker = createWorker({
logger: m => console.log(m), // Add logger here
});
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();
```
### with multiple languages, separate by '+'
```javascript
import Tesseract from 'tesseract.js';
const { TesseractWorker } = Tesseract;
const worker = new TesseractWorker();
worker
.recognize(
'https://tesseract.projectnaptha.com/img/eng_bw.png',
'eng+chi_tra'
)
.progress((p) => {
console.log('progress', p);
})
.then(({ text }) => {
console.log(text);
worker.terminate();
});
import { createWorker } from 'tesseract.js';
const worker = createWorker();
(async () => {
await worker.load();
await worker.loadLanguage('eng+chi_tra');
await worker.initialize('eng+chi_tra');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();
```
### with whitelist char (^2.0.0-beta.1)
### with whitelist char (^2.0.0-alpha.5)
```javascript
import { createWorker } from 'tesseract.js';
Sadly, whitelist chars is not supported in tesseract.js v4, so in tesseract.js we need to switch to tesseract v3 mode to make it work.
const worker = createWorker();
```javascript
import Tesseract from 'tesseract.js';
const { TesseractWorker, OEM } = Tesseract;
const worker = new TesseractWorker();
worker
.recognize(
'https://tesseract.projectnaptha.com/img/eng_bw.png',
'eng',
{
'tessedit_ocr_engine_mode': OEM.TESSERACT_ONLY,
'tessedit_char_whitelist': '0123456789-.',
}
)
.progress((p) => {
console.log('progress', p);
})
.then(({ text }) => {
console.log(text);
worker.terminate();
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({
tessedit_char_whitelist: '0123456789',
});
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();
```
### with different pageseg mode (^2.0.0-alpha.5)
### with different pageseg mode (^2.0.0-beta.1)
Check here for more details of pageseg mode: https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163
```javascript
import Tesseract from 'tesseract.js';
const { TesseractWorker, PSM } = Tesseract;
const worker = new TesseractWorker();
worker
.recognize(
'https://tesseract.projectnaptha.com/img/eng_bw.png',
'eng',
{
'tessedit_pageseg_mode': PSM.SINGLE_BLOCK,
}
)
.progress((p) => {
console.log('progress', p);
})
.then(({ text }) => {
console.log(text);
worker.terminate();
});
```
### with pdf output (^2.0.0-alpha.12)
import { createWorker, PSM } from 'tesseract.js';
In this example, pdf file will be downloaded in browser and write to file system in Node.js
const worker = createWorker();
```javascript
import Tesseract from 'tesseract.js';
const { TesseractWorker } = Tesseract;
const worker = new TesseractWorker();
worker
.recognize(
'https://tesseract.projectnaptha.com/img/eng_bw.png',
'eng',
{
'tessjs_create_pdf': '1',
}
)
.progress((p) => {
console.log('progress', p);
})
.then(({ text }) => {
console.log(text);
worker.terminate();
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({
tessedit_pageseg_mode: PSM.SINGLE_BLOCK,
});
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();
```
If you want to handle pdf file by yourself
### with pdf output (^2.0.0-beta.1)
```javascript
import Tesseract from 'tesseract.js';
const { TesseractWorker } = Tesseract;
const worker = new TesseractWorker();
worker
.recognize(
'https://tesseract.projectnaptha.com/img/eng_bw.png',
'eng',
{
'tessjs_create_pdf': '1',
'tessjs_pdf_auto_download': false, // disable auto download
'tessjs_pdf_bin': true, // add pdf file bin array in result
}
)
.progress((p) => {
console.log('progress', p);
})
.then(({ files: { pdf } }) => {
console.log(Object.values(pdf)); // As pdf is an array-like object, you need to do a little convertion first.
worker.terminate();
});
```
Please check **examples** folder for details.
### with preload language data
Browser: [download-pdf.html](../examples/browser/download-pdf.html)
Node: [download-pdf.js](../examples/node/download-pdf.js)
```javascript
const Tesseract = require('tesseract.js');
const { TesseractWorker, utils: { loadLang } } = Tesseract;
const worker = new TesseractWorker();
loadLang({ langs: 'eng', langPath: worker.options.langPath })
.then(() => {
worker
.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png')
.progress(p => console.log(p))
.then(({ text }) => {
console.log(text);
worker.terminate();
});
});
### with only part of the image (^2.0.0-beta.1)
```javascript
import { createWorker } from 'tesseract.js';
const worker = createWorker();
const rectangles = [
{ left: 0, top: 0, width: 500, height: 250 },
];
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png', 'eng', { rectangles });
console.log(text);
await worker.terminate();
})();
```
### with only part of the image (^2.0.0-alpha.12)
### with multiple workers to speed up (^2.0.0-beta.1)
```javascript
import Tesseract from 'tesseract.js';
const { TesseractWorker } = Tesseract;
const worker = new TesseractWorker();
worker
.recognize(
'https://tesseract.projectnaptha.com/img/eng_bw.png',
'eng',
{
tessjs_image_rectangle_left: 0,
tessjs_image_rectangle_top: 0,
tessjs_image_rectangle_width: 500,
tessjs_image_rectangle_height: 250,
}
)
.progress((p) => {
console.log('progress', p);
})
.then(({ text }) => {
console.log(text);
worker.terminate();
});
import { createWorker, createScheduler } from 'tesseract.js';
const scheduler = createScheduler();
const worker1 = createWorker();
const worker2 = createWorker();
(async () => {
await worker1.load();
await worker2.load();
await worker1.loadLanguage('eng');
await worker2.loadLanguage('eng');
await worker1.initialize('eng');
await worker2.initialize('eng');
scheduler.addWorker(worker1);
scheduler.addWorker(worker2);
/** Add 10 recognition jobs */
const results = await Promise.all(Array(10).fill(0).map(() => (
await scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png')
)))
console.log(results);
await scheduler.terminate(); // It also terminates all workers.
})();
```

38
docs/faq.md

@ -3,9 +3,9 @@ FAQ @@ -3,9 +3,9 @@ FAQ
## How does tesseract.js download and keep \*.traineddata?
When you execute recognize() function (ex: `recognize(image, 'eng')`), the language model to download is determined by the 2nd argument of recognize(). (`eng` in the example)
The language model is downloaded by `worker.loadLanguage()` and you need to pass the langs to `worker.initialize()`.
Tesseract.js will first check if \*.traineddata already exists. (browser: [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API), Node.js: fs, in the folder you execute the command) If the \*.traineddata doesn't exist, it will fetch \*.traineddata.gz from [tessdata](https://github.com/naptha/tessdata), ungzip and store in IndexedDB or fs, you can delete it manually and it will download again for you.
During the downloading of language model, Tesseract.js will first check if \*.traineddata already exists. (browser: [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API), Node.js: fs, in the folder you execute the command) If the \*.traineddata doesn't exist, it will fetch \*.traineddata.gz from [tessdata](https://github.com/naptha/tessdata), ungzip and store in IndexedDB or fs, you can delete it manually and it will download again for you.
## How can I train my own \*.traineddata?
@ -15,26 +15,28 @@ For tesseract.js v1, check [Training Tesseract 3.03–3.05](https://github.com/t @@ -15,26 +15,28 @@ For tesseract.js v1, check [Training Tesseract 3.03–3.05](https://github.com/t
## How can I get HOCR, TSV, Box, UNLV, OSD?
Starting from 2.0.0-alpha.10, you can get all these information in the final result.
Starting from 2.0.0-beta.1, you can get all these information in the final result.
```javascript
import Tesseract from 'tesseract.js';
const { TesseractWorker } = Tesseract;
const worker = new TesseractWorker();
worker
.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png', 'eng', {
import { createWorker } from 'tesseract.js';
const worker = createWorker({
logger: m => console.log(m)
});
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({
tessedit_create_box: '1',
tessedit_create_unlv: '1',
tessedit_create_osd: '1',
})
.then((result) => {
console.log(result.text);
console.log(result.hocr);
console.log(result.tsv);
console.log(result.box);
console.log(result.unlv);
console.log(result.osd);
});
const { data: { text, hocr, tsv, box, unlv } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
console.log(hocr);
console.log(tsv);
console.log(box);
console.log(unlv);
})();
```

20
docs/local-installation.md

@ -9,10 +9,20 @@ Because of this we recommend loading `tesseract.js` from a CDN. But if you reall @@ -9,10 +9,20 @@ Because of this we recommend loading `tesseract.js` from a CDN. But if you reall
In Node.js environment, the only path you may want to customize is languages/langPath.
```javascript
const worker = Tesseract.TesseractWorker({
workerPath: 'https://unpkg.com/tesseract.js@v2.0.0-alpha.13/dist/worker.min.js',
Tesseract.recognize(image, langs, {
workerPath: 'https://unpkg.com/tesseract.js@v2.0.0-beta.1/dist/worker.min.js',
langPath: 'https://tessdata.projectnaptha.com/4.0.0',
corePath: 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.10/tesseract-core.wasm.js',
corePath: 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.13/tesseract-core.wasm.js',
})
```
Or
```javascript
const worker = createWorker({
workerPath: 'https://unpkg.com/tesseract.js@v2.0.0-beta.1/dist/worker.min.js',
langPath: 'https://tessdata.projectnaptha.com/4.0.0',
corePath: 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.13/tesseract-core.wasm.js',
});
```
@ -23,6 +33,6 @@ A string specifying the location of the [worker.js](./dist/worker.min.js) file. @@ -23,6 +33,6 @@ A string specifying the location of the [worker.js](./dist/worker.min.js) file.
A string specifying the location of the tesseract language files, with default value 'https://tessdata.projectnaptha.com/4.0.0'. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`.
### corePath
A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.10/tesseract-core.wasm.js' (fallback to tesseract-core.asm.js when WebAssembly is not available).
A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.13/tesseract-core.wasm.js' (fallback to tesseract-core.asm.js when WebAssembly is not available).
Another WASM option is 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.10/tesseract-core.js' which is a script that loads 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.10/tesseract-core.wasm'. But it fails to fetch at this moment.
Another WASM option is 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.13/tesseract-core.js' which is a script that loads 'https://unpkg.com/tesseract.js-core@v2.0.0-beta.13/tesseract-core.wasm'. But it fails to fetch at this moment.

15
docs/tesseract_parameters.md

@ -1,12 +1,14 @@ @@ -1,12 +1,14 @@
Tesseract.js Parameters
=======================
In the 3rd argument of `TesseractWorker.recognize()`, you can pass a params object to customize the result of OCR, below are supported parameters in tesseract.js so far.
When initializing
In the 3rd argument of `ecognize()`, you can pass a params object to customize the result of OCR, below are supported parameters in tesseract.js so far.
Example:
```javascript
import Tesseract from 'tesseract.js';
import { createWorker, OEM, PSM } from 'tesseract.js';
const { TesseractWorker, OEM, PSM } = Tesseract;
const worker = new TesseractWorker();
@ -24,17 +26,8 @@ worker @@ -24,17 +26,8 @@ worker
| tessedit\_ocr\_engine\_mode | enum | OEM.LSTM\_ONLY | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L268) for definition of each mode |
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful the content in image is limited |
| tessjs\_create\_pdf | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js generates a pdf output |
| tessjs\_create\_hocr | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result |
| tessjs\_create\_tsv | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result |
| tessjs\_create\_box | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result |
| tessjs\_create\_unlv | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result |
| tessjs\_create\_osd | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result |
| tessjs\_pdf\_name | string | 'tesseract.js-ocr-result' | the name of the generated pdf file |
| tessjs\_pdf\_title | string | 'Tesseract.js OCR Result' | the title of the generated pdf file |
| tessjs\_pdf\_auto\_download | boolean | true | If the value is true, tesseract.js will automatic download/writeFile pdf file |
| tessjs\_pdf\_bin | boolean | false | whether to include pdf binary array in the result object (result.files.pdf) |
| tessjs\_image\_rectangle\_left | number | 0 | The left of the sub-rectangle of the image. |
| tessjs\_image\_rectangle\_top | number | 0 | The top of the sub-rectangle of the image. |
| tessjs\_image\_rectangle\_width | number | -1 | The width of the sub-rectangle of the image, -1 means auto width detection |
| tessjs\_image\_rectangle\_height | number | -1 | The height of the sub-rectangle of the image, -1 means auto height detection |

Loading…
Cancel
Save