# API - [createWorker()](#create-worker) - [Worker.load](#worker-load) - [Worker.loadLanguage](#worker-load-language) - [Worker.initialize](#worker-initialize) - [Worker.setParameters](#worker-set-parameters) - [Worker.recognize](#worker-recognize) - [Worker.detect](#worker-detect) - [Worker.terminate](#worker-terminate) - [createScheduler()](#create-scheduler) - [Scheduler.addWorker](#scheduler-add-worker) - [Scheduler.addJob](#scheduler-add-job) - [Scheduler.getQueueLen](#scheduler-get-queue-len) - [Scheduler.getNumWorkers](#scheduler-get-num-workers) - [setLogging()](#set-logging) - [recognize()](#recognize) - [detect()](#detect) - [PSM](#psm) - [OEM](#oem) --- ## createWorker(options): Worker createWorker is a factory function that creates a tesseract worker, a worker is basically a Web Worker in browser and Child Process in Node. **Arguments:** - `options` an object of customized options - `corePath` path for tesseract-core.js script - `langPath` path for downloading traineddata, do not include `/` at the end of the path - `workerPath` path for downloading worker script - `dataPath` path for saving traineddata in WebAssembly file system, not common to modify - `cachePath` path for the cached traineddata, more useful for Node, for browser it only changes the key in IndexDB - `cacheMethod` a string to indicate the method of cache management, should be one of the following options - write: read cache and write back (default method) - readOnly: read cache and not to write back - refresh: not to read cache and write back - none: not to read cache and not to write back - `workerBlobURL` a boolean to define whether to use Blob URL for worker script, default: true - `gzip` a boolean to define whether the traineddata from the remote is gzipped, default: true - `logger` a function to log the progress, a quick example is `m => console.log(m)` **Examples:** ```javascript const { createWorker } = Tesseract; const worker = createWorker({ langPath: '...', logger: m => console.log(m), }); ``` ## Worker A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is: - load - loadLanguauge - initialize - setParameters // optional - recognize or detect - terminate Each function is async, so using async/await or Promise is required. When it is resolved, you get an object: ```json { "jobId": "Job-1-123", "data": { ... } } ``` jobId is generated by Tesseract.js, but you can put your own when calling any of the function above. ### Worker.load(jobId): Promise Worker.load() loads tesseract.js-core scripts (download from remote if not presented), it makes Web Worker/Child Process ready for next action. **Arguments:** - `jobId` Please see details above **Examples:** ```javascript (async () => { await worker.load(); })(); ``` ### Worker.loadLanguage(langs, jobId): Promise Worker.loadLanguage() loads traineddata from cache or download traineddata from remote, and put traineddata into the WebAssembly file system. **Arguments:** - `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra** - `jobId` Please see details above **Examples:** ```javascript (async () => { await worker.loadLanguage('eng+chi_tra'); })(); ``` ### Worker.initialize(langs, oem, jobId): Promise Worker.initialize() initializes the Tesseract API, make sure it is ready for doing OCR tasks. **Arguments:** - `langs` a string to indicate the languages loaded by Tesseract API, it can be the subset of the languauge traineddata you loaded from Worker.loadLanguage. - `oem` a enum to indicate the OCR Engine Mode you use - `jobId` Please see details above **Examples:** ```javascript (async () => { /** You can load more languages in advance, but use only part of them in Worker.initialize() */ await worker.loadLanguage('eng+chi_tra'); await worker.initialize('eng'); })(); ``` ### Worker.setParameters(params, jobId): Promise Worker.setParameters() set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful. **Arguments:** - `params` an object with key and value of the parameters - `jobId` Please see details above **Supported Paramters:** | name | type | default value | description | | ---- | ---- | ------------- | ----------- | | tessedit\_ocr\_engine\_mode | enum | OEM.LSTM\_ONLY | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L268) for definition of each mode | | tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode | | tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful the content in image is limited | | tessjs\_create\_hocr | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result | | tessjs\_create\_tsv | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result | | tessjs\_create\_box | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result | | tessjs\_create\_unlv | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result | | tessjs\_create\_osd | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result | **Examples:** ```javascript (async () => { await worker.setParameters({ tessedit_char_whitelist: '0123456789', }); }) ``` ### Worker.recognize(image, options, jobId): Promise ### Worker.detect(image, jobId): Promise ### Worker.terminate(jobId): Promise ## createScheduler(): Scheduler ### Scheduler.addWorker(worker): string ### Scheduler.addJob(worker): Promise ### Scheduler.getQueueLen(): number Scheduler.getNumWorkers() returns the length of job queue. ### Scheduler.getNumWorkers(): number Scheduler.getNumWorkers() returns number of workers added into the scheduler ### Scheduler.terminate(): Promise Scheduler.terminate() terminates all workers added, useful to do quick clean up. **Examples:** ```javascript (async () => { await scheduler.terminate(); })(); ``` ## setLogging(logging: boolean) setLogging() sets the logging flag, you can `setLogging(true)` to see detailed information, useful for debugging. **Arguments:** - `logging` boolean to define whether to see detailed logs, default: false **Examples:** ```javascript const { setLogging } = Tesseract; setLogging(true); ``` ## recognize(image, langs, options): Promise recognize() is a function to quickly achieve recognize() task, it is not recommended to use in real application, but useful when you want to save some time. See [Tesseract.js](../src/Tesseract.js) ## detect(image, options): Promise Same background as recongize(), but it does detect instead. See [Tesseract.js](../src/Tesseract.js) ## PSM See [PSM.js](../src/constatns/PSM.js) ## OEM See [OEM.js](../src/constatns/OEM.js) ## TesseractWorker.recognize(image, lang, [, options]) -> [TesseractJob](#tesseractjob) Figures out what words are in `image`, where the words are in `image`, etc. > Note: `image` should be sufficiently high resolution. > Often, the same image will get much better results if you upscale it before calling `recognize`. - `image` see [Image Format](./image-format.md) for more details. - `lang` property with a value from the [list of lang parameters](./tesseract_lang_list.md), you can use multiple languages separated by '+', ex. `eng+chi_tra` - `options` a flat json object that may include properties that override some subset of the [default tesseract parameters](./tesseract_parameters.md) Returns a [TesseractJob](#tesseractjob) whose `then`, `progress`, `catch` and `finally` methods can be used to act on the result. ### Simple Example: ```javascript const worker = new Tesseract.TesseractWorker(); worker .recognize(myImage) .then(function(result){ console.log(result); }); ``` ### More Complicated Example: ```javascript const worker = new Tesseract.TesseractWorker(); // if we know our image is of spanish words without the letter 'e': worker .recognize(myImage, 'spa', { tessedit_char_blacklist: 'e', }) .then(function(result){ console.log(result); }); ``` ## TesseractWorker.detect(image) -> [TesseractJob](#tesseractjob) Figures out what script (e.g. 'Latin', 'Chinese') the words in image are written in. - `image` see [Image Format](./image-format.md) for more details. Returns a [TesseractJob](#tesseractjob) whose `then`, `progress`, `catch` and `finally` methods can be used to act on the result of the script. ```javascript const worker = new Tesseract.TesseractWorker(); worker .detect(myImage) .then(function(result){ console.log(result); }); ``` ## TesseractJob A TesseractJob is an object returned by a call to `recognize` or `detect`. It's inspired by the ES6 Promise interface and provides `then` and `catch` methods. It also provides `finally` method, which will be fired regardless of the job fate. One important difference is that these methods return the job itself (to enable chaining) rather than new. Typical use is: ```javascript const worker = new Tesseract.TesseractWorker(); worker.recognize(myImage) .progress(message => console.log(message)) .catch(err => console.error(err)) .then(result => console.log(result)) .finally(resultOrError => console.log(resultOrError)); ``` Which is equivalent to: ```javascript const worker = new Tesseract.TesseractWorker(); const job1 = worker.recognize(myImage); job1.progress(message => console.log(message)); job1.catch(err => console.error(err)); job1.then(result => console.log(result)); job1.finally(resultOrError => console.log(resultOrError)); ``` ### TesseractJob.progress(callback: function) -> TesseractJob Sets `callback` as the function that will be called every time the job progresses. - `callback` is a function with the signature `callback(progress)` where `progress` is a json object. For example: ```javascript const worker = new Tesseract.TesseractWorker(); worker.recognize(myImage) .progress(function(message){console.log('progress is: ', message)}); ``` The console will show something like: ```javascript progress is: {loaded_lang_model: "eng", from_cache: true} progress is: {initialized_with_lang: "eng"} progress is: {set_variable: Object} progress is: {set_variable: Object} progress is: {recognized: 0} progress is: {recognized: 0.3} progress is: {recognized: 0.6} progress is: {recognized: 0.9} progress is: {recognized: 1} ``` ### TesseractJob.then(callback: function) -> TesseractJob Sets `callback` as the function that will be called if and when the job successfully completes. - `callback` is a function with the signature `callback(result)` where `result` is a json object. For example: ```javascript const worker = new Tesseract.TesseractWorker(); worker.recognize(myImage) .then(function(result){console.log('result is: ', result)}); ``` The console will show something like: ```javascript result is: { blocks: Array[1] confidence: 87 html: "
TesseractJob Sets `callback` as the function that will be called if the job fails. - `callback` is a function with the signature `callback(error)` where `error` is a json object. ### TesseractJob.finally(callback: function) -> TesseractJob Sets `callback` as the function that will be called regardless if the job fails or success. - `callback` is a function with the signature `callback(resultOrError)` where `resultOrError` is a json object.