Pure Javascript OCR for more than 100 Languages 📖🎉🖥
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 

11 KiB

API


createWorker(options): Worker

createWorker is a factory function that creates a tesseract worker, a worker is basically a Web Worker in browser and Child Process in Node.

Arguments:

  • options an object of customized options
    • corePath path for tesseract-core.js script
    • langPath path for downloading traineddata, do not include / at the end of the path
    • workerPath path for downloading worker script
    • dataPath path for saving traineddata in WebAssembly file system, not common to modify
    • cachePath path for the cached traineddata, more useful for Node, for browser it only changes the key in IndexDB
    • cacheMethod a string to indicate the method of cache management, should be one of the following options
      • write: read cache and write back (default method)
      • readOnly: read cache and not to write back
      • refresh: not to read cache and write back
      • none: not to read cache and not to write back
    • workerBlobURL a boolean to define whether to use Blob URL for worker script, default: true
    • gzip a boolean to define whether the traineddata from the remote is gzipped, default: true
    • logger a function to log the progress, a quick example is m => console.log(m)

Examples:

const { createWorker } = Tesseract;
const worker = createWorker({
  langPath: '...',
  logger: m => console.log(m),
});

Worker

A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is:

  • load
  • loadLanguauge
  • initialize
  • setParameters // optional
  • recognize or detect
  • terminate

Each function is async, so using async/await or Promise is required. When it is resolved, you get an object:

{
  "jobId": "Job-1-123",
  "data": { ... }
}

jobId is generated by Tesseract.js, but you can put your own when calling any of the function above.

Worker.load(jobId): Promise

Worker.load() loads tesseract.js-core scripts (download from remote if not presented), it makes Web Worker/Child Process ready for next action.

Arguments:

  • jobId Please see details above

Examples:

(async () => {
  await worker.load();
})();

Worker.loadLanguage(langs, jobId): Promise

Worker.loadLanguage() loads traineddata from cache or download traineddata from remote, and put traineddata into the WebAssembly file system.

Arguments:

  • langs a string to indicate the languages traineddata to download, multiple languages are concated with +, ex: eng+chi_tra
  • jobId Please see details above

Examples:

(async () => {
  await worker.loadLanguage('eng+chi_tra');
})();

Worker.initialize(langs, oem, jobId): Promise

Worker.initialize() initializes the Tesseract API, make sure it is ready for doing OCR tasks.

Arguments:

  • langs a string to indicate the languages loaded by Tesseract API, it can be the subset of the languauge traineddata you loaded from Worker.loadLanguage.
  • oem a enum to indicate the OCR Engine Mode you use
  • jobId Please see details above

Examples:

(async () => {
  /** You can load more languages in advance, but use only part of them in Worker.initialize() */
  await worker.loadLanguage('eng+chi_tra');
  await worker.initialize('eng');
})();

Worker.setParameters(params, jobId): Promise

Worker.setParameters() set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit_char_whitelist is very useful.

Arguments:

  • params an object with key and value of the parameters
  • jobId Please see details above

Supported Paramters:

name type default value description
tessedit_ocr_engine_mode enum OEM.DEFAULT Check HERE for definition of each mode
tessedit_pageseg_mode enum PSM.SINGLE_BLOCK Check HERE for definition of each mode
tessedit_char_whitelist string '' setting white list characters makes the result only contains these characters, useful the content in image is limited
preserve_interword_spaces string '0' '0' or '1', keeps the space between words
tessjs_create_hocr string '1' only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result
tessjs_create_tsv string '1' only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result
tessjs_create_box string '0' only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result
tessjs_create_unlv string '0' only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result
tessjs_create_osd string '0' only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result

Examples:

(async () => {
  await worker.setParameters({
    tessedit_char_whitelist: '0123456789',
  });
})

Worker.recognize(image, options, jobId): Promise

Worker.recognize() provides core function of Tesseract.js as it executes OCR

Figures out what words are in image, where the words are in image, etc.

Note: image should be sufficiently high resolution. Often, the same image will get much better results if you upscale it before calling recognize.

Arguments:

  • image see Image Format for more details.
  • options a object of customized optons
    • rectangles an array of objects to specify the region you want to recognized in the image, the object should contain top, left, width and height, see example below.
  • jobId Please see details above

Output:

Examples:

const { createWorker } = Tesseract;
(async () => {
  const worker = createWorker();
  await worker.load();
  await worker.loadLanguage('eng');
  await worker.initialize('eng');
  const { data: { text } } = await worker.recognize(image);
  console.log(text);
})();

With rectangles

const { createWorker } = Tesseract;
(async () => {
  const worker = createWorker();
  await worker.load();
  await worker.loadLanguage('eng');
  await worker.initialize('eng');
  const { data: { text } } = await worker.recognize(image, {
    rectangles: [
      { top: 0, left: 0, width: 100, height: 100 },
    ],
  });
  console.log(text);
})();

Worker.detect(image, jobId): Promise

Worker.detect() does OSD (Orientation and Script Detection) to the image instead of OCR.

Arguments:

  • image see Image Format for more details.
  • jobId Please see details above

Examples:

const { createWorker } = Tesseract;
(async () => {
  const worker = createWorker();
  await worker.load();
  await worker.loadLanguage('eng');
  await worker.initialize('eng');
  const { data } = await worker.detect(image);
  console.log(data);
})();

Worker.terminate(jobId): Promise

Worker.terminate() terminates the worker and clean up

Arguments:

  • jobId Please see details above
(async () => {
  await worker.terminate();
})();

createScheduler(): Scheduler

createScheduler() is a factory function to create a scheduler, a scheduler manage a job queue and workers to enable multiple workers to work together, it is useful when you want to speed up your performance.

Examples:

const { createScheduler } = Tesseract;
const scheduler = createScheduler();

Scheduler

Scheduler.addWorker(worker): string

Scheduler.addWorker() adds a worker into the worker pool inside scheduler, it is suggested to add one worker to only one sheduler.

Arguments:

  • worker see Worker above

Examples:

const { createWorker, createScheduler } = Tesseract;
const scheduler = createScheduler();
const worker = createWorker();
scheduler.addWorker(worker);

Scheduler.addJob(action, ...payload): Promise

Scheduler.addJob() adds a job to the job queue and scheduler waits and finds an idle worker to take the job.

Arguments:

  • action a string to indicate the action you want to do, right now only recognize and detect are supported
  • payload a arbitrary number of args depending on the action you called.

Examples:

(async () => {
 const { data: { text } } = await scheduler.addJob('recognize', image, options);
 const { data } = await scheduler.addJob('detect', image);
})();

Scheduler.getQueueLen(): number

Scheduler.getNumWorkers() returns the length of job queue.

Scheduler.getNumWorkers(): number

Scheduler.getNumWorkers() returns number of workers added into the scheduler

Scheduler.terminate(): Promise

Scheduler.terminate() terminates all workers added, useful to do quick clean up.

Examples:

(async () => {
  await scheduler.terminate();
})();

setLogging(logging: boolean)

setLogging() sets the logging flag, you can setLogging(true) to see detailed information, useful for debugging.

Arguments:

  • logging boolean to define whether to see detailed logs, default: false

Examples:

const { setLogging } = Tesseract;
setLogging(true);

recognize(image, langs, options): Promise

recognize() is a function to quickly do recognize() task, it is not recommended to use in real application, but useful when you want to save some time.

See Tesseract.js

detect(image, options): Promise

Same background as recongize(), but it does detect instead.

See Tesseract.js

PSM

See PSM.js

OEM

See OEM.js