From c41619f68018b75866632b3394a75d955e8b5248 Mon Sep 17 00:00:00 2001 From: Balearica Date: Mon, 19 Sep 2022 19:12:04 -0700 Subject: [PATCH] Updated docs --- docs/api.md | 12 +++++------- docs/examples.md | 8 ++++---- docs/faq.md | 39 +++++++++----------------------------- docs/local-installation.md | 2 +- 4 files changed, 19 insertions(+), 42 deletions(-) diff --git a/docs/api.md b/docs/api.md index 8e795b0..5b87b3b 100644 --- a/docs/api.md +++ b/docs/api.md @@ -206,7 +206,7 @@ Worker.setParameters() set parameters for Tesseract API (using SetVariable()), i - `params` an object with key and value of the parameters - `jobId` Please see details above -**Supported Paramters:** +**Useful Paramters:** | name | type | default value | description | | --------------------------- | ------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------- | @@ -215,11 +215,8 @@ Worker.setParameters() set parameters for Tesseract API (using SetVariable()), i | tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful the content in image is limited | | preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words | | user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** | -| tessjs\_create\_hocr | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result | -| tessjs\_create\_tsv | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result | -| tessjs\_create\_box | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result | -| tessjs\_create\_unlv | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result | -| tessjs\_create\_osd | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result | + +This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.) **Examples:** @@ -243,8 +240,9 @@ Figures out what words are in `image`, where the words are in `image`, etc. **Arguments:** - `image` see [Image Format](./image-format.md) for more details. -- `options` a object of customized options +- `options` an object of customized options - `rectangle` an object to specify the regions you want to recognized in the image, should contain top, left, width and height, see example below. +- `output` an object specifying which output formats to return (by default `text`, `blocks`, `hocr`, and `tsv` are returned) - `jobId` Please see details above **Output:** diff --git a/docs/examples.md b/docs/examples.md index 188fa13..11b5b9e 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -51,7 +51,7 @@ const worker = await createWorker(); await worker.terminate(); })(); ``` -### with whitelist char (^2.0.0-beta.1) +### with whitelist char ```javascript const { createWorker } = require('tesseract.js'); @@ -70,7 +70,7 @@ const worker = await createWorker(); })(); ``` -### with different pageseg mode (^2.0.0-beta.1) +### with different pageseg mode Check here for more details of pageseg mode: https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163 @@ -91,7 +91,7 @@ const worker = await createWorker(); })(); ``` -### with pdf output (^2.0.0-beta.1) +### with pdf output Please check **examples** folder for details. @@ -189,7 +189,7 @@ const rectangles = [ })(); ``` -### with multiple workers to speed up (^2.0.0-beta.1) +### with multiple workers to speed up ```javascript const { createWorker, createScheduler } = require('tesseract.js'); diff --git a/docs/faq.md b/docs/faq.md index 900ea7a..5f44824 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -1,6 +1,13 @@ FAQ === +# Project +## What is the scope of this project? +Tesseract.js is the JavaScript/Webassembly port of the Tesseract OCR engine. We do not edit the underlying Tesseract recognition engine in any way. Therefore, if you encounter bugs caused by the Tesseract engine you may open an issue here for the purposes of raising awareness to other users, but fixing is outside the scope of this repository. + +If you encounter a Tesseract bug you would like to see fixed you should confirm the behavior is the same in the [main (CLI) version](https://github.com/tesseract-ocr/tesseract) of Tesseract and then open a Git Issue in that repository. + +# Trained Data ## How does tesseract.js download and keep \*.traineddata? The language model is downloaded by `worker.loadLanguage()` and you need to pass the langs to `worker.initialize()`. @@ -9,33 +16,5 @@ During the downloading of language model, Tesseract.js will first check if \*.tr ## How can I train my own \*.traineddata? -For tesseract.js v2, check [TrainingTesseract 4.00](https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00) - -For tesseract.js v1, check [Training Tesseract 3.03–3.05](https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05) - -## How can I get HOCR, TSV, Box, UNLV, OSD? - -Starting from 2.0.0-beta.1, you can get all these information in the final result. - -```javascript -import { createWorker } from 'tesseract.js'; -const worker = await createWorker({ - logger: m => console.log(m) -}); - -(async () => { - await worker.loadLanguage('eng'); - await worker.initialize('eng'); - await worker.setParameters({ - tessedit_create_box: '1', - tessedit_create_unlv: '1', - tessedit_create_osd: '1', - }); - const { data: { text, hocr, tsv, box, unlv } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png'); - console.log(text); - console.log(hocr); - console.log(tsv); - console.log(box); - console.log(unlv); -})(); -``` +See the documentation from the main [Tesseract project](https://tesseract-ocr.github.io/tessdoc/) for training instructions. + diff --git a/docs/local-installation.md b/docs/local-installation.md index f3fd35b..cb3eadc 100644 --- a/docs/local-installation.md +++ b/docs/local-installation.md @@ -33,6 +33,6 @@ A string specifying the location of the [worker.js](./dist/worker.min.js) file. A string specifying the location of the tesseract language files, with default value 'https://tessdata.projectnaptha.com/4.0.0'. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`. ### corePath -A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm.js' (fallback to tesseract-core.asm.js when WebAssembly is not available). +A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm.js'. Another WASM option is 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.js' which is a script that loads 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm'. But it fails to fetch at this moment.