diff --git a/README.md b/README.md index 53759ca..6a697e5 100644 --- a/README.md +++ b/README.md @@ -56,205 +56,13 @@ or # Documentation * [Examples](./docs/examples.md) -* [Tesseract.recognize](#tesseractrecognizeimage-imagelike-options---tesseractjob) - + [Simple Example](#simple-example) - + [More Complicated Example](#more-complicated-example) -* [Tesseract.detect](#tesseractdetectimage-imagelike---tesseractjob) -* [ImageLike](#imagelike) -* [TesseractJob](#tesseractjob) - + [TesseractJob.progress](#tesseractjobprogresscallback-function---tesseractjob) - + [TesseractJob.then](#tesseractjobthencallback-function---tesseractjob) - + [TesseractJob.catch](#tesseractjobcatchcallback-function---tesseractjob) - + [TesseractJob.finally](#tesseractjobfinallycallback-function---tesseractjob) -* [Local Installation](#local-installation) - + [corePath](#corepath) - + [workerPath](#workerpath) - + [langPath](#langpath) -* [Contributing](#contributing) - + [Development](#development) - + [Building Static Files](#building-static-files) - + [Send us a Pull Request!](#send-us-a-pull-request) +* [Image Format](./docs/image-format.md) +* [API](./docs/api.md) +* [Local Installation](./docs/local-installation.md) +# Contributing -## Tesseract.recognize(image: [ImageLike](#imagelike)[, options]) -> [TesseractJob](#tesseractjob) -Figures out what words are in `image`, where the words are in `image`, etc. -> Note: `image` should be sufficiently high resolution. -> Often, the same image will get much better results if you upscale it before calling `recognize`. - -- `image` is any [ImageLike](#imagelike) object. -- `options` is either absent (in which case it is interpreted as `'eng'`), a string specifing a language short code from the [language list](./docs/tesseract_lang_list.md), or a flat json object that may: - + include properties that override some subset of the [default tesseract parameters](./docs/tesseract_parameters.md) - + include a `lang` property with a value from the [list of lang parameters](./docs/tesseract_lang_list.md) - -Returns a [TesseractJob](#tesseractjob) whose `then`, `progress`, `catch` and `finally` methods can be used to act on the result. - -### Simple Example: -```javascript -Tesseract.recognize(myImage) -.then(function(result){ - console.log(result) -}) -``` - -### More Complicated Example: -```javascript -// if we know our image is of spanish words without the letter 'e': -Tesseract.recognize(myImage, { - langs: 'spa', - tessedit_char_blacklist: 'e' -}) -.then(function(result){ - console.log(result) -}) -``` - - - - -## Tesseract.detect(image: [ImageLike](#imagelike)) -> [TesseractJob](#tesseractjob) - -Figures out what script (e.g. 'Latin', 'Chinese') the words in image are written in. - -- `image` is any [ImageLike](#imagelike) object. - -Returns a [TesseractJob](#tesseractjob) whose `then`, `progress`, `catch` and `finally` methods can be used to act on the result of the script. - - -```javascript -Tesseract.detect(myImage) -.then(function(result){ - console.log(result) -}) -``` - - -## ImageLike - -The main Tesseract.js functions take an `image` parameter, which should be something that is like an image. What's considered "image-like" differs depending on whether it is being run from the browser or through NodeJS. - -On a browser, an image can be: -- an `img`, `video`, or `canvas` element -- a `File` object (from a file `` or drag-drop event) -- a path or URL to an accessible image (the image must either be hosted locally) - -In Node.js, an image can be -- a path to a local image - - -## TesseractJob - -A TesseractJob is an object returned by a call to `recognize` or `detect`. It's inspired by the ES6 Promise interface and provides `then` and `catch` methods. It also provides `finally` method, which will be fired regardless of the job fate. One important difference is that these methods return the job itself (to enable chaining) rather than new. - -Typical use is: -```javascript -Tesseract.recognize(myImage) - .progress(message => console.log(message)) - .catch(err => console.error(err)) - .then(result => console.log(result)) - .finally(resultOrError => console.log(resultOrError)) -``` - -Which is equivalent to: -```javascript -var job1 = Tesseract.recognize(myImage); - -job1.progress(message => console.log(message)); - -job1.catch(err => console.error(err)); - -job1.then(result => console.log(result)); - -job1.finally(resultOrError => console.log(resultOrError)); -``` - - - -### TesseractJob.progress(callback: function) -> TesseractJob -Sets `callback` as the function that will be called every time the job progresses. -- `callback` is a function with the signature `callback(progress)` where `progress` is a json object. - -For example: -```javascript -Tesseract.recognize(myImage) - .progress(function(message){console.log('progress is: ', message)}) -``` - -The console will show something like: -```javascript -progress is: {loaded_lang_model: "eng", from_cache: true} -progress is: {initialized_with_lang: "eng"} -progress is: {set_variable: Object} -progress is: {set_variable: Object} -progress is: {recognized: 0} -progress is: {recognized: 0.3} -progress is: {recognized: 0.6} -progress is: {recognized: 0.9} -progress is: {recognized: 1} -``` - - -### TesseractJob.then(callback: function) -> TesseractJob -Sets `callback` as the function that will be called if and when the job successfully completes. -- `callback` is a function with the signature `callback(result)` where `result` is a json object. - - -For example: -```javascript -Tesseract.recognize(myImage) - .then(function(result){console.log('result is: ', result)}) -``` - -The console will show something like: -```javascript -result is: { - blocks: Array[1] - confidence: 87 - html: "
TesseractJob -Sets `callback` as the function that will be called if the job fails. -- `callback` is a function with the signature `callback(error)` where `error` is a json object. - -### TesseractJob.finally(callback: function) -> TesseractJob -Sets `callback` as the function that will be called regardless if the job fails or success. -- `callback` is a function with the signature `callback(resultOrError)` where `resultOrError` is a json object. - -## Local Installation - -In the browser, `tesseract.js` simply provides the API layer. Internally, it opens a WebWorker to handle requests. That worker itself loads code from the Emscripten-built `tesseract.js-core` which itself is hosted on a CDN. Then it dynamically loads language files hosted on another CDN. - -Because of this we recommend loading `tesseract.js` from a CDN. But if you really need to have all your files local, you can use the `Tesseract.create` function which allows you to specify custom paths for workers, languages, and core. - -```javascript -window.Tesseract = Tesseract.create({ - workerPath: '/path/to/worker.js', - langPath: 'https://cdn.jsdelivr.net/gh/naptha/tessdata@gh-pages/3.02/', - corePath: 'https://cdn.jsdelivr.net/gh/naptha/tesseract.js-core@0.1.0/index.js', -}) -``` - -### corePath -A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://cdn.jsdelivr.net/gh/naptha/tesseract.js-core@0.1.0/index.js'. Set this string before calling `Tesseract.recognize` and `Tesseract.detect` if you want Tesseract.js to use a different file. - -### workerPath -A string specifying the location of the [worker.js](./dist/worker.js) file. Set this string before calling `Tesseract.recognize` and `Tesseract.detect` if you want Tesseract.js to use a different file. - -### langPath -A string specifying the location of the tesseract language files, with default value 'https://cdn.jsdelivr.net/gh/naptha/tessdata@gh-pages/3.02/'. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`. Set this string before calling `Tesseract.recognize` and `Tesseract.detect` if you want Tesseract.js to use different language files. - - -## Contributing -### Development +## Development To run a development copy of tesseract.js, first clone this repo. ```shell > git clone https://github.com/naptha/tesseract.js.git @@ -269,18 +77,16 @@ Then, `cd tesseract.js && npm install && npm start` Starting up http-server, serving ./ Available on: - http://127.0.0.1:7355 - http://[your ip]:7355 + http://127.0.0.1:3000 + http://[your ip]:3000 ``` -Then open `http://localhost:7355/examples/file-input/demo.html` in your favorite browser. The devServer automatically rebuilds `tesseract.js` and `tesseract.worker.js` when you change files in the src folder. +Then open `http://localhost:3000/examples/browser/demo.html` in your favorite browser. The devServer automatically rebuilds `tesseract.dev.js` and `worker.min.js` when you change files in the src folder. -### Building Static Files +## Building Static Files After you've cloned the repo and run `npm install` as described in the [Development Section](#development), you can build static library files in the dist folder with + ```shell > npm run build ``` - -### Send us a Pull Request! -Thanks :) diff --git a/docs/api.md b/docs/api.md new file mode 100644 index 0000000..186e64a --- /dev/null +++ b/docs/api.md @@ -0,0 +1,146 @@ +# API + +## Tesseract.recognize(image [, options]) -> [TesseractJob](#tesseractjob) +Figures out what words are in `image`, where the words are in `image`, etc. +> Note: `image` should be sufficiently high resolution. +> Often, the same image will get much better results if you upscale it before calling `recognize`. + +- `image` see [Image Format](./image-format.md) for more details. +- `options` is either absent (in which case it is interpreted as `'eng'`), a string specifing a language short code from the [language list](./tesseract_lang_list.md), or a flat json object that may: + + include properties that override some subset of the [default tesseract parameters](./tesseract_parameters.md) + + include a `lang` property with a value from the [list of lang parameters](./tesseract_lang_list.md), you can use multiple languages separated by '+', ex. `eng+chi_tra` + +Returns a [TesseractJob](#tesseractjob) whose `then`, `progress`, `catch` and `finally` methods can be used to act on the result. + +### Simple Example: +```javascript +const worker = new Tessearct.TesseractWorker(); +worker + .recognize(myImage) + .then(function(result){ + console.log(result); + }); +``` + +### More Complicated Example: +```javascript +const worker = new Tessearct.TesseractWorker(); +// if we know our image is of spanish words without the letter 'e': +worker + .recognize(myImage, { + lang: 'spa', + tessedit_char_blacklist: 'e', + }) + .then(function(result){ + console.log(result); + }); +``` + +## Tesseract.detect(image) -> [TesseractJob](#tesseractjob) + +Figures out what script (e.g. 'Latin', 'Chinese') the words in image are written in. + +- `image` see [Image Format](./image-format.md) for more details. + +Returns a [TesseractJob](#tesseractjob) whose `then`, `progress`, `catch` and `finally` methods can be used to act on the result of the script. + +```javascript +const worker = new Tessearct.TesseractWorker(); +worker + .detect(myImage) + .then(function(result){ + console.log(result); + }); +``` + +## TesseractJob + +A TesseractJob is an object returned by a call to `recognize` or `detect`. It's inspired by the ES6 Promise interface and provides `then` and `catch` methods. It also provides `finally` method, which will be fired regardless of the job fate. One important difference is that these methods return the job itself (to enable chaining) rather than new. + +Typical use is: +```javascript +const worker = new Tessearct.TesseractWorker(); +worker.recognize(myImage) + .progress(message => console.log(message)) + .catch(err => console.error(err)) + .then(result => console.log(result)) + .finally(resultOrError => console.log(resultOrError)); +``` + +Which is equivalent to: +```javascript +const worker = new Tessearct.TesseractWorker(); +const job1 = worker.recognize(myImage); + +job1.progress(message => console.log(message)); + +job1.catch(err => console.error(err)); + +job1.then(result => console.log(result)); + +job1.finally(resultOrError => console.log(resultOrError)); +``` + + + +### TesseractJob.progress(callback: function) -> TesseractJob +Sets `callback` as the function that will be called every time the job progresses. +- `callback` is a function with the signature `callback(progress)` where `progress` is a json object. + +For example: +```javascript +const worker = new Tessearct.TesseractWorker(); +worker.recognize(myImage) + .progress(function(message){console.log('progress is: ', message)}); +``` + +The console will show something like: +```javascript +progress is: {loaded_lang_model: "eng", from_cache: true} +progress is: {initialized_with_lang: "eng"} +progress is: {set_variable: Object} +progress is: {set_variable: Object} +progress is: {recognized: 0} +progress is: {recognized: 0.3} +progress is: {recognized: 0.6} +progress is: {recognized: 0.9} +progress is: {recognized: 1} +``` + + +### TesseractJob.then(callback: function) -> TesseractJob +Sets `callback` as the function that will be called if and when the job successfully completes. +- `callback` is a function with the signature `callback(result)` where `result` is a json object. + + +For example: +```javascript +const worker = new Tessearct.TesseractWorker(); +worker.recognize(myImage) + .then(function(result){console.log('result is: ', result)}); +``` + +The console will show something like: +```javascript +result is: { + blocks: Array[1] + confidence: 87 + html: "
TesseractJob +Sets `callback` as the function that will be called if the job fails. +- `callback` is a function with the signature `callback(error)` where `error` is a json object. + +### TesseractJob.finally(callback: function) -> TesseractJob +Sets `callback` as the function that will be called regardless if the job fails or success. +- `callback` is a function with the signature `callback(resultOrError)` where `resultOrError` is a json object. diff --git a/docs/examples.md b/docs/examples.md index 5385c4e..e778f97 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -1,5 +1,7 @@ # Tesseract.js Examples +You can also check [examples](../examples) folder. + ### basic ```javascript diff --git a/docs/image-format.md b/docs/image-format.md new file mode 100644 index 0000000..48b2126 --- /dev/null +++ b/docs/image-format.md @@ -0,0 +1,13 @@ +# Image Format + +Support Format: **bmp, jpg, png, pbm** + +The main Tesseract.js functions (ex. recognize, detect) take an `image` parameter, which should be something that is like an image. What's considered "image-like" differs depending on whether it is being run from the browser or through NodeJS. + +On a browser, an image can be: +- an `img`, `video`, or `canvas` element +- a `File` object (from a file ``) +- a path or URL to an accessible image + +In Node.js, an image can be +- a path to a local image diff --git a/docs/local-installation.md b/docs/local-installation.md new file mode 100644 index 0000000..aeb53bc --- /dev/null +++ b/docs/local-installation.md @@ -0,0 +1,24 @@ +## Local Installation + +In browser environment, `tesseract.js` simply provides the API layer. Internally, it opens a WebWorker to handle requests. That worker itself loads code from the Emscripten-built `tesseract.js-core` which itself is hosted on a CDN. Then it dynamically loads language files hosted on another CDN. + +Because of this we recommend loading `tesseract.js` from a CDN. But if you really need to have all your files local, you can pass extra arguments to `TessearctWorker` to specify custom paths for workers, languages, and core. + +In Node.js environment, the only path you may want to customize is languages/langPath. + +```javascript +const worker = Tesseract.TesseractWorker({ + workerPath: 'https://cdn.jsdelivr.net/gh/naptha/tesseract.js@v2.0.0/dist/worker.min.js', + langPath: 'https://tessdata.projectnaptha.com/4.0.0', + corePath: 'https://cdn.jsdelivr.net/gh/naptha/tesseract.js-core@v2.0.0-beta.5/tesseract-core.js', +}); +``` + +### workerPath +A string specifying the location of the [worker.js](./dist/worker.min.js) file. + +### langPath +A string specifying the location of the tesseract language files, with default value 'https://tessdata.projectnaptha.com/4.0.0'. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`. + +### corePath +A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://cdn.jsdelivr.net/gh/naptha/tesseract.js-core@v2.0.0-beta.5/tesseract-core.js'.