> # UNDER CONTRUCTION > ## Due for Release on Monday, Oct 3, 2016 # tesseract.js Tesseract.js is a pure javascript version of the Tesseract OCR Engine that can recognize English, Chinese, Russian, and 60 other languages. Tesseract.js lets your code get the words out of scanned documents and other images. # Installation Tesseract.js works with a ` ``` ### Local First grab copies of `tesseract.js` and `tesseract.worker.js` from the [dist folder](https://github.com/naptha/tesseract.js/tree/master/dist). Then include `tesseract.js` on your page, and set `Tesseract.workerUrl` like this: ```html ``` ## npm ### TODO walp # Docs * [Tesseract.recognize(image: ImageLike[, options]) -> [TesseractJob](#tesseractjob)](#tesseractrecognizeimage-imagelike-options---tesseractjob) + [Simple Example](#simple-example) + [More Complicated Example](#more-complicated-example) * [Tesseract.detect(image: ImageLike) -> [TesseractJob](#tesseractjob)](#tesseractdetectimage-imagelike---tesseractjob) * [ImageLike](#imagelike) * [TesseractJob](#tesseractjob) + [TesseractJob.progress(callback: function) -> TesseractJob](#tesseractjobprogresscallback-function---tesseractjob) + [TesseractJob.then(callback: function) -> TesseractJob](#tesseractjobthencallback-function---tesseractjob) + [TesseractJob.error(callback: function) -> TesseractJob](#tesseractjoberrorcallback-function---tesseractjob) * [Tesseract Remote File Options](#tesseract-remote-file-options) + [Tesseract.coreUrl](#tesseractcoreurl) + [Tesseract.workerUrl](#tesseractworkerurl) + [Tesseract.langUrl](#tesseractlangurl) ## Tesseract.recognize(image: [ImageLike](#imagelike)[, options]) -> [TesseractJob](#tesseractjob) Figures out what words are in `image`, where the words are in `image`, etc. - `image` is any [ImageLike](#imagelike) object. - `options` is an optional flat json object. `options` may: + include properties that override some subset of the [default tesseract parameters](./tesseract_parameters.md) + include a `lang` property with a value from the [list of lang parameters](./tesseract_lang_list.md) Returns a [TesseractJob](#tesseractjob) whose `then`, `progress`, and `error` methods can be used to act on the result. ### Simple Example: ```javascript Tesseract.recognize('#my-image') .then(function(result){ console.log(result) }) ``` ### More Complicated Example: ```javascript // if we know our image is of spanish words without the letter 'e': Tesseract.recognize('#my-image', { lang: 'spa', tessedit_char_blacklist: 'e' }) .then(function(result){ console.log(result) }) ``` ## Tesseract.detect(image: [ImageLike](#imagelike)) -> [TesseractJob](#tesseractjob) Figures out what script (e.g. 'Latin', 'Chinese') the words in image are written in. - `image` is any [ImageLike](#imagelike) object. Returns a [TesseractJob](#tesseractjob) whose `then`, `progress`, and `error` methods can be used to act on the result of the script. ```javascript Tesseract.detect('#my-image') .then(function(result){ console.log(result) }) ``` ## ImageLike The main Tesseract.js functions take an `image` parameter, which should be something that is 'image-like'. That means `image` should be - an `img` element or querySelector that matches an `img` element - a `video` element or querySelector that matches a `video` element - a `canvas` element or querySelector that matches a `canvas` element - a CanvasRenderingContext2D (returned by `canvas.getContext('2d')`) - the absolute `url` of an image from the same website that is running your script. Browser security policies don't allow access to the content of images from other websites :( ## TesseractJob A TesseractJob is an an object returned by a call to recognize or detect. All methods of a given TesseractJob return that TesseractJob to enable chaining. Typical use is: ```javascript Tesseract.recognize('#my-image') .progress(function(message){console.log(message)}) .error(function(err){console.error(err)}) .then(function(result){console.log(result)}) ``` Which is equivalent to: ```javascript var job1 = Tesseract.recognize('#my-image'); job1.progress(function(message){console.log(message)}); job1.error(function(err){console.error(err)}); job1.then(function(result){console.log(result)}) ``` ### TesseractJob.progress(callback: function) -> TesseractJob Sets `callback` as the function that will be called every time the job progresses. - `callback` is a function with the signature `callback(progress)` where `progress` is a json object. For example: ```javascript Tesseract.recognize('#my-image') .progress(function(message){console.log('progress is: 'message)}) ``` The console will show something like: ```javascript progress is: {loaded_lang_model: "eng", from_cache: true} progress is: {initialized_with_lang: "eng"} progress is: {set_variable: Object} progress is: {set_variable: Object} progress is: {recognized: 0} progress is: {recognized: 0.3} progress is: {recognized: 0.6} progress is: {recognized: 0.9} progress is: {recognized: 1} ``` ### TesseractJob.then(callback: function) -> TesseractJob Sets `callback` as the function that will be called if and when the job successfully completes. - `callback` is a function with the signature `callback(result)` where `result` is a json object. For example: ```javascript Tesseract.recognize('#my-image') .then(function(result){console.log('result is: 'result)}) ``` The console will show something like: ```javascript progress is: { blocks: Array[1] confidence: 87 html: "
TesseractJob Sets `callback` as the function that will be called if the job fails. - `callback` is a function with the signature `callback(erros)` where `error` is a json object. ## Tesseract Remote File Options ### Tesseract.coreUrl A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://cdn.rawgit.com/naptha/tesseract.js-core/master/index.js'. Set this string before calling `Tesseract.recognize` and `Tesseract.detect` if you want Tesseract.js to use a different file. For example: ```javascript Tesseract.coreUrl = 'https://absolute-path-to/tesseract.js-core/index.js' ``` ### Tesseract.workerUrl A string specifying the location of the [tesseract.worker.js](./dist/tesseract.worker.js) file, with default value 'https://cdn.rawgit.com/naptha/tesseract.js/8b915dc/dist/tesseract.worker.js'. Set this string before calling `Tesseract.recognize` and `Tesseract.detect` if you want Tesseract.js to use a different file. For example: ```javascript Tesseract.workerUrl = 'https://absolute-path-to/tesseract.worker.js' ``` ### Tesseract.langUrl A string specifying the location of the tesseract language files, with default value 'https://cdn.rawgit.com/naptha/tessdata/gh-pages/3.02/'. Language file urls are calculated according to the formula `Tesseract.langUrl + lang + '.traineddata.gz'`. Set this string before calling `Tesseract.recognize` and `Tesseract.detect` if you want Tesseract.js to use different language files. In the following exampple, Tesseract.js will download the language file from 'https://absolute-path-to/lang/folder/rus.traineddata.gz': ```javascript Tesseract.langUrl = 'https://absolute-path-to/lang/folder/' Tesseract.recognize('#my-im', { lang: 'rus' }) ``` # Contributing ## Development To run a development copy of tesseract.js, first clone this repo. ```shell > git clone https://github.com/naptha/tesseract.js.git ``` Then, cd in to the folder, `npm install`, and `npm start` ```shell > cd tesseract.js > npm install && npm start ... a bunch of npm stuff ... tesseract.js@1.0.0 start /Users/guillermo/Desktop/code_static/tesseract.js node devServer.js Listening at http://localhost:7355 ``` Then open `http://localhost:7355` in your favorite browser. The devServer automatically rebuilds tesseract.js and tesseract.worker.js when you change files in the src folder. ## Building Static Files After you've cloned the repo and run `npm install` as described in the [Development Section](#development), you can build static library files in the dist folder with ```shell > npm run build ``` ## Send us a Pull Request! Thanks :)