Pure Javascript OCR for more than 100 Languages 📖🎉🖥
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

198 lines
5.9 KiB

> # UNDER CONTRUCTION
> ## Due for Release on Monday, Oct 3, 2016
10 years ago
# tesseract.js
8 years ago
Tesseract.js is a pure javascript version of the Tesseract OCR Engine that can recognize English, Chinese, Russian, and 60 other languages.
Tesseract.js lets your code get the words out of scanned documents and other images.
8 years ago
<!-- ![alt text]( "Logo Title Text 1") -->
# Installation
Tesseract.js works with a `<script>` tag via local copy or cdn, or with `npm` (if you're using webpack / browserify).
8 years ago
## Script Tag
### CDN
```html
<script src='https://cdn.rawgit.com/naptha/tesseract.js/a01d2a2/dist/tesseract.js'></script>
<script>
Tesseract.recognize('#my-image')
.progress(function (p) { console.log('progress', p) })
.then(function (result) { console.log('result', result) })
</script>
```
### Local
First grab copies of `tesseract.js` and `tesseract.worker.js` from the [dist folder](https://github.com/naptha/tesseract.js/tree/master/dist). Then include `tesseract.js` on your page, and set `Tesseract.workerUrl` like this:
8 years ago
```html
<script src='/path/to/tesseract.js'></script>
<script>
Tesseract.workerUrl = 'http://www.absolute-path-to/tesseract.worker.js'
Tesseract.recognize('#my-image')
.progress(function (p) { console.log('progress', p) })
.then(function (result) { console.log('result', result) })
</script>
8 years ago
```
## npm
### TODO
<!-- ```shell
> npm install tesseract-js ?
```-->
# Docs
## ImageLike
The main Tesseract.js functions take an `image` parameter, which should be something that is 'image-like'.
That means `image` should be
- an `img` element or querySelector that matches an `img` element
- a `video` element or querySelector that matches a `video` element
- a `canvas` element or querySelector that matches a `canvas` element
- a CanvasRenderingContext2D (returned by `canvas.getContext('2d')`)
- the absolute `url` of an image from the same website that is running your script. Browser security policies don't allow access to the content of images from other websites :(
## Tesseract.recognize(image: [ImageLike](#imagelike)[, options]) -> [TesseractJob](#tesseractjob)
Figures out what words are in the image, where the words are, etc.
- `image` should be an [ImageLike](#imagelike) object.
- `options` is an optional parameter with tesseract specific keys
+ hi
Returns a [TesseractJob](#tesseractjob) whose `then` method can be used to act on the result.
Example:
```javascript
Tesseract.recognize('#my-image')
.then(function(result){
console.log(result)
})
```
## Tesseract.detect(image: [ImageLike](#imagelike)) -> [TesseractJob](#tesseractjob)
Figures out what script (e.g. 'Latin', 'Chinese') the words in the image are written in.
`image` should be an [ImageLike](#imagelike) object.
Returns a [TesseractJob](#tesseractjob) whose `then` method can be used to act on the result of the script.
```javascript
Tesseract.detect('#my-image')
.then(function(result){
console.log(result)
})
```
## TesseractJob
A TesseractJob is an an object returned by a call to recognize or detect.
All methods of a TesseractJob return itself to enable chaining.
Typical use is:
```javascript
var job1 = Tesseract.recognize('#my-image')
job1.progress(function(message){console.log(message)})
.error(function(err){console.error(err)})
.then(function(result){console.log(result)})
```
or more concisely:
```javascript
Tesseract.recognize('#my-image')
.progress(function(message){console.log(message)})
.error(function(err){console.error(err)})
.then(function(result){console.log(result)})
```
### TesseractJob.progress(callback: function) -> TesseractJob
Sets `callback` as the function that will be called every time the job progresses.
`callback` is a function with the signature `callback(progress)` where progress is json object.
For example:
```javascript
Tesseract.recognize('#my-image')
.progress(function(message){console.log('progress is: 'message)})
```
The console will show something like:
```javascript
progress is: {loaded_lang_model: "eng", from_cache: true}
progress is: {initialized_with_lang: "eng"}
progress is: {set_variable: Object}
progress is: {set_variable: Object}
progress is: {recognized: 0}
progress is: {recognized: 0.3}
progress is: {recognized: 0.6}
progress is: {recognized: 0.9}
progress is: {recognized: 1}
```
### TesseractJob.then(callback: function) -> TesseractJob
Sets `callback` as the function that will be called if and when the job successfully completes.
For example:
```javascript
Tesseract.recognize('#my-image')
.then(function(result){console.log('result is: 'result)})
```
The console will show something like:
```javascript
progress is: {
blocks: Array[1]
confidence: 87
html: "<div class='ocr_page' id='page_1' ..."
lines: Array[3]
oem: "DEFAULT"
paragraphs: Array[1]
psm: "SINGLE_BLOCK"
symbols: Array[33]
text: "Hello World↵from beyond↵the Cosmic Void↵↵"
version: "3.04.00"
words: Array[7]
}
```
### TesseractJob.error(callback: function) -> TesseractJob
Sets `callback` as the function that will be called if and when the job successfully completes.
# Contributing
## Development
To run a development copy of tesseract.js, first clone this repo.
```shell
> git clone https://github.com/naptha/tesseract.js.git
```
Then, cd in to the folder, `npm install`, and `npm start`
8 years ago
```shell
> cd tesseract.js
> npm install && npm start
8 years ago
... a bunch of npm stuff ...
tesseract.js@1.0.0 start /Users/guillermo/Desktop/code_static/tesseract.js
node devServer.js
Listening at http://localhost:7355
```
Then open `http://localhost:7355` in your favorite browser. The devServer automatically rebuilds tesseract.js and tesseract.worker.js when you change files in the src folder.
## Building Static Files
After you've cloned the repo and run `npm install` as described in the [Development Section](#development), you can build static library files in the dist folder with
```shell
> npm run build
```
## Send us a Pull Request!