Tesseract is Google's optical character recognition library, and is not natively a Python package. Because of this, there's a Python binding for it that calls the executable, which can then be installed manually.
Go to the GitHub repository for Tesseract, which is found at the following link: https://github.com/tesseract-ocr/tesseract.
Scroll down to the Installing Tesseract section in the GitHub readme. Here, we are presented with two options:
- Installing it via a pre-built binary package
- Building it from source
We want to install it via the pre-built binary package, so click on that link. We can also build it from source if we want to, but that doesn't really offer any advantages. The Tesseract Wiki explains the steps to install it on various different operating systems.
As we're using Windows, and we want to install a pre-built one, click on the Tesseract at UB Mannheim link, where you will find all the latest setup files. Download the latest setup from the site.
Once downloaded, run the installer or execute the command. However, this is not going to put Tesseract in your path. We need to make sure it is in your path; otherwise, when you call Tesseract from within Python, you're going to get an error message.
So, we need to figure out where Tesseract is and modify our path variable. To do this, type the where tesseract command in Command Prompt, as shown in the following screenshot:
Once you have the binary packages, use the pip command to apply the Python binding to the packages. Use the following commands:
$ pip install tesseract
$ pip install pytesseract
We should be good to go with Tesseract now.