Introduction
Optical Character Recognition (OCR) technology has revolutionized the way we convert different types of documents, such as scanned paper documents, PDFs, or images captured by a digital camera, into editable and searchable data. OCR technology is crucial for automating data entry processes, improving accuracy, and saving time by eliminating the need for manual data extraction. Its applications span across various industries including banking, healthcare, logistics, and government services, making it an essential tool in the digital transformation journey.
In this tutorial, we will focus on a specific use case of OCR technology: driver's license recognition. Recognizing and extracting information from driver's licenses is a common requirement for businesses and organizations that need to verify identity, such as car rental services, financial institutions, and security agencies. Automating this process using OCR can significantly enhance operational efficiency, reduce human error, and streamline customer interactions.
For this tutorial, we will use the API4AI OCR API, a robust and versatile solution that offers high accuracy and performance for general OCR tasks. API4AI was chosen for its ease of use, comprehensive documentation, and competitive pricing. It provides a flexible API that can be integrated into various applications to perform OCR on different types of documents, including driver's licenses. You are also free to use any other tools, using this tutorial as inspiration.
One of the key motivations behind using a general OCR API like API4AI, as opposed to specialized solutions designed specifically for driver's license recognition, is cost-effectiveness. Specialized solutions often come with higher costs and less flexibility, which can be a significant burden, especially for small to medium-sized businesses. By leveraging a general OCR API, you can achieve similar results at a fraction of the cost while maintaining the flexibility to adapt the solution to other OCR needs as well.
In the sections that follow, we will guide you through the process of setting up your environment, integrating the API4AI OCR API, and writing the necessary code to recognize and extract information from driver's licenses. Whether you're a developer looking to add OCR capabilities to your application or a business owner seeking to automate identity verification, this step-by-step tutorial will provide you with the knowledge and tools to get started.
Understanding OCR and Its Applications
Definition of Optical Character Recognition (OCR)
Optical Character Recognition (OCR) is a technology that enables the conversion of different types of documents, such as scanned paper documents, PDFs, or images, into editable and searchable text data. OCR algorithms analyze the visual patterns of characters within these documents and translate them into machine-readable text, allowing computers to understand and process the content. OCR has become an indispensable tool in the digitization of information, enabling automation and streamlining workflows across various industries.
Common Applications of OCR Technology
OCR technology finds applications in a wide range of industries and scenarios, including:
Document Digitization: Converting physical documents into digital formats for storage, retrieval, and sharing.
Data Entry Automation: Automating data entry tasks by extracting text from documents and entering it into databases or other systems.
Text Recognition in Images: Recognizing text within images captured by digital cameras or smartphones, such as signs, labels, or handwritten notes.
Translation Services: Enabling the translation of printed or handwritten text from one language to another.
Accessibility: Making printed materials accessible to visually impaired individuals by converting them into text-to-speech or braille formats.
Specific Use Cases in Driver's License Recognition
Driver's license recognition is a specialized application of OCR technology that involves extracting information from driver's licenses, such as the name of the license holder, license number, date of birth, and address. This information is commonly required for identity verification purposes in various industries, including:
Car Rental Services: Verifying the identity of customers before renting vehicles to ensure compliance with age restrictions and driver eligibility.
Financial Institutions: Authenticating customer identities for account opening, loan applications, or financial transactions.
Government Agencies: Processing driver's license renewals, registrations, and other administrative tasks efficiently.
Security and Access Control: Granting access to restricted areas or sensitive information based on verified identities.
Importance of Choosing the Right OCR API for the Task
When it comes to driver's license recognition and other OCR tasks, choosing the right OCR API is crucial for achieving accurate and reliable results. Factors to consider when selecting an OCR API include:
Accuracy: The ability of the OCR engine to accurately recognize text, even in challenging conditions such as low-quality images or distorted text.
Speed: The processing speed of the OCR API, especially when dealing with large volumes of documents or real-time applications.
Ease of Integration: The simplicity and flexibility of integrating the OCR API into existing applications or workflows.
Language Support: The support for multiple languages and character sets, especially for applications in multilingual environments.
Cost: The pricing structure of the OCR API, including any usage-based fees or subscription plans, and its affordability for the intended use case.
By carefully evaluating these factors and choosing a reliable OCR API like API4AI, you can ensure the success of your driver's license recognition project and maximize its benefits in terms of efficiency, accuracy, and cost-effectiveness.
Why Not Use Specialized Solutions for Driver's License Recognition?
Overview of Specialized Solutions for Driver's License Recognition
Specialized solutions for driver's license recognition are designed specifically to extract and verify information from driver's licenses. These solutions often come with pre-built templates and algorithms tailored for different license formats, making them seemingly convenient for businesses that require high accuracy and quick deployment. These solutions typically offer features such as automatic format detection, advanced data extraction, and integration with identity verification services.
Discussion on the High Costs Associated with Specialized Solutions
While specialized solutions offer convenience and high accuracy, they come with significant drawbacks, primarily in terms of cost. These solutions often involve:
High Licensing Fees: Specialized software typically comes with high upfront licensing costs or subscription fees that can be prohibitively expensive for small to medium-sized businesses.
Per-Transaction Costs: Many specialized solutions charge based on the number of transactions or scans, leading to escalating costs as the volume of processed licenses increases.
Maintenance and Support Fees: Ongoing costs for software maintenance, updates, and support can add up, further increasing the total cost of ownership.
Vendor Lock-In: Businesses may become dependent on a single vendor, limiting their flexibility to switch to alternative solutions without incurring additional costs or undergoing significant disruptions.
Benefits of Building a Solution on Top of General OCR APIs
Using a general OCR API, such as API4AI, for driver's license recognition offers several advantages over specialized solutions:
Cost-Effectiveness: General OCR APIs typically have lower upfront costs and more flexible pricing models, including pay-as-you-go options. This makes them more affordable, especially for businesses with varying processing volumes.
Flexibility and Customization: General OCR APIs provide the flexibility to adapt and customize the OCR process to specific needs. Developers can fine-tune the data extraction process, implement custom validation rules, and integrate with other systems without being constrained by the limitations of a specialized solution.
Scalability: General OCR APIs are designed to handle a wide range of document types and can scale easily with the growth of the business. As the volume of processed licenses increases, the solution can be scaled up without significant changes to the underlying infrastructure.
By leveraging the power of general OCR APIs, these organizations achieved significant cost savings, improved efficiency, and maintained the flexibility to adapt their solutions as their needs evolved. This demonstrates the effectiveness of general OCR solutions in real-world applications, reinforcing the case for their use in driver's license recognition tasks.
Writing Code to Recognize Driver's Licenses with API4AI OCR
Assumptions
In this tutorial, we will explore the application of the API4AI OCR API to recognize key information from a driver’s license. Leveraging OCR technology, we can automate the extraction of this data, making processes more efficient and reducing the potential for human error. To keep our tutorial focused and manageable, we will use a sample driver’s license from Washington, D.C and will work with the ID and the name of the license holder. This will help us demonstrate the process clearly and effectively. However, the principles and methods we discuss can be applied to driver’s licenses from any US state. By the end of this tutorial, you should have a solid understanding of how to integrate and utilize the API4AI OCR API for driver's license recognition in your own projects.
Additionally, for our demonstration, we will use the demo API endpoint provided by API4AI, which offers a limited number of queries. This will be quite sufficient for our experimental purposes, allowing us to illustrate the capabilities of the OCR API without any cost. If you wish to implement a full-featured solution in a production environment, please refer to the API4AI documentation page for detailed instructions on obtaining an API key and understanding the full range of features available.
For testing and development we will use the picture below.
Understanding API4AI OCR API
The OCR API can be used in two modes: “simple_text” (by default) and “simple_words”. The first mode produces text with recognized phrases separated by line breaks and their positions. We’re not really interested in that right now because we want to know the location of each word so that we have something to fall back on. But first, we need to understand how the API works. As they say, better one example code than 1024 words.
import math
import sys
import cv2
import requests
API_URL = 'https://demo.api4ai.cloud/ocr/v1/results?algo=simple-words'
# get path from the 1st argument
image_path = sys.argv[1]
# we us HTTP API to get recognized words from the specified image
with open(image_path, 'rb') as f:
response = requests.post(API_URL, files={'image': f})
json_obj = response.json()
for elem in json_obj['results'][0]['entities'][0]['objects']:
box = elem['box'] # normalized x, y, width, height
text = elem['entities'][0]['text'] # recognized text
print( # show every word with bounding box
f'[{box[0]:.4f}, {box[1]:.4f}, {box[2]:.4f}, {box[3]:.4f}], {text}'
)
In this short code, we access the API by sending a picture in a POST request, with the path to the picture passed as the first command-line argument. This program will simply display the normalized values of the top-left coordinate, width, and height of the area containing the recognized word, as well as the word itself. Here is an output fragment for the above picture:
...
[0.6279, 0.6925, 0.0206, 0.0200], All
[0.6529, 0.6800, 0.1118, 0.0300], 02/21/1984
[0.6162, 0.7175, 0.0309, 0.0200], BEURT
[0.6515, 0.7350, 0.0441, 0.0175], 4a.ISS
[0.6515, 0.7675, 0.1132, 0.0250], 02/17/2010
[0.7662, 0.1725, 0.0647, 0.1125], tomand
[0.6529, 0.8550, 0.0324, 0.0275], ♥♥
[0.6941, 0.8550, 0.0809, 0.0275], DONOR
[0.6529, 0.8950, 0.1074, 0.0300], VETERAN
[0.9000, 0.0125, 0.0691, 0.0375], USA
Let’s try to apply the obtained data to an image by drawing bounding boxes using OpenCV. To do this, we need to convert the normalized values into absolute values expressed in integer pixels. We need the exact coordinate values of the upper left corner and the lower right corner so that we can use them to draw the bounding box. To achieve this, let’s create the get_corner_coords function.
def get_corner_coords(height, width, box):
x1 = int(box[0] * width)
y1 = int(box[1] * height)
obj_width = box[2] * width
obj_height = box[3] * height
x2 = int(x1 + obj_width)
y2 = int(y1 + obj_height)
return x1, y1, x2, y2
The function for drawing the bounding box will be very simple:
def draw_bounding_box(image, box):
x1, y1, x2, y2 = get_corner_coords(image.shape[0], image.shape[1], box)
cv2.rectangle(image, (x1 - 2, y1 - 2), (x2 + 2, y2 + 2), (127, 0, 0), 2)
In this feature, we slightly widened the frame by two pixels so that it is not pressed too close to the words. The color (127, 0, 0) is navy blue specified in BGR format. The thickness of the frame is two pixels.
Of course, to work with an image, it must first be read. Let’s modify the last part of our script: read the image, remove the debug output with information about frames, draw each bounding box on the read image, and then save the modified image to the file “output.png”.
image = cv2.imread(image_path)
for elem in json_obj['results'][0]['entities'][0]['objects']:
box = elem['box'] # normalized x, y, width, height
text = elem['entities'][0]['text'] # recognized text
draw_bounding_box(image, box) # add boundaries to image
cv2.imwrite('output.png', image)
And what do we have now:
Extracting the ID and Name of the License Holder
Before, we managed to use the API and extract text information from the picture of the driver’s license. That’s great! But how do we get to the number and name?
These are the elements we have in the area we are interested in:
[0.3059, 0.1975, 0.0500, 0.0175], 4d.DLN
[0.3059, 0.2325, 0.1059, 0.0275], A9999999
[0.3074, 0.2800, 0.0603, 0.0200], 1.FAMILY
[0.3735, 0.2800, 0.0412, 0.0175], NAME
[0.3059, 0.3150, 0.0794, 0.0300], JONES
[0.3059, 0.3675, 0.0574, 0.0225], 2.GIVEN
[0.3691, 0.3675, 0.0529, 0.0225], NAMES
[0.3074, 0.4025, 0.1191, 0.0275], ANGELINA
[0.3074, 0.4375, 0.1191, 0.0300], GABRIELA
Yes, the POST request gave ordered results, but the order could actually be different, so we can’t rely on it. It is better to assume that the result always stores the recognized elements in a scattered fashion.
Let’s create a list named words, so that we can easily search for words and their positions:
words = []
for elem in json_obj['results'][0]['entities'][0]['objects']:
box = elem['box']
text = elem['entities'][0]['text']
words.append({'box': box, 'text': text})
Let’s call “4d.DLN,” “1.FAMILY,” and “2.GIVEN” the field names, and what’s below them in the picture the field values. The easiest approach is to look for the closest lower-lying elements based on the positions of the field names. We may find words far to the right or left, so we should consider the distance between the text elements rather than their position relative to the axes. Let’s write some code.
First, let’s find the positions of the field names:
ID_MARK = '4d.DLN'
FAMILY_MARK = '1.FAMILY'
NAME_MARK = '2.GIVEN'
id_mark_info = {}
fam_mark_info = {}
name_mark_info = {}
for elem in words:
if elem['text'] == ID_MARK:
id_mark_info = elem
elif elem['text'] == FAMILY_MARK:
fam_mark_info = elem
elif elem['text'] == NAME_MARK:
name_mark_info = elem
Next, we will write a function that finds the nearest elements below the given reference element:
def find_label_below(word_info):
x = word_info['box'][0]
y = word_info['box'][1]
candidate = words[0]
candidate_dist = math.inf
for elem in words:
if elem['text'] == word_info['text']:
continue
curr_box_x = elem['box'][0]
curr_box_y = elem['box'][1]
curr_vert_dist = curr_box_y - y
curr_horiz_dist = x - curr_box_x
if curr_vert_dist > 0: # we are only looking for items below
dist = math.hypot(curr_vert_dist, curr_horiz_dist)
if dist > candidate_dist:
continue
candidate_dist = dist
candidate = elem
return candidate
Let’s try to apply this function and draw the boundaries of the found elements:
id_info = find_label_below(id_mark_info)
fam_info = find_label_below(fam_mark_info)
name_info = find_label_below(name_mark_info)
name2_info = find_label_below(name_info)
canvas = image.copy()
draw_bounding_box(canvas, id_info['box'])
draw_bounding_box(canvas, fam_info['box'])
draw_bounding_box(canvas, name_info['box'])
draw_bounding_box(canvas, name2_info['box'])
cv2.imwrite('result.png', canvas)
Let's take a look at what we have accomplished so far:
It looks like we successfully extracted the required fields :)
Finalizing Results
Based on everything we've discussed, let's create a practically useful program without using OpenCV. This program will take the path to the picture as an argument and output the ID number and full name to the terminal.
#!/usr/bin/env python3
import math
import sys
import requests
API_URL = 'https://demo.api4ai.cloud/ocr/v1/results?algo=simple-words'
ID_MARK = '4d.DLN'
FAMILY_MARK = '1.FAMILY'
NAME_MARK = '2.GIVEN'
ADDRESS_MARK = '8.ADDRESS'
def find_text_below(words, word_info):
x = word_info['box'][0]
y = word_info['box'][1]
candidate = words[0]
candidate_dist = math.inf
for elem in words:
if elem['text'] == word_info['text']:
continue
curr_box_x = elem['box'][0]
curr_box_y = elem['box'][1]
curr_vert_dist = curr_box_y - y
curr_horiz_dist = x - curr_box_x
if curr_vert_dist > 0: # we are only looking for items below
dist = math.hypot(curr_vert_dist, curr_horiz_dist)
if dist > candidate_dist:
continue
candidate_dist = dist
candidate = elem
return candidate
if __name__ == '__main__':
if len(sys.argv) != 2:
print('Expected one argument: path to image.')
sys.exit(1)
image_path = sys.argv[1]
with open(image_path, 'rb') as f:
response = requests.post(API_URL, files={'image': f})
json_obj = response.json()
words = []
for elem in json_obj['results'][0]['entities'][0]['objects']:
box = elem['box']
text = elem['entities'][0]['text']
words.append({'box': box, 'text': text})
id_mark_info = {}
fam_mark_info = {}
name_mark_info = {}
for elem in words:
if elem['text'] == ID_MARK:
id_mark_info = elem
elif elem['text'] == FAMILY_MARK:
fam_mark_info = elem
elif elem['text'] == NAME_MARK:
name_mark_info = elem
license = find_text_below(words, id_mark_info)['text']
family_name = find_text_below(words, fam_mark_info)['text']
name1_info = find_text_below(words, name_mark_info)
name1 = name1_info['text']
name2 = find_text_below(words, name1_info)['text']
if name2 == ADDRESS_MARK: # no second name
full_name = f'{name1} {family_name}'
else: # with second name
full_name = f'{name1} {name2} {family_name}'
print(f'Driver license: {license}')
print(f'Full name: {full_name}')
The output of the program for the picture we presented at the beginning, as the first argument:
License: A9999999
Full name: ANGELINA GABRIELA JONES
The program can easily be extended to retrieve other data from driver’s licenses. Of course, we didn’t consider all possible problematic situations because the goal was to demonstrate the practical use of the API, leaving room for improvement by the reader. For instance, to handle rotated images, we could determine the angle of rotation from the key fields and use that information to search for “underlying” elements with field values. Give it a try! By using these general ideas, it’s easy to implement program logic for other types of documents and images with text.
Read the OCR API documentation and examples of code written in different programming languages to learn more.
Conclusion
In this tutorial, we have taken you through a step-by-step process of using the API4AI OCR API to recognize and extract information from a US driver’s license. We began by understanding the fundamentals of OCR technology and its various applications. We then discussed the rationale behind using a general OCR API over specialized solutions, highlighting the benefits of cost-effectiveness, flexibility, and scalability.
Through the tutorial, we wrote code to send the image to the API, extract the ID number and name of the license holder, and handle the OCR results effectively. We also demonstrated how to parse and validate the extracted data, and discussed ways to extend the program to retrieve additional information.
Using OCR for driver's license recognition offers numerous benefits. It automates the data extraction process, reducing manual effort and minimizing errors. This can significantly enhance operational efficiency in various industries such as car rentals, financial institutions, and government agencies. Moreover, the flexibility of general OCR APIs allows for customization and adaptation to various document types and use cases.
We encourage you to explore further applications of OCR technology beyond driver's license recognition. OCR can be applied to a wide range of documents and scenarios, from digitizing printed texts to automating form processing and enhancing accessibility. By leveraging the power of OCR, you can streamline workflows, improve accuracy, and unlock new possibilities for innovation in your projects.
Thank you for following along with this tutorial. We hope you found it informative and useful. For more details and advanced usage, be sure to check out the OCR API documentation and explore additional examples in various programming languages. Happy coding!
Additional Resources
Links to API4AI OCR API Documentation
To dive deeper into the features and capabilities of the API4AI OCR API, refer to the official documentation. It provides comprehensive guides, code examples, and detailed explanations of the API endpoints, helping you leverage the full potential of OCR in your applications.
Further Reading on OCR Technology and Image Processing
For those interested in expanding their knowledge of OCR technology and image processing, here are some valuable resources:
Books:
"Digital Image Processing" by Rafael C. Gonzalez and Richard E. Woods
"Mastering OpenCV 4 with Python" by Alberto Fernandez Villan
"Deep Learning for Computer Vision" by Rajalingappaa Shanmugamani
Articles and Papers:
"Optical character recognition: an illustrated guide to the frontier" by George Nagy, Thomas A. Nartker, Stephen V. Rice
"End-to-End Text Recognition with Convolutional Neural Networks" by Tao Wang, David J. Wu, Adam Coates, and Andrew Y. Ng
"Best OCR Solutions for Various Use Cases: How to Choose the Right One"
Websites:
Towards Data Science - A platform with numerous articles on machine learning, deep learning, and image processing.
OpenCV - Official site for OpenCV with tutorials and documentation.
Links to Related Tutorials and Courses
Enhance your practical skills and gain hands-on experience with these tutorials and courses focused on OCR and image processing:
Tutorials:
OpenCV-Python Tutorials - Official OpenCV tutorials for Python.
Real Python: OCR with Tesseract and OpenCV - A practical guide to using Tesseract and OpenCV in Python.
Online Courses:
Coursera: Introduction to Computer Vision and Image Processing - A comprehensive course on computer vision and OpenCV.
Udacity: Computer Vision Nanodegree Program - An in-depth program covering various aspects of computer vision.
edX: Computer Vision and Image Processing Fundamentals - A foundational course on computer vision principles and applications.
CS231n: Deep Learning for Computer Vision - A deep dive into the details of deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification
By exploring these additional resources, you can enhance your understanding of OCR technology, refine your skills, and discover new ways to implement OCR in your projects. Happy learning!