Locate Text and Shapes in a Raster PDF and Annotate

Posted 2 weeks ago

Worldwide

Summary

This task needs to be completed ASAP (i.e., less than 7 days). We have a set of 30,000 technical pictures in raster PDF format that describe the components of a coal power plant. We need someone to create an OCR program that detects text inside these raster PDFs and creates JSON files containing the coordinates for bounding boxes for the identified text. PROGRAM INPUTS (THE PDFS): 1. These PDFs contain two different types of pictures: * Process & Instrumentation pictures (PNIDs or P&IDs) - Classic engineering process pictures. The most important information they contain relates how components are connected upstream and downstream. These files are clearer and will probably work better for OCR. * Drawings - More detailed pictures that depict components to scale. 2. A handful of manufacturers built this coal power plant. Each produced PNIDs and Drawings in its own unique format. We have no reliable index or information about which pictures are PNIDs and which are Drawings. You will likely need to programmatically classify each picture by type and Manufacturer. Since OCR will probably be easier for PNIDs, we should probably star the OCR process with the PNIDs for one manufacturer. Once we solve it, we can repeat for PNIDs for other manufacturers, and then repeat the process for Drawings. ABOUT THE TEXT CONTENT: We are attempting to identify the following items: 1. PI tags - There are two types: * PI – Live data: Components for which the plant is receiving live data * PI – No live data: Components for which the plant is NOT receiving live data Notes: * There is a list called the PI Tag Explorer that has ~22K PI Tags that we can use as a bible though we do not have information about which PI tags have live data. 2. Other tags These come in different categories: equipment, cable, instrument. We don't know how to differentiate. So, we should just identify as many other tags as we can. 3. Picture attributes: * Number * Title - (Our UI will allow users to fuzzy match on this.) PROGRAM OUTPUTS: The program needs to output the following for each of the 30,000 PDFs: 1. An image file in png format that has all of the bounding boxes. See the example png file. 2. A JSON file that details the text identified, the type of component, and the coordinates of the bounding box. See the example JSON file. PROJECT MILESTONES: This will be a fixed price project. Payments will be made after we verify you have completed each of the following milestones: 1. Identifying family and metadata for each drawing. For each drawing we can identify - 15% of fixed price * P&ID vs Drawing * Manufacturer * Drawing number (?) * Drawing name 2. Identifying all tags in shapes (e.g., in circles, circles within squares, etc.) - 30% of fixed price * We identify at last 98% of all shapes that are present (shape detection) - For 100 shapes that exist in a drawing, we are able to identify at least 98 of them - For all shapes that are identified, the shapes are indeed really shapes (no false positives) * We identify the alphanumeric tags correctly 95%+ of the time (OCR) - If we have 100 alphanumeric tags that exist in shapes on a page, then we correctly tag at least 95 alphanumeric tags with accurate alphanumeric transcription. - E.g., if a tag's value is "11BSFT500" if we tag this tag with anything BUT this exact value (e.g., we don't identify it to begin with or we tag it erroneously with something like "IIBSFT500"), then it would be considered a failure 3. Identifying cable numbers (i.e., values from cable number database) - 30% of fixed price We identify at least 98% of the cable numbers that exist in a drawing. If there 100 cable numbers, we correctly identify and correctly alphanumerically tag at least 98 of them. Anything other than the exact value (e.g., we don't identify a given cable number or we tag it with an erroneous alphanumeric string) is considered a failure 4. Identifying remaining freestanding tags in drawings - 25% of fixed price We identify 98%+ of the freestanding equipment tags in a drawing. Same constraints as cable numbers for accuracy For any of these to be completed, it must completed for the full set of 30,000 raster PDF files. PROJECT LOGISTICS: * All of the raster PDFs are located on a Windows server on AWS. We will give you access to this machine where you will need to perform all of your work. The pictures cannot leave the machine. * You will need to hire and manage any resources you need to generate any ground truth data you need. We have generated ground truth in a Google Sheet for 15 files. We don't think this is going to be sufficient though. * We will perform our own, independent sample testing of your outputs to confirm you have achieved our quality thresholds. My company will have similar projects from time to time. If you do a good job, we will likely want to use you on them.

$4,000.00
Fixed-price
Expert
Experience Level
Remote Job
One-time project
Project Type

Skills and Expertise

Mandatory skills

Image Annotation

Image Analysis

Activity on this job

Proposals:20 to 50
Last viewed by client:2 weeks ago
Hires:
1
Interviewing:
3
Invites sent:
0
Unanswered invites:
0

About the client

Member since Jun 25, 2020

United States
Commack10:00 AM
$33K total spent
5 hires, 1 active
530 hours
Individual client

Explore similar jobs on Upwork

German native Speakers needed for remote Dialogue recordingFixed-price‐ Posted 1 month ago

German

German to English Translation

English to German Translation

AI Agent Setup Consultancy for TelecomHourly‐ Posted 6 days ago

AI Agent Development

AI Bot

AI Builder

AI App Development

Lead Generation

Sales

Email Communication

Administrative Support

Price & Quote Negotiation

Python

How it works

Create your free profile
Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you want
Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securely
From contract to payment, we help you work safely and get paid securely.