Locate Text and Shapes in a Raster PDF and Annotate

Posted 2 weeks ago

Worldwide

Summary

This task needs to be completed ASAP (i.e., less than 7 days). We have a set of 30,000 technical pictures in raster PDF format that describe the components of a coal power plant. We need someone to create an OCR program that detects text inside these raster PDFs and creates JSON files containing the coordinates for bounding boxes for the identified text. PROGRAM INPUTS (THE PDFS): 1. These PDFs contain two different types of pictures: * Process & Instrumentation pictures (PNIDs or P&IDs) - Classic engineering process pictures. The most important information they contain relates how components are connected upstream and downstream. These files are clearer and will probably work better for OCR. * Drawings - More detailed pictures that depict components to scale. 2. A handful of manufacturers built this coal power plant. Each produced PNIDs and Drawings in its own unique format. We have no reliable index or information about which pictures are PNIDs and which are Drawings. You will likely need to programmatically classify each picture by type and Manufacturer. Since OCR will probably be easier for PNIDs, we should probably star the OCR process with the PNIDs for one manufacturer. Once we solve it, we can repeat for PNIDs for other manufacturers, and then repeat the process for Drawings. ABOUT THE TEXT CONTENT: We are attempting to identify the following items: 1. PI tags - There are two types: * PI – Live data: Components for which the plant is receiving live data * PI – No live data: Components for which the plant is NOT receiving live data Notes: * There is a list called the PI Tag Explorer that has ~22K PI Tags that we can use as a bible though we do not have information about which PI tags have live data. 2. Other tags These come in different categories: equipment, cable, instrument. We don't know how to differentiate. So, we should just identify as many other tags as we can. 3. Picture attributes: * Number * Title - (Our UI will allow users to fuzzy match on this.) PROGRAM OUTPUTS: The program needs to output the following for each of the 30,000 PDFs: 1. An image file in png format that has all of the bounding boxes. See the example png file. 2. A JSON file that details the text identified, the type of component, and the coordinates of the bounding box. See the example JSON file. PROJECT MILESTONES: This will be a fixed price project. Payments will be made after we verify you have completed each of the following milestones: 1. Identifying family and metadata for each drawing. For each drawing we can identify - 15% of fixed price * P&ID vs Drawing * Manufacturer * Drawing number (?) * Drawing name 2. Identifying all tags in shapes (e.g., in circles, circles within squares, etc.) - 30% of fixed price * We identify at last 98% of all shapes that are present (shape detection) - For 100 shapes that exist in a drawing, we are able to identify at least 98 of them - For all shapes that are identified, the shapes are indeed really shapes (no false positives) * We identify the alphanumeric tags correctly 95%+ of the time (OCR) - If we have 100 alphanumeric tags that exist in shapes on a page, then we correctly tag at least 95 alphanumeric tags with accurate alphanumeric transcription. - E.g., if a tag's value is "11BSFT500" if we tag this tag with anything BUT this exact value (e.g., we don't identify it to begin with or we tag it erroneously with something like "IIBSFT500"), then it would be considered a failure 3. Identifying cable numbers (i.e., values from cable number database) - 30% of fixed price We identify at least 98% of the cable numbers that exist in a drawing. If there 100 cable numbers, we correctly identify and correctly alphanumerically tag at least 98 of them. Anything other than the exact value (e.g., we don't identify a given cable number or we tag it with an erroneous alphanumeric string) is considered a failure 4. Identifying remaining freestanding tags in drawings - 25% of fixed price We identify 98%+ of the freestanding equipment tags in a drawing. Same constraints as cable numbers for accuracy For any of these to be completed, it must completed for the full set of 30,000 raster PDF files. PROJECT LOGISTICS: * All of the raster PDFs are located on a Windows server on AWS. We will give you access to this machine where you will need to perform all of your work. The pictures cannot leave the machine. * You will need to hire and manage any resources you need to generate any ground truth data you need. We have generated ground truth in a Google Sheet for 15 files. We don't think this is going to be sufficient though. * We will perform our own, independent sample testing of your outputs to confirm you have achieved our quality thresholds. My company will have similar projects from time to time. If you do a good job, we will likely want to use you on them.

  • $4,000.00

    Fixed-price
  • Expert
    Experience Level
  • Remote Job
  • One-time project
    Project Type
Skills and Expertise
Mandatory skills
Image Annotation
Image Analysis
Activity on this job
  • Proposals:20 to 50
  • Last viewed by client:2 weeks ago
  • Hires:
    1
  • Interviewing:
    3
  • Invites sent:
    0
  • Unanswered invites:
    0
About the client
Member since Jun 25, 2020
  • United States
    Commack10:00 AM
  • $33K total spent
    5 hires, 1 active
  • 530 hours
  • Individual client

Explore similar jobs on Upwork

German
German to English Translation
English to German Translation
AI Agent Setup Consultancy for TelecomHourly‐ Posted 6 days ago
AI Agent Development
AI Bot
AI Builder
AI App Development
Lead Generation
Sales
Email Communication
Administrative Support
Price & Quote Negotiation
Python

How it works

  • Post a job icon
    Create your free profile
    Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
  • Talent comes to you icon
    Work the way you want
    Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
  • Payment simplified icon
    Get paid securely
    From contract to payment, we help you work safely and get paid securely.
Want to get started? Create a profile

About Upwork

  • Rating is 4.9 out of 5.
    4.9/5
    (Average rating of clients by professionals)
  • G2 2021
    #1 freelance platform
  • 49,000+
    Signed contract every week
  • $2.3B
    Freelancers earned on Upwork in 2020

Find the best freelance jobs

Growing your career is as easy as creating a free profile and finding work like this that fits your skills.

Trusted by

  • Microsoft Logo
  • Airbnb Logo
  • Bissell Logo
  • GoDaddy Logo