Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Evaluating Google Cloud Vision for OCR

102 vues

Publié le

Google Cloud Visionを帳票OCRが実務レベルで利用可能かを評価した懸賞資料です。
This is a detailed report on evaluating Google Could Vision for real world business process OCR purposes.

Publié dans : Technologie
  • Soyez le premier à commenter

Evaluating Google Cloud Vision for OCR

  1. 1. 1 Evaluation on Google Cloud Vision API 1.1 (beta) - POC Report for Japanese Health Care Report OCR 26-May, 2017 Asia Technology Office Shinichi Hashitani
  2. 2. Executive Summary • Google Cloud Vision API is an AIaaS provided by Google based on machine learning engine, which come with OCR (Optical Character Recognition), Image classification, landmark detection, and other features. OCR feature is highly sophisticated and able to recognize Japanese characters at almost perfect accuracy. • Its OCR capability is suited for standard document scanning. It does not recognizes multi- column document structure well, and not very suited for tabular format document. The format of health care reports vary across medical institutions, but they are all primarily in tabular format, making it difficult to extract meaningful data accurately. Only about 30% of text are extracted, and they are not structured well enough for processing, either. • OCR feature does not accept any parameter other than the image itself; therefore, it requires in-house processing of response JSON. There are two primary approaches: 1. Text mining on content section of response. 2. Programmatic text composition based on characters coordinate information. For health care report scanning, neither approach is feasible. • Still Google Cloud Vision API can be used for standard format documents (Books, whitepapers, academic writings, public announcement, etc.) It can also used for publications (newspapers, magazines, case study reports) for text mining purpose. Further evaluation of other format is also conducted during POC. 2
  3. 3. About Google Cloud Vision API Google Cloud API is a REST API based service, accessible from any system in any language which can communicate with JSON over HTTP. The request is authenticated on either based on OAuth2 (recommended) or Cloud API Key. Request format is common for all Cloud Vision services, “type” needs to be specified for a specific use. It is also possible to request multiple services within a single call on the same image. (In this case, each type specified is counted as one unit.) 3 The response time for A4 page with 500 characters is 3 to 6 seconds round trip. The price model is per-unit-of-task basis, and relatively inexpensive. (1.5 USD for 1000 unit-per- month. 1.0 USD beyond 20 million unit-per-month. Free for below 1000 unit-per-month.) {“requests”: [ “image”: {“content”: image_base64}, “features”: [“type”, “TEXT_DETECTION”, “maxResults”: 1}] } ]} {response: [ …. {“blockType”: “TEXT”, “boundingBox”: { “vertices”: [ {“x”: 594, “y”:327} …. “text”: “遥” ….
  4. 4. Google Cloud Vision API – Output format The output is in JSON format composite of two sets of information. 1. Character (or a small group of character) information. 2. Re-structured full text of covert text. 4 The full description mimics the actual text structure by concatenating characters based on their coordinates and appending line break character for each line. From the output, it is confirmed that the engine analyze character-by-character and able to process text with characters in different languages accurately. At this point in time, this capability is far superior than that of Microsoft Cognitive Service, which confuses alphabet characters with similar Kanji characters. {"boundingBox": {"vertices": [{"x": 444,"y": 71,{"x": 485,"y": 67,{"x": 488,"y": 104,{"x": 447,"y": 108}], "property": {"detectedLanguages": [{"languageCode": "ja"}], "text": "基" } …. "textAnnotations": [ {"boundingPoly": {"vertices": [{"x": 24,"y": 62,{"x": 1538,"y": 62,{"x": 1538,"y": 3096,{"x": 24,"y": 3096}], "description": "-基準値¥n|今回ー前回ー前々回ー¥n総合判定¥n要経過観察! 要経過観察|要経過観察¥nメタボリックシンドローム判定¥n非該当 1予備群 該当1基準該当¥n【心電図】不完全右脚ブロック¥n甩¥n血中脂質] LDLコレス テロールやや高値。食べ過ぎに注意し、動物¥n性脂肪や卵などコレステ ロールの多いものを制限し、経過を見て下¥nさい。¥n[尿酸]尿酸が高めです。 注意してください。¥n治療中の場合は、この結果表を主治医にお見せ下さ い。¥n総合判定医師名: 川口 毅 ーーーーーッ童¥n総合所見¥n", "locale": "ja", ….
  5. 5. Google Cloud Vision API – Output Processing Since Google Cloud Vision API does not support structured documents and doesn’t accept any additional information for processing, the in-house output processing is needed in order to extract desired data out of the out put. There are two ways: 1. Re-structure data from each character from their coordinates. 2. Text mining on the structured full text. 5 Based on the accuracy, composition of structured text, and what needs to be extracted, the approach to take varies. Text mining approach is a simpler solution between two methods. {"boundingBox": {"vertices": [{"x": 444,"y": 71,{"x": 485,"y": 67,{"x": 488,"y": 104,{"x": 447,"y": 108}], "property": {"detectedLanguages": …. "textAnnotations": [ … "description": "-基準値¥n|今回ー前回ー 前々回ー¥n総合判定¥n要経過観察!要経過観 察|要経過観察¥n… …. “要“ + “経“ + “過“ + “観“+ “察“ = “要経過観察” “…¥n総合判定¥n要経過観察…“ = “要経過観察”
  6. 6. Output Processing – Text Restructuring This is a raw data processing. Like structured full text provided in the output itself, the method is to re-structure text based on concatenating each character based on their coordinates. Pros: - Targets specific area to be extracted. (Suitable for structured document.) - Less affected by the accuracy of the scan. Cons: - Requires complex logic to process. (Requires coordinate-based calculation for each string) - Requires tailored logic for each type of document. It is ideal for extracting a small amount of information out of the entire document. The logic depends on coordinates. Therefore, it cannot process unstructured documents or semi- structured documents. It also strongly depends on the scan positioning of the document; a small mispositioning of scan can cause the logic to fail fetching characters to process. 6
  7. 7. Output Processing – Full Text Mining Text mining disregards coordinate information of each character. Rather, it takes the restructured full text as input, search through the string to extract text. Pros: - Logic is simple and known text mining techniques are directly applicable. - Possibly re-use one logic to multiple document formats. Cons: - The accuracy entirely depends on the accuracy of full text extraction. - Failing to read “key” text will also fail to extract the value. It is ideal for processing large text, especially for analytical purpose. It is still ban be used for extracting particular set of information if the accuracy of the extracted text is high. 7
  8. 8. Google Cloud Vision API – Restuctured Full Text Google Cloud Vision is designed for a standard single column document, reading and processing from top to bottom, left to right. When restructuring the full text, it cannot restructure it well if it is in multi column format. Google Cloud Vision tries to read and to process line by line. Therefore, the entire row will be displayed as one line, each column is concatenated with spaces in between. 8 AAAAA¥n BBBBB EEEEE HHHHH¥n CCCCC FFFFF IIIII¥n DDDDD GGGGG JJJJJ¥n The sentence flows from B to C to D, but the text comes out as from B to E to H. A word can be divided into two lines, therefore some words (words span across multiple lines) cannot be recognized correctly. Also, when lines don’t align horizontally beyond columns, or space between columns are too wide, often the entire sentence is not processed.
  9. 9. Reading Health Care Report – POC Procedures In this POC, the actual health care report is scanned by a MFP, in both color and monochrome modes. Cloud API is called from a python program running on a local machine. The same report is scanned in 3 mode (color/mono/grayscale) in the same resolution. (300dpi/JPEG) Since the grayscale is not supported by MFP, color TIFF is converted into grayscale JPEG. 9 1. The program reads the image, encodes it into a text format (base64). 2. The program construct JSON requests including encoded image and send it to the Cloud API. 3. The Cloud API processes the image and send back text in JSON format. 4. The program dumps JSON response into a physical file for analysis. Program (Python) 2 1 3 4
  10. 10. POC Result - Monochrome 10 - Overall read accuracy is very poor. The left-most pane is not scanned entirely. - Only limited parts of the document are scanned. When scanned, character are recognized correctly in most cases. - Traditional OCR worked better with monochrome, but it is not in Google Cloud Vision. Correct Incorrect Not Scanned
  11. 11. POC Result - Grayscale 11 - Overall read accuracy is the worst among three options. - The left-most pane is recognized well; able to read outlined characters as well. - Only limited parts of the document are scanned. When scanned, character are recognized correctly in most cases. Correct Incorrect Not Scanned
  12. 12. POC Result - Color 12 - Overall read accuracy is poor, but better than other two options. - The left-most pane is recognized well; able to read outlined characters as well. - Only limited parts of the document are scanned. When scanned, character are recognized correctly in most cases. Correct Incorrect Not Scanned
  13. 13. POC Result – Summary All patterns failed to deliver dependable results for production use. - The results varies among three patterns, but none of them recognized even a half of fields interested for scanning. - Character recognition accuracy itself is high. (Around 95%.) Still it is not reliable enough for production use. Health Care Report is often in multi-pane/tabular format and not suited for this solution. - Due to its document structure, large part of the document is not recognized as text areas for scanning. - Tabular column borders are wrongly recognized as characters. - Table columns are often not fully scanned. (Whitespaces between columns are recognized as the end of sentence.) 13
  14. 14. POC Result – Critical Issues Rows not scanned in multi column structure - Since the entire image is scanned as a single column paragraph, some rows are entirely skipped based on the alignment of lines across columns. 14 1 2 3 4 5 Table border is often wrongly converted to “!” or “1” - Since the scan is processed as a single line, table border is also converted to “|” . But often converted to some meaningful value like “1”. - This happens by chance, and it can alter the actual value with wrongly converted character. (In below case, 80 is converted as 180)
  15. 15. POC Result – Critical Issues cont’d Columns are skipped due to whitespaces between them. - In tabular format, the whitespace between column values often considered as the end of the line, and the remaining columns are not scanned. 15
  16. 16. Follow Up Case – Overview Considering the fact that the document structure affects the accuracy of scan significantly, the complexity of Health Care Report is a particularly challenging for Google Cloud Vision API to process correctly. Additional test is conducted to divide the image into three independent images, so a single 3- pane tabular format image is divided into 3 tabular format images. Each divided image is sent to Cloud Vision API as a separate request. 16
  17. 17. Follow Up Case - Result 17 - Read accuracy is significantly improved. Around 90% of fields interested are scanned. - Character recognition accuracy is high, about the same level as previous cases. - Still all critical issues are present. (Caused not by multi-pane document structure, but by tabular format.) Correct Incorrect Not Scanned
  18. 18. Overall Summary Google Cloud Vision API is not suitable for HCR scanning. - The nature of the document structure hinders it from scanning the desired value. - Due to some critical issues in tabular data scanning, incorrect values can be extracted. - For HCR, both Text Restructuring and Full Text Mining approach can cover for scanning inaccuracy. By processing partially by dividing or cutting the image, there is a possibility of using Google Cloud Vision API as a part of solution. However… - Each image sent will be counted as one request. # of partial images for each HCR will multiply the cost and response time of the processing. - Fairly good amount of effort needed for pre-process and post-process in order to extract the right set of data. - Logic required strongly depends on the accuracy of the service. It is a high risk that the change in Cloud Vision API behavior affects the entire solution. - By the same token, there is a chance of improvement of Google Cloud Vision API will significantly simplify the overall solution. (Cloud Vision API is still in beta.) 18
  19. 19. Appendix 1 – Sample Scanning 1 19 Standard Report with a footer annotation Scan Rate: 100% Scan Accuracy (without punctuations): 100% Scan Accuracy (with punctuations): 99% Source: Reinsurance Trend Report by SOMPO Japan Correct Incorrect Not Scanned
  20. 20. Appendix 1 – Sample Scanning 2 20 Standard Report within a single-column table Scan Rate: 99% Scan Accuracy (without punctuations): 100% Scan Accuracy (with punctuations): 99% Source: Overview on Japan Pension System by Ministry of Health, Labour, and Welfare Correct Incorrect Not Scanned
  21. 21. Appendix 1 – Sample Scanning 3 21 Standard Report within a single-column table and a standard paragraph Scan Rate: 100% Scan Accuracy (without punctuations): 99% Scan Accuracy (with punctuations): 99% Source: Reinsurance Trend Report by SOMPO Japan Correct Incorrect Not Scanned
  22. 22. Appendix 1 – Sample Scanning 4 22 Standard Report within a row-wide image Scan Rate: 100% Scan Accuracy (without punctuations): 100% Scan Accuracy (with punctuations): 99% Source: Overview on Japan Pension System by Ministry of Health, Labour, and Welfare Correct Incorrect Not Scanned
  23. 23. Appendix 1 – Sample Scanning 5 23 Case Study Report in two columns with a row-wide image Scan Rate: 94% Scan Accuracy (without punctuations): 99% Scan Accuracy (with punctuations): 98% Source: IoT Case Study on Fujitsu i Network Systems by CISCO Solution Correct Incorrect Not Scanned
  24. 24. Appendix 1 – Sample Scanning 6 24 Case Study Report in three columns with in-text images Scan Rate: 93% Scan Accuracy (without punctuations): 100% Scan Accuracy (with punctuations): 99% Source: IoT Case Study on Fujitsu i Network Systems by CISCO Solution Correct Incorrect Not Scanned
  25. 25. Appendix 2 – Program Source Code (Python) 25 #coding:utf-8 import sys import json import base64 import requests def process_image(image_path): GOOGLE_CLOUD_VISION_API_URL = "https://vision.googleapis.com/v1/images:annotate?key=" GOOGLE_CLOUD_VISION_API_KEY = “you_need_a_real_API_key_here" REQUEST_HEADER = {'Content-Type': 'application/json'} # loading an image in binary image_base64 = str(base64.b64encode(open(image_path, 'rb').read()).decode("utf-8")) request_json = { 'requests': [ { 'image': { 'content': image_base64 }, 'features': [ { 'type': "TEXT_DETECTION", 'maxResults': 1 } ] } ] }
  26. 26. Appendix 2 – Program Source Code (Python) cont’d 26 # prep & execution ocr_session = requests.Session() ocr_request = requests.Request("POST", GOOGLE_CLOUD_VISION_API_URL + GOOGLE_CLOUD_VISION_API_KEY, data=json.dumps(request_json), headers=REQUEST_HEADER) ocr_response = ocr_session.send(ocr_session.prepare_request(ocr_request), verify=True, timeout=60) # response if ocr_response.status_code == requests.codes.ok: print("Process Successful") with open("D:¥ocr_result.json", 'w', encoding="utf-8") as json_file: json.dump(ocr_response.json(), json_file, ensure_ascii=False, indent=4, sort_keys=True) return ocr_response.json() else: print("Process Failed") ocr_response.raise_for_status() return "error" if __name__ == '__main__': # Execute process_image with the file name passed as a command line parameter print("File name:" + sys.argv[1]) process_image(sys.argv[1])

×