pierreguillou commited on
Commit
6dbb555
·
1 Parent(s): eaf48db

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +44 -15
app.py CHANGED
@@ -56,35 +56,64 @@ def app_outputs(uploaded_pdf):
56
  if not msg.startswith("Error with the PDF"):
57
 
58
  # Extraction of image data (text and bounding boxes)
59
- dataset, lines, row_indexes, par_boxes, line_boxes = extraction_data_from_image(images)
 
60
  # prepare our data in the format of the model
61
- encoded_dataset = dataset.map(prepare_inference_features, batched=True, batch_size=64, remove_columns=dataset.column_names)
62
  custom_encoded_dataset = CustomDataset(encoded_dataset, tokenizer)
63
  # Get predictions (token level)
64
  outputs, images_ids_list, chunk_ids, input_ids, bboxes = predictions_token_level(images, custom_encoded_dataset)
65
- # Get predictions (paragraph level)
66
- probs_bbox, bboxes_list_dict, input_ids_dict_dict, probs_dict_dict, df = predictions_paragraph_level(dataset, outputs, images_ids_list, chunk_ids, input_ids, bboxes)
67
- # Get labeled images with paragraphs bounding boxes
68
- images = get_labeled_images(dataset, images_ids_list, bboxes_list_dict, probs_dict_dict)
69
 
70
  img_files = list()
71
- # get image of PDF without bounding boxes
72
  for i in range(num_images):
73
  if filename != "files/blank.png": img_file = f"img_{i}_" + filename.replace(".pdf", ".png")
74
  else: img_file = filename.replace(".pdf", ".png")
75
- images[i].save(img_file)
76
  img_files.append(img_file)
77
 
78
  if num_images < max_imgboxes:
 
79
  img_files += [image_blank]*(max_imgboxes - num_images)
80
- images += [Image.open(image_blank)]*(max_imgboxes - num_images)
81
  for count in range(max_imgboxes - num_images):
82
  df[num_images + count] = pd.DataFrame()
83
  else:
 
84
  img_files = img_files[:max_imgboxes]
85
- images = images[:max_imgboxes]
86
  df = dict(itertools.islice(df.items(), max_imgboxes))
87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  # save
89
  csv_files = list()
90
  for i in range(max_imgboxes):
@@ -93,25 +122,25 @@ def app_outputs(uploaded_pdf):
93
  df[i].to_csv(csv_file, encoding="utf-8", index=False)
94
 
95
  else:
96
- img_files, images, csv_files = [""]*max_imgboxes, [""]*max_imgboxes, [""]*max_imgboxes
97
  img_files[0], img_files[1] = image_blank, image_blank
98
- images[0], images[1] = Image.open(image_blank), Image.open(image_blank)
99
  csv_file = "csv_wo_content.csv"
100
  csv_files[0], csv_files[1] = gr.File.update(value=csv_file, visible=True), gr.File.update(value=csv_file, visible=True)
101
  df, df_empty = dict(), pd.DataFrame()
102
  df[0], df[1] = df_empty.to_csv(csv_file, encoding="utf-8", index=False), df_empty.to_csv(csv_file, encoding="utf-8", index=False)
103
 
104
- return msg, img_files[0], img_files[1], images[0], images[1], csv_files[0], csv_files[1], df[0], df[1]
105
 
106
  # gradio APP
107
  with gr.Blocks(title="Inference APP for Document Understanding at paragraph level (v1)", css=".gradio-container") as demo:
108
  gr.HTML("""
109
  <div style="font-family:'Times New Roman', 'Serif'; font-size:26pt; font-weight:bold; text-align:center;"><h1>Inference APP for Document Understanding at paragraph level (v1)</h1></div>
110
- <div style="margin-top: 40px"><p>(02/12/2023) This Inference APP uses the <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://huggingface.co/pierreguillou/lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-paragraphlevel-ml512-v4" target="_blank">model LiLT base combined with XLM-RoBERTa base and finetuned on the dataset DocLayNet base at paragraph level</a> (chunk size of 512 tokens).</p></div>
111
  <div><p><a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://arxiv.org/abs/2202.13669" target="_blank">LiLT (Language-Independent Layout Transformer)</a> is a Document Understanding model that uses both layout and text in order to detect labels of bounding boxes. Combined with the model <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://huggingface.co/xlm-roberta-base" target="_blank">XML-RoBERTa base</a>, this finetuned model has the capacity to <b>understand any language</b>. Finetuned on the dataset <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://huggingface.co/datasets/pierreguillou/DocLayNet-base" target="_blank">DocLayNet base</a>, it can <b>classifly any bounding box (and its OCR text) to 11 labels</b> (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).</p></div>
112
  <div><p>It relies on an external OCR engine to get words and bounding boxes from the document image. Thus, let's run in this APP an OCR engine (<a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://github.com/madmaze/pytesseract#python-tesseract" target="_blank">PyTesseract</a>) to get the bounding boxes, then run LiLT (already fine-tuned on the dataset DocLayNet base at paragraph level) on the individual tokens and then, visualize the result at paragraph level!</p></div>
113
  <div><p><b>It allows to get all pages of any PDF (of any language) with bounding boxes labeled at paragraph level and the associated dataframes with labeled data (bounding boxes, texts, labels) :-)</b></p></div>
114
- <div><p>However, the inference time per page can be high when running the model on CPU due to the number of paragraph predictions to be made. Therefore, to avoid running this APP for too long, <b>only the first 2 pages are processed by this APP</b>. If you want to increase this limit, you can either clone this APP in Hugging Face Space (or run its <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://github.com/piegu/language-models/blob/master/Gradio_inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512-v4.ipynb" target="_blank">notebook</a> on your own plateform) and change the value of the parameter <code>max_imgboxes</code>, or run the inference notebook "<a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://github.com/piegu/language-models/blob/master/inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512-v4.ipynb" target="_blank">Document AI | Inference at paragraph level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset)</a>" on your own platform as it does not have this limit.</p></div>
115
  <div style="margin-top: 20px"><p>More information about the DocLayNet datasets, the finetuning of the model and this APP in the following blog posts:</p>
116
  <ul><li>(02/14/2023) <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://medium.com/@pierre_guillou/document-ai-inference-app-for-document-understanding-at-line-level-a35bbfa98893" target="_blank">Document AI | Inference APP for Document Understanding at line level</a></li><li>(02/10/2023) <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://medium.com/@pierre_guillou/document-ai-document-understanding-model-at-line-level-with-lilt-tesseract-and-doclaynet-dataset-347107a643b8" target="_blank">Document AI | Document Understanding model at line level with LiLT, Tesseract and DocLayNet dataset</a></li><li>(01/31/2023) <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://medium.com/@pierre_guillou/document-ai-doclaynet-image-viewer-app-3ac54c19956" target="_blank">Document AI | DocLayNet image viewer APP</a></li><li>(01/27/2023) <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://medium.com/@pierre_guillou/document-ai-processing-of-doclaynet-dataset-to-be-used-by-layout-models-of-the-hugging-face-hub-308d8bd81cdb" target="_blank">Document AI | Processing of DocLayNet dataset to be used by layout models of the Hugging Face hub (finetuning, inference)</a></li></ul></div>
117
  """)
 
56
  if not msg.startswith("Error with the PDF"):
57
 
58
  # Extraction of image data (text and bounding boxes)
59
+ dataset, texts_lines, texts_pars, texts_lines_par, row_indexes, par_boxes, line_boxes, lines_par_boxes = extraction_data_from_image(images)
60
+ print(dataset)
61
  # prepare our data in the format of the model
62
+ encoded_dataset = dataset.map(prepare_inference_features_paragraph, batched=True, batch_size=64, remove_columns=dataset.column_names)
63
  custom_encoded_dataset = CustomDataset(encoded_dataset, tokenizer)
64
  # Get predictions (token level)
65
  outputs, images_ids_list, chunk_ids, input_ids, bboxes = predictions_token_level(images, custom_encoded_dataset)
66
+ # Get predictions (line level)
67
+ probs_bbox, bboxes_list_dict, input_ids_dict_dict, probs_dict_dict, df = predictions_paragraph_level_gradio(dataset, outputs, images_ids_list, chunk_ids, input_ids, bboxes)
68
+ # Get labeled images with lines bounding boxes
69
+ labeled_images = get_labeled_images_gradio(dataset, images_ids_list, bboxes_list_dict, probs_dict_dict)
70
 
71
  img_files = list()
72
+ # get image of PDF with bounding boxes
73
  for i in range(num_images):
74
  if filename != "files/blank.png": img_file = f"img_{i}_" + filename.replace(".pdf", ".png")
75
  else: img_file = filename.replace(".pdf", ".png")
76
+ labeled_images[i].save(img_file)
77
  img_files.append(img_file)
78
 
79
  if num_images < max_imgboxes:
80
+ num_true_images = num_images
81
  img_files += [image_blank]*(max_imgboxes - num_images)
82
+ labeled_images += [Image.open(image_blank)]*(max_imgboxes - num_images)
83
  for count in range(max_imgboxes - num_images):
84
  df[num_images + count] = pd.DataFrame()
85
  else:
86
+ num_true_images = max_imgboxes
87
  img_files = img_files[:max_imgboxes]
88
+ labeled_images = labeled_images[:max_imgboxes]
89
  df = dict(itertools.islice(df.items(), max_imgboxes))
90
 
91
+ for num_page in range(num_true_images):
92
+ example = dataset[num_page]
93
+ df_num_page = df[num_page]
94
+ width, height = example["images"].size
95
+
96
+ # apply same transformations
97
+ bboxes_par_list = [denormalize_box(normalize_box(upperleft_to_lowerright(bbox), width, height), width, height) for bbox in example['bboxes_par']]
98
+
99
+ texts_list = list()
100
+ for bbox_par, label in zip(df_num_page["bboxes"].tolist(), df_num_page["labels"]):
101
+ index_par = bboxes_par_list.index(bbox_par)
102
+ bboxes_lines_par_list = dataset[num_page]["bboxes_lines_par"][index_par]
103
+ texts_lines_par_list = dataset[num_page]["texts_lines_par"][index_par]
104
+ boxes, texts = sort_data_wo_labels(bboxes_lines_par_list, texts_lines_par_list)
105
+ # apply text startegy in function of label
106
+ if label == "Text" or label == "Caption" or label == "Footnote":
107
+ texts = ' '.join(texts)
108
+ else:
109
+ texts = '\n'.join(texts)
110
+ texts_list.append(texts)
111
+
112
+ df[num_page]["Paragraph text"] = texts_list
113
+
114
+ cols = ["bboxes", "Paragraph text", "labels"]
115
+ df[num_page] = df[num_page][cols]
116
+
117
  # save
118
  csv_files = list()
119
  for i in range(max_imgboxes):
 
122
  df[i].to_csv(csv_file, encoding="utf-8", index=False)
123
 
124
  else:
125
+ img_files, labeled_images, csv_files = [""]*max_imgboxes, [""]*max_imgboxes, [""]*max_imgboxes
126
  img_files[0], img_files[1] = image_blank, image_blank
127
+ labeled_images[0], labeled_images[1] = Image.open(image_blank), Image.open(image_blank)
128
  csv_file = "csv_wo_content.csv"
129
  csv_files[0], csv_files[1] = gr.File.update(value=csv_file, visible=True), gr.File.update(value=csv_file, visible=True)
130
  df, df_empty = dict(), pd.DataFrame()
131
  df[0], df[1] = df_empty.to_csv(csv_file, encoding="utf-8", index=False), df_empty.to_csv(csv_file, encoding="utf-8", index=False)
132
 
133
+ return msg, img_files[0], img_files[1], labeled_images[0], labeled_images[1], csv_files[0], csv_files[1], df[0], df[1]
134
 
135
  # gradio APP
136
  with gr.Blocks(title="Inference APP for Document Understanding at paragraph level (v1)", css=".gradio-container") as demo:
137
  gr.HTML("""
138
  <div style="font-family:'Times New Roman', 'Serif'; font-size:26pt; font-weight:bold; text-align:center;"><h1>Inference APP for Document Understanding at paragraph level (v1)</h1></div>
139
+ <div style="margin-top: 40px"><p>(02/12/2023) This Inference APP uses the <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://huggingface.co/pierreguillou/lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-paragraphlevel-ml512" target="_blank">model LiLT base combined with XLM-RoBERTa base and finetuned on the dataset DocLayNet base at paragraph level</a> (chunk size of 512 tokens).</p></div>
140
  <div><p><a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://arxiv.org/abs/2202.13669" target="_blank">LiLT (Language-Independent Layout Transformer)</a> is a Document Understanding model that uses both layout and text in order to detect labels of bounding boxes. Combined with the model <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://huggingface.co/xlm-roberta-base" target="_blank">XML-RoBERTa base</a>, this finetuned model has the capacity to <b>understand any language</b>. Finetuned on the dataset <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://huggingface.co/datasets/pierreguillou/DocLayNet-base" target="_blank">DocLayNet base</a>, it can <b>classifly any bounding box (and its OCR text) to 11 labels</b> (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).</p></div>
141
  <div><p>It relies on an external OCR engine to get words and bounding boxes from the document image. Thus, let's run in this APP an OCR engine (<a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://github.com/madmaze/pytesseract#python-tesseract" target="_blank">PyTesseract</a>) to get the bounding boxes, then run LiLT (already fine-tuned on the dataset DocLayNet base at paragraph level) on the individual tokens and then, visualize the result at paragraph level!</p></div>
142
  <div><p><b>It allows to get all pages of any PDF (of any language) with bounding boxes labeled at paragraph level and the associated dataframes with labeled data (bounding boxes, texts, labels) :-)</b></p></div>
143
+ <div><p>However, the inference time per page can be high when running the model on CPU due to the number of paragraph predictions to be made. Therefore, to avoid running this APP for too long, <b>only the first 2 pages are processed by this APP</b>. If you want to increase this limit, you can either clone this APP in Hugging Face Space (or run its <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://github.com/piegu/language-models/blob/master/Gradio_inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb" target="_blank">notebook</a> on your own plateform) and change the value of the parameter <code>max_imgboxes</code>, or run the inference notebook "<a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://github.com/piegu/language-models/blob/master/inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb" target="_blank">Document AI | Inference at paragraph level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset)</a>" on your own platform as it does not have this limit.</p></div>
144
  <div style="margin-top: 20px"><p>More information about the DocLayNet datasets, the finetuning of the model and this APP in the following blog posts:</p>
145
  <ul><li>(02/14/2023) <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://medium.com/@pierre_guillou/document-ai-inference-app-for-document-understanding-at-line-level-a35bbfa98893" target="_blank">Document AI | Inference APP for Document Understanding at line level</a></li><li>(02/10/2023) <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://medium.com/@pierre_guillou/document-ai-document-understanding-model-at-line-level-with-lilt-tesseract-and-doclaynet-dataset-347107a643b8" target="_blank">Document AI | Document Understanding model at line level with LiLT, Tesseract and DocLayNet dataset</a></li><li>(01/31/2023) <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://medium.com/@pierre_guillou/document-ai-doclaynet-image-viewer-app-3ac54c19956" target="_blank">Document AI | DocLayNet image viewer APP</a></li><li>(01/27/2023) <a style="text-decoration: none; border-bottom: #64b5f6 0.125em solid; color: #64b5f6" href="https://medium.com/@pierre_guillou/document-ai-processing-of-doclaynet-dataset-to-be-used-by-layout-models-of-the-hugging-face-hub-308d8bd81cdb" target="_blank">Document AI | Processing of DocLayNet dataset to be used by layout models of the Hugging Face hub (finetuning, inference)</a></li></ul></div>
146
  """)