File size: 3,883 Bytes
7eea4d3
 
 
 
 
 
 
 
 
d1812eb
7eea4d3
61dd6ae
7eea4d3
 
61dd6ae
 
 
fad0e5d
61dd6ae
d1812eb
38bf95f
57cf0d8
214d3e7
57cf0d8
214d3e7
b6b3d08
 
8e8083f
c0373dd
61dd6ae
28cd386
1ec7a25
2663466
fad0e5d
61dd6ae
 
8d05aef
7d9f9e9
3bc61ac
 
28cd386
61dd6ae
38bf95f
0a321d0
9353de4
fad0e5d
61dd6ae
d1812eb
61dd6ae
a1cbf23
 
92e87de
38c9049
7398237
c0373dd
b8951a6
7d9f9e9
 
b8951a6
 
7cdfdd1
ffd17db
 
c60033b
 
54ed0ba
25b37bd
 
 
54ed0ba
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
language:
- en
- de
tags:
- parser
- parsing
- PDF
- pdfplumber
- docling
- txt
- tables
- python
- windows
- RAG
---

# <b>PDF to TXT converter ready to chunck for your RAG</b>
<b>ONLY WINDOWS</b><br>
<b>EXE and PY available (en and german)</b><br>
better input = better output<br>
* PDF Parser - Sevenof9_v7d.exe - EXE GUI <br>
* PDF Parser - Sevenof9_v7d.py - Python, you need import libs<br>
* docling_by_sevenof9_v1.py - Python, you need nvidia RTX to run it fast<br>
all other older versions<br><br>
<b>&#x21e8;</b> give me a ❤️, if you like  ;)<br><br>

DOWNLOAD: "PDF Parser - Sevenof9_v7d.exe" or PDF "Parser - Sevenof9_v7d.py" or  "docling_by_sevenof9_v1.py" (read below)
...

Most LLM applications only convert your PDF simple to txt, nothing more, its like you save your PDF as txt file. For usual flowing text Books its quite okay! But often blocks of text that are close together are mixed up and tables cannot be read logically.
Therefore its better to convert it with some help of a <b>"Parser"</b>. The embedder can now find a better context.<br>
I work with "<b>pdfplumber/pdfminer</b>" none OCR, so its very fast!<br>
<ul style="line-height: 1.05;">
<li>Works with single and multi pdf list, works with folder</li>
<li>Intelligent multiprocessing</li>
<li>Error tolerant, that means if your PDF is not convertible, it will be skipped, no special handling</li>
<li>Instant view of the result, hit one pdf on top of the list</li>
<li>Converts some common tables as json-foramt inside the txt file, readable for embedder</li>
<li>Adds the absolute PAGE number to each page</li>
<li>Adds the label “chapter” for large font and/or “important” for bold font</li>
<li>All txt files will be created in original folder of PDF</li>
<li>All previous txt files are overwritten</li>
<li>aprox 5 to 20 Pages/sec - depends on complexity and system-power</li>
<li>tested on 300 PDF files ~30000 pages</li>
</ul>
<br>
This I have created with my brain and the help of chatGPT, Iam not a coder... sorry so I will not fulfill any wishes unless there are real errors.<br>
It is really hard for me with GUI and the Function and in addition to compile it.<br>
For the python-file you need to import missing libraries.<br>
Of course there is a lot of need for optimization(save/error-handling) or the use of other parser libraries, but it's a start.
<br><br>
i am working on a 50% faster version. in addition, the GUI should allow more influence on the processing, e.g. faster raw text, trim margins (yes/no) and set % yourself, set unimportant text block size, layout with line breaks or force more continuous text. preview on first 10 pages with generated images what is detected with border arround text and tables<br>
Give me a hand if you can ;)<br>
...
<br>
I also have a "<b>docling</b>" parser with OCR (GPU is need for fast processing), its only be a python-file, not compiled.<br>
You have to download all libs, and if you start (first time) internal also OCR models are downloaded. At the moment i have prepared a kind of multi docling, 
the number of parallel processed PDFs depend on VRAM and if you use OCR only for tables or for all. I have set VRAM = 16GB (my GPU RAM, you should set yours) and the multiple calls for docling are VRAM/1.3, 
so it uses ~12GB (in my version) and processes 12 PDFs at once, only txt and tables are converted, so no images no diagrams (to process pages in parallel its to complicate). For now all PDFs must be same folder like the python file. 
If you change OCR for all the VRAM consum is rasing you have to set 1.3 to 2 or more.
<br><br>

<b>now have fun and leave a comment if you like  ;)</b><br>
on discord "sevenof9"
<br>
my embedder collection:<br>
https://huggingface.co/kalle07/embedder_collection

<br>
<br>
I am not responsible for any errors or crashes on your system. If you use it, you take full responsibility!