Support multi-language
Does this model support other language than english?
And even if it doesn't, how can data be synthesized and the model retrained for other languages.
Since it's a popular demand, I will try to make a script that demonstrates the fine-tuning on a different language, so I will leave the issue till this is done. We will be working on multilingual support and evaluating it across different languages. But the easiest way to get started is to have documents already converted to DoclingDocuments, then exporting them to DocTags, then finetuning with them. Check "export_to_document_tokens" https://docling-project.github.io/docling/reference/docling_document/
@phonk2682 Thanks for your interest. To fine-tune properly I suggest you have your documents in DocTag format. To be able to do so you need to check https://docling-project.github.io/docling/reference/docling_document/, there you can import documents from html, MD etc, or build the DoclingDocument yourself, then export it to DocTags, then you will have the exact format we use for training.
Hi @asnassar ,
I've been reviewing the DocLing documentation at https://docling-project.github.io/docling/reference/docling_document but couldn't find information on converting HTML or Markdown files into DocTag format. The documentation appears to only cover converting from DocTag to Markdown/HTML.
Could you please provide guidance on how to perform this reverse conversion? Any examples or documentation would be greatly appreciated.
Thank you for your help!
Hello
@hatimbr
, you can find it over here:
https://docling-project.github.io/docling/examples/minimal/
Instead of exporting to markdown, just use result.document.export_to_document_tokens
.
@hatimbr This is easy:
- first convert document into DoclingDocument
- export the DoclingDocument to DocTags (see this: https://github.com/docling-project/docling-core/blob/2371c11b8f74628169a9bb377036511235070af0/docling_core/types/doc/document.py#L3552)
PS: there are new serializers coming today: https://github.com/docling-project/docling-core/pull/192
=> if you can wait, I would suggest to use those
@asnassar @PeterWJStaar thank you for your answers. I understand them, but I still don't know how to transform a Markdown or HTML into a DoclingDocument (without doing it manually). I don't see any method for that. Maybe I should check the PR you pushed.