Image-Text-to-Text
Transformers
ONNX
Safetensors
English
idefics3
conversational

Multilingual Support Plans for SmolDocling-256M

#13
by reix2098 - opened

Hello,​

I am interested in utilizing the SmolDocling-256M model for document conversion tasks across multiple languages. Could you please inform me if there are any plans to extend the model's capabilities to support multilingual document processing in the near future?​
Hugging Face

Thank you for your efforts in developing this model.​

Best regards,​

Adrian Rey.

Docling org

Hello Adrian,
We do definitely plan on supporting more languages soon, we actually already do indirectly quite a few. The only problem with claiming multilingual is that this means we have to evaluate for multiple languages, and have a good benchmark dataset to test most of our features. Which languages are you interested in so we can add it to the todo.

hello @asnassar I am as well interested to know if Docling supports arabic? From what I tested it seems that it reads the characters from left to right.
Is there any configuration I can do so it gives me words from right to left?

Thanks

Docling org

@rimahajou , this was a bug in the training, and we are currently fixing it, as an Arabic speaker myself I assure you we will fix this in an upcoming release. We had this bug carry on from a previous version of docling, which we already fixed on docling. We just have to apply the same thing on SmolDocling.

An example:

image.png

Dear Ahmed,

Hello Adrian,
We do definitely plan on supporting more languages soon, we actually already do indirectly quite a few. The only problem with claiming multilingual is that this means we have to evaluate for multiple languages, and have a good benchmark dataset to test most of our features. Which languages are you interested in so we can add it to the todo.

Thank you for considering the inclusion of additional languages in the SmolDocling-256M model. I am primarily interested in Spanish support, as it is among the most spoken languages globally, with approximately 548 million speakers. Additionally, support for French (280 million speakers), Portuguese (257 million), German (134.6 million), and Italian (67.9 million) would be highly beneficial, given their significant number of speakers worldwide. These languages are not only widely spoken but also utilize the Latin alphabet, similar to English, which could facilitate their integration into the model without the need for extensive change in training dataset. ​

Best regards,

Adrian Rey.

@rimahajou , this was a bug in the training, and we are currently fixing it, as an Arabic speaker myself I assure you we will fix this in an upcoming release. We had this bug carry on from a previous version of docling, which we already fixed on docling. We just have to apply the same thing on SmolDocling.

Thanks @asnassar ! Excited to see the new releases of docling on that !

SmolDocling-256M does amazing job .
do you have any plan for adding Thai langauges (which has about 70 million speakers) into this model?

does it support greek?

First of all, I want to congratulate you on the beautiful work you have done. I am interested in using it in the Portuguese language.

Thank you for this model; this is something I have been looking for for a long time. If possible, please consider adding support for the Polish language, which is spoken by 43 million people.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment