Multilingual Support Plans for SmolDocling-256M

#13

by reix2098 - opened 10 days ago

reix2098

10 days ago

Hello,

I am interested in utilizing the SmolDocling-256M model for document conversion tasks across multiple languages. Could you please inform me if there are any plans to extend the model's capabilities to support multilingual document processing in the near future?
Hugging Face

Thank you for your efforts in developing this model.

Best regards,

Adrian Rey.

asnassar

Docling org 10 days ago

Hello Adrian,
We do definitely plan on supporting more languages soon, we actually already do indirectly quite a few. The only problem with claiming multilingual is that this means we have to evaluate for multiple languages, and have a good benchmark dataset to test most of our features. Which languages are you interested in so we can add it to the todo.

rimahajou

10 days ago

hello @asnassar I am as well interested to know if Docling supports arabic? From what I tested it seems that it reads the characters from left to right.
Is there any configuration I can do so it gives me words from right to left?

Thanks

asnassar

Docling org 10 days ago

@rimahajou , this was a bug in the training, and we are currently fixing it, as an Arabic speaker myself I assure you we will fix this in an upcoming release. We had this bug carry on from a previous version of docling, which we already fixed on docling. We just have to apply the same thing on SmolDocling.

rimahajou

10 days ago

An example:

reix2098

10 days ago

Dear Ahmed,

Hello Adrian,
We do definitely plan on supporting more languages soon, we actually already do indirectly quite a few. The only problem with claiming multilingual is that this means we have to evaluate for multiple languages, and have a good benchmark dataset to test most of our features. Which languages are you interested in so we can add it to the todo.

Thank you for considering the inclusion of additional languages in the SmolDocling-256M model. I am primarily interested in Spanish support, as it is among the most spoken languages globally, with approximately 548 million speakers. Additionally, support for French (280 million speakers), Portuguese (257 million), German (134.6 million), and Italian (67.9 million) would be highly beneficial, given their significant number of speakers worldwide. These languages are not only widely spoken but also utilize the Latin alphabet, similar to English, which could facilitate their integration into the model without the need for extensive change in training dataset.

Best regards,

Adrian Rey.

rimahajou

10 days ago

@rimahajou , this was a bug in the training, and we are currently fixing it, as an Arabic speaker myself I assure you we will fix this in an upcoming release. We had this bug carry on from a previous version of docling, which we already fixed on docling. We just have to apply the same thing on SmolDocling.

Thanks @asnassar ! Excited to see the new releases of docling on that !

Thanabordee

10 days ago

SmolDocling-256M does amazing job .
do you have any plan for adding Thai langauges (which has about 70 million speakers) into this model?

Panindhra

10 days ago

does it support greek?

JPVersiani

8 days ago

First of all, I want to congratulate you on the beautiful work you have done. I am interested in using it in the Portuguese language.

Uglyron

about 11 hours ago

Thank you for this model; this is something I have been looking for for a long time. If possible, please consider adding support for the Polish language, which is spoken by 43 million people.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment