Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
as-cle-bertย 
posted an update 5 days ago
Post
1777
One of the biggest challenges I've been facing since I started developing [๐๐๐Ÿ๐ˆ๐ญ๐ƒ๐จ๐ฐ๐ง](https://github.com/AstraBert/PdfItDown) was handling correctly the conversion of files like Excel sheets and CSVs: table conversion was bad and messy, almost unusable for downstream tasks๐Ÿซฃ

That's why today I'm excited to introduce ๐ซ๐ž๐š๐๐ž๐ซ๐ฌ, the new feature of PdfItDown v1.4.0!๐ŸŽ‰

With ๐˜ณ๐˜ฆ๐˜ข๐˜ฅ๐˜ฆ๐˜ณ๐˜ด, you can choose among three (for now๐Ÿ‘€) flavors of text extraction and conversion to PDF:

- ๐——๐—ผ๐—ฐ๐—น๐—ถ๐—ป๐—ด, which does a fantastic work with presentations, spreadsheets and word documents๐Ÿฆ†

- ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ๐—ฃ๐—ฎ๐—ฟ๐˜€๐—ฒ by LlamaIndex, suitable for more complex and articulated documents, with mixture of texts, images and tables๐Ÿฆ™

- ๐— ๐—ฎ๐—ฟ๐—ธ๐—œ๐˜๐——๐—ผ๐˜„๐—ป by Microsoft, not the best at handling highly structured documents, by extremly flexible in terms of input file format (it can even convert XML, JSON and ZIP files!)โœ’๏ธ

You can use this new feature in your python scripts (check the attached code snippet!๐Ÿ˜‰) and in the command line interface as well!๐Ÿ

Have fun and don't forget to star the repo on GitHub โžก๏ธ https://github.com/AstraBert/PdfItDown