Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
as-cle-bertย 
posted an update about 18 hours ago
Post
830
Ever dreamt of ingesting into a vector DB that pile of CSVs, Word documents and presentations laying in some remote folders on your PC?๐Ÿ—‚๏ธ
What if I told you that you can do it within three to six lines of code?๐Ÿคฏ
Well, with my latest open-source project, ๐ข๐ง๐ ๐ž๐ฌ๐ญ-๐š๐ง๐ฒ๐ญ๐ก๐ข๐ง๐  (https://github.com/AstraBert/ingest-anything), you can take all your non-PDF files, convert them to PDF, extract their text, chunk, embed and load them into a vector database, all in one go!๐Ÿš€
How? It's pretty simple!
๐Ÿ“ The input files are converted into PDF by PdfItDown (https://github.com/AstraBert/PdfItDown)
๐Ÿ“‘ The PDF text is extracted using LlamaIndex readers
๐Ÿฆ› The text is chunked exploiting Chonkie
๐Ÿงฎ The chunks are embedded thanks to Sentence Transformers models
๐Ÿ—„๏ธ The embeddings are loaded into a Qdrant vector database

And you're done!โœ…
Curious of trying it? Install it by running:

๐˜ฑ๐˜ช๐˜ฑ ๐˜ช๐˜ฏ๐˜ด๐˜ต๐˜ข๐˜ญ๐˜ญ ๐˜ช๐˜ฏ๐˜จ๐˜ฆ๐˜ด๐˜ต-๐˜ข๐˜ฏ๐˜บ๐˜ต๐˜ฉ๐˜ช๐˜ฏ๐˜จ

And you can start using it in your python scripts!๐Ÿ
Don't forget to star it on GitHub and let me know if you have any feedback! โžก๏ธ https://github.com/AstraBert/ingest-anything