Ever dreamt of ingesting into a vector DB that pile of CSVs, Word documents and presentations laying in some remote folders on your PC?๐๏ธ What if I told you that you can do it within three to six lines of code?๐คฏ Well, with my latest open-source project, ๐ข๐ง๐ ๐๐ฌ๐ญ-๐๐ง๐ฒ๐ญ๐ก๐ข๐ง๐ (https://github.com/AstraBert/ingest-anything), you can take all your non-PDF files, convert them to PDF, extract their text, chunk, embed and load them into a vector database, all in one go!๐ How? It's pretty simple! ๐ The input files are converted into PDF by PdfItDown (https://github.com/AstraBert/PdfItDown) ๐ The PDF text is extracted using LlamaIndex readers ๐ฆ The text is chunked exploiting Chonkie ๐งฎ The chunks are embedded thanks to Sentence Transformers models ๐๏ธ The embeddings are loaded into a Qdrant vector database
And you're done!โ Curious of trying it? Install it by running:
And you can start using it in your python scripts!๐ Don't forget to star it on GitHub and let me know if you have any feedback! โก๏ธ https://github.com/AstraBert/ingest-anything