Let's pipe some ๐ฑ๐ฎ๐๐ฎ ๐ณ๐ฟ๐ผ๐บ ๐๐ต๐ฒ ๐๐ฒ๐ฏ into our vector database, shall we?๐ค
With ๐ข๐ง๐ ๐๐ฌ๐ญ-๐๐ง๐ฒ๐ญ๐ก๐ข๐ง๐ ๐ฏ๐.๐.๐ (https://github.com/AstraBert/ingest-anything) you can now scrape content simply starting from URLs, extract the text from it, chunk it and put it into your favorite LlamaIndex-compatible database!๐ธ๏ธ
You can do it thanks to ๐ฐ๐ฟ๐ฎ๐๐น๐ฒ๐ฒ by Apify, an open-source crawling library for python and javascript that handles all the data flow from the web: ingest-anything then combines it with ๐๐ฒ๐ฎ๐๐๐ถ๐ณ๐๐น๐ฆ๐ผ๐๐ฝ, ๐ฃ๐ฑ๐ณ๐๐๐๐ผ๐๐ป and ๐ฃ๐๐ ๐๐ฃ๐ฑ๐ณ to scrape HTML files, convert them to PDF and extract the text - hassle-free!๐ธ
Check the attached code snippet if you're curious of knowing how to get started๐ฌ
PS: Don't tell anybody, but this release also has another gem... It supports OpenAI models for agentic chunking, following the new releases of Chonkie๐ฆโจ