PostgreSQL
PostgreSQL is a powerful, open source object-relational database system. It is the most popular database by application developers for a few years running. pgai is a PostgreSQL extension that allows you to easily ingest huggingface datasets into your PostgreSQL database.
Run PostgreSQL with pgai installed
You can easily run a docker container containing PostgreSQL with pgai.
docker run -d --name pgai -p 5432:5432 \ -v pg-data:/home/postgres/pgdata/data \ -e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg17
Then run the following command to install pgai into the database.
docker exec -it pgai psql -c "CREATE EXTENSION ai CASCADE;"
You can then connect to the database using the psql
command line tool in the container.
docker exec -it pgai psql
or using your favorite PostgreSQL client using the following connection string: postgresql://postgres:password@localhost:5432/postgres
Alternatively, you can install pgai into an existing PostgreSQL database. For instructions on how to install pgai into an existing PostgreSQL database, follow the instructions in the github repo.
Create a table from a dataset
To load a dataset into PostgreSQL, you can use the ai.load_dataset
function. This function will create a PostgreSQL table, and load the dataset from the Hugging Face Hub
in a streaming fashion.
select ai.load_dataset('rajpurkar/squad', table_name => 'squad');
You can now query the table using standard SQL.
select * from squad limit 10;
Full documentation for the ai.load_dataset
function can be found here.
Import only a subset of the dataset
You can also import a subset of the dataset by specifying the max_batches
parameter.
This is useful if the dataset is large and you want to experiment with a smaller subset.
SELECT ai.load_dataset('rajpurkar/squad', table_name => 'squad', batch_size => 100, max_batches => 1);
Load a dataset into an existing table
You can also load a dataset into an existing table. This is useful if you want more control over the data schema or want to predefine indexes and constraints on the data.
select ai.load_dataset('rajpurkar/squad', table_name => 'squad', if_table_exists => 'append');