cuDF
cuDF is a Python GPU DataFrame library.
To read from a single Parquet file, use the read_parquet
function to read it into a DataFrame:
import cudf
df = (
cudf.read_parquet("https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet")
.groupby('horoscope')['text']
.apply(lambda x: x.str.len().mean())
.sort_values(ascending=False)
.head(5)
)
To read multiple Parquet files - for example, if the dataset is sharded - you’ll need to use dask-cudf
:
import dask
import dask.dataframe as dd
dask.config.set({"dataframe.backend": "cudf"})
df = (
dd.read_parquet("https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/*.parquet")
)