Spaces:
Running
Running
File size: 2,797 Bytes
019fb90 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
{
"cells": [
{
"cell_type": "markdown",
"id": "6fb06d81-1778-403c-b15b-d68200a5e6b5",
"metadata": {},
"source": [
"# Spark on Hugging Face"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7399a5ed-aea8-45cf-866f-2decd7097456",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from pyspark.sql import SparkSession\n",
"spark = SparkSession.builder.appName(\"demo\").getOrCreate()"
]
},
{
"cell_type": "markdown",
"id": "8bf07f63-6fed-4cf9-8fee-5f3a5fb6bed1",
"metadata": {
"tags": []
},
"source": [
"Example:\n",
"\n",
"```python\n",
"# Load the BAAI/Infinity-Instruct dataset\n",
"df = read_parquet(\"hf://datasets/BAAI/Infinity-Instruct/7M/*.parquet\")\n",
"\n",
"# Load only one column\n",
"df_langdetect_only = read_parquet(\"hf://datasets/BAAI/Infinity-Instruct/7M/*.parquet\", columns=[\"langdetect\"])\n",
"\n",
"# Load values within certain ranges\n",
"criteria = [(\"langdetect\", \"=\", \"zh-cn\")]\n",
"df_chinese_only = read_parquet(\"hf://datasets/BAAI/Infinity-Instruct/7M/*.parquet\", filters=criteria)\n",
"\n",
"# Save dataset\n",
"write_parquet(df_chinese_only, \"hf://datasets/username/Infinity-Instruct-Chinese-Only\")\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ca71b3ac-3291-4e4e-8fee-b3550b0426d6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from hf_spark_utils import read_parquet, write_parquet, set_session\n",
"set_session(spark)"
]
},
{
"cell_type": "markdown",
"id": "07ea62a4-7549-4a75-8a12-9d830f6e3cde",
"metadata": {},
"source": [
"#### (Optional) Login"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "343b3a9a-2dce-492b-9384-703368ba3975",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from huggingface_hub import notebook_login\n",
"notebook_login(new_session=False)"
]
},
{
"cell_type": "markdown",
"id": "332b7609-f0eb-4703-aea6-fec3d09f5870",
"metadata": {},
"source": [
"#### Run your code:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c0dfe01-9190-454c-9c52-216f74d339e1",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
} |