mbudisic commited on
Commit
2ee6a32
·
1 Parent(s): f8e1015

Created the golden dataset.

Browse files
TODO.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TODO
2
+
3
+ 1. Package the RAG (non-agentic) model generator into a factory function
4
+ 2. Ensure one can swap embeddings
5
+ 3. Vibe check text embeddings vs snowflake-arctic-embed-l
6
+ 4. Generate dataset
7
+ 5. Finetune snowflake
8
+ 6. Upload to HF
9
+ 7. Use in the agentic
10
+
11
+ https://www.notion.so/Session-11-Certification-Challenge-1c8cd547af3d8103a6b6c5f3408d5854
data/kg_pstuts_transcripts_test_dev.json ADDED
The diff for this file is too large to render. See raw diff
 
data/kg_pstuts_transcripts_test_dev_train.json ADDED
The diff for this file is too large to render. See raw diff
 
data/kg_test_dev.json ADDED
The diff for this file is too large to render. See raw diff
 
dataset_card.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ pretty_name: PsTuts-RAG Q&A Dataset
6
+ size_categories:
7
+ - 10K<n<100K
8
+ tags:
9
+ - rag
10
+ - question-answering
11
+ - photoshop
12
+ - ragas
13
+ ---
14
+
15
+ # 📊 PsTuts-RAG Q&A Dataset
16
+
17
+ This dataset contains question-answer pairs generated using [RAGAS](https://github.com/explodinggradients/ragas) from Photoshop tutorial video transcripts. It's designed for training and evaluating RAG (Retrieval-Augmented Generation) systems focused on Photoshop tutorials.
18
+
19
+ ## 📝 Dataset Description
20
+
21
+ ### Dataset Summary
22
+
23
+ The dataset contains 100 question-answer pairs related to Photoshop usage, generated from video transcripts using RAGAS's knowledge graph and testset generation capabilities. The questions are formulated from the perspective of different user personas (Beginner Photoshop User and Photoshop Trainer).
24
+
25
+ ### Dataset Creation
26
+
27
+ The dataset was created through the following process:
28
+ 1. Loading transcripts from Photoshop tutorial videos
29
+ 2. Building a knowledge graph using RAGAS with the following transformations:
30
+ - Headlines extraction
31
+ - Headline splitting
32
+ - Summary extraction
33
+ - Embedding extraction
34
+ - Theme extraction
35
+ - NER extraction
36
+ - Similarity calculations
37
+ 3. Generating synthetic question-answer pairs using different query synthesizers:
38
+ - SingleHopSpecificQuerySynthesizer (80%)
39
+ - MultiHopAbstractQuerySynthesizer (10%)
40
+ - MultiHopSpecificQuerySynthesizer (10%)
41
+
42
+ ### Languages
43
+
44
+ The dataset is in English.
45
+
46
+ ## 📊 Dataset Structure
47
+
48
+ ### Data Instances
49
+
50
+ Each instance in the dataset contains:
51
+ - `user_input`: A question about Photoshop
52
+ - `reference`: The reference answer
53
+ - Additional metadata from RAGAS generation
54
+
55
+ Example:
56
+ ```
57
+ {
58
+ "user_input": "How can I use the Move tool to move many layers at once in Photoshop?",
59
+ "reference": "If you have the Move tool selected in Photoshop, you can move multiple layers at once by selecting those layers in the Layers panel first, then dragging any of the selected layers with the Move tool."
60
+ }
61
+ ```
62
+
63
+ ### Data Fields
64
+
65
+ - `user_input`: String containing the question
66
+ - `reference`: String containing the reference answer
67
+ - Additional RAGAS metadata fields
68
+
69
+ ### Data Splits
70
+
71
+ The dataset was generated from test and dev splits of the original transcripts.
72
+
73
+ ## 🚀 Usage
74
+
75
+ This dataset can be used for:
76
+ - Fine-tuning RAG systems for Photoshop-related queries
77
+ - Evaluating RAG system performance on domain-specific (Photoshop) knowledge
78
+ - Benchmarking question-answering models in the design/creative software domain
79
+
80
+ ### Loading the Dataset
81
+
82
+ ```python
83
+ from datasets import load_dataset
84
+
85
+ dataset = load_dataset("mbudisic/pstuts_rag_qa")
86
+ ```
87
+
88
+ ## 📚 Additional Information
89
+
90
+ ### Source Data
91
+
92
+ The source data consists of transcripts from Photoshop tutorial videos, processed and transformed into a knowledge graph using RAGAS.
93
+
94
+ ### Personas Used in Generation
95
+
96
+ 1. **Beginner Photoshop User**: Learning to complete simple tasks, use tools in Photoshop, and navigate the UI
97
+ 2. **Photoshop Trainer**: Experienced trainer looking to develop step-by-step guides for Photoshop beginners
98
+
99
+ ### Citation
100
+
101
+ If you use this dataset in your research, please cite:
102
+
103
+ ```
104
+ @misc{pstuts_rag_qa,
105
+ author = {Budisic, Marko},
106
+ title = {PsTuts-RAG Q&A Dataset},
107
+ year = {2023},
108
+ publisher = {Hugging Face},
109
+ howpublished = {\url{https://huggingface.co/datasets/mbudisic/pstuts_rag_qa}}
110
+ }
111
+ ```
112
+
113
+ ### Contributions
114
+
115
+ Thanks to RAGAS for providing the framework to generate this dataset.
notebooks/create_golden_dataset.ipynb CHANGED
@@ -717,6 +717,71 @@
717
  "kg"
718
  ]
719
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
720
  {
721
  "cell_type": "code",
722
  "execution_count": null,
 
717
  "kg"
718
  ]
719
  },
720
+ {
721
+ "cell_type": "code",
722
+ "execution_count": 165,
723
+ "metadata": {},
724
+ "outputs": [
725
+ {
726
+ "data": {
727
+ "application/vnd.jupyter.widget-view+json": {
728
+ "model_id": "264714962ac74dec896548157e691803",
729
+ "version_major": 2,
730
+ "version_minor": 0
731
+ },
732
+ "text/plain": [
733
+ "VBox(children=(HTML(value='<center> <img\\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…"
734
+ ]
735
+ },
736
+ "metadata": {},
737
+ "output_type": "display_data"
738
+ },
739
+ {
740
+ "data": {
741
+ "application/vnd.jupyter.widget-view+json": {
742
+ "model_id": "560ae4c3b1b1446f9ba6bd9687e11c57",
743
+ "version_major": 2,
744
+ "version_minor": 0
745
+ },
746
+ "text/plain": [
747
+ "Uploading the dataset shards: 0%| | 0/1 [00:00<?, ?it/s]"
748
+ ]
749
+ },
750
+ "metadata": {},
751
+ "output_type": "display_data"
752
+ },
753
+ {
754
+ "data": {
755
+ "application/vnd.jupyter.widget-view+json": {
756
+ "model_id": "c89b9b036ee44a93907628e1ad9ba2af",
757
+ "version_major": 2,
758
+ "version_minor": 0
759
+ },
760
+ "text/plain": [
761
+ "Creating parquet from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s]"
762
+ ]
763
+ },
764
+ "metadata": {},
765
+ "output_type": "display_data"
766
+ },
767
+ {
768
+ "data": {
769
+ "text/plain": [
770
+ "CommitInfo(commit_url='https://huggingface.co/datasets/mbudisic/pstuts_rag_qa/commit/6b0d6bb3d96c1d26e5678fd1b781401047ea5d92', commit_message='Upload dataset', commit_description='', oid='6b0d6bb3d96c1d26e5678fd1b781401047ea5d92', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/mbudisic/pstuts_rag_qa', endpoint='https://huggingface.co', repo_type='dataset', repo_id='mbudisic/pstuts_rag_qa'), pr_revision=None, pr_num=None)"
771
+ ]
772
+ },
773
+ "execution_count": 165,
774
+ "metadata": {},
775
+ "output_type": "execute_result"
776
+ }
777
+ ],
778
+ "source": [
779
+ "from huggingface_hub import login\n",
780
+ "login()\n",
781
+ "ragas_dataset = testset.to_hf_dataset()\n",
782
+ "# ragas_dataset.push_to_hub(\"mbudisic/pstuts_rag_qa\")"
783
+ ]
784
+ },
785
  {
786
  "cell_type": "code",
787
  "execution_count": null,
uv.lock CHANGED
@@ -998,6 +998,15 @@ wheels = [
998
  { url = "https://files.pythonhosted.org/packages/78/ce/5e897ee51b7d26ab4e47e5105e7368d40ce6cfae2367acdf3165396d50be/ipython-9.2.0-py3-none-any.whl", hash = "sha256:fef5e33c4a1ae0759e0bba5917c9db4eb8c53fee917b6a526bd973e1ca5159f6", size = 604277 },
999
  ]
1000
 
 
 
 
 
 
 
 
 
 
1001
  [[package]]
1002
  name = "ipython-pygments-lexers"
1003
  version = "1.1.1"
@@ -1251,6 +1260,37 @@ wheels = [
1251
  { url = "https://files.pythonhosted.org/packages/ca/77/71d78d58f15c22db16328a476426f7ac4a60d3a5a7ba3b9627ee2f7903d4/jupyter_console-6.6.3-py3-none-any.whl", hash = "sha256:309d33409fcc92ffdad25f0bcdf9a4a9daa61b6f341177570fdac03de5352485", size = 24510 },
1252
  ]
1253
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1254
  [[package]]
1255
  name = "jupyter-core"
1256
  version = "5.7.2"
@@ -1284,6 +1324,15 @@ wheels = [
1284
  { url = "https://files.pythonhosted.org/packages/e2/48/577993f1f99c552f18a0428731a755e06171f9902fa118c379eb7c04ea22/jupyter_events-0.12.0-py3-none-any.whl", hash = "sha256:6464b2fa5ad10451c3d35fabc75eab39556ae1e2853ad0c0cc31b656731a97fb", size = 19430 },
1285
  ]
1286
 
 
 
 
 
 
 
 
 
 
1287
  [[package]]
1288
  name = "jupyter-lsp"
1289
  version = "2.2.5"
@@ -1296,6 +1345,23 @@ wheels = [
1296
  { url = "https://files.pythonhosted.org/packages/07/e0/7bd7cff65594fd9936e2f9385701e44574fc7d721331ff676ce440b14100/jupyter_lsp-2.2.5-py3-none-any.whl", hash = "sha256:45fbddbd505f3fbfb0b6cb2f1bc5e15e83ab7c79cd6e89416b248cb3c00c11da", size = 69146 },
1297
  ]
1298
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1299
  [[package]]
1300
  name = "jupyter-server"
1301
  version = "2.16.0"
@@ -2351,6 +2417,7 @@ dependencies = [
2351
  { name = "isort" },
2352
  { name = "jq" },
2353
  { name = "jupyter" },
 
2354
  { name = "langchain" },
2355
  { name = "langchain-community" },
2356
  { name = "langchain-core" },
@@ -2403,6 +2470,7 @@ requires-dist = [
2403
  { name = "isort", specifier = ">=6.0.1" },
2404
  { name = "jq", specifier = ">=1.8.0" },
2405
  { name = "jupyter", specifier = ">=1.1.1" },
 
2406
  { name = "langchain", specifier = ">=0.3.25" },
2407
  { name = "langchain-community", specifier = ">=0.3.23" },
2408
  { name = "langchain-core", specifier = ">=0.3.59" },
 
998
  { url = "https://files.pythonhosted.org/packages/78/ce/5e897ee51b7d26ab4e47e5105e7368d40ce6cfae2367acdf3165396d50be/ipython-9.2.0-py3-none-any.whl", hash = "sha256:fef5e33c4a1ae0759e0bba5917c9db4eb8c53fee917b6a526bd973e1ca5159f6", size = 604277 },
999
  ]
1000
 
1001
+ [[package]]
1002
+ name = "ipython-genutils"
1003
+ version = "0.2.0"
1004
+ source = { registry = "https://pypi.org/simple" }
1005
+ sdist = { url = "https://files.pythonhosted.org/packages/e8/69/fbeffffc05236398ebfcfb512b6d2511c622871dca1746361006da310399/ipython_genutils-0.2.0.tar.gz", hash = "sha256:eb2e116e75ecef9d4d228fdc66af54269afa26ab4463042e33785b887c628ba8", size = 22208 }
1006
+ wheels = [
1007
+ { url = "https://files.pythonhosted.org/packages/fa/bc/9bd3b5c2b4774d5f33b2d544f1460be9df7df2fe42f352135381c347c69a/ipython_genutils-0.2.0-py2.py3-none-any.whl", hash = "sha256:72dd37233799e619666c9f639a9da83c34013a73e8bbc79a7a6348d93c61fab8", size = 26343 },
1008
+ ]
1009
+
1010
  [[package]]
1011
  name = "ipython-pygments-lexers"
1012
  version = "1.1.1"
 
1260
  { url = "https://files.pythonhosted.org/packages/ca/77/71d78d58f15c22db16328a476426f7ac4a60d3a5a7ba3b9627ee2f7903d4/jupyter_console-6.6.3-py3-none-any.whl", hash = "sha256:309d33409fcc92ffdad25f0bcdf9a4a9daa61b6f341177570fdac03de5352485", size = 24510 },
1261
  ]
1262
 
1263
+ [[package]]
1264
+ name = "jupyter-contrib-core"
1265
+ version = "0.4.2"
1266
+ source = { registry = "https://pypi.org/simple" }
1267
+ dependencies = [
1268
+ { name = "jupyter-core" },
1269
+ { name = "notebook" },
1270
+ { name = "setuptools" },
1271
+ { name = "tornado" },
1272
+ { name = "traitlets" },
1273
+ ]
1274
+ sdist = { url = "https://files.pythonhosted.org/packages/50/94/0d37e5b49ea1c8bf204c46f9b0257c1f3319a4ab88acbd401da2cab25e55/jupyter_contrib_core-0.4.2.tar.gz", hash = "sha256:1887212f3ca9d4487d624c0705c20dfdf03d5a0b9ea2557d3aaeeb4c38bdcabb", size = 17490 }
1275
+
1276
+ [[package]]
1277
+ name = "jupyter-contrib-nbextensions"
1278
+ version = "0.7.0"
1279
+ source = { registry = "https://pypi.org/simple" }
1280
+ dependencies = [
1281
+ { name = "ipython-genutils" },
1282
+ { name = "jupyter-contrib-core" },
1283
+ { name = "jupyter-core" },
1284
+ { name = "jupyter-highlight-selected-word" },
1285
+ { name = "jupyter-nbextensions-configurator" },
1286
+ { name = "lxml" },
1287
+ { name = "nbconvert" },
1288
+ { name = "notebook" },
1289
+ { name = "tornado" },
1290
+ { name = "traitlets" },
1291
+ ]
1292
+ sdist = { url = "https://files.pythonhosted.org/packages/50/91/78cc4362611dbde2b0cd068204aaf1b8899d0459c50d8ff9daca8c069791/jupyter_contrib_nbextensions-0.7.0.tar.gz", hash = "sha256:06e33f005885eb92f89cbe82711e921278201298d08ab0d886d1ba09e8c3e9ca", size = 23462252 }
1293
+
1294
  [[package]]
1295
  name = "jupyter-core"
1296
  version = "5.7.2"
 
1324
  { url = "https://files.pythonhosted.org/packages/e2/48/577993f1f99c552f18a0428731a755e06171f9902fa118c379eb7c04ea22/jupyter_events-0.12.0-py3-none-any.whl", hash = "sha256:6464b2fa5ad10451c3d35fabc75eab39556ae1e2853ad0c0cc31b656731a97fb", size = 19430 },
1325
  ]
1326
 
1327
+ [[package]]
1328
+ name = "jupyter-highlight-selected-word"
1329
+ version = "0.2.0"
1330
+ source = { registry = "https://pypi.org/simple" }
1331
+ sdist = { url = "https://files.pythonhosted.org/packages/cd/a5/3dfeb7c8643ef502e82969fdebb201b63b33ded15a7761b27299bacebc3a/jupyter_highlight_selected_word-0.2.0.tar.gz", hash = "sha256:9fa740424859a807950ca08d2bfd28a35154cd32dd6d50ac4e0950022adc0e7b", size = 12592 }
1332
+ wheels = [
1333
+ { url = "https://files.pythonhosted.org/packages/50/d7/19ab7cfd60bf268d2abbacc52d4295a40f52d74dfc0d938e4761ee5e598b/jupyter_highlight_selected_word-0.2.0-py2.py3-none-any.whl", hash = "sha256:9545dfa9cb057eebe3a5795604dcd3a5294ea18637e553f61a0b67c1b5903c58", size = 11699 },
1334
+ ]
1335
+
1336
  [[package]]
1337
  name = "jupyter-lsp"
1338
  version = "2.2.5"
 
1345
  { url = "https://files.pythonhosted.org/packages/07/e0/7bd7cff65594fd9936e2f9385701e44574fc7d721331ff676ce440b14100/jupyter_lsp-2.2.5-py3-none-any.whl", hash = "sha256:45fbddbd505f3fbfb0b6cb2f1bc5e15e83ab7c79cd6e89416b248cb3c00c11da", size = 69146 },
1346
  ]
1347
 
1348
+ [[package]]
1349
+ name = "jupyter-nbextensions-configurator"
1350
+ version = "0.6.4"
1351
+ source = { registry = "https://pypi.org/simple" }
1352
+ dependencies = [
1353
+ { name = "jupyter-contrib-core" },
1354
+ { name = "jupyter-core" },
1355
+ { name = "jupyter-server" },
1356
+ { name = "notebook" },
1357
+ { name = "pyyaml" },
1358
+ { name = "tornado" },
1359
+ { name = "traitlets" },
1360
+ ]
1361
+ wheels = [
1362
+ { url = "https://files.pythonhosted.org/packages/05/fe/cffb14a4fbb43cf276aa3047e42c3f9ecfda851ba3c466295401f6b1e085/jupyter_nbextensions_configurator-0.6.4-py2.py3-none-any.whl", hash = "sha256:fe7a7b0805b5926449692fb077e0e659bab8b27563bc68cba26854532fdf99c7", size = 466890 },
1363
+ ]
1364
+
1365
  [[package]]
1366
  name = "jupyter-server"
1367
  version = "2.16.0"
 
2417
  { name = "isort" },
2418
  { name = "jq" },
2419
  { name = "jupyter" },
2420
+ { name = "jupyter-contrib-nbextensions" },
2421
  { name = "langchain" },
2422
  { name = "langchain-community" },
2423
  { name = "langchain-core" },
 
2470
  { name = "isort", specifier = ">=6.0.1" },
2471
  { name = "jq", specifier = ">=1.8.0" },
2472
  { name = "jupyter", specifier = ">=1.1.1" },
2473
+ { name = "jupyter-contrib-nbextensions", specifier = ">=0.7.0" },
2474
  { name = "langchain", specifier = ">=0.3.25" },
2475
  { name = "langchain-community", specifier = ">=0.3.23" },
2476
  { name = "langchain-core", specifier = ">=0.3.59" },