OSainz neildlf commited on
Commit
a4f7364
·
verified ·
1 Parent(s): 1d02ab6

Eguneratuta, bukatutzat eman genezake, emaitzak falta dira soilik (#2)

Browse files

- Eguneratuta, bukatutzat eman genezake, emaitzak falta dira soilik (b84e6f0609f36886fc5ead02d3863a38028c027c)


Co-authored-by: Neil de la fuente <[email protected]>

Files changed (1) hide show
  1. README.md +93 -3
README.md CHANGED
@@ -47,9 +47,11 @@ tags:
47
 
48
  This model achieves state-of-the-art performance on zero-shot Named Entity Recognition (NER) by first training on `GuideX`, a large-scale synthetic dataset with executable guidelines, and then fine-tuning on a collection of gold-standard IE datasets.
49
 
50
- - **Homepage:** [https://neilus03.github.io/guidex.com/](https://neilus03.github.io/guidex.com/)
51
- - **Paper:** [GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction](https://arxiv.org/abs/2506.00649)
52
- - **Code & Data:** The code and data for reproducing the GuideX methodology are available on the project homepage.
 
 
53
 
54
  ## Model Description
55
 
@@ -60,3 +62,91 @@ This model achieves state-of-the-art performance on zero-shot Named Entity Recog
60
  - **License:** Llama 2 Community License
61
  - **Finetuned from model:** `meta-llama/Llama-3.1-8B`
62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  This model achieves state-of-the-art performance on zero-shot Named Entity Recognition (NER) by first training on `GuideX`, a large-scale synthetic dataset with executable guidelines, and then fine-tuning on a collection of gold-standard IE datasets.
49
 
50
+ - 💻 **Project Page:** [https://neilus03.github.io/guidex.com/](https://neilus03.github.io/guidex.com/)
51
+ - 📒 **Code:** [GuideX codebase](https://github.com/Neilus03/GUIDEX)
52
+ - 📖 **Paper:** [GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction](https://arxiv.org/abs/2506.00649)
53
+ - 🐕 **GuideX Colection in the 🤗HuggingFace Hub:** [GuideX Collection](https://huggingface.co/collections/neildlf/guidex-6842ef478e8d9bb0a00c844d)
54
+ - 🚀 **Example Jupyter Notebooks:** [GuideX Notebooks](https://github.com/Neilus03/GUIDEX/tree/dev-neil/notebooks)
55
 
56
  ## Model Description
57
 
 
62
  - **License:** Llama 2 Community License
63
  - **Finetuned from model:** `meta-llama/Llama-3.1-8B`
64
 
65
+ ## Schema definition and inference example
66
+
67
+ The labels are represented as Python classes, and the guidelines or instructions are introduced as docstrings. The model start generating after the `result = [` line.
68
+ ```Python
69
+ # Entity definitions
70
+ @dataclass
71
+ class Launcher(Template):
72
+ """Refers to a vehicle designed primarily to transport payloads from the Earth's
73
+ surface to space. Launchers can carry various payloads, including satellites,
74
+ crewed spacecraft, and cargo, into various orbits or even beyond Earth's orbit.
75
+ They are usually multi-stage vehicles that use rocket engines for propulsion."""
76
+
77
+ mention: str
78
+ """
79
+ The name of the launcher vehicle.
80
+ Such as: "Sturn V", "Atlas V", "Soyuz", "Ariane 5"
81
+ """
82
+ space_company: str # The company that operates the launcher. Such as: "Blue origin", "ESA", "Boeing", "ISRO", "Northrop Grumman", "Arianespace"
83
+ crew: List[str] # Names of the crew members boarding the Launcher. Such as: "Neil Armstrong", "Michael Collins", "Buzz Aldrin"
84
+
85
+
86
+ @dataclass
87
+ class Mission(Template):
88
+ """Any planned or accomplished journey beyond Earth's atmosphere with specific objectives,
89
+ either crewed or uncrewed. It includes missions to satellites, the International
90
+ Space Station (ISS), other celestial bodies, and deep space."""
91
+
92
+ mention: str
93
+ """
94
+ The name of the mission.
95
+ Such as: "Apollo 11", "Artemis", "Mercury"
96
+ """
97
+ date: str # The start date of the mission
98
+ departure: str # The place from which the vehicle will be launched. Such as: "Florida", "Houston", "French Guiana"
99
+ destination: str # The place or planet to which the launcher will be sent. Such as "Moon", "low-orbit", "Saturn"
100
+
101
+ # This is the text to analyze
102
+ text = (
103
+ "The Ares 3 mission to Mars is scheduled for 2032. The Starship rocket build by SpaceX will take off from Boca Chica,"
104
+ "carrying the astronauts Max Rutherford, Elena Soto, and Jake Martinez."
105
+ )
106
+
107
+ # The annotation instances that take place in the text above are listed here
108
+ result = [
109
+ Mission(mention='Ares 3', date='2032', departure='Boca Chica', destination='Mars'),
110
+ Launcher(mention='Starship', space_company='SpaceX', crew=['Max Rutherford', 'Elena Soto', 'Jake Martinez'])
111
+ ]
112
+ ```
113
+
114
+ ## How to Get Started with the Model
115
+
116
+ Please read our [🚀 Example Jupyter Notebooks](https://github.com/Neilus03/GUIDEX/tree/dev-neil/notebooks) to get started with GuideX.
117
+
118
+ The best way to load the model is using our custom `load_model` fuction. However, you can also load them using the AutoModelForCausalLM class.
119
+
120
+ **Important**: Our flash attention implementation has small numerical differences compared to the attention implementation in Huggingface.
121
+ You must use the flag `trust_remote_code=True` or you will get inferior results. Flash attention requires an available CUDA GPU. Running GuideX
122
+ pre-trained models on a CPU is not supported. We plan to address this in future releases. First, install flash attention 2:
123
+ ```bash
124
+ pip install flash-attn --no-build-isolation
125
+ pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
126
+ ```
127
+
128
+ Then you can load the model using
129
+
130
+ ```python
131
+ import torch
132
+ from transformers import AutoTokenizer, AutoModelForCausalLM
133
+
134
+ tokenizer = AutoTokenizer.from_pretrained("HiTZ/Llama-3.1-GuideX-8B")
135
+ model = AutoModelForCausalLM.from_pretrained("HiTZ/Llama-3.1-GuideX-8B", trust_remote_code=True, torch_dtype=torch.bfloat16)
136
+ model.to("cuda")
137
+ ```
138
+
139
+ Read our [🚀 Example Jupyter Notebooks](https://github.com/hitz-zentroa/GoLLIE/tree/main/notebooks) to learn how to easily define guidelines, generate model inputs and parse the output!
140
+
141
+ ```
142
+ ## Citation
143
+ @misc{delafuente2025guidexguidedsyntheticdata,
144
+ title={GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction},
145
+ author={Neil De La Fuente and Oscar Sainz and Iker García-Ferrero and Eneko Agirre},
146
+ year={2025},
147
+ eprint={2506.00649},
148
+ archivePrefix={arXiv},
149
+ primaryClass={cs.CL},
150
+ url={https://arxiv.org/abs/2506.00649},
151
+ }
152
+ ```