muhammad-mujtaba-ai commited on
Commit
59218a6
·
verified ·
1 Parent(s): 1bda82d

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. Benchmark.png +0 -0
  2. README.md +293 -0
  3. logo.png +0 -0
Benchmark.png ADDED
README.md ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ #{{ card_data }}
5
+
6
+ license: mit
7
+ language:
8
+ - en
9
+ base_model: Salesforce/codegen-2B-multi
10
+ pipeline_tag: text-generation
11
+ library_name: transformers
12
+ inference:
13
+ parameters:
14
+ provider: "spaces"
15
+ space_url: "https://huggingface.co/spaces/muhammad-mujtaba-ai/SolidityLLMDemo"
16
+ tags:
17
+ - Solidity
18
+ - BlockChain
19
+ - Smart Contracts
20
+ - Code Generation
21
+ ---
22
+
23
+ # **Solidity-Code-LLM**
24
+ Solidity-Code-LLM is a fine tuned large language model designed to understand, generate, and analyze smart contracts written in Solidity. Developed by ChainGPT—a leader in AI infrastructure for the Web3 and blockchain space—this model is purpose-built for the decentralized development ecosystem.
25
+
26
+ - **Developed by:** ChainGPT
27
+ - **License:** MIT License
28
+ - **Finetuned from model:** Salesforce/codegen-2B-multi
29
+
30
+ ![Training Pipeline](logo.png)
31
+
32
+
33
+ # Model Details
34
+
35
+ ### Model Description
36
+ Solidity-Code-LLM is a specialized language model trained in two stages: pre-training on a large, unstructured Solidity dataset, followed by instruction-based fine-tuning on a cleaned, curated dataset. Unlike general-purpose code models, it is exclusively focused on Solidity—the dominant language for Ethereum-compatible blockchains—making it an efficient and accurate assistant for writing and debugging smart contracts across a wide range of use cases, including tokens, DApps, DAOs, and governance protocols.
37
+
38
+
39
+ ### Model Features
40
+ - **Type**: Code Gen For Causal LLM
41
+ - **Tokenizer**: GPT2Tokenizer
42
+ - **Number of Parameters**: 2B
43
+ - **Number of Layers**: 32 Transformer blocks
44
+ - **Context Length**: Full 2048 tokens
45
+ - **Dtype**: bfloat16
46
+
47
+ ### Model Sources
48
+ For more details, please refer to,
49
+ - **Paper [optional]:** {{ paper | default("[More Information Needed]", true)}}
50
+ - **Demo:** [Demo On Hugging Face Space](https://huggingface.co/spaces/Chain-GPT/SolidityLLMDemo)
51
+
52
+
53
+ # Model Comparison
54
+ We have compared our model with the following models
55
+ - Qwen/CodeQwen1.5-7B
56
+ - deepseek-ai/deepseek-coder-1.3b-base
57
+ - codellama/CodeLlama-7b-hf
58
+ - GPT 4o mini
59
+
60
+ On the following parameters
61
+ - Compilation(%)--Percentage of generated contracts that compile successfully without modification.
62
+ - OpenZeppelin Compliance(%)--Adherence to OpenZeppelin library usage and standards.
63
+ - Gas Efficiency(%)--Degree of gas optimization based on Slither’s suggestions.
64
+ - Security(%)--Percentage of code free from common vulnerabilities detected by Slither.
65
+
66
+ ## Benchmark
67
+ The figure below presents a detailed comparison of the models across all evaluation criteria
68
+ ![Benchmark](Benchmark.png)
69
+
70
+
71
+ # Uses
72
+ ### Direct Use
73
+ - Assisting developers in writing Solidity smart contracts.
74
+ - Educational tool for learning Solidity.
75
+ - Auto-generating documentation or contract templates.
76
+
77
+ ### Downstream Use
78
+ - Integrated into IDEs or smart contract development platforms.
79
+ - Supporting autonomous agents that interact with blockchains.
80
+
81
+ ### Out-of-Scope Use
82
+ - Not suitable for general-purpose code generation in languages other than Solidity.
83
+ - Not intended for legal auditing or formal verification without human oversight.
84
+ - Should not be used to deploy contracts to production without expert review.
85
+
86
+ ### Bias, Risks, and Limitations
87
+ - May reflect biases from web-scraped content (e.g., outdated or insecure coding practices).
88
+ - Model might hallucinate code or provide syntactically valid but logically incorrect suggestions.
89
+ - Risks associated with using AI-generated code in high-stakes or financial environments without thorough vetting.
90
+
91
+ ### Recommendations
92
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Manual code review and testing are strongly recommended before deployment.
93
+
94
+ # How to Get Started with the Model
95
+ The model follows a two-step generation process: it first produces a natural language description of the code, and subsequently generates the corresponding source code based on the given prompt. The complete output is generated internally before being displayed to the user. For scenarios requiring direct code generation without intermediate descriptions, a streaming mode can be utilized to produce code in real time.
96
+
97
+ Requirements for model.
98
+ ```Python
99
+ pip install transformers torch accelerate
100
+ ```
101
+
102
+ Use the code below to get started with the model.
103
+
104
+ ```Python
105
+ from transformers import AutoModelForCausalLM, AutoTokenizer
106
+
107
+ modelpath = "ChainGPT/SolidityLLM"
108
+
109
+ tokenizer = AutoTokenizer.from_pretrained(modelpath)
110
+ model = AutoModelForCausalLM.from_pretrained(modelpath)
111
+
112
+ prompt = "Write a Solidity function to transfer tokens."
113
+ inputs = tokenizer(prompt, return_tensors="pt")
114
+
115
+ outputs = model.generate(**inputs, max_new_tokens=1400, pad_token_id=tokenizer.eos_token_id)
116
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
117
+
118
+ print(generated_text)
119
+ ```
120
+
121
+ ## Streaming
122
+ To stream the generated code. During streaming, description is not generated.
123
+
124
+ ```Python
125
+ import time
126
+ import torch
127
+ from queue import Empty
128
+ from threading import Thread
129
+ from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
130
+
131
+ model = AutoModelForCausalLM.from_pretrained(
132
+ "ChainGPT/SolidityLLM",
133
+ torch_dtype=torch.bfloat16,
134
+ device_map="cuda"
135
+ )
136
+ tokenizer = AutoTokenizer.from_pretrained("ChainGPT/SolidityLLM")
137
+
138
+ CodeInstruction = "Develop a Solidity Contract for lottery which requires 1 eth for registration fee and the winner gets a reward of 10 eth."
139
+ prompt = (
140
+ f'You are given a coding instruction. Generate only the code that completes the task. Do not include any explanation or description.\n'
141
+ f'Instruction: {CodeInstruction}\nCode:'
142
+ )
143
+
144
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
145
+ streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
146
+ generation_thread = Thread(target=model.generate, kwargs={
147
+ "input_ids": inputs["input_ids"],
148
+ "max_new_tokens": 1800,
149
+ "temperature": 0.7,
150
+ "do_sample": True,
151
+ "streamer": streamer
152
+ })
153
+ generation_thread.start()
154
+
155
+ def consume_streamer(streamer: TextIteratorStreamer):
156
+ inside_code_block = False
157
+ buffer = ""
158
+ while True:
159
+ try:
160
+ for token in streamer:
161
+ buffer += token
162
+
163
+ while "```" in buffer:
164
+ parts = buffer.split("```", 1)
165
+ if not inside_code_block:
166
+ inside_code_block = True
167
+ buffer = parts[1]
168
+ else:
169
+ for char in parts[0]:
170
+ yield char
171
+ return
172
+ if inside_code_block:
173
+ while buffer:
174
+ yield buffer[0]
175
+ buffer = buffer[1:]
176
+ break
177
+ except Empty:
178
+ time.sleep(0.1)
179
+
180
+ for chunk in consume_streamer(streamer):
181
+ print(chunk, end="", flush=True)
182
+ ```
183
+
184
+
185
+ # Training Details
186
+
187
+ ### Training Procedure
188
+ We uses the following compute for finetuning the model
189
+ - 80 GB 4 GPUs Cluster
190
+ - Training Duration: 1095 Hours which is approximately 1 and a half month
191
+
192
+ ### Pre-training:
193
+ Trained on raw, unstructured data.
194
+ -1B tokens, raw corpus
195
+
196
+ ### Fine-tuning:
197
+ - Contracts written in Solidity version >= 0.5.
198
+ - Token length between 200 and 4000.
199
+ - Deduplicated and filtered out interfaces, libraries, contracts with irrelevant comments, or relative imports.
200
+ - Only compilable and executable contracts retained.
201
+ - 650K instructions finetune
202
+
203
+
204
+ # Results
205
+ Prompt
206
+ ```
207
+ Develop a Solidity program for a decentralized library where users can store and retrieve data after paying a small token fee.
208
+ ```
209
+
210
+ Output Code
211
+ ```Solidity
212
+ // SPDX-License-Identifier: MIT
213
+ pragma solidity ^0.8.0;
214
+
215
+ import "@openzeppelin/contracts/access/Ownable.sol";
216
+ import "@openzeppelin/contracts/token/ERC20/IERC20.sol";
217
+
218
+ contract DecentralizedLibrary is Ownable(msg.sender) {
219
+ IERC20 public token;
220
+ uint256 public tokenFee;
221
+
222
+ struct DataEntry {
223
+ string data;
224
+ address owner;
225
+ }
226
+
227
+ mapping(uint256 => DataEntry) private dataStore;
228
+ mapping(address => uint256) private userBalances;
229
+ uint256 private dataCount;
230
+
231
+ event DataStored(uint256 indexed dataId, address indexed owner, string data);
232
+ event DataRetrieved(uint256 indexed dataId, address indexed owner);
233
+
234
+ constructor(IERC20 _token, uint256 _initialTokenFee) {
235
+ token = _token;
236
+ tokenFee = _initialTokenFee;
237
+ }
238
+
239
+ function storeData(string memory _data) public {
240
+ payTokenFee();
241
+ dataStore[dataCount] = DataEntry(_data, msg.sender);
242
+ emit DataStored(dataCount, msg.sender, _data);
243
+ dataCount++;
244
+ }
245
+
246
+ function retrieveData(uint256 _dataId) public {
247
+ payTokenFee();
248
+ require(_dataId < dataCount, "Data does not exist");
249
+ emit DataRetrieved(_dataId, msg.sender);
250
+ }
251
+
252
+ function payTokenFee() internal {
253
+ require(token.balanceOf(msg.sender) >= tokenFee, "Insufficient token balance");
254
+ token.transferFrom(msg.sender, address(this), tokenFee);
255
+ }
256
+
257
+ function getTokenBalance() public view returns (uint256) {
258
+ return token.balanceOf(msg.sender);
259
+ }
260
+
261
+ function setTokenFee(uint256 _newTokenFee) public onlyOwner {
262
+ tokenFee = _newTokenFee;
263
+ }
264
+
265
+ function withdrawTokens(uint256 _amount) public onlyOwner {
266
+ require(token.balanceOf(address(this)) >= _amount, "Insufficient contract balance");
267
+ token.transfer(msg.sender, _amount);
268
+ }
269
+ }
270
+ ```
271
+
272
+ # Evaluation Matrics
273
+ To evaluate the performance of our fine-tuned LLM specialized in Solidity smart contract generation, we used **[Slither](https://github.com/crytic/slither)**, a static analysis framework widely used for analyzing Solidity code.
274
+
275
+ We focused on four key evaluation criteria:
276
+
277
+ - **Compilation Success Rate**
278
+ We measured the percentage of generated smart contracts that compile successfully without modification. This helps assess the syntactic and structural correctness of the model outputs.
279
+
280
+ - **OpenZeppelin Standards Compliance**
281
+ We verified whether the generated contracts adhere to best practices by checking for proper usage of OpenZeppelin libraries. This includes ensuring the latest or stable versions of libraries are used and the overall contract structure aligns with established OpenZeppelin patterns.
282
+
283
+ - **Gas Optimization Opportunities**
284
+ Using Slither’s gas optimization analysis, we identified areas in the generated contracts where gas usage could be reduced. We measured the number and types of optimization suggestions as an indicator of how efficient the generated code is.
285
+
286
+ - **Security Vulnerabilities**
287
+ We analyzed each contract for known security vulnerabilities using Slither’s built-in detectors. We recorded the number and severity of the vulnerabilities detected, providing a measure of the security quality of the model’s outputs.
288
+
289
+ These evaluation metrics help quantify the practical usability and reliability of the generated smart contracts in real-world scenarios.
290
+
291
+
292
+ # Summary
293
+ Model shows improved understanding and generation capabilities in Solidity when compared to baseline LLMs not trained on Solidity data.
logo.png ADDED