|
--- |
|
license: other |
|
gated: true |
|
extra_gated_heading: Investment Access Request - G-Operator |
|
extra_gated_description: >- |
|
G-Operator is available exclusively to qualified investors under NDA. Access |
|
is restricted to investment evaluation purposes only. |
|
extra_gated_button_content: Request Investment Access |
|
extra_gated_prompt: >- |
|
By requesting access, you acknowledge that this model is proprietary |
|
technology subject to NDA restrictions. You agree to use this model solely for |
|
investment evaluation purposes and maintain strict confidentiality of all |
|
technical details, training methodologies, and performance characteristics. |
|
Unauthorized use, reproduction, or distribution is strictly prohibited. |
|
extra_gated_fields: |
|
Email: text |
|
Investment Purpose: text |
|
Institution or Fund: text |
|
Are you a qualified investor?: |
|
type: select |
|
options: |
|
- 'Yes' |
|
- 'No' |
|
Expected Investment Timeline: |
|
type: select |
|
options: |
|
- 0-3 months |
|
- 3-6 months |
|
- 6-12 months |
|
- 12+ months |
|
NDA Status: |
|
type: select |
|
options: |
|
- Will sign NDA |
|
- Already have NDA |
|
- Need NDA template |
|
Contact Email for Follow-up: text |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- google/gemma-3n-E4B-it |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
tags: |
|
- android |
|
- control |
|
- gemma |
|
- google |
|
- device |
|
--- |
|
|
|
# G-Operator: Android Device Control with Gemma 3N |
|
|
|
<div align="center"> |
|
|
|
 |
|
|
|
**Multimodal Android Device Control Agent** |
|
|
|
[](#license--terms) |
|
[](https://huggingface.co/google/gemma-3n-E4B-it) |
|
[](https://www.python.org/) |
|
[](https://huggingface.co/docs/transformers/) |
|
|
|
</div> |
|
|
|
## π Overview |
|
|
|
G-Operator is a fine-tuned multimodal AI agent based on Google's Gemma 3N-E4B-IT model, specifically designed for Android device control through visual understanding and action generation. The model can analyze Android device screenshots and generate precise JSON actions to control the device. |
|
|
|
## π Investment Access Control |
|
|
|
This model is **proprietary technology** available exclusively to **qualified investors** under **NDA restrictions**. Access is granted solely for investment evaluation purposes. |
|
|
|
## π¦ Available Model Versions |
|
|
|
This repository contains multiple versions of the G-Operator model: |
|
|
|
### π― Recommended: Merged Model |
|
- **`gemma3n_e4b_it_merged`**: Complete merged model ready for inference |
|
- **Best for**: Production use and direct inference |
|
- **Size**: Full model weights (merged LoRA adapters) |
|
|
|
### π Training Checkpoints |
|
- **`checkpoint-5500`**: Training checkpoint at 5,500 steps |
|
- **`checkpoint-6000`**: Training checkpoint at 6,000 steps |
|
- **`checkpoint-6252`**: Final training checkpoint at 6,252 steps |
|
- **Best for**: Resuming training or analysis of training progression |
|
|
|
### π§ LoRA Adapter |
|
- **`adapter_model.safetensors`**: LoRA adapter weights |
|
- **Best for**: Parameter-efficient fine-tuning or adapter-based inference |
|
|
|
## π Key Features |
|
|
|
- **Multimodal Understanding**: Processes both text instructions and Android device screenshots |
|
- **JSON Action Generation**: Outputs structured JSON actions for device control |
|
- **LoRA Fine-tuning**: Efficient parameter-efficient fine-tuning approach |
|
- **Android-Specific Training**: Trained on real Android control episodes |
|
- **High Performance**: Based on the powerful Gemma 3N architecture |
|
|
|
## π Model Details |
|
|
|
| Property | Value | |
|
|----------|-------| |
|
| **Base Model** | [google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it) | |
|
| **Architecture** | Gemma 3N (4B parameters) | |
|
| **Fine-tuning Method** | LoRA (Low-Rank Adaptation) | |
|
| **LoRA Rank** | 32 | |
|
| **LoRA Alpha** | 64 | |
|
| **Target Modules** | q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj | |
|
| **Training Data** | Android control episodes with screenshots and actions | |
|
| **License** | Gemma 3N License | |
|
|
|
## π οΈ Installation |
|
|
|
### Prerequisites |
|
|
|
Before installing the model, you must: |
|
|
|
1. **Request Access**: Click the "Request Access" button on this page and fill out the form |
|
2. **Wait for Approval**: Access requests are typically reviewed within 1-2 business days |
|
3. **Authenticate**: Once approved, you'll need to authenticate with Hugging Face |
|
|
|
### Authentication Required |
|
|
|
**Important**: You must be authenticated with Hugging Face to access this gated model. Ensure you have: |
|
1. Received access approval |
|
2. Logged in using `huggingface-cli login` or `login()` from `huggingface_hub` |
|
|
|
### Basic Usage (Merged Model) |
|
|
|
```python |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoProcessor, AutoModelForImageTextToText |
|
|
|
# Load merged model and processor |
|
model_id = "Tonic/g-operator" |
|
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
|
model = AutoModelForImageTextToText.from_pretrained( |
|
model_id, |
|
torch_dtype=torch.bfloat16, |
|
trust_remote_code=True, |
|
device_map="auto" |
|
) |
|
|
|
# Prepare input |
|
image = Image.open("android_screenshot.png").convert("RGB") |
|
goal = "Open the Settings app" |
|
instruction = "Navigate to the Settings app on the home screen" |
|
|
|
# Build conversation |
|
conversation = [ |
|
{ |
|
"role": "system", |
|
"content": [ |
|
{"type": "text", "text": "You are a helpful multimodal assistant specialized in Android device control. You respond with JSON actions to control Android devices."} |
|
] |
|
}, |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "image", "image": image}, |
|
{"type": "text", "text": f"Goal: {goal}\nStep: {instruction}\nRespond with a JSON action containing relevant keys (e.g., action_type, x, y, text, app_name, direction)."} |
|
] |
|
} |
|
] |
|
|
|
# Generate response |
|
inputs = processor.apply_chat_template( |
|
conversation, |
|
add_generation_prompt=True, |
|
return_tensors="pt" |
|
).to(model.device) |
|
|
|
with torch.no_grad(): |
|
outputs = model.generate( |
|
inputs, |
|
max_new_tokens=128, |
|
do_sample=True, |
|
temperature=0.7, |
|
top_p=0.9 |
|
) |
|
|
|
response = processor.tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True) |
|
print(response) |
|
``` |
|
|
|
### Using LoRA Adapter |
|
|
|
```python |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoProcessor, AutoModelForImageTextToText |
|
from peft import PeftModel |
|
|
|
# Load base model |
|
base_model_id = "google/gemma-3n-E4B-it" |
|
model = AutoModelForImageTextToText.from_pretrained( |
|
base_model_id, |
|
torch_dtype=torch.bfloat16, |
|
trust_remote_code=True, |
|
device_map="auto" |
|
) |
|
|
|
# Load LoRA adapter |
|
adapter_model_id = "Tonic/g-operator" |
|
model = PeftModel.from_pretrained(model, adapter_model_id) |
|
|
|
# Load processor |
|
processor = AutoProcessor.from_pretrained(adapter_model_id, trust_remote_code=True) |
|
|
|
# Use the same inference code as above... |
|
``` |
|
|
|
### Loading Specific Checkpoints |
|
|
|
```python |
|
import torch |
|
from transformers import AutoProcessor, AutoModelForImageTextToText |
|
|
|
# Load specific checkpoint |
|
checkpoint_path = "Tonic/g-operator/checkpoint-6252" # or checkpoint-6000, checkpoint-5500 |
|
model = AutoModelForImageTextToText.from_pretrained( |
|
checkpoint_path, |
|
torch_dtype=torch.bfloat16, |
|
trust_remote_code=True, |
|
device_map="auto" |
|
) |
|
processor = AutoProcessor.from_pretrained(checkpoint_path, trust_remote_code=True) |
|
|
|
# Use the same inference code as above... |
|
``` |
|
|
|
### Expected Output Format |
|
|
|
The model generates JSON actions in the following format: |
|
|
|
```json |
|
{ |
|
"action_type": "tap", |
|
"x": 540, |
|
"y": 1200, |
|
"text": "Settings", |
|
"app_name": "com.android.settings", |
|
"confidence": 0.95 |
|
} |
|
``` |
|
|
|
## π Training Configuration |
|
|
|
### Training Parameters |
|
|
|
| Parameter | Value | |
|
|-----------|-------| |
|
| **Learning Rate** | 3e-4 | |
|
| **Batch Size** | 1 (per device) | |
|
| **Gradient Accumulation** | 16 | |
|
| **Epochs** | 1.0 | |
|
| **Warmup Ratio** | 0.1 | |
|
| **Weight Decay** | 0.01 | |
|
| **Optimizer** | AdamW | |
|
| **Scheduler** | Cosine | |
|
| **Mixed Precision** | bfloat16 | |
|
|
|
### Vision Configuration |
|
|
|
| Parameter | Value | |
|
|-----------|-------| |
|
| **Max Image Tokens** | 256 | |
|
| **Min Image Tokens** | 64 | |
|
| **Image Splitting** | Enabled | |
|
| **Image Format** | RGB | |
|
|
|
## π― Use Cases |
|
|
|
### 1. Automated Testing |
|
- UI automation for Android apps |
|
- Regression testing with visual verification |
|
- Cross-device compatibility testing |
|
|
|
### 2. Accessibility Support |
|
- Voice-controlled device navigation |
|
- Assistive technology integration |
|
- Screen reader enhancement |
|
|
|
### 3. Remote Device Management |
|
- Remote troubleshooting |
|
- Device configuration automation |
|
- Support ticket resolution |
|
|
|
### 4. App Development |
|
- UI/UX testing automation |
|
- User flow validation |
|
- Performance testing |
|
|
|
|
|
## π Safety and Limitations |
|
|
|
### Safety Considerations |
|
- **Device Control**: Model generates actions that can modify device state |
|
- **Testing Environment**: Always test in controlled environments first |
|
- **Human Oversight**: Implement safety checks for critical operations |
|
|
|
### Known Limitations |
|
- **Screen Resolution**: Performance may vary with different screen sizes |
|
- **App-Specific**: Training focused on common Android apps |
|
- **Language**: Primarily English language support |
|
- **Real-time**: Not optimized for real-time video processing |
|
|
|
## π License & Terms |
|
|
|
This model is **proprietary technology** owned by Tonic and is subject to strict licensing terms: |
|
|
|
### Investment Evaluation License |
|
|
|
- **Purpose**: Access granted solely for investment evaluation and due diligence |
|
- **Restrictions**: No commercial use, reproduction, or distribution without written consent |
|
- **NDA Required**: All access is subject to Non-Disclosure Agreement |
|
- **Confidentiality**: All technical details, training methodologies, and performance characteristics are confidential |
|
|
|
### Base Model Attribution |
|
|
|
- **Gemma 3N-E4B-IT**: Licensed under [Gemma 3N License](https://ai.google.dev/gemma/terms) from Google |
|
- **Fine-tuning**: Proprietary to Tonic, subject to separate licensing terms |
|
|
|
## π Acknowledgments |
|
|
|
- **Google**: For the base Gemma 3N model |
|
- **Hugging Face**: For the transformers library and hosting |
|
|
|
## π Related Links |
|
|
|
- [Base Model: Gemma 3N-E4B-IT](https://huggingface.co/google/gemma-3n-E4B-it) |
|
- [Training Repository](https://github.com/Josephrp/train_android_models) |
|
- [Documentation](https://docs.your-org.com/g-operator) |
|
- [Demo Space](https://huggingface.co/spaces/Tonic/g-operator-demo) |
|
|
|
--- |
|
|
|
<div align="center"> |
|
|
|
**Made with β€οΈ by the Tonic Team** |
|
|
|
[](https://huggingface.co/Tonic) |
|
[](https://github.com/Josephrp/train_android_models) |
|
|
|
</div> |