g-operator / README.md

Update README.md

4cbf061 verified 23 days ago

11 kB

	---
	license: other
	gated: true
	extra_gated_heading: Investment Access Request - G-Operator
	extra_gated_description: >-
	G-Operator is available exclusively to qualified investors under NDA. Access
	is restricted to investment evaluation purposes only.
	extra_gated_button_content: Request Investment Access
	extra_gated_prompt: >-
	By requesting access, you acknowledge that this model is proprietary
	technology subject to NDA restrictions. You agree to use this model solely for
	investment evaluation purposes and maintain strict confidentiality of all
	technical details, training methodologies, and performance characteristics.
	Unauthorized use, reproduction, or distribution is strictly prohibited.
	extra_gated_fields:
	Email: text
	Investment Purpose: text
	Institution or Fund: text
	Are you a qualified investor?:
	type: select
	options:
	- 'Yes'
	- 'No'
	Expected Investment Timeline:
	type: select
	options:
	- 0-3 months
	- 3-6 months
	- 6-12 months
	- 12+ months
	NDA Status:
	type: select
	options:
	- Will sign NDA
	- Already have NDA
	- Need NDA template
	Contact Email for Follow-up: text
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- google/gemma-3n-E4B-it
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- android
	- control
	- gemma
	- google
	- device
	---

	# G-Operator: Android Device Control with Gemma 3N

	<div align="center">

	![G-Operator Logo](https://huggingface.co/Tonic/g-operator/resolve/main/g-operator-banner.png)

	Multimodal Android Device Control Agent

	[![Model License](https://img.shields.io/badge/License-Proprietary-red.svg)](#license--terms)
	[![Model Size](https://img.shields.io/badge/Size-4B%20Parameters-green.svg)](https://huggingface.co/google/gemma-3n-E4B-it)
	[![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://www.python.org/)
	[![Transformers](https://img.shields.io/badge/Transformers-4.54.0+-orange.svg)](https://huggingface.co/docs/transformers/)

	</div>

	## 🌟 Overview

	G-Operator is a fine-tuned multimodal AI agent based on Google's Gemma 3N-E4B-IT model, specifically designed for Android device control through visual understanding and action generation. The model can analyze Android device screenshots and generate precise JSON actions to control the device.

	## 🔐 Investment Access Control

	This model is proprietary technology available exclusively to qualified investors under NDA restrictions. Access is granted solely for investment evaluation purposes.

	## 📦 Available Model Versions

	This repository contains multiple versions of the G-Operator model:

	### 🎯 Recommended: Merged Model
	- `gemma3n_e4b_it_merged`: Complete merged model ready for inference
	- Best for: Production use and direct inference
	- Size: Full model weights (merged LoRA adapters)

	### 🔄 Training Checkpoints
	- `checkpoint-5500`: Training checkpoint at 5,500 steps
	- `checkpoint-6000`: Training checkpoint at 6,000 steps
	- `checkpoint-6252`: Final training checkpoint at 6,252 steps
	- Best for: Resuming training or analysis of training progression

	### 🔧 LoRA Adapter
	- `adapter_model.safetensors`: LoRA adapter weights
	- Best for: Parameter-efficient fine-tuning or adapter-based inference

	## 🚀 Key Features

	- Multimodal Understanding: Processes both text instructions and Android device screenshots
	- JSON Action Generation: Outputs structured JSON actions for device control
	- LoRA Fine-tuning: Efficient parameter-efficient fine-tuning approach
	- Android-Specific Training: Trained on real Android control episodes
	- High Performance: Based on the powerful Gemma 3N architecture

	## 📋 Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it) \|
	\| Architecture \| Gemma 3N (4B parameters) \|
	\| Fine-tuning Method \| LoRA (Low-Rank Adaptation) \|
	\| LoRA Rank \| 32 \|
	\| LoRA Alpha \| 64 \|
	\| Target Modules \| q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj \|
	\| Training Data \| Android control episodes with screenshots and actions \|
	\| License \| Gemma 3N License \|

	## 🛠️ Installation

	### Prerequisites

	Before installing the model, you must:

	1. Request Access: Click the "Request Access" button on this page and fill out the form
	2. Wait for Approval: Access requests are typically reviewed within 1-2 business days
	3. Authenticate: Once approved, you'll need to authenticate with Hugging Face

	### Authentication Required

	Important: You must be authenticated with Hugging Face to access this gated model. Ensure you have:
	1. Received access approval
	2. Logged in using `huggingface-cli login` or `login()` from `huggingface_hub`

	### Basic Usage (Merged Model)

	```python
	import torch
	from PIL import Image
	from transformers import AutoProcessor, AutoModelForImageTextToText

	# Load merged model and processor
	model_id = "Tonic/g-operator"
	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForImageTextToText.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	device_map="auto"
	)

	# Prepare input
	image = Image.open("android_screenshot.png").convert("RGB")
	goal = "Open the Settings app"
	instruction = "Navigate to the Settings app on the home screen"

	# Build conversation
	conversation = [
	{
	"role": "system",
	"content": [
	{"type": "text", "text": "You are a helpful multimodal assistant specialized in Android device control. You respond with JSON actions to control Android devices."}
	]
	},
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": f"Goal: {goal}\nStep: {instruction}\nRespond with a JSON action containing relevant keys (e.g., action_type, x, y, text, app_name, direction)."}
	]
	}
	]

	# Generate response
	inputs = processor.apply_chat_template(
	conversation,
	add_generation_prompt=True,
	return_tensors="pt"
	).to(model.device)

	with torch.no_grad():
	outputs = model.generate(
	inputs,
	max_new_tokens=128,
	do_sample=True,
	temperature=0.7,
	top_p=0.9
	)

	response = processor.tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
	print(response)
	```

	### Using LoRA Adapter

	```python
	import torch
	from PIL import Image
	from transformers import AutoProcessor, AutoModelForImageTextToText
	from peft import PeftModel

	# Load base model
	base_model_id = "google/gemma-3n-E4B-it"
	model = AutoModelForImageTextToText.from_pretrained(
	base_model_id,
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	device_map="auto"
	)

	# Load LoRA adapter
	adapter_model_id = "Tonic/g-operator"
	model = PeftModel.from_pretrained(model, adapter_model_id)

	# Load processor
	processor = AutoProcessor.from_pretrained(adapter_model_id, trust_remote_code=True)

	# Use the same inference code as above...
	```

	### Loading Specific Checkpoints

	```python
	import torch
	from transformers import AutoProcessor, AutoModelForImageTextToText

	# Load specific checkpoint
	checkpoint_path = "Tonic/g-operator/checkpoint-6252" # or checkpoint-6000, checkpoint-5500
	model = AutoModelForImageTextToText.from_pretrained(
	checkpoint_path,
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	device_map="auto"
	)
	processor = AutoProcessor.from_pretrained(checkpoint_path, trust_remote_code=True)

	# Use the same inference code as above...
	```

	### Expected Output Format

	The model generates JSON actions in the following format:

	```json
	{
	"action_type": "tap",
	"x": 540,
	"y": 1200,
	"text": "Settings",
	"app_name": "com.android.settings",
	"confidence": 0.95
	}
	```

	## 📊 Training Configuration

	### Training Parameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Learning Rate \| 3e-4 \|
	\| Batch Size \| 1 (per device) \|
	\| Gradient Accumulation \| 16 \|
	\| Epochs \| 1.0 \|
	\| Warmup Ratio \| 0.1 \|
	\| Weight Decay \| 0.01 \|
	\| Optimizer \| AdamW \|
	\| Scheduler \| Cosine \|
	\| Mixed Precision \| bfloat16 \|

	### Vision Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Max Image Tokens \| 256 \|
	\| Min Image Tokens \| 64 \|
	\| Image Splitting \| Enabled \|
	\| Image Format \| RGB \|

	## 🎯 Use Cases

	### 1. Automated Testing
	- UI automation for Android apps
	- Regression testing with visual verification
	- Cross-device compatibility testing

	### 2. Accessibility Support
	- Voice-controlled device navigation
	- Assistive technology integration
	- Screen reader enhancement

	### 3. Remote Device Management
	- Remote troubleshooting
	- Device configuration automation
	- Support ticket resolution

	### 4. App Development
	- UI/UX testing automation
	- User flow validation
	- Performance testing


	## 🔒 Safety and Limitations

	### Safety Considerations
	- Device Control: Model generates actions that can modify device state
	- Testing Environment: Always test in controlled environments first
	- Human Oversight: Implement safety checks for critical operations

	### Known Limitations
	- Screen Resolution: Performance may vary with different screen sizes
	- App-Specific: Training focused on common Android apps
	- Language: Primarily English language support
	- Real-time: Not optimized for real-time video processing

	## 📄 License & Terms

	This model is proprietary technology owned by Tonic and is subject to strict licensing terms:

	### Investment Evaluation License

	- Purpose: Access granted solely for investment evaluation and due diligence
	- Restrictions: No commercial use, reproduction, or distribution without written consent
	- NDA Required: All access is subject to Non-Disclosure Agreement
	- Confidentiality: All technical details, training methodologies, and performance characteristics are confidential

	### Base Model Attribution

	- Gemma 3N-E4B-IT: Licensed under [Gemma 3N License](https://ai.google.dev/gemma/terms) from Google
	- Fine-tuning: Proprietary to Tonic, subject to separate licensing terms

	## 🙏 Acknowledgments

	- Google: For the base Gemma 3N model
	- Hugging Face: For the transformers library and hosting

	## 🔗 Related Links

	- [Base Model: Gemma 3N-E4B-IT](https://huggingface.co/google/gemma-3n-E4B-it)
	- [Training Repository](https://github.com/Josephrp/train_android_models)
	- [Documentation](https://docs.your-org.com/g-operator)
	- [Demo Space](https://huggingface.co/spaces/Tonic/g-operator-demo)

	---

	<div align="center">

	Made with ❤️ by the Tonic Team

	[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Follow-blue?logo=huggingface)](https://huggingface.co/Tonic)
	[![GitHub](https://img.shields.io/badge/GitHub-Star-yellow?logo=github)](https://github.com/Josephrp/train_android_models)

	</div>