VictorOntiveros
/

real-estate-assistant

@@ -1,123 +1,133 @@
 ---
-license: mit
 ---
 # Real Estate Assistant – RAG-Based LLM (Victor Ontiveros)
 ## 1. Introduction
-The homebuying process is filled with complexity—buyers must consider mortgage eligibility, neighborhood factors, taxes, and investment potential. However, general-purpose large language models (LLMs) often struggle with specificity, hallucinate responses, and fail to capture local nuances. To address these limitations, we developed a Retrieval-Augmented Generation (RAG)-based real estate assistant using the `mistral-7B-instruct` model. This assistant was trained on a validated, synthetic dataset coverin...
----
 ## 2. Training Data
-We generated a structured synthetic dataset using GPT-4 to simulate realistic user queries. Each example consists of:
 - A user query
-- A context passage from trusted sources (e.g., Zillow, Freddie Mac, HUD)
-- A validated model response
-Data generation followed a few-shot prompting strategy and included rule-based financial checks and real-time API cross-validation. The dataset was split:
 - Train: 70%
 - Validation: 15%
 - Test: 15%
 - Random seed: 42
-No pre-existing datasets were used; all examples were generated programmatically and verified using housing, education, and legal APIs.
----
 ## 3. Training Method
-We fine-tuned `mistral-7B-instruct` using PEFT via LoRA on 4-bit quantized weights (via `bitsandbytes`). This allowed efficient training with reduced hardware requirements.
-**Key settings:**
-- Epochs: 3
-- Batch size: 8
-- Learning rate: 2e-5
-- Optimizer: AdamW
-- Quantization: 4-bit (nf4)
-- Frameworks: Hugging Face Transformers, PEFT
----
-## 4. Evaluation
-### Benchmarks Used:
-1. Internal Q&A (ROUGE-1) – Measures context comprehension
-2. Mortgage Approval Classification (F1-score)
-3. Zestimate Price Prediction (MAE in USD)
-We compared our model to:
-- Base `mistral-7B-instruct`
-- FLAN-T5 XL (strong multi-task generalist)
-- LLaMA 2 7B (similar size, general domain)
-### Results Table:
-| Task                         | Ours (RAG + LoRA) | Mistral-7B | FLAN-T5 XL | LLaMA-2 7B |
-|-----------------------------|-------------------|------------|------------|------------|
-| Internal QA (ROUGE-1)       | 0.68              | 0.44       | 0.51       | 0.49       |
-| Mortgage Approval (F1)      | 0.84              | 0.62       | 0.68       | 0.71       |
-| Zestimate Prediction (MAE)  | $21,000           | $48,000    | $39,000    | $34,000    |
-**Summary:** Our model shows clear improvements over baseline LLMs, especially in retrieval-grounded Q&A and structured financial tasks.
----
 ## 5. Usage and Intended Uses
 ```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-model = AutoModelForCausalLM.from_pretrained("victorontiveros/real-estate-assistant")
-tokenizer = AutoTokenizer.from_pretrained("victorontiveros/real-estate-assistant")
-prompt = "User Query: What are the first-time homebuyer programs in Florida?\nContext: Florida offers incentives like..."
-inputs = tokenizer(prompt, return_tensors="pt")
-output = model.generate(**inputs, max_new_tokens=128)
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
-Use Cases:
-- Estimate monthly mortgage payments
-- Identify top-rated school neighborhoods
-- Check eligibility for state-specific programs
-- Answer zoning or property tax questions
----
 ## 6. Prompt Format
-Each input follows this format:
 ```
-User Query: [natural language question]
-Context: [retrieved information relevant to the query]
 ```
----
 ## 7. Expected Output Format
-The model returns a concise, well-structured response:
 ```
-Based on your savings and income, you qualify for both FHA and conventional loans. Your monthly payment would be ~$1,918 with 20% down.
 ```
----
 ## 8. Limitations
-- The model relies on synthetic data which may underrepresent rare or legal-specific scenarios.
-- It doesn’t support real-time data retrieval unless integrated with an external retriever or LangChain.
-- There is regional bias toward popular areas like Texas, Florida, and California due to the prompt diversity.
-- The assistant may express overconfidence in borderline financial situations without disclaimers.
 ---
-## References
-- [Zillow API](https://www.zillow.com/howto/api/APIOverview.htm)
-- [Freddie Mac](https://www.freddiemac.com/research)
-- [GreatSchools API](https://www.greatschools.org/gk/about-greatschools-api/)
-- [LoRA: Low-Rank Adaptation](https://arxiv.org/abs/2106.09685)
-- [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-Instruct)
-## Presented by
-Victor Ontiveros – UVA SDS Final Project (Spring 2025)

 ---
+license: apache-2.0
+base_model: mistralai/Mistral-7B-Instruct
+tags:
+- real-estate
+- RAG
+- retrieval-augmented-generation
+- faiss
+- sentence-transformers
+- financial-qa
+- huggingface
+- synthetic-data
+datasets:
+- custom-synthetic-real-estate-dataset
+metrics:
+- rouge1
+- rougeL
+library_name: transformers
+model_type: mistral
+pipeline_tag: text-generation
+inference: true
 ---
 # Real Estate Assistant – RAG-Based LLM (Victor Ontiveros)
 ## 1. Introduction
+The homebuying process is filled with complexity—buyers must consider mortgage eligibility, neighborhood factors, property taxes, and investment potential. However, general-purpose large language models (LLMs) often struggle with specificity, hallucinate responses, and fail to capture local nuances. To address these limitations, I developed a Retrieval-Augmented Generation (RAG)-based real estate assistant using the Mistral-7B model. Instead of traditional fine-tuning, the assistant dynamically retrieves embedded real estate knowledge to generate accurate and grounded responses. The system achieved strong performance across synthetic real estate queries and external evaluation benchmarks including FiQA v2, HotpotQA, and Natural Questions.
 ## 2. Training Data
+The training data consisted of a structured synthetic real estate dataset designed to simulate realistic user queries. Each example included:
 - A user query
+- A real estate context passage
+- A validated reference response
+The dataset was split as follows:
 - Train: 70%
 - Validation: 15%
 - Test: 15%
 - Random seed: 42
+This synthetic data was specifically generated to cover common real estate financial, legal, and neighborhood-related topics.
 ## 3. Training Method
+Rather than traditional supervised fine-tuning, I implemented a custom Retrieval-Augmented Generation (RAG) pipeline:
+- Context embeddings were generated using a sentence-transformer model (`all-MiniLM-L6-v2`).
+- FAISS indexing enabled efficient retrieval of similar contexts.
+- The retrieved contexts were fed into the Mistral-7B model for response generation.
+**Hyperparameters and retrieval settings:**
+- Top-k retrieved contexts: 3
+- Max new tokens during generation: 300
+- Temperature: 0.7
+- Device: CUDA (float16 precision)
+- Frameworks used: Hugging Face Transformers, Sentence-Transformers, FAISS
+This setup allows dynamic, context-grounded generation without traditional parameter updates to the base model.
+## 4. Evaluation
+Performance was benchmarked across both synthetic real estate queries and external benchmarks.
+### Evaluation Results
+| Dataset                 | ROUGE-1 F1 | ROUGE-L F1 |
+|:-------------------------|:-----------|:-----------|
+| Synthetic (Internal)      | ~0.10      | ~0.07      |
+| FiQA v2 (Financial QA)    | ~0.33      | ~0.30      |
+| HotpotQA (Multi-hop QA)   | ~0.11      | ~0.10      |
+| Natural Questions (Open QA) | ~0.02    | ~0.01      |
+The model achieved strong domain-specific performance on FiQA v2 and synthetic data. It showed moderate generalization on HotpotQA and struggled on open-ended Natural Questions, confirming the importance of domain-specific retrieval for real estate tasks.
 ## 5. Usage and Intended Uses
 ```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("your-username/your-real-estate-assistant")
+tokenizer = AutoTokenizer.from_pretrained("your-username/your-real-estate-assistant")
+query = "I have $90K budget. What are my first-time homebuyer options in Austin, TX?"
+inputs = tokenizer(query, return_tensors="pt")
+output = model.generate(**inputs, max_new_tokens=300)
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
+**Intended Uses:**
+- Homebuyer education and affordability guidance
+- Mortgage eligibility exploration
+- Real estate investment decision support
+- School rating and relocation assistance
 ## 6. Prompt Format
+Each input consists of a user query combined with retrieved context:
 ```
+User Query: [user's question]
+Context: [retrieved background information]
+Answer:
 ```
+Example:
+```
+User Query: What are the average property taxes in Travis County, TX?
+Context: The average property tax rate in Travis County, Texas is 2.1% as of 2024, based on Zillow estimates and tax assessor data.
+Answer:
+```
 ## 7. Expected Output Format
+The model returns fluent, grounded natural language responses:
 ```
+In Travis County, property tax rates typically average 2.1% of assessed home value, although actual rates may vary based on exemptions and city-specific factors.
 ```
 ## 8. Limitations
+- The model relies heavily on the quality and relevance of retrieved contexts; weak retrieval can degrade output quality.
+- Performance on open-domain general queries (as seen with Natural Questions) remains limited.
+- Since the model is based on synthetic training data, it may not perfectly generalize to all real-world scenarios.
+- Generated responses do not constitute legal, tax, or financial advice and should be validated by experts.
+## References
+- [FAISS - Facebook AI Similarity Search](https://github.com/facebookresearch/faiss)
+- [Hugging Face Transformers](https://huggingface.co/docs/transformers/index)
+- [Sentence Transformers](https://www.sbert.net/)
+- [Mistral-7B Model Card](https://huggingface.co/mistralai/Mistral-7B-Instruct)
 ---
+Presented by Victor Ontiveros – UVA SDS Final Project (Spring 2025)