Floorplan Retrieval with Design Intent Models
This repository contains two models trained for the research paper: "Unlocking Floorplan Retrieval with Design Intent via Contrastive Multimodal Learning".
These models are designed to retrieve architectural floorplans from a database based on a source image and a natural language instruction describing a desired change. This enables a more intuitive and goal-driven search for architects and designers.
Model Details
Two architectures were trained for this task using a triplet contrastive learning framework. The goal is to learn a shared embedding space where a query (source image + text instruction) is closer to a positive target image (that satisfies the instruction) than to a negative image.
1. CLIP-MLP-Floorplan-Retriever
(Recommended)
This model uses the pre-trained multimodal embeddings from CLIP (ViT-B/32). The image and text embeddings are concatenated and passed through a simple MLP for fusion. This model demonstrated superior performance in both quantitative metrics and user studies.
- Image Encoder: CLIP Vision Transformer (ViT-B/32)
- Text Encoder: CLIP Text Transformer
- Fusion: Concatenation + Multi-Layer Perceptron (MLP)
- Training Loss:
TripletMarginWithDistanceLoss
with Cosine Similarity (margin=0.2)
2. BERT-ResNet-CA-Floorplan-Retriever
This model uses separate pre-trained encoders for image and text. A cross-attention module is used to fuse the features, allowing the image representation to attend to linguistic cues from the instruction.
- Image Encoder: ResNet50
- Text Encoder: BERT (base-uncased)
- Fusion: Cross-Attention Module
- Training Loss:
TripletMarginLoss
with L2 Euclidean Distance (margin=1.0)
How to Use
You can use these models to get a fused embedding for a (floorplan, instruction) pair. You can then compare this embedding (e.g., using cosine similarity) against a pre-computed database of floorplan embeddings to find the best match.
First, install the necessary libraries:
pip install torch transformers Pillow
Evaluation results
- Precision@3 on Synthetic Floorplan Intent Datasetself-reported0.393
- Unique Preference Rate on Synthetic Floorplan Intent Datasetself-reported0.607
- Precision@3 on Synthetic Floorplan Intent Datasetself-reported0.226
- Unique Preference Rate on Synthetic Floorplan Intent Datasetself-reported0.179