Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?
Abstract
A multimodal framework using satellite imagery and text data outperforms vision-only models in predicting household wealth, with LLM-generated text proving more effective than agent-retrieved text.
We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.
Community
Setting the Stage
Measuring poverty and household wealth in low- and middle-income countries is challenging. Traditional surveys, like those from the Demographic and Health Surveys (DHS), provide reliable data but are expensive, infrequent, and often miss remote or fast-changing areas. Researchers have started using satellite images to spot signs of wealth, such as roads or buildings, but these visuals alone can’t capture cultural, historical, or social factors. With advances in AI, including large language models (LLMs) like GPT, we can now tap into text from the internet or AI-generated descriptions to add context. This paper explores whether combining satellite images (visual data) with text (from AI “memory” or web searches) can create a more complete picture of poverty, potentially revealing a shared “ideal” representation of wealth across these data types—drawing from ideas like the Platonic Representation Hypothesis.
What the Authors Did
The authors analyzed data from more than 60,000 DHS neighborhoods across Africa, spanning 1990 to 2020, using the International Wealth Index (IWI) as a measure of household wealth. For each location, the authors paired high-resolution satellite images (from Landsat, showing physical features like infrastructure) with two types of text: (1) descriptions generated by LLMs based only on the location and year, drawing from the model’s built-in knowledge; and (2) real-world text retrieved by an AI “search agent” that queries the web (e.g., Wikipedia or news) for historical and economic details. The authors built five prediction systems: one using just images, one using LLM text, one using agent-searched text, one fusing images and text into a shared code, and an ensemble combining all. The authors tested these on different data splits (random, out-of-country, out-of-time) to check robustness. Results showed that blending images with LLM text boosted prediction accuracy (e.g., explaining 77% of wealth variation vs. 63% for images alone), with LLM knowledge outperforming web-searched text. Embeddings (compact AI representations) from images and text showed moderate overlap (similarity around 0.60), suggesting some shared underlying structure.
Why the Results Matters
This work advances poverty mapping by making it more accurate and scalable, helping policymakers target aid in underserved African regions without relying solely on costly surveys. It highlights how AI can bridge gaps in data by fusing visuals and text, with LLM “memory” proving surprisingly effective for generalization across countries and time periods. While web-searched text added minor unique insights (weak support for “agent-induced novelty”), the findings broadly align with the idea of unified representations in AI. The authors released a large multimodal dataset on Hugging Face (with ~60,000 entries including images, texts, and IWI labels) to enable further research in AI for social good, such as fairer models or causal analysis, ultimately supporting global efforts to reduce inequality.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM (2025)
- Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models (2025)
- Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards (2025)
- TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting (2025)
- MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection (2025)
- Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence (2025)
- Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper