AI & ML interests

1. Synthetic Data Generation Text-to-Table and Text-to-Database modeling using LLMs. Controllable data synthesis—preserving correlation structures, class balance, and statistical realism. Model-based evaluation of synthetic vs. real data (utility, privacy leakage, fairness). 2. Domain-Specific Simulation Biomedical EMR synthesis (EHR, claims, genomics, EEG, imaging metadata). Environmental, urban, and energy-consumption datasets from public technical reports. Financial or behavioral simulations built from research abstracts or SEC filings. 3. Data Quality & Validation AI models for detecting implausible synthetic records. Automatic alignment of generated data distributions with reference real-world datasets. Metrics for realism, diversity, and representativeness. 4. LLM & Diffusion Applications Fine-tuning domain LLMs on synthetic corpora to expand scarce data domains. Diffusion models for structured data generation (e.g., time-series, sensor data). Multi-modal dataset creation: text + tabular + temporal. 5. Privacy & Governance Privacy auditing via membership-inference and re-identification resistance tests. “Privacy-free” pipelines—no PII ever enters training loops. Model cards and dataset cards documenting provenance and safety. 6. Downstream ML Enablement Benchmark creation for ML model validation when real data access is restricted. Rapid prototyping datasets for AI startups, hackathons, and education. Synthetic data augmentation for supervised learning and reinforcement learning environments. 7. AI-for-Science Integration Using LLMs to read research papers and auto-extract structured experiment data. Building machine-readable “paper-to-data” repositories across disciplines. Enabling reproducibility and meta-analysis with synthetic but statistically consistent datasets.

Recent Activity

DBbun creates unique, high-quality synthetic databases designed for research, analytics, and machine learning. Unlike traditional datasets, DBbun’s resources are completely privacy-free, since they are not based on real patient or customer data. Instead, they are generated intelligently using advanced Generative AI, which incorporates information from publicly available scientific papers and other open resources to build entirely new datasets and models.

Read Articles:

Contact: [email protected]