🏠 DALAS: Multimodal Real Estate Valuation via Deep Learning

A State-of-the-Art Hybrid Architecture combining Computer Vision (DINOv2), NLP (CamemBERT), and Gradient Boosting (XGBoost) to estimate real estate prices in France with 97% accuracy.

📄 Technical Report

This project is based on extensive research involving the scraping of 675k listings and 3M+ images. For a detailed explanation of the Masked Multi-Task Loss and the Visual Premium analysis, please refer to the full report:

👉 Read the Full Technical Report (PDF)

🚀 Project Overview

Traditional Automated Valuation Models (AVMs) rely heavily on tabular data (surface, location). DALAS bridges the semantic gap by integrating unstructured data (images and descriptions) to capture the "intrinsic condition" of a property.

Core Innovation: The Two-Stage Hybrid Pipeline

Neural Feature Extraction: A custom backbone fuses visual features (via ConvNeXt/DINOv2) and textual features (via CamemBERT) to learn a high-dimensional latent representation of the property.
Gradient Boosting Regressor: These embeddings are fed into an XGBoost model (trained with gpu_hist) to handle non-linear geographic interactions and minimize error.

👥 Team & Credits

Project developed at Sorbonne Université (Master of Computer Science).

📊 Key Results

We achieved State-of-the-Art performance on the French rental market, quantitatively proving that visual features reduce estimation error by ~30%.

Market Segment	Model Architecture	$R^2$ Score	MAE (Mean Absolute Error)
Rental (Location)	Hybrid (Ours)	0.973	36.96 €
Rental (Location)	Tabular Baseline	0.954	53.40 €
Sales (Achat)	Hybrid (Ours)	0.760	~64k €

🛠 Repository Structure

The codebase is organized to separate data acquisition, analysis, and modeling logic:

DALAS/
├── analyse/                  # 📊 Exploratory Data Analysis (EDA)
│   ├── analyse_bivariee.py   # Correlation matrices & ANOVA tests
│   ├── analyse_multivariee.py# PCA & Dimensionality reduction analysis
│   └── tools.py              # Statistical utility functions
├── data_acquisition/         # 🕷️ Distributed Scraping Pipeline
│   ├── get_image.py          # Asynchronous image downloader
│   └── dataset_stat.py       # Data volume monitoring
├── images_process/           # 🖼️ Computer Vision Pipeline
│   ├── ai_part.py            # DINOv2 Feature Extraction logic
│   └── filter_images.py      # Zero-Shot Semantic Filtering (CLIP)
├── model/                    # 🧠 Deep Learning & Training Core
│   ├── model.py              # PyTorch Backbone definition (Multi-modal)
│   ├── train.py              # Training loop with Masked Multi-Task Loss
│   ├── my_xgboost.py         # Stage 2: Boosting Regressor implementation
│   ├── data_loader.py        # Custom Dataset class with Robust Scaling
│   └── eval.py               # Inference & Metrics calculation
├── dataset/                  # Dataset generation scripts
└── environment.yml           # Conda environment configuration

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
analyse		analyse
dataset		dataset
model		model
.gitignore		.gitignore
README.md		README.md
Rapport.pdf		Rapport.pdf
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏠 DALAS: Multimodal Real Estate Valuation via Deep Learning

📄 Technical Report

🚀 Project Overview

👥 Team & Credits

📊 Key Results

🛠 Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🏠 DALAS: Multimodal Real Estate Valuation via Deep Learning

📄 Technical Report

🚀 Project Overview

👥 Team & Credits

📊 Key Results

🛠 Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages