A Reproducible Workflow for Multi-Source Firm-Level Statistics
This project demonstrates how to integrate heterogeneous business datasets into a coherent analytical framework for economic statistics. It reflects challenges commonly encountered in official statistics (e.g., Destatis, Eurostat): inconsistent identifiers, divergent variable definitions, missing values, measurement errors, and discrepancies across data sources.
The project uses fully synthetic data to illustrate transparent, reproducible methods for:
- data generation
- cleaning & validation
- harmonization across sources
- statistical integration
- computation of structural indicators
- visualization of economic patterns
All code is implemented in R using a modular pipeline suitable for production-like environments.
Modern economic statistics increasingly rely on multi-source integration: administrative business registers, structural business surveys, short-term indicators, and accounting extracts.
Such sources differ in:
- reporting frequency
- timeliness
- variable definitions
- quality and completeness
- industry and regional classification detail
This project provides a compact but realistic framework to:
- Generate synthetic firm-level source datasets
- Clean and validate each dataset 3 Link identifiers and harmonize classifications
- Integrate monthly panels across sources
- Compute economic indicators at firm, sector, and regional level
- Produce reproducible tables and visualizations
business-data-integration/
βββ data
β βββ clean # cleaned intermediate data
β βββ processed # unified firm-level panel (analysis-ready)
β βββ raw # synthetic raw datasets (generated)
βββ LICENSE
βββ output
β βββ figures # visualizations
β βββ tables # aggregated indicators
βββ R
β βββ 01_generate_synthetic_data.R
β βββ 02_clean_and_validate_data.R
β βββ 03_integrate_sources.R
β βββ 04_compute_indicators.R
β βββ 05_visualize_results.R
βββ README.md
βββ renv
β βββ activate.R
β βββ library
β βββ settings.json
β βββ staging
βββ renv.lock
Reproducibility: The project uses renv for a full dependency snapshot.
π Reproducibility With renv
This project uses renv to ensure that anyone who clones the repository obtains exactly the same R package environment.
Before running the pipeline for the first time, start R inside the project directory and check the environment:
renv::status()If packages need to be restored, run:
renv::restore()This guarantees that all scripts operate identically across machines.
Three realistic (but fully artificial) firm-level datasets are generated:
Variables:
firm_idregion_codenace_codelegal_formemployeesrevenue_last_year
Intentionally includes:
- missing values
- negative values
- inconsistent reporting patterns
Panel data for JanβDec 2023:
firm_idmonthemployees- synthetic missingness for interpolation
- regional & sector attributes copied from the register
β Industry-Specific Seasonal Patterns (Added Realism)
The monthly employment dataset includes sector-specific seasonal variation, reflecting realistic trends observed in economic statistics:
- Retail (G47) β strong December activity
- Accommodation & Food (I55, I56) β summer employment peaks
- Manufacturing (C10, C29) β mild seasonal movement
- Transport (H49) β steady with slight autumn increases
Seasonality is introduced using multiplicative adjustment factors, producing more realistic monthly employment curves.
Variables:
firm_idmonthturnover- missing values for interpolation
- moderate log-normal variability
ββββββββββββββββββββββββββββββ
β 01_generate_synthetic_data β
ββββββββββββββββ¬ββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββ
β 02_clean_and_validate_data β
ββββββββββββββββ¬ββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββ
β 03_integrate_sources β
ββββββββββββββββ¬ββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββ
β 04_compute_indicators β
ββββββββββββββββ¬ββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββ
β 05_visualize_results β
ββββββββββββββββββββββββββββββ- structural key validation
- detection of inconsistencies (negative values, missingness)
- industry/region-based imputation
- time-series interpolation (approximation)
- standardization of variable names
- controlled join logic between datasets
- date harmonization
- preparation of unified firm-month records
For each firm Γ month:
- turnover precedence rules
- employment precedence rules
- consistency checks
- generation of firm-level indicators
Computed at firm level:
- turnover YoY growth
- monthly employment growth
- labor productivity
- simple seasonal index
Computed at sector/region level:
- total turnover
- average turnover per firm
- total employees
- productivity aggregates
- firm counts
- firm size distribution
- sectoral turnover profiles
- monthly aggregate turnover trends
Visual outputs stored in
output/figures/.
- R
-
dplyr,tidyrβ reshaping, joins, aggregationlubridateβ date manipulationggplot2β visualizationreadrβ data I/Ojanitorβ column cleaningpurrrβ functional utilities
- renv for reproducibility
- Supports VS Code, RStudio, and command-line R
π§ Before Running the Pipeline
Start R in the project root and ensure the correct environment is active:
renv::status()If packages are missing:
renv::restore()Then proceed with the pipeline steps below.
# 1. Generate synthetic data
source("R/01_generate_synthetic_data.R")
# 2. Clean & validate
source("R/02_clean_and_validate_data.R")
# 3. Integrate sources & build indicators
source("R/03_integrate_sources.R")
# 4. Compute sectoral and regional aggregates
source("R/04_compute_indicators.R")
# 5. Produce visualizations
source("R/05_visualize_results.R")Outputs appear in:
data/clean/data/processed/output/tables/output/figures/
Future enhancements might include:
- probabilistic record linkage
- multiple-year panels
- Monte Carlo simulations
- machine learningβbased imputation
- benchmarking algorithms for multi-source coherence (e.g. Denton, Chow-Lin)
- firm-level microdata anonymization techniques
MIT License
Golib Sanaev Applied Data Scientist & Analyst | ML β’ Data Analysis β’ Forecasting β’ Python β’ SQL β’ Econometrics
GitHub: @gsanaev
Email: gsanaev80@gmail.com
LinkedIn: golib-sanaev
Sanaev, G. (2025). Business Data Integration & Economic Indicators Pipeline in R.
GitHub Repository: https://github.com/gsanaev/business-data-integration