Skip to content

Open datasets from Sparky for research and innovation. Curated data across healthcare, AI, and science that are anonymized, documented and ready for ML/AI applications.

License

Notifications You must be signed in to change notification settings

SparkyScience/sparky-open-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Sparky Logo

Sparky Open Datasets

Curated, research-grade datasets for advancing open science and data-driven innovation

License: CC BY 4.0 Datasets GDPR Compliant EU Funded

Explore Datasets β€’ Get Started β€’ Contribute β€’ Research Guide


🌟 Welcome

Welcome to Sparky Open Datasets, a commitment to advancing open science by providing high-quality, well-documented datasets that enable researchers, educators, and practitioners worldwide to conduct reproducible research, explore market trends, develop analytical methods, and foster innovation through open data.

Our mission is to connect people, technology, and data to drive meaningful insights across industries, from sustainability to healthcare.


πŸ“Š Available Datasets

πŸ’Ό ESG LinkedIn Jobs Dataset

Domain Size

Comprehensive ESG job market analysis from LinkedIn

Explore the evolving landscape of Environmental, Social, and Governance (ESG) careers across Europe. This dataset captures 3,190 unique job postings from August 2025 to January 2026, providing insights into sustainability careers, skill requirements, geographic trends, and industry adoption.

Key Features:

  • 🌍 Global coverage with Europe focus
  • πŸ“Š 5-month temporal data
  • πŸ“ Full job descriptions for NLP analysis
  • 🏒 2,827 unique companies

Quick Access:

πŸ“– View Documentation β†’

Research Applications:

  • Job market trend forecasting
  • Skills gap analysis
  • Geographic ESG adoption patterns
  • Compensation benchmarking
  • Curriculum development

πŸ₯ HealthChain Sensor Data

Domain Size

Real-time health monitoring from medical-grade wearable sensors

High-frequency biosensor data from Sparky Health Sensors (Class IIa medical devices) deployed at KBC Rijeka Hospital. This dataset contains 183,736 measurements at 52 Hz from 3 sensors during a 55-patient clinical study, providing rich 9-axis IMU data for healthcare AI research.

Key Features:

  • πŸ₯ Real clinical environment
  • ⚑ 52 Hz sampling (high-frequency)
  • πŸ” GDPR compliant & fully anonymized
  • πŸ† Medical-grade certified sensors

Quick Access:

πŸ“– View Documentation β†’

Research Applications:

  • Fall detection algorithms
  • Activity recognition (ML/DL)
  • Gait analysis & mobility assessment
  • Sleep quality monitoring
  • Sensor fusion techniques
  • Real-time patient monitoring

πŸš€ Getting Started

Installation

# Clone the repository
git clone https://github.com/SparkyScience/sparky-open-datasets.git
cd sparky-open-datasets

# Install Python dependencies
pip install pandas numpy scipy scikit-learn matplotlib seaborn pyarrow

Quick Start Examples

πŸ“Š ESG Job Market Analysis
import pandas as pd
import matplotlib.pyplot as plt

# Load ESG jobs dataset
df = pd.read_parquet('datasets/esg-linkedin-jobs/esg_linkedin_jobs.parquet')

# Analyze job trends
print(f"Total ESG Jobs: {len(df):,}")
print(f"Average text length: {df['text'].str.len().mean():.0f} characters")

# Extract and visualize top locations from full_header
df['location'] = df['full_header'].str.split(' | ').str[0]
top_locations = df['location'].value_counts().head(10)

# Create visualization
fig, ax = plt.subplots(figsize=(10, 6))
top_locations.plot(kind='barh', ax=ax, color='#2E7D32')
ax.set_title('Top 10 Locations for ESG Jobs', fontsize=16, fontweight='bold')
ax.set_xlabel('Number of Job Postings')
plt.tight_layout()
plt.show()

# Skills analysis from job descriptions
skills = ['Python', 'sustainability', 'ESG', 'reporting', 'analysis', 'climate']
for skill in skills:
    count = df['text'].str.contains(skill, case=False, na=False).sum()
    print(f"Jobs requiring '{skill}': {count} ({count/len(df)*100:.1f}%)")
πŸ₯ Healthcare Sensor Analysis
import pandas as pd
import numpy as np

# Load HealthChain sensor data (Parquet is 6x faster!)
df = pd.read_parquet('datasets/healthchain-sensor-data/healthchain_sensor_data.parquet')

# Basic statistics
print(f"Total Measurements: {len(df):,}")
print(f"Unique Sensors: {df['sensor_id'].nunique()}")
print(f"Date Range: {df['datetime'].min()} to {df['datetime'].max()}")
print(f"Sampling Rate: ~52 Hz")

# Calculate movement intensity
df['movement'] = np.sqrt(df['AccX']**2 + df['AccY']**2 + df['AccZ']**2)
df['rotation'] = np.sqrt(df['GyroX']**2 + df['GyroY']**2 + df['GyroZ']**2)

print(f"\nMovement Statistics:")
print(f"  Average: {df['movement'].mean():.2f} m/sΒ²")
print(f"  Max: {df['movement'].max():.2f} m/sΒ²")

# Detect high-activity events (potential falls or rapid movements)
high_activity = df[df['movement'] > 15]  # > 1.5g
print(f"\nHigh Activity Events: {len(high_activity):,}")

# Analyze by sensor
sensor_stats = df.groupby('sensor_id').agg({
    'movement': ['mean', 'std', 'max'],
    'datetime': 'count'
}).round(2)
print(f"\nPer-Sensor Statistics:")
print(sensor_stats)
πŸ”¬ Advanced: Fall Detection Model
import pandas as pd
import numpy as np
from scipy import signal
from sklearn.ensemble import RandomForestClassifier

# Load data
df = pd.read_parquet('datasets/healthchain-sensor-data/healthchain_sensor_data.parquet')

# Feature engineering for fall detection
df['acc_magnitude'] = np.sqrt(df['AccX']**2 + df['AccY']**2 + df['AccZ']**2)
df['gyro_magnitude'] = np.sqrt(df['GyroX']**2 + df['GyroY']**2 + df['GyroZ']**2)

# Detect sudden changes (potential fall indicators)
df['acc_change'] = df['acc_magnitude'].diff().abs()
df['gyro_change'] = df['gyro_magnitude'].diff().abs()

# Identify high-impact events
fall_candidates = df[
    (df['acc_change'] > 5) |  # Sudden acceleration change
    (df['gyro_change'] > 50) |  # Sudden rotation
    (df['acc_magnitude'] > 20)  # High impact
]

print(f"Potential fall events detected: {len(fall_candidates)}")

# Extract features for ML (window-based)
def extract_features(window):
    return {
        'acc_mean': window['acc_magnitude'].mean(),
        'acc_std': window['acc_magnitude'].std(),
        'acc_max': window['acc_magnitude'].max(),
        'gyro_mean': window['gyro_magnitude'].mean(),
        'gyro_std': window['gyro_magnitude'].std(),
        'acc_peak_count': len(signal.find_peaks(window['acc_magnitude'], height=15)[0])
    }

# Example: Extract features for first 100 windows
window_size = 52 * 2  # 2 seconds at 52 Hz
features = []
for i in range(0, min(10000, len(df) - window_size), window_size):
    window = df.iloc[i:i+window_size]
    features.append(extract_features(window))

feature_df = pd.DataFrame(features)
print(f"\nExtracted {len(feature_df)} feature windows")
print(feature_df.head())

πŸ“‚ Repository Structure

sparky-open-datasets/
β”‚
β”œβ”€β”€ πŸ“Š datasets/
β”‚   β”œβ”€β”€ esg-linkedin-jobs/           # ESG job market data
β”‚   β”‚   β”œβ”€β”€ README.md                # Comprehensive documentation
β”‚   β”‚   β”œβ”€β”€ data_dictionary.md       # Field definitions
β”‚   β”‚   └── *.json / *.csv / *.parquet
β”‚   β”‚
β”‚   └── healthchain-sensor-data/     # Real-time health sensors
β”‚       β”œβ”€β”€ README.md                # Comprehensive documentation
β”‚       β”œβ”€β”€ data_dictionary.md       # Field definitions
β”‚       β”œβ”€β”€ *.csv / *.parquet / *.json
β”‚       β”œβ”€β”€ by_sensor/               # Per-sensor data files
β”‚       └── sensor_mapping.json
β”‚
β”œβ”€β”€ πŸ“š docs/
β”‚   └── RESEARCH_GUIDE.md            # Comprehensive research guide
β”‚
β”œβ”€β”€ 🎨 .github/
β”‚   └── assets/                      # Logos and images
β”‚
β”œβ”€β”€ πŸ“„ README.md                     # This file
β”œβ”€β”€ πŸ“œ LICENSE                       # CC BY 4.0 + MIT
└── 🀝 CONTRIBUTING.md               # Contribution guidelines

Each dataset includes:

  • πŸ“„ README.md – Comprehensive documentation
  • πŸ“– Data Dictionary – Schema and field descriptions
  • πŸ“Š Multiple Formats – JSON, CSV, Parquet
  • πŸ“ˆ Statistics – Dataset summary and quality metrics
  • 🐍 Code Examples – Python usage snippets
  • πŸ“œ Citation – BibTeX and APA formats

🎯 Use Cases

For Researchers

  • Conduct reproducible research with clean, validated data
  • Access real-world clinical and job market datasets
  • Publish findings with proper citations
  • Contribute to open science initiatives

For Educators

  • Build course materials and assignments
  • Teach data science and ML with real datasets
  • Demonstrate best practices in data handling
  • Inspire students with impactful research questions

For Practitioners

  • Benchmark algorithms against standardized data
  • Prototype healthcare AI solutions
  • Analyze job market trends for workforce planning
  • Develop data-driven business insights

For Students

  • Learn data analysis with real-world datasets
  • Complete thesis and capstone projects
  • Build portfolio projects
  • Gain hands-on ML/AI experience

πŸ“š Citation

If you use these datasets in your research, please cite appropriately:

ESG LinkedIn Jobs Dataset:

@dataset{sparky_esg_2026,
  author = {Sparky* (Sparky solution d.o.o.)},
  title = {ESG LinkedIn Jobs Dataset},
  year = {2026},
  month = {January},
  publisher = {GitHub},
  version = {1.0},
  url = {https://github.com/SparkyScience/sparky-open-datasets/tree/main/datasets/esg-linkedin-jobs},
  note = {3,190 ESG job postings, August 2025 - January 2026}
}

HealthChain Sensor Dataset:

@dataset{sparky_healthchain_2026,
  author = {Sparky* (Sparky solution d.o.o.)},
  title = {HealthChain Real-Time Health Sensor Dataset},
  year = {2026},
  month = {January},
  publisher = {GitHub},
  version = {1.0},
  url = {https://github.com/SparkyScience/sparky-open-datasets/tree/main/datasets/healthchain-sensor-data},
  note = {IMU sensor measurements from 3 Sparky Health Sensors, July 2025}
}

🀝 Contributing

We welcome contributions from the community! Whether you want to:

  • πŸ› Report data quality issues
  • πŸ“– Improve documentation
  • πŸ’‘ Suggest new datasets
  • πŸ”§ Enhance processing scripts
  • 🌍 Translate documentation

Please read our Contributing Guidelines to get started.

Stay tuned! Follow this repository for updates.


πŸ“œ License

Datasets: Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0)

  • βœ… Free to share and adapt
  • βœ… Commercial use allowed
  • ⚠️ Attribution required

Code & Scripts: Licensed under MIT License

  • βœ… Free to use, modify, and distribute

πŸ“ž Contact & Support

Questions? Ideas? Found an issue?

Issues Discussions Website


⭐ Star this repository if you find it useful!

Made with ❀️ by Sparky*
Connecting people, technology, and data for a better future

Back to Top ↑

About

Open datasets from Sparky for research and innovation. Curated data across healthcare, AI, and science that are anonymized, documented and ready for ML/AI applications.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •