Curated, research-grade datasets for advancing open science and data-driven innovation
Explore Datasets β’ Get Started β’ Contribute β’ Research Guide
Welcome to Sparky Open Datasets, a commitment to advancing open science by providing high-quality, well-documented datasets that enable researchers, educators, and practitioners worldwide to conduct reproducible research, explore market trends, develop analytical methods, and foster innovation through open data.
Our mission is to connect people, technology, and data to drive meaningful insights across industries, from sustainability to healthcare.
|
Comprehensive ESG job market analysis from LinkedIn Explore the evolving landscape of Environmental, Social, and Governance (ESG) careers across Europe. This dataset captures 3,190 unique job postings from August 2025 to January 2026, providing insights into sustainability careers, skill requirements, geographic trends, and industry adoption. Key Features:
Quick Access: Research Applications:
|
Real-time health monitoring from medical-grade wearable sensors High-frequency biosensor data from Sparky Health Sensors (Class IIa medical devices) deployed at KBC Rijeka Hospital. This dataset contains 183,736 measurements at 52 Hz from 3 sensors during a 55-patient clinical study, providing rich 9-axis IMU data for healthcare AI research. Key Features:
Quick Access:
Research Applications:
|
# Clone the repository
git clone https://github.com/SparkyScience/sparky-open-datasets.git
cd sparky-open-datasets
# Install Python dependencies
pip install pandas numpy scipy scikit-learn matplotlib seaborn pyarrowπ ESG Job Market Analysis
import pandas as pd
import matplotlib.pyplot as plt
# Load ESG jobs dataset
df = pd.read_parquet('datasets/esg-linkedin-jobs/esg_linkedin_jobs.parquet')
# Analyze job trends
print(f"Total ESG Jobs: {len(df):,}")
print(f"Average text length: {df['text'].str.len().mean():.0f} characters")
# Extract and visualize top locations from full_header
df['location'] = df['full_header'].str.split(' | ').str[0]
top_locations = df['location'].value_counts().head(10)
# Create visualization
fig, ax = plt.subplots(figsize=(10, 6))
top_locations.plot(kind='barh', ax=ax, color='#2E7D32')
ax.set_title('Top 10 Locations for ESG Jobs', fontsize=16, fontweight='bold')
ax.set_xlabel('Number of Job Postings')
plt.tight_layout()
plt.show()
# Skills analysis from job descriptions
skills = ['Python', 'sustainability', 'ESG', 'reporting', 'analysis', 'climate']
for skill in skills:
count = df['text'].str.contains(skill, case=False, na=False).sum()
print(f"Jobs requiring '{skill}': {count} ({count/len(df)*100:.1f}%)")π₯ Healthcare Sensor Analysis
import pandas as pd
import numpy as np
# Load HealthChain sensor data (Parquet is 6x faster!)
df = pd.read_parquet('datasets/healthchain-sensor-data/healthchain_sensor_data.parquet')
# Basic statistics
print(f"Total Measurements: {len(df):,}")
print(f"Unique Sensors: {df['sensor_id'].nunique()}")
print(f"Date Range: {df['datetime'].min()} to {df['datetime'].max()}")
print(f"Sampling Rate: ~52 Hz")
# Calculate movement intensity
df['movement'] = np.sqrt(df['AccX']**2 + df['AccY']**2 + df['AccZ']**2)
df['rotation'] = np.sqrt(df['GyroX']**2 + df['GyroY']**2 + df['GyroZ']**2)
print(f"\nMovement Statistics:")
print(f" Average: {df['movement'].mean():.2f} m/sΒ²")
print(f" Max: {df['movement'].max():.2f} m/sΒ²")
# Detect high-activity events (potential falls or rapid movements)
high_activity = df[df['movement'] > 15] # > 1.5g
print(f"\nHigh Activity Events: {len(high_activity):,}")
# Analyze by sensor
sensor_stats = df.groupby('sensor_id').agg({
'movement': ['mean', 'std', 'max'],
'datetime': 'count'
}).round(2)
print(f"\nPer-Sensor Statistics:")
print(sensor_stats)π¬ Advanced: Fall Detection Model
import pandas as pd
import numpy as np
from scipy import signal
from sklearn.ensemble import RandomForestClassifier
# Load data
df = pd.read_parquet('datasets/healthchain-sensor-data/healthchain_sensor_data.parquet')
# Feature engineering for fall detection
df['acc_magnitude'] = np.sqrt(df['AccX']**2 + df['AccY']**2 + df['AccZ']**2)
df['gyro_magnitude'] = np.sqrt(df['GyroX']**2 + df['GyroY']**2 + df['GyroZ']**2)
# Detect sudden changes (potential fall indicators)
df['acc_change'] = df['acc_magnitude'].diff().abs()
df['gyro_change'] = df['gyro_magnitude'].diff().abs()
# Identify high-impact events
fall_candidates = df[
(df['acc_change'] > 5) | # Sudden acceleration change
(df['gyro_change'] > 50) | # Sudden rotation
(df['acc_magnitude'] > 20) # High impact
]
print(f"Potential fall events detected: {len(fall_candidates)}")
# Extract features for ML (window-based)
def extract_features(window):
return {
'acc_mean': window['acc_magnitude'].mean(),
'acc_std': window['acc_magnitude'].std(),
'acc_max': window['acc_magnitude'].max(),
'gyro_mean': window['gyro_magnitude'].mean(),
'gyro_std': window['gyro_magnitude'].std(),
'acc_peak_count': len(signal.find_peaks(window['acc_magnitude'], height=15)[0])
}
# Example: Extract features for first 100 windows
window_size = 52 * 2 # 2 seconds at 52 Hz
features = []
for i in range(0, min(10000, len(df) - window_size), window_size):
window = df.iloc[i:i+window_size]
features.append(extract_features(window))
feature_df = pd.DataFrame(features)
print(f"\nExtracted {len(feature_df)} feature windows")
print(feature_df.head())sparky-open-datasets/
β
βββ π datasets/
β βββ esg-linkedin-jobs/ # ESG job market data
β β βββ README.md # Comprehensive documentation
β β βββ data_dictionary.md # Field definitions
β β βββ *.json / *.csv / *.parquet
β β
β βββ healthchain-sensor-data/ # Real-time health sensors
β βββ README.md # Comprehensive documentation
β βββ data_dictionary.md # Field definitions
β βββ *.csv / *.parquet / *.json
β βββ by_sensor/ # Per-sensor data files
β βββ sensor_mapping.json
β
βββ π docs/
β βββ RESEARCH_GUIDE.md # Comprehensive research guide
β
βββ π¨ .github/
β βββ assets/ # Logos and images
β
βββ π README.md # This file
βββ π LICENSE # CC BY 4.0 + MIT
βββ π€ CONTRIBUTING.md # Contribution guidelines
Each dataset includes:
- π README.md β Comprehensive documentation
- π Data Dictionary β Schema and field descriptions
- π Multiple Formats β JSON, CSV, Parquet
- π Statistics β Dataset summary and quality metrics
- π Code Examples β Python usage snippets
- π Citation β BibTeX and APA formats
- Conduct reproducible research with clean, validated data
- Access real-world clinical and job market datasets
- Publish findings with proper citations
- Contribute to open science initiatives
- Build course materials and assignments
- Teach data science and ML with real datasets
- Demonstrate best practices in data handling
- Inspire students with impactful research questions
- Benchmark algorithms against standardized data
- Prototype healthcare AI solutions
- Analyze job market trends for workforce planning
- Develop data-driven business insights
- Learn data analysis with real-world datasets
- Complete thesis and capstone projects
- Build portfolio projects
- Gain hands-on ML/AI experience
If you use these datasets in your research, please cite appropriately:
ESG LinkedIn Jobs Dataset:
@dataset{sparky_esg_2026,
author = {Sparky* (Sparky solution d.o.o.)},
title = {ESG LinkedIn Jobs Dataset},
year = {2026},
month = {January},
publisher = {GitHub},
version = {1.0},
url = {https://github.com/SparkyScience/sparky-open-datasets/tree/main/datasets/esg-linkedin-jobs},
note = {3,190 ESG job postings, August 2025 - January 2026}
}HealthChain Sensor Dataset:
@dataset{sparky_healthchain_2026,
author = {Sparky* (Sparky solution d.o.o.)},
title = {HealthChain Real-Time Health Sensor Dataset},
year = {2026},
month = {January},
publisher = {GitHub},
version = {1.0},
url = {https://github.com/SparkyScience/sparky-open-datasets/tree/main/datasets/healthchain-sensor-data},
note = {IMU sensor measurements from 3 Sparky Health Sensors, July 2025}
}We welcome contributions from the community! Whether you want to:
- π Report data quality issues
- π Improve documentation
- π‘ Suggest new datasets
- π§ Enhance processing scripts
- π Translate documentation
Please read our Contributing Guidelines to get started.
Stay tuned! Follow this repository for updates.
Datasets: Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0)
- β Free to share and adapt
- β Commercial use allowed
β οΈ Attribution required
Code & Scripts: Licensed under MIT License
- β Free to use, modify, and distribute
- π Bug Reports: Open an issue
- π¬ Questions: Start a discussion
- π§ Email: info@sparky.science
- π Website: sparky.science
β Star this repository if you find it useful!
Made with β€οΈ by Sparky*
Connecting people, technology, and data for a better future