Originally developed by Arindham Krishna
GitHub: arindham-codes-cmd
-
This project explores solar energy forecasting using time series models like ARIMA and SARIMA. We analyse generation and weather sensor data from two solar plants, the goal is to predict daily energy output and support smarter grid management, load balancing, and consumer awareness.
-
The dataset includes DC/AC output power readings, ambient and module temperature, irradiation, and cumulative yield, merged and modeled to simulate real-world forecasting scenarios. This project also benchmarks model performance for academic and portfolio purposes, with plans to visualize insights through a Power BI dashboard for business ready storytelling.
Solar energy production is highly sensitive to weather conditions like cloud cover, temperature, and irradiation, These can cause unpredictable fluctuations in daily output. For solar companies managing multiple plants, this variability makes it difficult to plan grid operations, balance loads, and inform consumers about expected energy availability.
This project simulates a real-world scenario where a solar company operates two plants and wants to:
- Forecast next 7 day energy generation based on weather data
- Compare operational performance between plants
- Benchmark forecasting models for future deployment
By building predictive models and visualizing insights, the project supports better operational planning and showcases data driven decision making for both academic and portfolio purposes.
This project was executed following a structured, five-phase blueprint designed to take raw solar data all the way to a dashboard-ready forecasting solution:
- Imported generation and weather data for both plants
- Cleaned nulls, duplicaes, and outliers
- Converted timestamps and extracted HOUR, DAY, MONTH features
- Merged weather + generation data on timestamp and source key
Outcome: Two clean datasets—plant1_merged.csv and plant2_merged.csv
Investigated how weather variables influence energy generation
Key visualizations included
- IRRADIATION vs DC_POWER correlation Daily Yield trends over time.
- Hourly generation patterns (peak hours)
- Plant-wise generation comparison
- Inverter efficiency: DC to AC conversion
Outcome: Visual insights into solar plant behavior and performance drivers
- Resampled data to daily yield
- Applied seasonal decomposition and stationarity checks
- Built ARIMA and SARIMA models for each plant
- Forecasted next 7 days and compared with regression models
Outcome: Forecast comparison charts and residual analysis
- Developed an interactive dashboard for decision-makers
- Used Python output CSVs for actuals and predictions
- Visuals included forecast trends, plant comparisons, and model metrics
Outcome: Interview-ready dashboard showcasing forecasting logic and insights
- Structured findings into a professional README
Outcome: A complete data-to-dashboard story, ready for portfolio and interviews
The project uses real-world solar energy data collected over 34 days from two solar power plants in India. Each plant provides two datasets:
- Generation Data: Captured at the inverter level, includes DC/AC power, daily yield, and total yield
- Weather Sensor Data: Captured at the plant level, includes ambient temperature, module temperature, and solar irradiation
Raw Data Folder Contains the original CSV files:
- Plant_1_Generation_Data.csv
- Plant_1_Weather_Sensor_Data.csv
- Plant_2_Generation_Data.csv
- Plant_2_Weather_Sensor_Data.csv
Merged Data Folder Cleaned and merged datasets for analysis:
- plant1_merged.csv
- Plant2_merged.csv
Dataset Structure Each file includes timestamped entries with the following key columns:
- DATE_TIME: Timestamp of reading
- PLANT_ID: Unique plant identifier
- SOURCE_KEY: Inverter ID
- DC_POWER, AC_POWER: Power readings , units: KW
- DAILY_YIELD, TOTAL_YIELD: Energy produced , units: KWh
- DATE_TIME: Timestamp of reading
- PLANT_ID: Unique plant identifier
- SOURCE_KEY: Sensor ID
- AMBIENT_TEMPERATURE, MODULE_TEMPERATURE: Temperature readings , units: Degree Celsius
- IRRADIATION: Solar irradiance level, units: KW/m2
These datasets were merged on DATE_TIME to create a unified view of generation and weather conditions for each plant.
This phase covers the first notebook: Exploratory Data Analysis EDA.ipynb, where we load, inspect, clean, and prepare the raw datasets for modeling.
We started by loading four CSVs, generation and weather data for both plants. To avoid repetitive code, we created a dictionary of datasets and used loops for inspection, reducing 4–5 lines into 2–3 clean ones. This approach improves readability, scalability, and keeps the notebook DRY (Don’t Repeat Yourself).
We looped through each dataset to check:
- Shape and sample rows
- Null values and duplicates
- Data types and structure via .info()
All datasets are clean and had no missing or duplicate rows. However, the DATE_TIME column was imported as object, which is expected in pandas.
Before converting DATE_TIME to datetime, we checked format consistency. We found that Plant 1 Generation used DD-MM-YYYY, while others used YYYY-MM-DD. So we used:
pd.to_datetime(df['DATE_TIME'], dayfirst=True, errors='coerce')
We didn’t manually set a format because mixed formats caused parsing errors. Using dayfirst=True allowed pandas to infer and standardize the format safely. The warning was triggered because pandas tried to parse YYYY-MM-DD with dayfirst=True, which can be ambiguous—but errors='coerce' ensured invalid formats were handled gracefully.
we explored key metrics:
- DC_POWER & AC_POWER: Plant 1 had slightly higher average values than Plant 2
- DAILY_YIELD: Median yield was ~2600 KWh for Plant 1 and ~2900 KWh for Plant 2
- IRRADIATION: Mean values hovered around 0.66–0.69, with some zero readings indicating nighttime or cloudy conditions
These insights helped us understand baseline performance and variability.
We applied IQR-based logic to detect outliers
```python
#simple IQR-based outlier detection
for name, df in datasets.items():
print(f"\n{name} — Outlier Summary:")
# exclude non-numeric columns
num_df = df.select_dtypes(include='number')
for col in num_df.columns:
Q1 = num_df[col].quantile(0.25)
Q3 = num_df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = ((num_df[col] < lower_bound) | (num_df[col] > upper_bound)).sum()
if outliers > 0:
print(f" {col}: {outliers} potential outliers out of total {num_df.shape[0]} rows")
```
Findings:
- Plant 2 Generation had ~3000 outliers in DC/AC power
- Weather datasets had minimal outliers (1–2 values)
We chose not to remove outliers at this stage to preserve real-world variability. If needed, we’ll apply capping using .clip() and re-evaluate model performance.
If we want to clip the outliers we will go ahead with -
df[col] = df[col].clip(lower_bound, upper_bound)
Before merging generation and weather data, we checked timestamp intervals. All datasets had consistent 15-minute gaps.
Why it matters: Ensures proper alignment during merging Prevents accidental NaNs due to mismatched timestamps
Helps decide between merge() (exact match) vs merge_asof() (nearest match) Since timestamps were regular, we used pd.merge()
We merged generation and weather data for each plant. Plant 2 merged cleanly, but Plant 1 had missing weather readings at 2020-06-03 14:00.
We filled gaps using time-based interpolation because as seeing the above table looks like there is no record in Plant 1 Weather Dataset at 14:00:00. That is why when we merge Plant 1 Generation and Weather dataset we get null values
```Python
#we will try to fix this by interpolating values in plant 1 weather dataset.
# Make sure DATE_TIME is datetime and sorted
plant1_weather['DATE_TIME'] = pd.to_datetime(plant1_weather['DATE_TIME'])
plant1_weather = plant1_weather.set_index('DATE_TIME').sort_index()
# Create a full 15-minute timeline from start to end
full_index = pd.date_range(start=plant1_weather.index.min(), end=plant1_weather.index.max(), freq='15T')
# Reindex your DataFrame to include *all* 15-min timestamps
plant1_weather_reindexed = plant1_weather.reindex(full_index)
# Now interpolate missing weather readings
plant1_weather_reindexed[['AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION']] = (
plant1_weather_reindexed[['AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION']]
.interpolate(method='time', limit_direction='both')
)
plant1_weather_reindexed.head(10)
What this does:
- Ensures every 15-min slot has a value
- Uses time-aware interpolation to estimate missing readings
- Preserves temporal continuity without introducing bias
Here, we remerge the data, then we go ahead and extract date time components required for modeling.
This phase uses the merged datasets to uncover operational patterns and performance differences between Plant 1 and Plant 2. Each plot is backed by code logic and real-world interpretation.
We plotted daily average irradiation against daily DC power output for both plants. The Pearson correlation coefficient was:
- Plant 1: 0.993 → near-perfect linear relationship
- Plant 2: 0.871 → weaker correlation, more scatter
Interpretation:
- In Plant 1, higher sunlight (irradiation) consistently leads to higher DC power output.
- In Plant 2, even when irradiation increases, DC power remains low or inconsistent.
Possible reasons:
- Inverter degradation or partial shading
- Sensor misalignment or calibration issues
- Maintenance gaps or panel-level faults
- Data quality issues (e.g., timestamp mismatches or unit inconsistencies)
- Daily Yield Comparison:
We grouped data by DATE_TIME.dt.date and aggregated DAILY_YIELD to compare total daily energy output in MWh.
Insights:
- Plant 1 consistently outperforms Plant 2 across most days.
- The gap widens on high-irradiation days, suggesting better conversion and fewer losses.
This grouped data was stored in a list and later concatenated into a combined DataFrame for plotting.
- Hourly AC Power Trend (Peak Hour Analysis)
Interpretation:
- If each plant has 20 inverters:
- Plant 1 → 20 × 800 = 16,000 Watts or 16 kW peak AC output
- Plant 2 → 20 × 600 = 12,000 Watts or 12 kW
This suggests Plant 1 has better panel exposure, inverter health, or fewer operational disruptions.
- Inverter Conversion Efficiency (DC to AC)
We calculated energy using:
Energy (MWh) = Power (W or kW) × Time (0.25 hours) / 1e6
Then grouped by SOURCE_KEY_x to compute total DC and AC energy per inverter and their conversion ratio:
Conversion Ratio (%) = (AC Energy / DC Energy) × 100
Initial finding:
- Plant 1 showed ~9% conversion, which is unrealistically low. Diagnosis:
- DC_POWER in Plant 1 was likely in Watts, while AC_POWER was in kW
- After converting DC_POWER by dividing by 1000, the efficiency corrected to ~97–98%—which is realistic and healthy.
This phase focuses on building forecasting models to predict next-day solar energy generation using historical daily yield data. We used both ARIMA and SARIMA to benchmark performance and capture trends, seasonality, and residual noise.
We grouped the merged datasets by DATE_TIME.dt.date and summed DAILY_YIELD to get daily totals. These were converted to MWh for clarity:
daily_yield['DAILY_YIELD_MWh'] = round(daily_yield['DAILY_YIELD'] / 1e6, 2)
This gave us 34 days of daily yield data per plant, which we stored in daily_df for modeling.
Using seasonal_decompose(), we broke down each time series into:
- Trend: Long-term direction of energy output
- Seasonality: Daily cycles or repeating patterns
- Residual: Random noise or anomalies
We applied the Augmented Dickey-Fuller test to check if the data was stationary:
- Plant 1: p-value = 0.00007 → Stationary
- Plant 2: p-value = 0.0389 → Stationary Since both passed the test (p ≤ 0.05), we proceeded without differencing.
We split the time series into:
- Train: First 27 days
- Test: Last 7 days
This setup allowed us to forecast the final week and compare predictions with actuals.
We fit an ARIMA(2,0,2) model and forecasted 7 days ahead.

Visuals: Forecast lines closely followed actuals for Plant 1, with slightly more deviation in Plant 2.
We fit a SARIMA(2,0,2)(1,1,1,7) model to capture weekly seasonality.

Interpretation: SARIMA captured seasonality but slightly overfit Plant 1’s fluctuations. ARIMA performed better for Plant 1; SARIMA was comparable for Plant 2. Key Takeaways
- ARIMA is effective for short-term yield forecasting when seasonality is minimal or stable.
- SARIMA adds value when patterns repeat weekly or monthly. Forecasting accuracy depends on data quality, inverter consistency, and weather variability.
The forecasting models, ARIMA and SARIMA successfully predicted solar power output with reasonable accuracy across both plants. Despite minor deviations, the predictions aligned well with actual trends, validating the use of time series techniques for short-term energy planning.
The Power BI dashboard, powered by these forecasts, enables energy planners to:
- Anticipate dips in solar generation due to weather variability or system inefficiencies
- Optimise energy dispatch by aligning predicted output with grid demand
- Schedule maintenance proactively, especially for underperforming inverters
- Compare plant performance and conversion efficiency in real-time
Together, the models and dashboard form a robust decision-support system for solar operations and turning raw data into actionable insights.
This project was originally created and maintained by Arindham Krishna.
All logic, forecasting methodology, and data handling scripts were authored by arindham-codes-cmd.
If you’re viewing this project from a forked repository, please note that the original version was developed by Arindham Krishna.
Visit the source repo here: PV-Output-Forecasting






