Using Deep Machine learning techniques, this project is exploring whether if we had more accurate predictions for solar electricity generation then we could reduce the amount of "spinning reserve" required. This would reduce carbon emissions and reduce costs to end-users, as well as increase the amount of solar generation the grid can handle
Benefits
Not required.
Learnings
Outcomes
The outcomes of the project to date are (2024 progress):
- Fully operational PV Nowcasting service running on two ML models:
PVNet for 0-6 hours
Blend of PVNet and National_xg for 6-8 hours, and
National_xq from 8 hours and beyond.
- Accuracy improvement over the previous OCF model by approximately 30% for the GSP and National forecasts (4-8 hours), resulting in forecasts approximately 40% more accurate than the BMRS model and over 40% against the PEF forecast (for 0-8 hours).
- Probabilistic forecasts for all horizons
Backtest runs for DRS project
UI including a new Delta view, dashboard view and probabilistic display
UI speedup, with query times from 20s down to <1 second
The most significant out of these is the achievement of the target set by NG-ESO of a 20% reduction in MAE error. This is extremely large in renewable forecasting, and is the result of numerous machine learning improvements.
Lastly, the forecast from Open Climate Fix is delivered completely open and documented. The resilience was significantly increased over the project duration, resulting in over 99.5% availability. This resilience is implementable by NG-ESO, with all the infrastructure constructed in code to allow replicability.
Lessons Learnt
In the course of W1 and WP2, the project identified the following lessons:
Never underestimate the importance of cleaning up and checking data in advance
Several approaches to loading data were tried, from on-the-fly to pre-preparing, and instituted automatic and visual tests of the data to ensure the project was always lining up the various data sources correctly.
Having infrastructure as code allows the main production service to run uninterrupted
Having code to easily instantiate infrastructure is very useful to the efficient management of environments to ensure the project could bring the algorithm into productive use. The Terraform software tool was used which makes spinning up (and down) environments very easy and repeatable. Being able to spin up new environments allowed the project to test new features in development environments while allowing the main production to keep on running uninterrupted.
Using Microservices to “start simple and iterate” accelerates development
Using a microservice architecture allowed the project to upgrade individual components as we see benefit in improving them, independently of changing other components’ behaviour. This has been very useful when building out the prototype service, as it has allowed the project team to start with a simple architecture - even a trivial forecast model - and iteratively improve the functionality in the components. For example, first starting out with one PV provider of data has allowed the project to get a prototype working, and in WP3 we will expand to onboard an additional PV provider.
Data processing may take longer than expected
While it was initially planned to extend our dataset back to 2016 for all data sources during WP2, it turned out that data processing takes much longer than expected. This does not have a direct impact on project deliverables but is something to consider in further ML research.
Data validation is important
For both ML training and inference, using clear and simple data validation builds trust in the data. This helps build a reliable production system and keeps software bugs at a minimum.
Engaging specialist UX/UI skills is important
By acknowledging that UX and UI design is a specialised area and incorporating those skills, a UI has been developed which will be easier to use and convey information effectively. This will be validated over WP3 through working with the end users.
Building our own hardware demonstrates value for money but may pose other challenges for a small team
Two computers have been built during the project with a total of six GPUs and it is estimated that using on-premises hardware instead of the cloud for data-heavy & GPU heavy machine learning R&D can significantly reduce the direct costs. However, the time it would require for a small team to put together all the components is significant (approx. 25 days for one person in total). While the total costs would still be lower, appropriate resource planning should be considered if planning hardware upgrades in the future.
In the course of WP3 and WP1 (extension), the project identified the following lessons:
Merging the code right away when performing frontend testing is of upmost importance
Merging the code after frontend testing proved to be time-consuming and it is something important to consider when performing tests.
Large Machine Learning models are harder to productionise
Large Machine Learning models proved to be difficult to productionise and the size of the model makes it difficult to use. Going forward we need to investigate further how to deploy large models.
Machine Learning training always takes longer than expected
Even with an already made model, making datapipes to work correctly takes time. It is important to always have enough time allocated when planning ML training activities.
Security and authentication is hard
Ensuring the robust authentication/security measures are in place is harder than we envisaged. It may be easier to implement packages already built or contract third party providers to support the process.
Separate National Forecast Model
PVLive estimate of National Solar generation does not equal the sum of PV Live’s GSP generation estimate. This motivates us to build a separate National forecast, compared to adding up our GSP forecast.
Investment is needed to take open-source contributions to the next level
Time and resources are needed to engage with open-source contributors and develop an active community. We may want to consider hiring an additional resource to support this activity.
Further lessons from WPS (extension) were as follows:
Expensive cloud machines storage disk left idle
We use some GCP/AWS machines for R&D and we often pause the machine when they are not in immediate use. This is because some GPU machines can cost significant amounts per hour. It was discovered that costs still accrue for the disk (storage) of the paused machines. Balancing the pausing of the machines with the ability to start them up quickly versus starting a cloud machine from scratch each time has no golden rule, but it is useful to be aware of.
Challenging to maintain active communication with National Grid ESO
Particularly high turnover at National Grid ESO forecasting team has affected communication on the project. This has evolved over the duration of the project, and more active and easier communication has been observed in the latter phase.
Reproducibility on cloud vs local servers
When results differ between the cloud and local services, it can be tricky to determine the cause. Verbose logging, saving intermediate data, and maintaining consistent package versions and setup on both machines helped. One particular bug involved results differing when multiple CPUs were used locally, but only one CPU was used in the cloud.
Protection of production data
Two environments in the cloud, “development” and “production,” are maintained to protect the “production” data. This setup allowed developers to access the “development” environment, where changes do not affect the live service. Although maintaining two environments increases costs, it is considered worthwhile.
Probabilistic forecasts
Some unreleased open-source packages were used to implement one of the probabilistic forecasts. The advantage of using this code before its release is noted, but it also means more thorough checks for bugs are required, which can take more time.