Over time, various factors have adversely affected the quality of DNO’s Geospatial Information System (GIS) data. If datasets are shared with patterns of error, then different users will fill these error gaps in different ways leading to inconsistent results from analysis and how the data is exploited by applications. A Machine Learning (ML) tool is proposed to carry out data cleansing and data gap closure of WPD’s network GIS and relevant network data. The ML tool will be trained and populated with existing network asset data and run to identify data gaps. The model will initially be developed based on a set of business rules that will be updated based on the outcomes of the ML algorithms. Appropriate will be testing carried out to validate the results and assess the accuracy of the model.
Benefits
The project will focus on errors that cannot currently be identified and fixed automatically. By highlighting potential errors and suggesting suitable fixes this will improve the usability of the GIS datasets by third parties, particularly where issues relate to missing data. Similarly, this is expected to reduce the time taken to manually identify and fix these GIS issues.
Learnings
Outcomes
The main outcomes of the project are;
1. An assessment of our initial data evaluation in the trial area has been made as part of the Interim Learning Report. This has provided an overview of the completeness of the different datasets and proportions of different asset types between the voltage layers rather than commenting on the accuracy of the data presented.
2. The types of error that can affect the GIS data have been documented as use cases and grouped together to allow for mapping between use case groups and potential evaluation methods. This is likely to be transferrable knowledge to other DNOs.
3. Interim learning has been shared with other DNOs that have already carried out work in this area or are planning to in order to avoid duplication of effort.
4. The applicability of AI approaches to identifying and suggesting corrections to GIS errors has been confirmed.
5. Two complementary modelling approaches have been selected and the rationale for their selection has been documented and shared.
6. The Proof of Concept model has been developed and tested both by Capgemini staff and on our hardware with a configuration that does not require access to the internet by our staff.
7. The accuracy of the Proof of Concept models has been evaluated and shown to be above that achieved by assuming the most frequently occurring result.
8. The results of the models have been evaluated and confidence metrics have been used to separate values with high and low confidence. The separate groups are seen to differ in accuracy with the high confidence group achieving better results than the low confidence group, confirming the usefulness of the confidence metrics.
9. Reports from using the model have been passed back to the business to allow identified errors and proposed corrections to be examined further with a view to correcting the errors identified.
10. Comparison with INM errors has shown that the different approaches are complementary.
11. Suggested priorities for Business As Usual implementation and further analysis likely to improve data accuracy have been proposed
12. Learning has been disseminated via published reports and a webinar enabling other DNOs to build on the learning generated by the project without duplicating the work
Lessons Learnt
Lessons learned related to two main areas, modelling methodology and data.
Modelling Methodology
· In order to effectively test data-driven and machine learning methods to identify data errors, the process had to introduce data quality exceptions in a way that simulates real life. Therefore an understanding of the prevalence of data errors is required to inform the synthetic errors added to train the Spatial Graph model.
· The data contains multiple distinct asset types with attributes that are not comparable. This means there are a limited number of features to apply traditional Machine Learning imputation techniques so an alternative more-complex approach is required, e.g. merging information about the other asset types into one main table per asset type, or some form of multi-model graph model.
· There are a number of key attributes that are high-cardinality categorical features, such as specification description, structure number, site number, circuit ID, etc. Basic approaches for encoding categorical features, such as one-hot encoding, can only be use effectively with categorical features with low cardinality. It may be possible to simplify some of these features by splitting them into separate columns, i.e. separating composite features like specification description.
· Where possible, rules-based algorithms should be preferred to data-driven ones, since these are easy to verify and understand, and the resulting suggestions have a very high probability of being correct.
· The methods used in SEAM are robust to different topologies and configurations of the networks, accommodating radial and mesh and can be used in a number of different scenarios where data on network topology may not be of high quality or complete.
· The use of this model could be more iterative in nature, with a data steward checking violations, updating Electric Office where violations may be caused by configuration, specifications and re-running the model to see the improvements made and reduction in violations.
· Traditional Machine Learning approaches that work with table-based observations (e.g. regression techniques such as k-nearest neighbours) will have limited usefulness. This was an initial hypothesis and a proposed approach for Use Cases 2 and 3. In context of geospatial data the absolute location of each asset is of limited utility on its own: what matters more is the local neighbourhood of each asset, i.e. what are the attributes of the other assets in the surrounding area? Therefore our approach will utilise a graph model (i.e. based on a connected graph of nodes and relationships with properties and labels).
· Traditional graph models for power networks are focused on power systems analysis and network management, rather than on asset management. They typically rely on electrical properties and require complete electrical connectivity – ignoring spatial relationships. This approach is well suited where the physical connectivity of the model is central to conducting the modelling or forms a part of the pattern identification
· The performance of the model was evaluated in line with set of synthetic errors. The model was trained/optimised to correct those synthetic errors which were reflected in any "confidence scores" assigned. The confidence scores were found to be useful in separating results into high and low confidence groups and the high confidence groups were seen to have greater accuracy.
· There is a trade off in the depth of the neural network (which requires greater processing and potential overfitting) and performance of the model.
The full spatial graph model is a heterogenous graph (or heterograph), which necessitates the usage of Relational Graph Convolutional Layers (RGCN), and types of layers derived from them, in the neural
· network model. Furthermore, each asset attribute to be predicted by the neural network requires a separate "head", resulting in a multi-headed architecture. This meant that the neural network had a backbone of several RGCN (or similar) graph convolutional layers followed by several Machine Learning Process heads taking the asset node embeddings as inputs.
· The application of Neural Network to predict asset attributes and relationships based on spatial relationships continues to be a viable method, as additional complexity is added. Proven to produce level of performance for predicting network type and operational voltage of each asset.
· As a side-effect of the process of creating the spatial mesh, it is possible to create a report of coordinates from all geometries in the region of interest that are very close each other (about 10% of edges in the spatial mesh are under 10cm). This could be used to make minor modifications to the geometry of some assets to ensure that they "snap together" exactly.
· For the purposes of the Proof of Concept, it is simplest if all of the node attributes to be predicted/corrected are categorical ones. While it is relatively straightforward to support both classification and regression heads in the neural network, it is not a high-enough priority, especially given the time remaining. Also, creation of "confidence scores" is easier for classification tasks. This means that Numerical attributes to be predicted, e.g. conductor rating, must be binned into fixed ranges.
· Evaluation of the spatial model has shown that the performance is good for the Proof of Concept. Furthermore, this model has the ability to be trained on one subset of the network and then used to identify and correct errors in another subset of the network. It is also able to be extended with more data, functionality and optimization. This proves that Generalised Neural Networks are a good approach for data cleaning for asset management.
· Evaluation of the spatial model has also identified some patterns of false alarms that are consequences of the highly limited data available to the current version of the model. Some "quick wins" have been identified that should bring significant performance benefits.
· The spatial and connectivity model are complementary. For example, the features calculated for the connectivity model (e.g. electrical connectivity, electrical properties) are valuable inputs for the prediction of the cable/wire specifications. Similarly, the electrical properties can be back-filled using the spatial model in order to get better estimates of the network capacity for the connectivity model.
Hence, combining the models will bring significant performance benefits to both.
· Network flow provides an exact characterization of network capacity in the single commodity case (i.e. real power flow). The linear formulation of network flow is functional and efficient as well as fast to solve and is sufficient for analysis that does not consider: analysis of systems away from their operating points (blackouts, instabilities), losses and coupling between real and reactive power.
· Where there are multiple demand and supply points in a network; commonly known as the circulation with demands problem, can be reduced to a network flow problem (which has many fast and efficient algorithms) by adding a synthetic 'super' source and 'super' sink nodes with edges that lead to actual source and sink nodes with capacity equal to demand / supply. This was used in the SEAM solution.
· Max flow is fast (data preparation and post processing phases take the majority of the model running time), robust and efficient. If the processing is done on a circuit by circuit basis, these procedures can be done in parallel.
Data
· There is no direction / parent-child relationships within the GIS data and so for connected point assets / line segment elements there is no indication within the data of the direction of flow of power expected within the asset. This meant that the connectivity graph needed to be undirected graph and any method chosen needed to be robust to the lack of direction / parent-child relationships within the data.
· The ability to eliminate reasons for violations (customer wrongly assigned, profile class wrongly assigned, Estimated Annual Consumption or half hourly consumption error, for example) is diminished due to the level of missing assets (cables and wires to create connectivity and connections to customers) and missing labels for cable and wire specifications. Again, this suggests that an iterative approach may be useful where this data is progressively added. Some improvements to the specification descriptions for cables and wires was required (potentially as an input from model 2); as well as assessment / review of the missing / synthetic service cables.
· There are few ‘true’ violations of network capacity indicated in the data as mostly the components of the network flagged as bottlenecks are where capacity values have been or reflect simulated cables / wires or the simplifying assumptions used to model ways in which customers are connected. This meant there was a need to combine outputs from the spatial graph model / manual interventions in the quality of technical data for circuits to reduce the number of false positives
· A sufficient completeness of physical circuits is required to understand the relationship between assets in different locations and how this can be pooled and used to improve the data quality in all of those areas. While work was ongoing during the project to build LV network connectivity in Electric Office (the circuits in our dataset include the outcome of Phase 1 of this project),a significant number of LV cables, wires and point assets with no circuit ID remained.
· Complete and detailed data dictionaries/catalogue do not currently exist for all our data sources. This slowed the process of forming a detailed understanding of the data to determine the best suited modelling approaches.
· There are different naming conventions for attributes across the different systems/sources which introduced an element of confusion. A data catalogue is being created to support the project data model to ensure there is a clear understanding of the relationship between datasets. This includes mapping to the Common Information Model outputs from the Integrated Network Model project.
· The project would have liked to use the Energy Performance Certificate (EPC) dataset to enrich the customer features (issued for domestic and non-domestic buildings constructed, sold or let since 2008). There is a challenge linking this to Meter Point Registration System (MPRS) data because EPC does not contain Unique Property Reference Number (UPRN) and the address data available is not well structured. Either including a UPRN on EPCs or improving the quality of the address data in CROWN would be beneficial.
· Lack of standard formats for data extracts from CROWN and Electric Office would need to be resolved to make the data analysis repeatable.
· The cable and wire specification attributes in EO are a concatenation of three associated components (size, type/material, number of conductors). A significant number of these contain at least one component that is ‘unknown’