The journey to a self-healing network: Intelligence, agents and complexity

Network Innovation

Login to access

Want to subscribe?

This article is part of: Network Innovation

To find out more about how to join or access this report please contact us

As the deployment of AI/ML ramps up, telcos need to develop an ‘intelligence architecture’, supporting the move towards Level 4 and 5 in the TM Forum’s Autonomous Network Framework. How will such an intelligent architecture support the self-healing network?

How AI agents can support the self-healing network

Among the trends impacting the deployment of new AI/ML in assurance, one area that is top of mind for the telcos and the vendors that we recently interviewed was the need to bring together data from across domains and up/down stacks, along with the necessary intelligence, to support decision-making for more complex root cause analysis.

Telcos have already significantly invested in data gathering but data federation (the ‘stitching’ together of multiple data sources to allow them to be used by assurance solutions) is still an ongoing task. However, it is important because some decisions may require information from various data siloes on the network (to answer a question such as: “Is the problem being experienced by this group of customers caused by an issue on the RAN or in the core?”). Data from systems such as billing and customer relationship management (CRM) may be needed to provide customer-specific insight (“How much will this customer-affecting network problem cost the company?”).

If you are not a subscriber, enter your details below to download a free copy of the report

The figure below sets out a range of uses for this federated intelligence and data.

The need for federated intelligence and data

Federated data will be necessary across all these uses because of several reasons.

Observability

5G standalone (SA) creates significant demand for additional reporting, visualisation and clustering for triage and decision-making – first across domains and stacks; and then across customers for service assurance:

  • Across domains and vertical stacks: Apart from the already noted need to collect and understand data from across domains and vendors, the move from dedicated hardware to virtual machines and containers requires new data sets from IT stacks. These more complex views will require data federation, including third-party data from other systems gathered to a suitable point in the network for analysis.
  • Across customers: Over the last 10 years, increasing focus on service assurance has required data analysis from individual customers or services to identify customer-impacting events. The continuing improvement in network resilience (from virtualisation and new automations) requires less focus on individual network issues from the network operations centre (NOC); allowing more focus on identifying and fixing problems that impact individual services and customers. Data from the network and a range of other sources (test data, weather patterns and customer sentiment) is needed – and new ML models have been developed to deal with the increasing volume of data and undertake anomaly detection, prediction and optimisation tasks. Our interviewees also reported that there was a sustained focus on assurance products for enterprises for VPN services, cloud gaming and other latency-sensitive applications.

New automations

Vendors interviewed noted that a good percentage of their engagements now included automation, as telcos look towards Levels 3 and 4 of the TM Forum Autonomous Network Framework. Automating root cause analysis using new ML techniques has been the most prominent activity in the last few years – and will remain the crucial first step in a self-healing network. Other automation areas include predicting future performance and the customer impact of actions taken, as well as dynamically setting thresholds and remediations such as the opening of trouble tickets.

Example of automated fault detection

Part of service orchestration

End-to-end service orchestration is a single process that fulfils a customer’s requirement for a new service by combining pre-existing solutions from the telco and its partners – from order handling through service design, service/resource provisioning and assurance. It is increasingly needed to support a much wider range and complexity of 5G services. Assurance data, along with other data sources such as inventories, will be required to support orchestrations and service design. For example, a newly designed service will notify assurance about how network functions are chained together and the KQIs/SLAs required to support the customer. The assurance platform will then track compliance (i.e. checking that all metrics within the required range) and forecast potential threats to KQIs/SLAs.

Data to external parties

Meeting enterprise customer requirements around new services will require telcos to integrate their data and operations across diverse and complex ecosystems. Delivering the right data (expected to be a mix of assurance, inventory and external third-party data) to the right people and processes at the right time will provide new visibility for enterprise customers and other partners. In the future, it will likely also support a range of data feeds and automations that stretch from the telco into these customers and partners.

Self-healing networks

Telcos face challenges in supporting new, more dynamic networks that generate multiple concurrent issues – making the self-healing network concept very attractive.

The journey towards self-healing actually started many years ago. And, as discussed by one research participant, “the sad reality is that many problems can be fixed by just turning off the offending hardware/software and turning it on again”, meaning that many of the first self-healing capabilities seen as far back as 2011 (see 3GPP’s Self-Organizing Network (SON) Release 10) focused on the restarting of equipment after software glitches. The next iteration then used the term ‘self-healing’ to describe issues such as traffic rerouting in the event of a fibre cut.

More recently, self-healing has focussed on architectural builds in the cloud where services failover automatically to standby hardware and backup links. Indeed, the advent of virtualised networks allows the term to expand from very simple activities (e.g., scheduling the nightly rebooting of a network function) to more complicated closed-loop activities such as adapting in real time to faults or new demands on the network. These require automations to understand the customer or service-level problem and then move to create resolutions in a particular network domain or multiple domains.

Table of contents

  • Foreword
  • Executive summary
    • Creating a set of solid first steps towards building the intelligence for a self-healing network
    • Other recommended actions
  • Introduction
  • What is intelligence architecture?
    • Centralised intelligence
    • Distributed intelligence
    • Hierarchical intelligence
  • Creating an intelligence architecture
    • Developing hierarchical and distributed intelligence architectures
    • The main components of a MAS
    • The development of assurance capabilities using a MAS architecture
  • Decision-making around development of an intelligence architecture
    • Do we need a MAS?
    • How much distribution of intelligence is needed?
    • Is a knowledge plane necessary?
    • Is a digital twin necessary?
  • Conclusion
  • Appendix
    • The main components of a MAS

Related research

If you are not a subscriber, enter your details below to download a free copy of the report


Charlotte Patrick

Charlotte Patrick

Charlotte Patrick

Associate Senior Analyst

Charlotte has 27 years of professional experience in strategy, marketing and finance. Most recently in the largest global technology analyst firm and previously two of the worlds largest global telecommunications companies. She is an electronics graduate and MBA with excellent business analysis, commercial and strategic planning skills.