Carriers Are Scaling Backbones With Merchant Silicon & Disaggregated, Distributed Networking

Jan 9, 2022

This post originally appeared on the Packet Pushers’ Ignition site on August 20, 2021.

For both individuals and businesses, the past 18-months have vastly increased their reliance on the Internet to access cloud services, online retail and entertainment venues and each other via high-definition video conferences. In the period from just before the initial SARS-CoV-2 outbreak through next year, global consumer Internet traffic will  increase by about 30 percent annually to nearly triple the level of 2018. Add in vastly more business data, driven by increased usage of cloud services (see my summary and analysis here) and you have the makings of a traffic jam without some innovative engineering by Internet backbone providers.

Internet backbone operators rarely share details about their engineering strategies and plans, so it was refreshing to see a lengthy, detailed blog earlier this year from Telia Carrier about how the company is Rethinking Internet Backbone Architectures . In it, the Head of Network Engineering and Architecture describes a three-part strategy for network redesign to “to keep up with customer demands for more, consistent bandwidth and a high-quality experience” and “accommodate the insatiable bandwidth requirements of 5G.” Telia’s plan, alongside other technological shifts in backbone networking, illustrates how Tier 1 carriers can accommodate traffic that doubles every two and a half years.

Significant Trends In Carrier Networking

Telia Carrier is the second-largest tier 1 network provider in the world when ranked by their customer cone size, which measures the number of direct and indirect customers by using topological data from BGP routing tables. Thus, Telia provides an instructive proxy for how Internet providers writ large are handling the explosion of traffic crossing between enterprises, homes and mobile users and infrastructure, SaaS or streaming service providers.

Elements of Telia’s strategy shown above will be familiar to Packet Pushers readers since they encapsulate frequent themes from our blogs and podcasts, namely:

1. A shift from proprietary network processing units (NPUs) to merchant switch silicon like Broadcom’s Jericho series designed to maximize routing throughput. (Note: Broadcom offers three lines of switch silicon: Jericho for carriers, the more familiar Tomahawk for hyperscale cloud operators and Trident for enterprise data centers). Telia’s move is motivated by two factors, “performance, as defined by bandwidth, is increasing exponentially” and “the frequency of meaningful performance increases has become higher,” down to a two-year product cadence from 4-6 years for traditional designs.

Over the past decade, the difference in performance gains between proprietary and merchant silicon has been astounding. The routing bandwidth of legacy NPUs has increased about 4-times since 2011 compared to 90-fold for streamlined designs like Jericho, whose latest incarnation, the Jericho2c+, provides up to 14.4 Tbps packet throughput. The rapid cadence of merchant silicon updates required Telia to significantly shorten its product validation and procurement cycle and develop a process to re-deploy older, but still useful core routers to the network periphery.

Telia’s rapid migration to merchant silicon is stunning, with more than 70 percent of its operational capacity now running on chipsets optimized for bandwidth and power efficiency, up from zero just three years ago.

2. Radical simplification of the IP/MPLS architecture by using virtualization and network overlays to decouple network and services architectures and adopting simple, standards-based BGP, EVPN and L3VPN routing protocols for services. According to Telia, such “functional disaggregation allow[s] for greater flexibility and scalability of network functions.” Converging on IP networking improves switching and routing performance through merchant chips designed for IP Ethernet, while separating the network control, data and application planes allows more efficient scaling of capacity and quicker insertion of new network services.

3. Partial disaggregation of optical networks , which splits optics from line-systems, increases vendor competition and standardizes alien wavelengths, namely the DWDM wavelengths used to connect Telia’s optical line system (OLS) to another carrier. Much like merchant silicon, the switch to disaggregated transponders like Ciena’s Waveserver yields a faster technology cycle and lower pricing of a competitive market and “added flexibility in times of supply chain constraints.”

4. Standardized, pluggable DWDM optics via the OIF 400G-ZR coherent optical interface enables interchangeable QSFP-DD modules in switch-router chassis. Telia says that pluggable optics simplify power and capacity planning, lower component inventory, speed circuit installations and deployments and eliminate mismatches between router and transponder hardware.

Virtual Distributed Routing (VDR)

Virtual distributed routing isn’t a new concept, having been part of Openstack (Neutron), but there is renewed interest in the idea via its inclusion in open, programmable switch operating systems. As we detailed in a podcast with Arrcus last year, VDR software runs on most popular merchant switch silicon and turns multiple heterogeneous switches into a unified switch fabric with up to 768 Tbps of routing performance.

VDR is critical to tier 1 carriers like Telia faced with continual increases in demand capacity since it allows scaling routing capacity horizontally by adding switch-routers, rather than vertically via line cards in a proprietary router chassis. Such a scale out architecture has long been used to cheaply scale enterprise storage via a collection of distributed storage nodes instead of a large, central storage array. VDR brings similar topological and cost advantages to the world of core routing by replacing forklift upgrades of modular router chassis with a collection of switch-router nodes based on merchant silicon.

The Arrcus VDR system is composed of a distributed Clos data plane with leaf forwarding (LF) and spine fabric (FC) nodes, an underlay connector (UC) providing out-of-band management and telemetry and a distributed control plane cluster (CC) handling routing and security policies. Arrcus claims to have the first publicly available “software-based, virtualized, distributed and massively scalable router” capable of handling hyperscale cloud or carrier workloads.

Enormous demands for additional backbone capacity have forced carriers to redesign their network architecture and equipment strategy to exploit the benefits of standardized components including merchant Ethernet switch silicon, pluggable optics and extensible software defined routing and switch software. Doing so allows them to benefit from the power of competitive markets that have significantly increased the pace of performance improvements and the flexibility of network equipment. As the following slide from Arista’s recent earnings presentation illustrates, devices with 800G interfaces and adaptable software will soon dominate the data center switch market. Carriers like Telia see this same dynamic and have adjusted their strategies, processes and organizational culture accordingly.

Source: Arista Networks, Q2 2021 earnings presentation

Related Posts