Pluris Massively Parallel Routing

(White Paper)


DISCLAIMER

THIS WHITEPAPER IS OBSOLETE AND DOES NOT REFLECT THE VIEWS OF PLURIS INC, FOR INFORMATION ON THE ACTUAL PLURIS PRODUCT AND ARCHITECTURE PLEASE REFER TO www.pluris.com.

This disclaimer was added at the request of Sam Halabi on July 11, 1999.


The Problem

Internet traffic is growing exponentially; faster than the performance of semiconductor devices. Barring a miraculous leap in the performance of silicon, it is clear that the Internet will soon need routing devices that simply cannot be built with the available technology. The first signs of the crisis are everywhere, and the complaints about the poor service of Internet backbones have reached national news services.

Proposed Solutions

Several established and start-up router vendors have recognized the problem and are trying to solve it by using very high speed proprietary integrated circuits. However, this only pushes their designs closer to the leading edge in high-frequency integrated circuits technology; it does not change the fundamental correspondence between progress in silicon performance and progress in the performance of their devices. When the slack between the technology used in today's routers and the state of the art in high-speed electronics is eliminated, the router vendors are doomed to fall behind the Internet's growth curve.

Another approach is to replace native IP routing with a much simpler process, known as cell switching, using technology called Asynchronous Transfer Mode (ATM). ATM side-steps the problem of the performance of individual devices by allowing traffic to be routed over a large number of virtual circuits traversing different ATM switches.

However, ATM has severe problems, both in design (such as a ridiculously small cell size that makes ATM barely useful for packet traffic) and in fundamental properties. The fundamental problem is that virtual circuit based networks cannot scale; a VC-based network of the size of today's Internet is not technically feasible (see "ATM: Another Technological Mirage" for a detailed discussion).

Recently, a number of "hybrid" approaches (namely Ipsilon's IP Switching and Cisco System's Tag Switching) were proposed in an attempt to combine ATM's performance and native IP's scalability.

The first such technology, IP switching, relies on detecting busy paths in native IP traffic and establishing ATM virtual circuits to expedite packet forwarding along those paths. It is easy to see that this approach offers only marginal improvement in scalability over ATM-only networks, as it approaches the behavior of ATM networks when the threshold of path establishment is low and becomes close to a very inefficient form of native IP routing when the threshold is high. It means that IP switching is effectively useless for Internet backbone networks.

Tag switching is based on building unidirectional "circuit trees", corresponding to all possible paths packets would take if native IP routing were performed on the network in the ATM switching fabric. This scheme, however, does not allow multiple alternative paths to the same destination. Even worse, tag switching does not work well with aggregated routes, because an aggregating router needs to split the tag-switched stream using native IP routing. Since large-scale aggregation is generally performed at exchange points between backbone networks, the border routers will have to perform native IP routing. However, those routers also have to handle far more traffic than routers or switches inside a backbone. In other words, tag switching does nothing to solve the problem at the places where it is most severe.

The final (and so far, most practical) approach is to create two-level flattened backbones where native IP routing is performed by edge routers connected by a mesh of permanent virtual circuits carried by an ATM network. Although such an approach allows an increase in capacity within a backbone, it does nothing to improve inter-backbone connectivity. Again, the border routers at exchange points have to perform native IP routing.

As the reader can see, no widely known approach allows the building of a network that is able to keep up with the demand.

Solution By Pluris Inc. - The Massively Parallel Routing

We are proud to present the simple and elegant solution to the problem, embodied in the patent-pending massively parallel routing.

Our approach is based on the observation that although aggregate data rates of Internet traffic are skyrocketing, the bandwidth of individual communication sessions remains relatively small (in fact, it cannot grow faster than the performance of host computers). This means that a high aggregate routing capacity can be achieved by distributing the paths of packets in those connections between a large number of medium-performance routing engines.

A Pluris router is composed of a large number of such routing engines (we call them "processing nodes") communicating with each other via a linearly scalable high-speed data interconnect. Such interconnects are well known and are a relatively well-understood technology commonly used in loosely-coupled massively parallel computers.

The processing nodes are connected via low-speed lines to a number of synchronous multiplexers that combine low-speed data streams into high-speed streams on backbone circuits, as shown in the diagram below:

Pluris Router

Every processing node has its own copy of the forwarding table (that table is not large, unlike BGP routing information bases).

Instead of conventional single-step IP routing (i.e. determination of exit interface from destination address), our process performs two steps for each packet: in the first step the exit high-speed communication line is determined, and in the second step one of the low-speed lines corresponding to the exit high-speed line is selected. The packet is then sent through the data interconnect to the processing node corresponding to the selected low-speed communication line.

Obviously, the "naive" second-step selection techniques such as random selection and round-robin would cause reordering of packets. Such reordering is unacceptable, because it will cause false packet loss detection by TCP Fast Retransmit algorithm. To alleviate this problem, the selection is made by computing a hash function from the packet's source and destination addresses and, optionally, port numbers.

The use of the hash function from the values of the packet's fields, which are invariant for all packets within a single TCP (or any other transport protocol) session, guarantees that all those packets will follow the same path, and therefore will not be reordered:

Backbone Routing

It is easy to see that hashing effectively randomizes packet routes, so the load is uniformly distributed between all participating processing nodes and low-speed lines. This, together with linear scalability of the data interconnect, means that the aggregate capacity of the massively parallel packet router can be increased nearly indefinitely by the simple addition of processing nodes. The only high-speed circuitry is in the synchronous multiplexers, and that circuitry is much simpler and cheaper than hardware implementations of IP routing or ATM switching. In fact, since Pluris routers treat high-speed backbone links as quantities of parallel low-speed circuits, a number of parallel multiplexed high-speed lines (for example, different strands of fiber in a cable, or different wavelength channels) can be combined into a single very high-speed communication line. In other words, the capacity of a network built using massively parallel routers is not limited by the capacity of any physical component.

An interesting property of a massively parallel packet router is that it can be configured to form a number of independent routers interconnected with a very fast "LAN", and thus can be used as a scalable platform for Internet Exchange Points (IXPs) (also known as Network Access Points, NAPs):

Scalable IXP

The groups of processing nodes belonging to different networks can run independent copies of the operating system, thus leaving participants in the IXP in complete control of their routing policies and software.

Technical Description Of Pluris MPR

Pluris Massively Parallel Router is a collection of single-board computers (processing nodes) and a proprietary data interconnect. Each processing node has 16 or more megabytes of DRAM and a 100+ MHz general-purpose microprocessor, sufficient to route IP packets at OC-3c speed (155 Mbps).

Processing Node

The only unusual feature of a processing node is the ring station module, which connects the processing node to the high-speed data interconnect.

The second generation of processing nodes will support at least OC-12 per node, and will be compatible with the first generation nodes (i.e. there will be no need to replace older nodes, and both generations will be able to co-exist in the same machine). The effect of higher per-node performance will mostly be in decreased size of the hardware and better price/performance.

The data interconnect is a patent-pending Self-Healing Butterfly Switch based on 1.2Gbps serial communication lines:

Butterfly Switch

Unlike the well-known butterfly and Benes switches, the Pluris switch is fault-tolerant (the diagram above does not show secondary links), so the packets are automatically rerouted in case of failures in links or routing elements. The use of radio-frequency serial lines reduces the amount of wiring between card cages.

Every card cage has 16 processing nodes and switch circuitry in additional intershelf link boards:

Pluris Box

All boards are hot-swappable; also every card cage has redundant power supplies. When several card cages are interconnected to form a larger system, the wiring is similar to the wiring of a hypercube-based massively parallel computer:

Pluris Box

One or several dedicated processing nodes equipped with 64-256 Mb of DRAM are used for performing routing protocols. When several such nodes are used, the output of every protocol engine is broadcasted to all forwarding nodes, so if a protocol engine node fails or is removed, the operation will continue. The failure of a forwarding node only causes reduction of throughput, but not interruption of service.

Pluris MPR is a very high-performance machine, but it is composed entirely from off-the-shelf integrated circuits, making it a low-cost and very reliable device. The maximal capacity of MPR is limited solely by the maximal length of coaxial cables interconnecting parts of the machine. The present design is capable of housing 16K processing nodes in 64 open racks arranged in 4 rows, to achieve the aggregate routing capacity of 2.4 Tbps (or 7 billion packets per second).

Although the diagram of a massively-parallel router includes multiplexors, in most cases they do not have to be purchased by ISPs, because telcos already have synchronous multiplexors installed to step down their backbone networks to DS-3 levels accepted by telephone switches.

Advantages Of Massively Parallel Routers

The first, and most significant, advantage is that Pluris MPR technology is the only technology known today that makes building a global terabit-per-second network possible. Other technologies do not achieve high speeds (conventional IP routing), or do not build truly global networks (ATM).

It is easy to migrate existing points of presence to the use of MPRs. In the first phase, a MPR is installed instead of the LAN switch and the backbone routers:

Pluris Deployment, Phase 1

After the MPRs are deployed, future customer access connections may be fanned out from OC-3c trunks with cheap low-end ATM switches, routers, and xDSL access racks:

Pluris Deployment, Phase 2

This allows ISPs to select the cheapest customer access technology. Unlike dedicated backbone routers, Pluris MPR can be configured to service thousands of such fan-out devices, making it ideal not only for backbone switch sites, but also for central office installations.

The inverse multiplexing-like operation of MPR makes it an ideal match to the new Wavelength Division Multiplexors, which separate bandwidth on fiber into many independent channels.

Any backbone site must have at least two conventional backbone routers to achieve redundant operation. That means that the number of hops in the network is increased by the intra-POP hops over the cluster LAN, thus increasing variance of network latency. Also, those routers participate in backbone routing as separate devices, thus increasing the quantity of routing information that has to be processed by every router, and so increasing convergence time of routing protocols. The highly redundant design of MPR eliminates any need to have more than one router per POP, making convergence times smaller than 0.1 second realistic and thus eliminating any need for link-layer redundancy. This will allow ISPs to load existing hot-spare fibers with user traffic, effectively reducing the cost of transmission by nearly 50%.

Unlike hardware-assisted IP routers, MPR routing engines are completely programmable, and therefore routers will not need any hardware upgrades to support new protocols. Pluris plans to leverage the flexibility to improve traffic management by using techniques not possible with hardware-assisted native IP routers. Conventional high-end routers always have a "fast" and a "slow" forwarding path, because complete implementation of IP routing in hardware is not feasible; therefore an "unfortunate" traffic pattern can easily overwhelm the "slow" forwarding path in those routers. The performance of purely software IP routing used in MPR is always consistent and does not depend on traffic patterns.

The programmability also means that a MPR machine can eventually be equipped with additional processing nodes interfacing with mass-storage devices and programmed to perform services, such as Web hosting and video-on-demand, or performing functions of large-scale cacheing proxy servers. This eliminates the communication bottleneck between servers and the backbone.

Essentially, Pluris massively-parallel router technology is future-proof, allowing ISP operators to start building infrastructure capable of growing and adapting to the new requirements far better than any other known networking technology.

Cisco and "bridge" logo are registered trademarks of Cisco Systems.