Pluris Massively Parallel Routing
(White Paper)THIS WHITEPAPER IS OBSOLETE AND DOES NOT REFLECT THE VIEWS OF PLURIS INC, FOR INFORMATION ON THE ACTUAL PLURIS PRODUCT AND ARCHITECTURE PLEASE REFER TO www.pluris.com.
This disclaimer was added at the request of Sam Halabi on July 11, 1999.
Internet traffic is
growing exponentially;
faster than the performance of semiconductor devices.
Barring a miraculous leap in the performance of silicon, it is clear
that the Internet will soon need routing devices that simply
cannot be built with the available technology.
The first signs of the crisis are everywhere, and the complaints
about the poor service of Internet backbones have reached national news
services.
Several established and start-up router vendors have recognized the
problem and are trying to solve it by using very high speed
proprietary integrated circuits.
However, this only pushes their designs closer to the leading
edge in high-frequency integrated circuits technology; it
does not change the fundamental correspondence between progress
in silicon performance and progress in the performance of their
devices.
When the slack between the technology used in today's routers and
the state of the art in high-speed electronics is eliminated,
the router vendors are doomed to fall behind the Internet's growth curve.
Another approach is to replace native IP routing with a much
simpler process, known as cell switching, using technology
called Asynchronous Transfer Mode (ATM).
ATM side-steps the problem of the performance of individual
devices by allowing traffic to be routed over a large number of
virtual circuits traversing different ATM switches.
However, ATM has severe
problems, both in design (such as a ridiculously small cell size that
makes ATM barely useful for packet traffic) and in fundamental
properties.
The fundamental problem is that virtual circuit based networks
cannot scale; a VC-based network of the size of today's Internet is
not technically feasible (see
"ATM: Another Technological Mirage"
for a detailed discussion).
Recently, a number of "hybrid" approaches (namely Ipsilon's
IP Switching and Cisco System's Tag Switching)
were proposed in an attempt to combine ATM's performance and
native IP's scalability.
The first such technology, IP switching, relies on detecting
busy paths in native IP traffic and establishing ATM
virtual circuits to expedite packet forwarding along those paths.
It is easy to see that this approach offers only marginal improvement
in scalability over ATM-only networks, as it approaches the behavior of
ATM networks when the threshold of path establishment is low and becomes
close to a very inefficient form of native IP routing when the threshold
is high.
It means that IP switching is effectively useless for Internet backbone
networks.
Tag switching is based on building unidirectional "circuit trees",
corresponding to all possible paths packets would take if native IP
routing were performed on the network in the ATM switching fabric.
This scheme, however, does not allow multiple alternative paths to
the same destination.
Even worse, tag switching does not work well with aggregated routes,
because an aggregating router needs to split the tag-switched stream
using native IP routing.
Since large-scale aggregation is generally performed at exchange points
between backbone networks, the border routers will have to perform
native IP routing.
However, those routers also have to handle far more
traffic than routers or switches inside a backbone.
In other words, tag switching does nothing to solve the problem at
the places where it is most severe.
The final (and so far, most practical) approach is to create two-level
flattened backbones where native IP routing is performed by edge routers
connected by a mesh of permanent virtual circuits carried by an ATM
network.
Although such an approach allows an increase in capacity within a backbone, it
does nothing to improve inter-backbone connectivity.
Again, the border routers at exchange points have to perform native IP
routing.
As the reader can see, no widely known approach allows the building of a network
that is able to keep up with the demand.
We are proud to present the simple and elegant solution to the problem,
embodied in the patent-pending massively parallel routing.
Our approach is based on the observation that although aggregate data rates
of Internet traffic are skyrocketing, the bandwidth of individual
communication sessions remains relatively small (in fact, it cannot grow
faster than the performance of host computers).
This means that a high aggregate routing capacity can be achieved
by distributing the paths of packets in those connections between
a large number of medium-performance routing engines.
A Pluris router is composed of a large number of such routing engines
(we call them "processing nodes") communicating with each other via a
linearly scalable high-speed data interconnect.
Such interconnects are well known and are a relatively well-understood
technology commonly used in loosely-coupled massively parallel computers.
The processing nodes are connected via low-speed lines to a number of
synchronous multiplexers that combine low-speed data streams into
high-speed streams on backbone circuits, as shown in the diagram below:
Every processing node has its own copy of the forwarding table
(that table is not large, unlike BGP routing information bases).
Instead of conventional single-step IP routing (i.e. determination of
exit interface from destination address), our process performs two
steps for each packet: in the first step the exit high-speed
communication line is determined, and in the second step one of the
low-speed lines corresponding to the exit high-speed line is selected.
The packet is then sent through the data interconnect to the
processing node corresponding to the selected low-speed communication line.
Obviously, the "naive" second-step selection techniques such
as random selection and round-robin would cause reordering of
packets.
Such reordering is unacceptable, because it will cause false
packet loss detection by TCP Fast Retransmit algorithm.
To alleviate this problem, the selection is made by computing
a hash function from the packet's source and destination addresses
and, optionally, port numbers.
The use of the hash function from the values of the packet's fields, which
are invariant for all packets within a single TCP (or any other
transport protocol) session, guarantees that all those packets
will follow the same path, and therefore will not be reordered:
It is easy to see that hashing effectively randomizes packet routes,
so the load is uniformly distributed between all participating
processing nodes and low-speed lines.
This, together with linear scalability of the data interconnect, means that
the aggregate capacity of the massively parallel packet router can be
increased nearly indefinitely by the simple addition of processing nodes.
The only high-speed circuitry is in the synchronous multiplexers, and
that circuitry is much simpler and cheaper than hardware implementations
of IP routing or ATM switching.
In fact, since Pluris routers treat high-speed backbone links as
quantities of parallel low-speed circuits, a number of parallel
multiplexed high-speed lines (for example, different strands of fiber
in a cable, or different wavelength channels) can be combined into a single very
high-speed communication line.
In other words, the capacity of a network built using massively
parallel routers is not limited by the capacity of any physical component.
An interesting property of a massively parallel packet router is that
it can be configured to form a number of independent routers interconnected
with a very fast "LAN", and thus can be used as a scalable platform for
Internet Exchange Points (IXPs) (also known as Network Access Points, NAPs):
The groups of processing nodes belonging to different networks can
run independent copies of the operating system, thus leaving participants
in the IXP in complete control of their routing policies and software.
Pluris Massively Parallel Router is a collection of single-board
computers (processing nodes) and a proprietary data interconnect.
Each processing node has 16 or more megabytes of DRAM and a 100+ MHz
general-purpose microprocessor, sufficient to route IP packets at
OC-3c speed (155 Mbps).
The only unusual feature of a processing node is the ring
station module, which connects the processing node to the high-speed
data interconnect.
The second generation of processing nodes will support at least OC-12 per
node, and will be compatible with the first generation nodes (i.e. there
will be no need to replace older nodes, and both generations will be
able to co-exist in the same machine).
The effect of higher per-node performance will mostly be in decreased
size of the hardware and better price/performance.
The data interconnect is a patent-pending Self-Healing Butterfly
Switch based on 1.2Gbps serial communication lines:
Unlike the well-known butterfly and Benes switches, the Pluris switch
is fault-tolerant (the diagram above does not show secondary links),
so the packets are automatically rerouted in
case of failures in links or routing elements.
The use of radio-frequency serial lines reduces the amount of wiring
between card cages.
Every card cage has 16 processing nodes and switch circuitry in
additional intershelf link boards:
All boards are hot-swappable; also every card cage has redundant
power supplies.
When several card cages are interconnected to form a larger system,
the wiring is similar to the wiring of a hypercube-based massively parallel
computer:
One or several dedicated processing nodes equipped with 64-256 Mb of DRAM
are used for performing routing protocols.
When several such nodes are used, the output of every protocol engine is
broadcasted to all forwarding nodes, so if a protocol engine node
fails or is removed, the operation will continue.
The failure of a forwarding node only causes reduction of throughput,
but not interruption of service.
Pluris MPR is a very high-performance machine, but it is
composed entirely from off-the-shelf integrated circuits, making it
a low-cost and very reliable device.
The maximal capacity of MPR is limited solely by the maximal
length of coaxial cables interconnecting parts of the machine.
The present design is capable of housing 16K processing nodes in
64 open racks arranged in 4 rows, to achieve the
aggregate routing capacity of 2.4 Tbps (or 7 billion packets per second).
Although the diagram of a massively-parallel router includes multiplexors,
in most cases they do not have to be purchased by ISPs, because telcos
already have synchronous multiplexors installed to step down their backbone
networks to DS-3 levels accepted by telephone switches.
The first, and most significant, advantage is that Pluris MPR technology
is the only technology known today that makes building a global
terabit-per-second network possible.
Other technologies do not achieve high speeds (conventional
IP routing), or do not build truly global networks (ATM).
It is easy to migrate
existing points of presence
to the use of MPRs.
In the first phase, a MPR is installed instead of the LAN switch
and the backbone routers:
After the MPRs are deployed, future customer access connections may be
fanned out from OC-3c trunks with cheap low-end ATM switches, routers,
and xDSL access racks:
This allows ISPs to select the cheapest customer access technology.
Unlike dedicated backbone routers, Pluris MPR can be configured
to service thousands of such fan-out devices, making it ideal
not only for backbone switch sites, but also for central office
installations.
The inverse multiplexing-like operation of MPR makes it an ideal
match to the new Wavelength Division Multiplexors, which separate
bandwidth on fiber into many independent channels.
Any backbone site must have at least two conventional backbone routers
to achieve redundant operation.
That means that the number of hops in the network is increased by
the intra-POP hops over the cluster LAN, thus increasing variance
of network latency.
Also, those routers participate in backbone routing as separate
devices, thus increasing the quantity of routing information that has
to be processed by every router, and so increasing convergence time
of routing protocols.
The highly redundant design of MPR eliminates any need to have more
than one router per POP, making convergence times smaller than 0.1
second realistic and thus eliminating any need for link-layer
redundancy.
This will allow ISPs to load existing hot-spare fibers with user
traffic, effectively reducing the cost of transmission by nearly 50%.
Unlike hardware-assisted IP routers, MPR routing engines are completely
programmable, and therefore routers will not need any hardware upgrades
to support new protocols.
Pluris plans to leverage the flexibility to improve traffic management
by using techniques not possible with hardware-assisted native IP routers.
Conventional high-end routers always have a "fast" and a "slow" forwarding path,
because complete implementation of IP routing in hardware is not feasible;
therefore an "unfortunate" traffic pattern can easily overwhelm the
"slow" forwarding path in those routers.
The performance of purely software IP routing used in MPR is always
consistent and does not depend on traffic patterns.
The programmability also means that a MPR machine can eventually be
equipped with additional processing nodes interfacing with mass-storage devices and
programmed to perform services, such as Web hosting and video-on-demand, or
performing functions of large-scale cacheing proxy servers.
This eliminates the communication bottleneck between servers
and the backbone.
Essentially, Pluris massively-parallel router technology is future-proof,
allowing ISP operators to start building infrastructure capable of
growing and adapting to the new requirements far better than any other
known networking technology.
Cisco and "bridge" logo are registered trademarks of Cisco Systems.