
Sharif Torpis of PAC*BELL volunteered his notes from this talk. They are included here.
This talk will be an operational perspective on what is important to measure and can be easily measured.
IfOutDiscards is normally a very good indicator of congestion.
IfInOctects and IfOutOctects give reasonable link utilization.
Packet Counters are a good indicator of CPU utilization and does not require using enterprise mibs.
IfIndex, IfDescr and IfAdEntIfIndex are good for mapping interfaces to human-readable reports and topology information. Driving the reports from the IP address is much more consistent. Using IfAdEntIfIndex will give you the address data you need.
Someone asks of ifSpeed causes a problem. Daniel acknowledges that this is a problem. Often the database for the customer and the actual configuration vary (especially if the installation engineer sets this manually).
Correlation across counters is very useful. For example, correlating IfOutDiscards and IfOutPackets will help point up performance issues.
Why these variables? MIB-II is available in almost anything that does SNMP.
ANS uses 5 minute intervals, however 15 minutes is probably ok. Doing anything more frequent is probably not possible on large networks and generally won’t provide much more insight.
There are some useful variables in the enterprise mibs -- Cpu utilization, queue loss information and so on.
RTT can be an indicator of standing queue and congestion, not just distance.
Keeping RTT data long term is useful for trend analysisand correlation against other data. Changes in the RTT are usually and indication of configuration changes or changes in congestion levels.
The tools that ANS uses are ICMP based, which run on POP hosts (not routers).
Some applications are sensitive to variance in transit time. Video and audio application are particularly sensitive. At ANS, this kind of measurement is typically done in cooperation with a customer. ANS uses ICMP, but there may be other good methods using RMON probes.
Provides critical view of performance -- the end user will feel it. Correlation can lead to the reasons for packet loss. The Ideal end result will be the ability to predict this before it happens and act to prevent it.
Polling all devices in less than 60 seconds (for 10,000 devices) is a design criterion
How often is one edge of your network unreachable from the other?
Track Circuit errors -- Degrading Circuits can be caught before they go down hard
Set thresholds with your circuit provider and live by them and make the carrier live by them.
Don't buy hardware that you can't query
What does ANS uses? A large mix.
There are gaps in the standards. LQM on high speed interfaces is needed (LQM without PPP). LQM on ATM VCs is needed.
Very useful when doing post-mortem, however router log formats are very vendor specific. Log information can also be lost because of router problems.
Most are vendor specific and traps can be lost just like syslogs.
Daniel has presented information as the previous two NANOGs.
Problems with source to destination data. Difficult to collect. RMON-2 can help, but don't address lots of IP-level issues. If they had routing information, it would be much better.
Route changes are good to measure as well. The Merit tools can help there.
Lab measurements -- Route convergence, flat-out forwarding performance
Misc correlation -- NTP is very useful, use it. Don't use RR option when doing RTT measurements. Polling a very large network is taxing. Most off-the-shelf NMSs don't have good pollers. Save data and think about the data before digging in.
Work with your customers on designing useful measurements. If they can't get what you need, they may go elsewhere.
Bill suggests that Daniel and he get together can write up a setup of recommendations for basic measurements.
- abq alt arc abq - n n alt n - n arc n n -