Chapter 38, "Accounting System," teaches about the UNIX accounting system, and the tools that the accounting system provides. Some of these utilities and reports give you information about system utilization and performance. In Chapter 18,
"What Is a Process," you learned that the sadc command, in combination with the shell scripts sa1 and sa2, enables you to automatically collect activity data. These automatic reports can create a documented history of how the system behaves,
which can be a valuable reference in times of performance problems. Requesting similar reports in an ad hoc manner is demonstrated in this chapter, as this method is usually most appropriate when investigating performance problems that are in progress.
In this portion of the guide, you will learn all about performance monitoring. There are a series of commands that enable system administrators, programmers, and users to examine each of the resources that a UNIX system uses. By examining these resources
you can determine if the system is operating properly or poorly. More important than the commands themselves, you will also learn strategies and procedures that can be used to search for performance problems. Armed with both the commands and the overall
methodologies with which to use them, you will understand the factors that are affecting system performance, and what can be done to optimize them so that the system performs at its best.
Although this chapter is helpful for users, it is particularly directed at new system administrators that are actively involved in keeping the system they depend on healthy, or trying to diagnose what has caused its performance to deteriorate.
This chapter introduces several new tools to use in your system investigations and revisits several commands that were introduced in Chapter 19, "Administrative Processes."
The sequence of the chapter is not based on particular commands. It is instead based on the steps and the strategies that you will use during your performance investigations. In other words, the chapter is organized to mirror the logical progression
that a system administrator uses to determine the state of the overall system and the status of each of its subsystems.
You will frequently start your investigations by quickly looking at the overall state of the system load, as described in the section "Monitoring the Overall System Status." To do this you see how the commands uptime and sar can be used to
examine the system load and the general level of Central Processing Unit (CPU) loading. You also see how tools such as SunOS's perfmeter can be helpful in gaining a graphic, high-level view of several components at once.
Next, in the section "Monitoring Processes with ps," you learn how ps can be used to determine the characteristics of the processes that are running on your system. This is a natural next step after you have determined that the overall system
status reflects a heavier-than-normal loading. You will learn how to use ps to look for processes that are consuming inordinate amounts of resources and the steps to take after you have located them.
After you have looked at the snapshot of system utilization that ps gives you, you may well have questions about how to use the memory or disk subsystems. So, in the next section, "Monitoring Memory Utilization," you learn how to monitor
memory performance with tools such as vmstat and sar, and how to detect when paging and swapping have become excessive (thus indicating that memory must be added to the system).
In the section "Monitoring Disk Subsystem Performance," you see how tools such as iostat, sar, and df can be used to monitor disk Input/Output (I/O) performance. You will see how to determine when your disk subsystem is unbalanced and what to
do to alleviate disk performance problems.
After the section on disk I/O performance is a related section on network performance. (It is related to the disk I/O discussion because of the prevalent use of networks to provide extensions of local disk service through such facilities as NFS.) Here
you learn to use netstat, nfsstat, and spray to determine the condition of your network.
This is followed by a brief discussion of CPU performance monitoring, and finally a section on kernel tuning. In this final section, you will learn about the underlying tables that reside within the UNIX operating system and how they can be tuned to
customize your system's UNIX kernel and optimize its use of resources.
You have seen before in this guide that the diversity of UNIX systems make it important to check each vendor's documentation for specific details about their particular implementation. The same thing applies here as well. Furthermore, modern developments
such as symmetric multiprocessor support and relational databases add new characteristics and problems to the challenge of performance monitoring. These are touched on briefly in the discussions that follow.
Before you get into the technical side of UNIX performance monitoring, there are a few guidelines that can help system administrators avoid performance problems and maximize their overall effectiveness.
All too typically, the UNIX system administrator learns about performance when there is a critical problem with the system. Perhaps the system is taking too long to process jobs or is far behind on the number of jobs that it normally processes. Perhaps
the response times for users have deteriorated to the point where users are becoming distracted and unproductive (which is a polite way of saying frustrated and angry!). In any case, if the system isn't actually failing to help its users attain their
particular goals, it is at least failing to meet their expectations.
It may seem obvious that when user productivity is being affected, money and time, and sometimes a great deal of both, are being lost. Simple measurements of the amount of time lost can often provide the cost justification for upgrades to the system. In
this chapter you learn how to identify which components of the system are the best candidates for such an upgrade. (If you think people were unhappy to begin with, try talking to them after an expensive upgrade has produced no discernible improvement in
Often, it is only when users begin complaining that people begin to examine the variables that are affecting performance. This in itself is somewhat of a problem. The system administrator should have a thorough understanding of the activities on the
system before users are affected by a crisis. He should know the characteristics of each group of users on the system. This includes the type of work that they submit while they are present during the day, as well as the jobs that are to be processed
during the evening. What is the size of the CPU requirement, the I/O requirement, and the memory requirement of the most frequently occurring and/or the most important jobs? What impact do these jobs have on the networks connected to the machine? Also
important is the time-sensitivity of the jobs, the classic example being payrolls that must be completed by a given time and date.
These profiles of system activity and user requirements can help the system administrator acquire a holistic understanding of the activity on the system. That knowledge will not only be of assistance if there is a sudden crisis in performance, but also
if there is a gradual erosion of it. Conversely, if the system administrator has not compiled a profile of his various user groups, and examined the underlying loads that they impose on the system, he will be at a serious disadvantage in an emergency when
it comes to figuring out where all the CPU cycles, or memory, have gone. This chapter examines the tools that can be used to gain this knowledge, and demonstrates their value.
Finally, although all users may have been created equal, the work of some users inevitably will have more impact on corporate profitability than the work of other users. Perhaps, given UNIX's academic heritage, running the system in a completely
democratic manner should be the goal of the system administrator. However, the system administrator will sooner or later find out, either politely or painfully, who the most important and the most influential groups are. This set of characteristics should
also somehow be factored into the user profiles the system administrator develops before the onset of crises, which by their nature obscure the reasoning process of all involved.
While the system is running, UNIX maintains several counters to keep track of critical system resources. The relevant resources that are tracked are the following:
Disk I/O activity
Tape I/O activity
System call activity
Context switching activity
File access utilization
Interprocess communication (IPC)
Free memory and swap space
Kernel memory allocation (KMA)
Remote file sharing (RFS)
By looking at reports based on these counters you can determine how the three major subsystems are performing. These subsystems are the following:
The CPU processes instructions and programs. Each time you submit a job to the system, it makes demands on the CPU. Usually, the CPU can service all demands in a timely manner. However, there is only so much available processing power, which must be
shared by all users and the internal programs of the operating system, too.
Every program that runs on the system makes some demand on the physical memory on the machine. Like the CPU, it is a finite resource. When the active processes and programs that are running on the system request more memory than the machine actually
has, paging is used to move parts of the processes to disk and reclaim their memory pages for use by other processes. If further shortages occur, the system may also have to resort to swapping, which moves entire processes to disk to make room.
The I/O subsystem(s) transfers data into and out of the machine. I/O subsystems comprise devices such as disks, printers, terminals/keyboards, and other relatively slow devices, and are a common source of resource contention problems. In addition,
there is a rapidly increasing use of network I/O devices. When programs are doing a lot of I/O, they can get bogged down waiting for data from these devices. Each subsystem has its own limitations with respect to the bandwidth that it can effectively use
for I/O operations, as well as its own peculiar problems.
Performance monitoring and tuning is not always an exact science. In the displays that follow, there is a great deal of variety in the system/subsystem loadings, even for the small sample of systems used here. In addition, different user groups have
widely differing requirements. Some users will put a strain on the I/O resources, some on the CPU, and some will stress the network. Performance tuning is always a series of trade-offs. As you will see, increasing the kernel size to alleviate one problem
may aggravate memory utilization. Increasing NFS performance to satisfy one set of users may reduce performance in another area and thereby aggravate another set of users. The goal of the task is often to find an optimal compromise that will satisfy the
majority of user and system resource needs.
The examination of specific UNIX performance monitoring techniques begins with a look at three basic tools that give you a snapshot of the overall performance of the system. After getting this high-level view, you will normally proceed to examine each
of the subsystems in detail.
One of the simplest reports that you use to monitor UNIX system performance measures the number of processes in the UNIX run queue during given intervals. It comes from the command uptime. It is both a high-level view of the system's workload and a
handy starting place when the system seems to be performing slowly. In general, processes in the run queue are active programs (that is, not sleeping or waiting) that require system resources. Here is an example:
The useful parts of the display are the three load-average figures. The 1.90 load average was measured over the last minute. The 1.98 average was measured over the last 5 minutes. The 2.01 load average was measured over the last 15 minutes.
TIP: What you are usually looking for is the trend of the averages. This particular example shows a system that is under a fairly consistent load. However, if a system is having problems, but the load
averages seem to be declining steadily, then you may want to wait a while before you take any action that might affect the system and possibly inconvenience users. While you are doing some ps commands to determine what caused the problem, the imbalance may
NOTE: uptime has certain limitations. For example, high-priority jobs are not distinguished from low-priority jobs although their impact on the system can be much greater.
Run uptime periodically and observe both the numbers and the trend. When there is a problem it will often show up here, and tip you off to begin serious investigations. As system loads increase, more demands will be made on your memory and I/O
subsystems, so keep an eye out for paging, swapping, and disk inefficiencies. System loads of 2 or 3 usually indicate light loads. System loads of 5 or 6 are usually medium-grade loads. Loads above 10 are often heavy loads on large UNIX machines. However,
there is wide variation among types of machines as to what constitutes a heavy load. Therefore, the mentioned technique of sampling your system regularly until you have your own reference for light, medium, and heavy loads is the best technique.
Because the goal of this first section is to give you the tools to view your overall system performance, a brief discussion of graphical performance meters is appropriate. SUN Solaris users are provided with an OpenWindows XView tool called perfmeter,
which summarizes overall system performance values in multiple dials or strip charts. Strip charts are the default. Not all UNIX systems come with such a handy tool. That's too bad because in this case a picture is worth, if not a thousand words, at least
30 or 40 man pages. In this concise format, you get information about the system resources shown in Table 39.1:
Table 39.1. System resources and their descriptions.
Percent of CPU being utilized
EtherNet activity, in packets per second
Paging, in pages per second
Jobs swapped per second
Number of device interrupts per second
Disk traffic, in transfers per second
Number of context switches per second
Average number of runnable processes over the last minute
Collisions per second detected on the EtherNet
Errors per second on receiving packets
The charts of the perfmeter are not a source for precise measurements of subsystem performance, but they are graphic representations of them. However, the chart can be very useful for monitoring several aspects of the system at the same time. When you
start a particular job, the graphics can demonstrate the impact of that job on the CPU, on disk transfers, and on paging. Many developers like to use the tool to assess the efficiency of their work for this very reason. Likewise, system administrators use
the tool to get valuable clues about where to start their investigations. As an example, when faced with intermittent and transitory problems, glancing at a perfmeter and then going directly to the proper display may increase the odds that you can catch in
the act the process that is degrading the system.
The scale value for the strip chart changes automatically when the chart refreshes to accommodate increasing or decreasing values on the system. You add values to be monitored by clicking the right mouse button and selecting from the menu. From the same
menu you can select properties, which will let you modify what the perfmeter is monitoring, the format (dials/graphs, direction of the displays, and solid/lined display), remote/local machine choice, and the frequency of the display.
You can also set a ceiling value for a particular strip chart. If the value goes beyond the ceiling value, this portion of the chart will be displayed in red. Thus, a system administrator who knows that someone is periodically running a job that eats up
all the CPU memory can set a signal that the job may be run again. The system administrator can also use this to monitor the condition of critical values from several feet away from his monitor. If he or she sees red, other users may be seeing red, too.
The perfmeter is a utility provided with SunOS. You should check your own particular UNIX operating system to determine if similar performance tools are provided.
If your machine does not support uptime, there is an option for sar that can provide the same type of quick, high-level snapshot of the system. The -q option reports the average queue length and the percentage of time that the queue is occupied.
This is the length of the run queue during the interval. The run queue list doesn't include jobs that are sleeping or waiting for I/O, but does include jobs that are in memory and ready to run.
This is the percentage of time that the run queue is occupied.
This is the average length of the swap queue during the interval. Jobs or threads that have been swapped out and are therefore unavailable to run are shown here.
This is the percentage of time that there are swapped jobs or threads.
The run queue length is used in a similar way to the load averages of uptime. Typically the number is less than 2 if the system is operating properly. Consistently higher values indicate that the system is under heavier loads, and is quite possibly CPU
bound. When the run queue length is high and the run queue percentage is occupied 100% of the time, as it is in this example, the system's idle time is minimized, and it is good to be on the lookout for performance-related problems in the memory and disk
subsystems. However, there is still no activity indicated in the swapping columns in the example. You will learn about swapping in the next section, and see that although this system is obviously busy, the lack of swapping is a partial vote of confidence
that it may still be functioning properly.
Another quick and easy tool to use to determine overall system utilization is sar with the -u option. CPU utilization is shown by -u, and sar without any options defaults on most versions of UNIX to this option. The CPU is either busy or idle. When it
is busy, it is either working on user work or system work. When it is not busy, it is either waiting on I/O or it is idle.
This is the percentage of time that the processor is in user mode (that is, executing code requested by a user).
This is the percentage of time that the processor is in system mode, servicing system calls. Users can cause this percentage to increase above normal levels by using system calls inefficiently.
This is the percentage of time that the processor is waiting on completion of I/O, from disk, NFS, or RFS. If the percentage is regularly high, check the I/O systems for inefficiencies.
This is the percentage of time the processor is idle. If the percentage is high and the system is heavily loaded, there is probably a memory or an I/O problem.
In this example, you see a system with ample CPU capacity left (that is, the average idle percentage is 31%). The system is spending most of its time on user tasks, so user programs are probably not too inefficient with their use of system calls. The
I/O wait percentage indicates an application that is making a fair amount of demands on the I/O subsystem.
Most administrators would argue that %idle should be in the low 'teens rather than 0, at least when the system is under load. If it is 0 it doesn't necessarily mean that the machine is operating poorly. However, it is usually a good bet that the machine
is out of spare computational capacity and should be upgraded to the next level of CPU speed. The reason to upgrade the CPU is in anticipation of future growth of user processing requirements. If the system work load is increasing, even if the users
haven't yet encountered the problem, why not anticipate the requirement? On the other hand, if the CPU idle time is high under heavy load, a CPU upgrade will probably not help improve performance much.
Idle time will generally be higher when the load average is low.
A high load average and idle time is a symptom of potential problems. Either the memory or the I/O subsystems, or both, are hindering the swift dispatch and completion of the jobs. You should review the following sections that show how to look for
paging, swapping, disk, or network-related problems.
You have probably noticed that, while throughout the rest of this chapter the commands are listed under the topic in which they are used (for example, nfsstat is listed in the section "Monitoring Network Performance"), this section is
dedicated to just one command. What's so special about ps? It is singled out in this manner because of the way that it is used in the performance monitoring process. It is a starting point for generating theories (for example, processes are using up so
much memory that you are paging and that is slowing down the system). Conversely, it is an ending point for confirming theories (for example, here is a burst of network activityI wonder if it is caused by that communications test job that the
programmers keep running?). Since it is so pivotal, and provides a unique snapshot of the processes on the system, ps is given its own section.
One of the most valuable commands for performance monitoring is the ps command. It enables you to monitor the status of the active processes on the system. Remember the words from the movie Casablanca, "round up the usual suspects"? Well, ps
helps to identify the usual suspects (that is, suspect processes that could be using inordinate resources). Then you can proceed to determine which of the suspects is actually guilty of causing the performance degradation. It is at once a powerful tool and
a source of overhead for the system itself. Using various options, the following information is shown:
Current status of the process
Parent process ID
Address of process
CPU time used
Using ps provides you a snapshot of the system's active processes. It is used in conjunction with other commands throughout this section. Frequently, you will look at a report from a command, for example vmstat, and then look to ps either to confirm or
to deny a theory you have come up with about the nature of your system's problem. The particular performance problem that motivated you to look at ps in the first place may have been caused by a process that is already off the list. It provides a series of
clues to use in generating theories that can then be tested by detailed analysis of the particular subsystem.
The ps command is described in detail in Chapter 19, "Administrating Processes." The following are the fields that are important in terms of performance tuning:
Flags that indicate the process's current state and are calculated by adding each of the hexadecimal values:
Process has terminated
System process, always in memory
Process is being traced by its parent
Process is being traced by parent, and is stopped
Process cannot be awakened by a signal
Process is in memory and locked, pending an event
Process cannot be swapped
The current state of the process, as indicated by one of the following letters:
Process is currently running on the processor
Process is sleeping, waiting for an I/O event (including terminal I/O) to complete
Process is ready to run
Process is idle
Process is a zombie process (it has terminated, and the parent is not waiting but is still in the process table)
Process is stopped because of parent tracing it
Process is waiting for more memory
User ID of the process's owner
Process ID number
Parent process ID number
CPU utilization for scheduling (not shown when -c is used)
Scheduling class, real-time, time sharing, or system (only shown when the -c option is used)
Process scheduling priority (higher numbers mean lower priorities).
Process nice number (used in scheduling prioritiesraising the number lowers the priority so the process gets less CPU time)
The amount of virtual memory required by the process (This is a good indication of the memory load the process places on the systems memory.)
The terminal that started the process, or its parent (A ? indicates that no terminal exists.)
The total amount of CPU time used by the process since it began
The command that generated the process
If your problem is immediate performance, you can disregard processes that are sleeping, stopped, or waiting on terminal I/O, as these will probably not be the source of the degradation. Look instead for the jobs that are ready to run, blocked for disk
I/O, or paging.
The following are tips for using ps to determine why system performance is suffering.
Look at the UID (user ID) fields for a number of identical jobs that are being submitted by the same user. This is often caused by a user who runs a script that starts a lot of background jobs without waiting for any of the jobs to complete. Sometimes
you can safely use kill to terminate some of the jobs. Whenever you can, you should discuss this with the user before you take action. In any case, be sure the user is educated in the proper use of the system to avoid a replication of the problem. In the
example, User ID 1001 has multiple instances of the same process running. In this case, it is a normal situation, in which multiple processes are spawned at the same time for searching through database tables to increase interactive performance.
Look at the TIME fields for a process that has accumulated a large amount of CPU time. In the example, you can see the large amount of time acquired by the processes whose command is shown as velox_ga. This may indicate that the process is in an
infinite loop, or that something else is wrong with its logic. Check with the user to determine whether it is appropriate to terminate the job. If something is wrong, ask the user if a dump of the process would assist in debugging it (check your UNIX
system's reference material for commands, such as gcore, that can dump a process).
Request the -l option and look at the SZ fields for processes that are consuming too much memory. In the example you can see the large amount of memory acquired by the processes whose command is shown as velox_ga. You could check with the user of this
process to try to determine why it behaves this way. Attempting to renice the process may simply prolong the problem that it is causing, so you may have to kill the job instead. SZ fields may also give you a clue as to memory shortage problems caused by
this particular combination of jobs. You can use vmstat or sar -wpgr to check the paging and swapping statistics that are examined.
Look for processes that are consuming inordinate CPU resources. Request the -c option and look at the CLS fields for processes that are running at inappropriately high priorities. Use the nice command to adjust the nice value of the process. Beware in
particular of any real-time (RT) process, which can often dominate the system. If the priority is higher than you expected, you should check with the user to determine how it was set. If he is resetting the priority because he has figured out the superuser
password, dissuade him from doing this. (See Chapter 19 to find out more about using the nice command to modify the priorities of processes.)
If the processes that are running are simply long-running, CPU-intensive jobs, ask the users if you can nice them to a lower priority or if they can run them at night, when other users will not be affected by them.
Look for processes that are blocking on I/O. Many of the example processes are in this state. When that is the case, the disk subsystem probably requires tuning. The section "Monitoring Disk Performance Using vmstat" examines how to
investigate problems with your disk I/O. If the processes are trying to read/write over NFS, this may be a symptom that the NFS server to which they are attached is down, or that the network itself is hung.
You could say that one can never have too much money, be too thin, or have too much system memory. Memory sometimes becomes a problematic resource when programs that are running require more physical memory than is available. When this occurs UNIX
systems begin a process called paging. During paging the system copies pages of physical memory to disk, and then allows the now-vacated memory to be used by the process that required the extra space. Occasional paging can be tolerated by most systems, but
frequent and excessive paging is usually accompanied by poor system performance and unhappy users.
Paging uses an algorithm that selects portions, or pages, of memory that are not being used frequently and displaces them to disk. The more frequently used portions of memory, which may be the most active parts of a process, thus remain in memory, while
other portions of the process that are idle get paged out.
In addition to paging, there is a similar technique used by the memory management system called swapping. Swapping moves entire processes, rather than just pages, to disk in order to free up memory resources. Some swapping may occur under normal
conditions. That is, some processes may just be idle enough (for example, due to sleeping) to warrant their return to disk until they become active once more. Swapping can become excessive, however, when severe memory shortages develop. Interactive
performance can degrade quickly when swapping increases since it often depends on keyboard-dependent processes (for example, editors) that are likely to be considered idle as they wait for you to start typing again.
As the condition of your system deteriorates, paging and swapping make increasing demands on disk I/O. This, in turn, may further slow down the execution of jobs submitted to the system. Thus, memory resource inadequacies may result in I/O resource
By now, it should be apparent that it is important to be able to know if the system has enough memory for the applications that are being used on it.
TIP: A rule of thumb is to allocate twice the swap space as you have physical memory. For example, if you have 32 MB of physical Random Access Memory (RAM) installed upon your system, you would set up 64 MB
of swap space when configuring the system. The system would then use this diskspace for its memory management when displacing pages or processes to disk.
Both vmstat and sar provide information about the paging and swapping characteristics of a system. Let's start with vmstat. On the vmstat reports you will see information about page-ins, or pages moved from disk to memory, and page-outs, or pages moved
from memory to disk. Further, you will see information about swap-ins, or processes moved from disk to memory, and swap-outs, or processes moved from memory to disk.
The vmstat command is used to examine virtual memory statistics, and present data on process status, free and swap memory, paging activity, disk reports, CPU load, swapping, cache flushing, and interrupts. The format of the command is:
vmstat t [n]
This command takes n samples, at t second intervals. For example, the following frequently used version of the command takes samples at 5-second intervals without stopping until canceled:
The following screen shows the output from the SunOS variant of the command
vmstat -S 5
which provides extra information regarding swapping.
The fields in the vmstat report are the following:
Reports the number of processes in each of the following states
In the Run queue
Blocked, waiting for resources
Swapped, waiting for processing resources
Reports on real and virtual memory
Available swap space
Size of free list
Reports on page faults and paging, averaged over an interval (typically 5 seconds) and provided in units per second
Pages reclaimed from the free list (not shown when the -S option is requested)
Minor faults (not shown when -S option is requested)
Number of pages swapped in (only shown with the -S option)
Number of pages swapped out (only shown with the -S option)
Kilobytes paged in
Kilobytes paged out
Anticipated short-term memory shortfall
Pages scanned by clock algorithm, per second
Shows the number of disk operations per second
Shows the per-second trap/interrupt rates
System faults per second
CPU context switches
Shows the use of CPU time
NOTE: The vmstat command's first line is rarely of any use. When reviewing the output from the command, always start at the second line and go forward for pertinent data.
Let's look at some of these fields for clues about system performance. As far as memory performance goes, po and w are very important. For people using the -S option so is similarly important. These fields all clearly show when a system is paging and
swapping. If w is non-zero and so continually indicates swapping, the system probably has a serious memory problem. If, likewise, po consistently has large numbers present, the system probably has a significant memory resource problem.
TIP: If your version of vmstat doesn't specifically provide swapping information, you can infer the swapping by watching the relationship between the w and the fre fields. An increase in w, the swapped-out
processes, followed by an increase in fre, the number of pages on the free list, can provide the same information in a different manner.
Other fields from the vmstat output are helpful, as well. The number of runnable and blocked processes can provide a good indication of the flow of processes, or lack thereof, through the system. Similarly, comparing each percentage CPU idle versus CPU
in system state, and versus CPU in user state, can provide information about the overall composition of the workload. As the load increases on the system, it is a good sign if the CPU is spending the majority of the time in the user state. Loads of 60 or
70 percent for CPU user state are ok. Idle CPU should drop as the user load picks up, and under heavy load may well fall to 0.
If paging and swapping are occurring at an unusually high rate, it may be due to the number and types of jobs that are running. Usually you can turn to ps to determine what those jobs are.
Imagine that ps shows a large number of jobs that require significant memory resources. (You saw how to determine this in the ps discussion in the previous section.) That would confirm the vmstat report. To resolve the problem, you would have to
restrict memory-intensive jobs, or the use of memory, or add more memory physically.
TIP: You can see that having a history of several vmstat and ps reports during normal system operation can be extremely helpful in determining what the usual conditions are, and, subsequently, what the
unusual ones are. Also, one or two vmstat reports may indicate a temporary condition, rather than a permanent problem. Sample the system multiple times before deciding that you have the answer to your system's performance problems.
Number of blocks transferred for swap-ins per second.
Number of transfers from memory to swap area per second. (More memory may be needed if the value is greater than 1.)
Number of blocks transferred for swap-outs per second.
Number of process switches per second.
Number of attaches per second (that is, page faults where the page is reclaimed from memory).
Number of times per second that file systems get page-in requests.
Number of pages paged in per second.
Number of page faults from protection errors per second.
Number of address translation page (validity) faults per second.
Number of faults per second caused by software lock requests requiring I/O.
Number of times per second that file systems get page-out requests.
Number of pages paged out per second.
Number of pages that are put on the free list by the page-stealing daemon. (More memory may be needed if this is a large value.)
Number of pages scanned by the page-stealing daemon. (More memory may be needed if this is a large value, because it shows that the daemon is checking for free memory more than it should need to.)
Percentage of the ufs inodes that were taken off the free list that had reusable pages associated with them. (Large values indicate that ufs inodes should be increased, so that the free list of inodes will not be page bound.) This will be %s5ipf for
System V file systems, like in the example.
The average number of pages, over this interval, of memory available to user processes.
The number of disk blocks available for page swapping.
You should use the report to examine each of the following conditions. Any one of them would imply that you may have a memory problem. Combinations of them increase the likelihood all the more.
Check for page-outs, and watch for their consistent occurrence. Look for a high incidence of address translation faults. Check for swap-outs. If they are occasional, it may not be a cause for concern as some number of them is normal (for example,
inactive jobs). However, consistent swap-outs are usually bad news, indicating that the system is very low on memory and is probably sacrificing active jobs. If you find memory shortage evidence in any of these, you can use ps to look for memory-intensive
jobs, as you saw in the section on ps.
In the CPU columns of the report, the vmstat command summarizes the performance of multiprocessor systems. If you have a two-processor system and the CPU load is reflected as 50 percent, it doesn't necessarily mean that both processors are equally busy.
Rather, depending on the multiprocessor implementation it can indicate that one processor is almost completely busy and the next is almost idle.
The first column of vmstat output also has implications for multiprocessor systems. If the number of runnable processes is not consistently greater than the number of processors, it is less likely that you can get significant performance increases from
adding more CPUs to your system.
Disk operations are the slowest of all operations that must be completed to enable most programs to complete. Furthermore, as more and more UNIX systems are being used for commercial applications, and particularly those that utilize relational database
systems, the subject of disk performance has become increasingly significant with regard to overall system performance. Therefore, probably more than ever before, UNIX system tuning activities often turn out to be searches for unnecessary and inefficient
disk I/O. Before you learn about the commands that can help you monitor your disk I/O performance, some background is appropriate.
Some of the major disk performance variables are the hard disk activities themselves (that is, rotation and arm movement), the I/O controller card, the I/O firmware and software, and the I/O backplane of the system.
For example, for a given disk operation to be completed successfully, the disk controller must be directed to access the information from the proper part of the disk. This results in a delay known as a queuing delay. When it has located the proper part
of the disk, the disk arm must begin to position itself over the correct cylinder. This results in a delay called seek latency. The read/write head must then wait for the relevant data to happen as the disk rotates underneath it. This is known as
rotational latency. The data must then be transferred to the controller. Finally, the data must be transferred over the I/O backplane of the system to be used by the application that requested the information.
If you think about your use of a compact disk, many of the operations are similar in nature. The CD platter contains information, and is spinning all the time. When you push 5 to request the fifth track of the CD, a controller positions the head that
reads the information at the correct area of the disk (similar to the queuing delay and seek latency of disk drives). The rotational latency occurs as the CD spins around until the start of your music passes under the reading head. The datain this
case your favorite songis then transferred to a controller and then to some digital to analog converters that transform it into amplified musical information that is playable by your stereo.
Seek time is the time required to move the head of the disk from one location of data, or track, to another. Moving from one track to another track that is adjacent to it takes very little time and is called minimum seek time. Moving the head between
the two furthest tracks on a disk is measured as the maximum seek time. The average seek time approximates the average amount of time a seek takes.
As data access becomes more random in nature, seek time can become more important. In most commercial database applications that feature relational databases, for example, the data is often being accessed in a random manner, at a high rate, and in
relatively small packets (for example, 512 bytes). Therefore, the disk heads are moving back and forth all the time looking for the pertinent data. Therefore, choosing disks that have small seek times for those systems can increase I/O performance.
Many drives have roughly the same rotational speed, measured as revolutions per minute, or RPMs. However, some manufacturers are stepping up the RPM rates of their drives. This can have a positive influence on performance by reducing the rotational
delay, which is the time that the disk head has to wait for the information to get to it (that is, on average one-half of a rotation). It also reduces the amount of time required to transfer the read/write information.
While reviewing the use of the commands to monitor disk performance, you will see how these clearly show which disks and disk subsystems are being the most heavily used. However, before examining those commands, there are some basic hardware-oriented
approaches to this problem that can help increase performance significantly. The main idea is to put the hardware where the biggest disk problem is, and to evenly spread the disk work load over available I/O controllers and disk drives.
If your I/O work load is heavy (for example, with many users constantly accessing large volumes of data from the same set of files), you can probably get significant performance increases by reducing the number of disk drives that are daisy chained off
one I/O controller from five or six to two or three. Perhaps doing this will force another daisy chain to increase in size past a total of four or five, but if the disks on that I/O controller are only used intermittently, system performance will be
Another example of this type of technique is if you had one group of users that are pounding one set of files all day long, you could locate the most frequently used data on the fastest disks.
Notice that, once again, the more thorough your knowledge of the characteristics of the work being done on your system, the greater the chance that your disk architecture will answer those needs.
NOTE: Remember, distributing a work load evenly across all disks and controllers is not the same thing as distributing the disks evenly across all controllers, or the files evenly across all disks. You must
know which applications make the heaviest I/O demands, and understand the work load itself, to distribute it effectively.
TIP: As you build file systems for user groups, remember to factor in the I/O work load. Make sure your high-disk I/O groups are put on their own physical disks and preferably their own I/O controllers as
well. If possible, keep them, and /usr, off the root disk as well.
Disk-striping software frequently can help in cases where the majority of disk access goes to a handful of disks. Where a large amount of data is making heavy demands on one disk or one controller, striping distributes the data across multiple disks
and/or controllers. When the data is striped across multiple disks, the accesses to it are averaged over all the I/O controllers and disks, thus optimizing overall disk throughput. Some disk-striping software also provides Redundant Array of Inexpensive
Disks (RAID) support and the ability to keep one disk in reserve as a hot standby (that is, a disk that can be automatically rebuilt and used when one of the production disks fails). When thought of in this manner, this can be a very useful feature in
terms of performance because a system that has been crippled by the failure of a hard drive will be viewed by your user community as having pretty bad performance.
This information may seem obvious, but it is important to the overall performance of a system. Frequently, the answer to disk performance simply rests on matching the disk architecture to the use of the system.
With the increasing use of relational database technologies on UNIX systems, I/O subsystem performance is more important than ever. While analyzing all the relational database systems and making recommendations is beyond the scope of this chapter, some
basic concepts are in order.
More and more often these days an application based on a relational database product is the fundamental reason for the procurement of the UNIX system itself. If that is the case in your installation, and if you have relatively little experience in terms
of database analysis, you should seek professional assistance. In particular, insist on a database analyst that has had experience tuning your database system on your operating system. Operating systems and relational databases are both complex systems,
and the performance interactions between them is difficult for the inexperienced to understand.
The database expert will spend a great deal of time looking at the effectiveness of your allocation of indexes. Large improvements in performance due to the addition or adjustment of a few indexes are quite common.
You should use raw disks versus the file systems for greatest performance. File systems incur more overhead (for example, inode and update block overhead on writes) than do raw devices. Most relational databases clearly reflect this performance
advantage in their documentation.
If the database system is extremely active, or if the activity is unbalanced, you should try to distribute the load more evenly across all the I/O controllers and disks that you can. You will see how to determine this in the following section.
The iostat command is used to examine disk input and output, and produces throughput, utilization, queue length, transaction rate, and service time data. It is similar both in format and in use to vmstat. The format of the command is:
iostat t [n]
This command takes n samples, at t second intervals. For example, the following frequently used version of the command takes samples at 5-second intervals without stopping, until canceled:
For example, the following shows disk statistics sampled at 5-second intervals.
The first line of the report shows the statistics since the last reboot. The subsequent lines show the interval data that is gathered. The default format of the command shows statistics for terminals (tty), for disks (fd and sd), and CPU.
For each terminal, iostat shows the following:
Characters in the terminal input queue
Characters in the terminal output queue
For each disk, iostat shows the following:
Blocks per second
Transfers per second
Average service time, in milliseconds
For the CPU, iostat displays the CPU time spent in the following modes:
Waiting for I/O
The first two fields, tin and tout, have no relevance to disk subsystem performance, as these fields describe the number of characters waiting in the input and output terminal buffers. The next fields are relevant to disk subsystem performance over the
preceding interval. The bps field indicates the size of the data transferred (read or written) to the drive. The tps field describes the transfers (that is, I/O requests) per second that were issued to the physical disk. Note that one transfer can combine
multiple logical requests. The serv field is for the length of time, in milliseconds, that the I/O subsystem required to service the transfer. In the last set of fields, note that I/O waiting is displayed under the wt heading.
You can look at the data within the report for information about system performance. As with vmstat, the first line of data is usually irrelevant to your immediate investigation. Looking at the first disk, sd0, you see that it is not being utilized as
the other three disks are. Disk 0 is the root disk, and often will show the greatest activity. This system is a commercial relational database implementation, however, and the activity that is shown here is often typical of online transaction processing,
or OLTP, requirements. Notice that the activity is mainly on disks sd53 and sd55. The database is being exercised by a high volume of transactions that are updating it (in this case over 100 updates per second).
Disks 30, 53, and 55 are three database disks that are being pounded with updates from the application through the relational database system. Notice that the transfers per second, the kilobytes per second, and the service times are all reflecting a
heavier load on disk 53 than on disks 30 and 55. Notice that disk 30's use is more intermittent but can be quite heavy at times, while 53's is more consistent. Ideally, over longer sample periods, the three disks should have roughly equivalent utilization
rates. If they continue to show disparities in use like these, you may be able to get a performance increase by determining why the load is unbalanced and taking corrective action.
You can use iostat -xtc to show the measurements across all of the drives in the system.
% iostat -xtc 10 5 _
extended disk statistics tty cpu
disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b tin tout us sy wt id
sd0 0.0 0.9 0.1 6.3 0.0 0.0 64.4 0 1 0 26 12 11 21 56
sd30 0.2 1.4 0.4 20.4 0.0 0.0 21.5 0 3 _
sd53 2.6 2.3 5.5 4.6 0.0 0.1 23.6 0 9 _
sd55 2.7 2.4 5.6 4.7 0.0 0.1 24.2 0 10 _
extended disk statistics tty cpu
disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b tin tout us sy wt id
sd0 0.0 0.3 0.0 3.1 0.0 0.0 20.4 0 1 0 3557 5 8 14 72
sd30 0.0 0.2 0.1 0.9 0.0 0.0 32.2 0 0 _
sd53 0.1 0.2 0.4 0.5 0.0 0.0 14.6 0 0 _
sd55 0.1 0.2 0.3 0.4 0.0 0.0 14.7 0 0 _
This example shows five samples of all disks at 10-second intervals.
Each line shows the following:
Reads per second
Writes per second
KB read per second
KB written per second
Average transactions waiting for service (that is, queue length)
Average active transactions being serviced
Average time, in milliseconds, of service
Percentage of time that the queue isn't empty
Percentage of time that the disk is busy
Once again, you can check to make sure that all disks are sharing the load equally, or if this is not the case, that the most active disk is also the fastest.
Percentage of time that the device is busy servicing transfers
Average number of requests outstanding during the period
Read/write transfers to the device per second
Number of blocks transferred to the device per second
Average number of milliseconds that a transfer request spends waiting in the queue for service
Average number of milliseconds for a transfer to be completed, including seek, rotational delay, and data transfer time.
You can see from the example that this system is lightly loaded, since %busy is a small number and the queue lengths and wait times are small as well. The average service times for most of the disks is consistent; however, notice that SCSI disk 3, sd3,
has a larger service time than the other disks. Perhaps the arrangement of data on the disk is not organized properly (a condition known as fragmentation) or perhaps the organization is fine but the disproportionate access of sd3 (see the blks/s column) is
bogging it down in comparison to the other drives.
TIP: You should double-check vmstat before you draw any conclusions based on these reports. If your system is paging or swapping with any consistency, you have a memory problem, and you need to address that
first because it is surely aggravating your I/O performance.
As this chapter has shown, you should distribute the disk load over I/O controllers and drives, and you should use your fastest drive to support your most frequently accessed data. You should also try to increase the size of your buffer cache if your
system has sufficient memory. You can eliminate fragmentation by rebuilding your file systems. Also, make sure that the file system that you are using is the fastest type supported with your UNIX system (for example, UFS) and that the block size is the
One of the biggest and most frequent problems that systems have is running out of disk space, particularly in /tmp or /usr. There is no magic answer to the question How much space should be allocated to these? but a good rule of thumb is between 1500KB
and 3000KB for /tmp and roughly twice that for /usr. Other file systems should have about 5 or 10 percent of the system's available capacity.
From this display you can see the following information (all entries are in KB):
Total size of usable space in file system (size is adjusted by allotted head room)
Space available for use
Percentage of total capacity used
The usable space has been adjusted to take into account a 10 percent reserve head room adjustment, and thus reflects only 90 percent of the actual capacity. The percentage shown under capacity is therefore used space divided by the adjusted usable
TIP: For best performance, file systems should be cleansed to protect the 10 percent head room allocation. Remove excess files with rm, or archive/move files that are older and no longer used to tapes with
tar or cpio, or to less-frequently-used disks.
"The network is the computer" is an appropriate saying these days. What used to be simple ASCII terminals connected over serial ports have been replaced by networks of workstations, Xterminals, and PCs, connected, for example, over 10 BASE-T
EtherNet networks. Networks are impressive information transmission media when they work properly. However, troubleshooting is not always as straightforward as it should be. In other words, he who lives by the network can die by the network without the
The two most prevalent standards that you will have to contend with in the UNIX world are TCP/IP, (a communications protocol) and NFS, (a popular network file system). Each can be a source of problems. In addition, you need to keep an eye on the
implementation of the network, which can also can be a problem area. Each network topology has different capacities, and each implementation (for example, using thin-net instead of 10 BASE-T twisted pair, or using intelligent hubs, and so on) has
advantages and problems inherent in its design. The good news is that even a simple EtherNet network has a large amount of bandwidth for transporting data. The bad news is that with every day that passes users and programmers are coming up with new methods
of using up as much of that bandwidth as possible.
Most networks are still based on EtherNet technologies. Ethernet is referred to as a 10 Mps medium, but the throughput that can be used effectively by users and applications is usually significantly less than 10 MB. Often, for various reasons, the
effective capacity falls to 4 Mps. That may still seem like a lot of capacity, but as the network grows it can disappear fast. When the capacity is used up, EtherNet is very democratic. If it has a capacity problem, all users suffer equally. Furthermore,
one person can bring an EtherNet network to its knees with relative ease. Accessing and transferring large files across the network, running programs that test transfer rates between two machines, or running a program that has a loop in it that happens to
be dumping data to another machine, and so on, can affect all the users on the network. Like other resources (that is, CPU, disk capacity, and so on), the network is a finite resource.
If given the proper instruction, users can quite easily detect capacity problems on the network by which they are supported. A quick comparison of a simple command executed on the local machine versus the same command executed on a remote machine (for
example, login and rlogin) can indicate that the network has a problem.
A little education can help your users and your network at the same time. NFS is a powerful tool, in both the good and the bad sense. Users should be taught that it will be slower to access the file over the network using NFS, particularly if the file
is sizable, than it will be to read or write the data directly on the remote machine by using a remote login. However, if the files are of reasonable size, and the use is reasonable (editing, browsing, moving files back and forth), it is a fine tool to
use. Users should understand when they are using NFS appropriately or not.
One of the most straightforward checks you can make of the network's operation is with netstat -i. This command can give you some insight into the integrity of the network. All the workstations and the computers on a given network share it. When more
than one of these entities try to use the network at the same time, the data from one machine "collides" with that of the other. (Despite the sound of the term, in moderation this is actually a normal occurrence, but too many collisions can be a
problem.) In addition, various technical problems can cause errors in the transmission and reception of the data. As the errors and the collisions increase in frequency, the performance of the network degrades because the sender of the data retransmits the
garbled data, thus further increasing the activity on the network.
Using netstat -i you can find out how many packets the computer has sent and received, and you can examine the levels of errors and collisions that it has detected on the network. Here is an example of the use of netstat:
The name of the network interface. The names show what the type of interface is (for example, an en followed by a digit indicates an EtherNet card, the lo0 shown here is a loopback interface used for testing networks).
The maximum transfer unit, also known as the packet size, of the interface.
The network to which the interface is connected.
The Internet address of the interface. (The Internet address for this name may be referenced in /etc/hosts.)
The number of packets the system has received since the last boot.
The number of input errors that have occurred since the last boot. This should be a very low number relative to the Ipkts field (that is, less than 0.25 percent, or there is probably a significant network problem).
Same as Ipkts, but for sent packets.
Same as Ierrs, but for output errors.
The number of collisions that have been detected. This number should not be more than 5 or 10 percent of the output packets (Opkts) number or the network is having too many collisions and capacity is reduced.
In this example you see that the collision ratio shows a network without too many collisions (approximately 1 percent). If collisions are constantly averaging 10 percent or more, the network is probably being over utilized.
The example also shows that input and output error ratios are negligible. Input errors usually mean that the network is feeding the system bad input packets, and the internal calculations that verify the integrity of the data (called checksums) are
failing. In other words, this normally indicates that the problem is somewhere out on the network, not on your machine. Conversely, rapidly increasing output errors probably indicates a local problem with your computer's network adapters, connectors,
interface, and so on.
If you suspect network problems you should repeat this command several times. An active machine should show Ipkts and Opkts consistently incrementing. If Ipkts changes and Opkts doesn't, the host is not responding to the client requesting data. You
should check the addressing in the hosts database. If Ipkts doesn't change, the machine is not receiving the network data at all.
It is quite possible that you will not detect collisions and errors when you use netstat -i, and yet will still have slow access across the network. Perhaps the other machine that you are trying to use is bogged down and cannot respond quickly enough.
Use spray to send a burst of packets to the other machine and record how many of them actually made the trip successfully. The results will tell you if the other machine is failing to keep up. Here is an example of a frequently used test:
% spray SCAT
sending 1162 packets of length 86 to SCAT ...
no packets dropped by SCAT
3321 packets/sec, 285623 bytes/sec
This shows a test burst sent from the source machine to the destination machine called SCAT. No packets were dropped. If SCAT were badly overloaded some probably would have been dropped. The example defaulted to sending 1162 packets of 86 bytes each.
Another example of the same command uses the -c option to specify the number of packets to send, the -d option to specify the delay so that you don't overrun your buffers, and the -l option to specify the length of the packet. This example of the command
is a more realistic test of the network:
% spray c 100 d 20 0 l 2048 SCAT
sending 100 packets of length 2048 to SCAT ...
no packets dropped by SCAT
572 packets/sec, 1172308 bytes/sec
Had you seen significant numbers (for example, 5 to 10 percent or more) of packets dropped in these displays, you would next try looking at the remote system. For example, using commands such as uptime, vmstat, sar, and ps as described earlier in this
section, you would check on the status of the remote machine. Does it have memory or CPU problems, or is there some other problem that is degrading its performance so it can't keep up with its network traffic?
Systems running NFS can skip spray and instead use nfsstat -c. The -c option specifies the client statistics, and -s can be used for server statistics. As the name implies, client statistics summarize this system's use of another machine as a server.
The NFS service uses synchronous procedures called RPCs (remote procedure calls). This means that the client waits for the server to complete the file activity before it proceeds. If the server fails to respond, the client retransmits the request. Just as
with collisions, the worse the condition of the communication, the more traffic that is generated. The more traffic that is generated, the slower the network and the greater the possibility of collisions. So if the retransmission rate is large, you should
look for servers that are under heavy loads, high collision rates that are delaying the packets en route, or EtherNet interfaces that are dropping packets.
The number of times no available client handles caused waiting
The number of refreshed authentications
The number of times the time-out value is reached or exceeded
The number of reads made to a symbolic link
If the timeout ratio is high, the problem can be unresponsive NFS servers or slow networks that are impeding the timely delivery and response of the packets. In the example, there are relatively few time-outs compared to the number of calls (72/74107 or
about 1/10 of 1 percent) that do retransmissions. As the percentage grows toward 5 percent, system administrators begin to take a closer look at it. If badxid is roughly the same as retrans, the problem is probably an NFS server that is falling behind in
servicing NFS requests, since duplicate acknowledgments are being received for NFS requests in roughly the same amounts as the retransmissions that are required. (The same thing is true if badxid is roughly the same as timeout.) However, if badxid is a
much smaller number than retrans and timeout, then it follows that the network is more likely to be the problem.
TIP: nfsstat enables you to reset the applicable counters to 0 by using the -z option (executed as root). This can be particularly handy when trying to determine if something has caused a problem in the
immediate time frame, rather than looking at the numbers collected since the last reboot.
One way to check for network loading is to use netstat without any parameters:
Local Address Remote Address Swind SendQ Rwind RecvQ State
AAA1.1023 bbb2.login 8760 0 8760 0 ESTABLISHED
AAA1.listen Cccc.32980 8760 0 8760 0 ESTABLISHED
AAA1.login Dddd.1019 8760 0 8760 0 ESTABLISHED
AAA1.32782 AAA1.32774 16384 0 16384 0 ESTABLISHED
In the report, the important field is the Send-Q field, which indicates the depth of the send queue for packets. If the numbers in Send-Q are large and increasing in size across several of the connections, the network is probably bogged down.
The netstat -s command displays statistics for each of several protocols supported on the system (that is, UDP, IP, TCP, and ICMP). The information can be used to locate problems for the protocol. Here is an example:
The checksum fields should always show extremely small values, as they are a percentage of total traffic sent along the interface.
By using netstat -s on the remote system in combination with spray on your own, you can determine whether data corruption (as opposed to network corruption) is impeding the movement of your network data. Alternate between the two displays, observing the
differences, if any, between the reports. If the two reports agree on the number of dropped packets, the file server is probably not keeping up. If they don't, suspect network integrity problems. Use netstat -i on the remote machine to confirm this.
If you suspect that there are problems with the integrity of the network itself, you must try to determine where the faulty piece of equipment is. Hire network consultants, who will use network diagnostic scopes to locate and correct the problems.
If the problem is that the network is extremely busy, thus increasing collisions, time-outs, retransmissions, and so on, you may need to redistribute the work load more appropriately. This is a good example of the "divide and conquer" concept
as it applies to computers. By partitioning and segmenting the network nodes into subnetworks that more clearly reflect the underlying work loads, you can maximize the overall performance of the network. This can be accomplished by installing additional
network interfaces in your gateway and adjusting the addressing on the gateway to reflect the new subnetworks. Altering your cabling and implementing some of the more advanced intelligent hubs may be needed as well. By reorganizing your network, you will
maximize the amount of bandwidth that is available for access to the local subnetwork. Make sure that systems that regularly perform NFS mounts of each other are on the same subnetwork.
If you have an older network and are having to rework your network topology, consider replacing the older coax-based networks with the more modern twisted-pair types, which are generally more reliable and flexible.
Make sure that the work load is on the appropriate machine(s). Use the machine with the best network performance to do its proper share of network file service tasks.
Check your network for diskless workstations. These require large amounts of network resources to boot up, swap, page, etc. With the cost of local storage descending constantly, it is getting harder to believe that diskless workstations are still
cost-effective when compared to regular workstations. Consider upgrading the workstations so that they support their users locally, or at least to minimize their use of the network.
If your network server has been acquiring more clients, check its memory and its kernel buffer allocations for proper sizing.
If the problem is that I/O-intensive programs are being run over the network, work with the users to determine what can be done to make that requirement a local, rather than a network, one. Educate your users to make sure they understand when they are
using the network appropriately and when they are being wasteful with this valuable resource.
The biggest problem a system administrator faces when examining performance is sorting through all the relevant information to determine which subsystem is really in trouble. Frequently, users complain about the need to upgrade a processor that is
assumed to be causing slow execution, when in fact it is the I/O subsystem or memory that is the problem. To make matters even more difficult, all of the subsystems interact with one another, thus complicating the analysis.
You already looked at the three most handy tools for assessing CPU load in the section "Monitoring the Overall System Status." As stated in that section, processor idle time can, under certain conditions, imply that I/O or memory subsystems
are degrading the system. It can also, under other conditions, imply that a processor upgrade is appropriate. Using the tools that have been reviewed in this chapter, you can by now piece together a competent picture of the overall activities of your
system and its subsystems. You should use the tools to make absolutely sure that the I/O and the memory subsystems are indeed optimized properly before you spend the money to upgrade your CPU.
If you have determined that your CPU has just run out of gas, and you cannot upgrade your system, all is not lost. CPUs are extremely powerful machines that are frequently underutilized for long spans of time in any 24 hour period. If you can rearrange
the schedule of the work that must be done to use the CPU as efficiently as possible, you can often overcome most problems. This can be done by getting users to run all appropriate jobs at off-hours (off work load hours, that is, not necessarily 9 to 5).
You can also get your users to run selected jobs at lower priorities. You can educate some of your less efficient users and programmers. Finally, you can carefully examine the work load and eliminate some jobs, daemons, and so on, that are not needed.
The following is a brief list of jobs and daemons that deserve review, and possibly elimination, based on the severity of the problem and their use, or lack thereof, on the system. Check each of the following and ask yourself whether you use it or need
them: accounting services, printer daemons, mountd remote mount daemon, sendmail daemon, talk daemon, remote who daemon, NIS server, and database daemons.
One of the most recent developments of significance in the UNIX server world is the rapid deployment of symmetric multiprocessor (SMP) servers. Of course, having multiple CPUs can mean that you may desire a more discrete picture of what is actually
happening on the system than sar -u can provide.
You learned about some multiprocessor issues in the examination of vmstat, but there are other tools for examining multiprocessor utilization. The mpstat command reports the per-processor statistics for the machine. Each row of the report shows the
activity of one processor.
All values are in terms of events per second, unless otherwise noted. You may specify a sample interval, and a number of samples, with the command, just as you would with sar. The fields of the report are the following:
CPU processor ID
Interrupts as threads (not counting clock interrupt)
Involuntary context switches
Thread migrations (to another processor)
Spins on mutexes (lock not acquired on first try)
Spins on reader/writer locks (lock not acquired on first try)
Percentage of user time
Percentage of system time
Percentage of wait time
Percentage of idle time
Don't be intimidated by the technical nature of the display. It is included here just as an indication that multiprocessor systems can be more complex than uniprocessor systems to examine for their performance. Some multiprocessor systems actually can
bias work to be done to a particular CPU. That is not done here, as you can see. The user, system, wait, and idle times are all relatively evenly distributed across all the available CPUs.
Kernel tuning is a complex topic, and the space that can be devoted to it in this section is limited. In order to fit this discussion into the space allowed, the focus is on kernel tuning for SunOS in general, and Solaris 2.x in particular. In addition,
the section focuses mostly on memory tuning. Your version of UNIX may differ in several respects from the version described here, and you may be involved in other subsystems, but you should get a good idea of the overall concepts and generally how the
parameters are tuned.
The most fundamental component of the UNIX operating system is the kernel. It manages all the major subsystems, including memory, disk I/O, utilization of the CPU, process scheduling, and so on. In short, it is the controlling agent that enables the
system to perform work for you.
As you can imagine from that introduction, the configuration of the kernel can dramatically affect system performance either positively or negatively. There are parameters that you can tune for various kernel modules that you can tune. A couple reasons
could motivate you to do this. First, by tuning the kernel you can reduce the amount of memory required for the kernel, thus increasing the efficiency of the use of memory, and increasing the throughput of the system. Second, you can increase the capacity
of the system to accommodate new requirements (users, processing, or both).
This is a classic case of software compromise. It would be nice to increase the capacity of the system to accommodate all users that would ever be put on the system, but that would have a deleterious effect on performance. Likewise, it would be nice to
tune the kernel down to its smallest possible size, but that would have negative side-effects as well. As in most software, the optimal solution is somewhere between the extremes.
Some people think that you only need to change the kernel when the number of people on the system increases. This is not true. You may need to alter the kernel when the nature of your processing changes. If your users are increasing their use of X
Windows, or increasing their utilization of file systems, running more memory-intensive jobs, and so on, you may need to adjust some of these parameters to optimize the throughput of the system.
Two trends are changing the nature of kernel tuning. First, in an effort to make UNIX a commercially viable product in terms of administration and deployment, most manufacturers are trying to minimize the complexity of the kernel configuration process.
As a result, many of the tables that were once allocated in a fixed manner are now allocated dynamically, or else are linked to the value of a handful of fields. Solaris 2.x takes this approach by calculating many kernel values based on the maxusers field.
Second, as memory is dropping in price and CPU power is increasing dramatically, the relative importance of precise kernel tuning for most systems is gradually diminishing. However, for high-performance systems, or systems with limited memory, it is still
a pertinent topic.
Your instruction in UNIX kernel tuning begins with an overview of the kernel tables that are changed by it, and how to display them. It continues with some examples of kernel parameters that are modified to adjust the kernel to current system demands,
and it concludes with a detailed example of paging and swapping parameters under SunOS.
CAUTION: Kernel tuning can actually adversely affect memory subsystem performance. As you adjust the parameters upward, the kernel often expands in size. This can affect memory performance, particularly
if your system is already beginning to experience a memory shortage problem under normal utilization. As the kernel tables grow, the internal processing related to them may take longer, too, so there may be some minor degradation related to the greater
time required for internal operating system activities. Once again, with a healthy system this may be transparent, but with a marginal system the problems may become apparent or more pronounced.
CAUTION: In general you should be very careful with kernel tuning. People that don't understand what they are doing can cripple their systems. Many UNIX versions come with utility programs that help
simplify configuration. It's best to use them. It also helps to read the manual, and to procure the assistance of an experienced system administrator, before you begin.
CAUTION: Finally, always make sure that you have a copy of your working kernel before you begin altering it. Some experienced system administrators actually make backup copies even if the utility
automatically makes one. And it is always a good idea to do a complete backup before installing a new kernel. Don't assume that your disk drives are safe because you are "just making a few minor adjustments," or that the upgrade that you are
installing "doesn't seem to change much with respect to the I/O subsystem." Make sure you can get back to your original system state if things go wrong.
When should you consider modifying the kernel tables? You should review your kernel parameters in several cases, such as before you add new users, before you increase your X Window activity significantly, or before you increase your NFS utilization
markedly. Also review them before the makeup of the programs that are running is altered in a way that will significantly increase the number of processes that are run or the demands they will make on the system
Some people believe that you always increase kernel parameters when you add more memory, but this is not necessarily so. If you have a thorough knowledge of your system's parameters and know that they are already adjusted to take into account both
current loads and some future growth, then adding more memory, in itself, is not necessarily a reason to increase kernel parameters.
Some of the tables are described as follows:
Process table The process table sets the number of processes that the system can run at a time. These processes include daemon processes, processes that local users are running, and processes that remote users are running. It also includes
forked or spawned processes of usersit may be a little more trouble for you to accurately estimate the number of these. If the system is trying to start system daemon processes and is prevented from doing so because the process table has reached its
limit, you may experience intermittent problems (possibly without any direct notification of the error).
User process table The user process table controls the number of processes per user that the system can run.
Inode table The inode table lists entries for such things as the following:
Each open pipe
Each current user directory
Mount points on each file system
Each active I/O device
When the table is full, performance will degrade. The console will have error messages written to it regarding the error when it occurs. This table is also relevant to the open file table, since they are both concerned with the same subsystem.
Open file table This table determines the number of files that can be open on the system at the same time. When the system call is made and the table is full, the program will get an error indication and the console will have an error logged to
Quota table If your system is configured to support disk quotas, this table contains the number of structures that have been set aside for that use. The quota table will have an entry for each user who has a file system that has quotas turned
on. As with the inode table, performance suffers when the table fills up, and errors are written to the console.
Callout table This table controls the number of timers that can be active concurrently. Timers are critical to many kernel-related and I/O activities. If the callout table overflows, the system is likely to crash.
The -v option enables you to see the current process table, inode table, open file table, and shared memory record table.
The fields in the report are as follows:
The number of process table entries in use/the number allocated
The number of inode table entries in use/the number allocated
The number of file table entries currently in use/the number 0 designating that space is allocated dynamically for this entry
The number of shared memory record table entries in use/the number 0 designating that space is allocated dynamically for this entry
The overflow field, showing the number of times the field to the immediate left has had to overflow
Any non-zero entry in the ov field is an obvious indication that you need to adjust your kernel parameters relevant to that field. This is one performance report where you can request historical information, for the last day, the last week, or since
last reboot, and actually get meaningful data out of it.
This is also another good report to use intermittently during the day to sample how much reserve capacity you have.
Since all the ov fields are 0, you can see that the system tables are healthy for this interval. In this display, for example, there are 122 process table entries in use, and there are 4058 process table entries allocated.
The relevant fields in the report are the following:
The index of the symbol (appears in brackets).
The value of the symbol.
The size, in bytes, of the associated object.
A symbol is one of the following types: NOTYPE (no type was specified), OBJECT (a data object such as an array or variable), FUNC (a function or other executable code), SECTION (a section symbol), or FILE (name of the source file).
The symbol's binding attributes. LOCAL symbols have a scope limited to the object file containing their definition; GLOBAL symbols are visible to all object files being combined; and WEAK symbols are essentially global symbols with a lower precedence
Except for three special values, this is the section header table index in relation to which the symbol is defined. The following special values exist: ABS indicates that the symbol's value will not change through relocation; COMMON indicates an
allocated block and the value provides alignment constraints; and UNDEF indicates an undefined symbol.
To display a list of the current values assigned to the tunable kernel parameters, you can use the sysdef -i command:
% sysdef -i
... (portions of display are deleted for brevity)
* System Configuration
swapfile dev swaplo blocks free
/dev/dsk/c0t3d0s1 32,25 8 547112 96936
* Tunable Parameters
5316608 maximum memory allowed in buffer cache (bufhwm)
4058 maximum number of processes (v.v_proc)
99 maximum global priority in sys class (MAXCLSYSPRI)
4053 maximum processes per user id (v.v_maxup)
30 auto update time limit in seconds (NAUTOUP)
25 page stealing low water mark (GPGSLO)
5 fsflush run rate (FSFLUSHR)
25 minimum resident memory for avoiding deadlock (MINARMEM)
25 minimum swapable memory for avoiding deadlock (MINASMEM)
* Utsname Tunables
5.3 release (REL)
DDDD node name (NODE)
SunOS system name (SYS)
Generic_10131831 version (VER)
* Process Resource Limit Tunables (Current:Maximum)
Infinity:Infinity cpu time
Infinity:Infinity file size
7ffff000:7ffff000 heap size
800000:7ffff000 stack size
Infinity:Infinity core file size
40: 400 file descriptors
Infinity:Infinity mapped memory
* Streams Tunables
9 maximum number of pushes allowed (NSTRPUSH)
65536 maximum stream message size (STRMSGSZ)
1024 max size of ctl part of message (STRCTLSZ)
* IPC Messages
200 entries in msg map (MSGMAP)
2048 max message size (MSGMAX)
65535 max bytes on queue (MSGMNB)
25 message queue identifiers (MSGMNI)
128 message segment size (MSGSSZ)
400 system message headers (MSGTQL)
1024 message segments (MSGSEG)
SYS system class name (SYS_NAME)
As stated earlier, over the years there have been many enhancements that have tried to minimize the complexity of the kernel configuration process. As a result, many of the tables that were once allocated in a fixed manner are now allocated dynamically,
or else linked to the value of the maxusers field. The next step in understanding the nature of kernel tables is to look at the maxusers parameter and its impact on UNIX system configuration.
SunOS uses the /etc/system file for modification of kernel-tunable variables. The basic format is this:
set parameter = value
It can also have this format:
set [module:]variablename = value
The /etc/system file can also be used for other purposes (for example, to force modules to be loaded at boot time, to specify a root device, and so on). The /etc/system file is used for permanent changes to the operating system values. Temporary changes
can be made using adb kernel debugging tools. The system must be rebooted for the changes made for them to become active using /etc/system. With adb the changes take place when applied.
CAUTION: Be very careful with set commands in the /etc/system file! They basically cause patches to be performed on the kernel itself, and there is a great deal of potential for dire consequences from
misunderstood settings. Make sure you have handy the relevant system administrators' manuals for your system, as well as a reliable and experienced system administrator for guidance.
Many of the tables are dynamically updated either upward or downward by the operating system, based on the value assigned to the maxusers parameter, which is an approximation of the number of users the system will have to support. The quickest and, more
importantly, safest way to modify the table sizes is by modifying maxusers, and letting the system perform the adjustments to the tables for you.
The maxusers parameter can be adjusted by placing commands in the /etc/system file of your UNIX system:
A number of kernel parameters adjust their values according to the setting of the maxusers parameter. For example, Table 39.2 lists the settings for various kernel parameters, where maxusers is utilized in their calculation.
Table 39.2. Kernel parameters affected by maxusers.
10 + 16 * maxusers (sets the size of the process table)
max_nprocs-5 (sets the number of user processes)
16 + max_nprocs (sets the size of the callout table)
max_nprocs + 16 + maxusers + 64 (sets size of the directory lookup cache)
max_nprocs + 16 + maxusers + 64 (sets the size of the inode table)
(maxusers * NMOUNT) / 4 + max_nprocs (sets the number of disk quota structures)
The directory name lookup cache (dnlc) is also based on maxusers in SunOS systems. With the increasing usage of NFS, this can be an important performance tuning parameter. Networks that have many clients can be helped by an increased name cache
parameter ncsize (that is, a greater amount of cache). By using vmstat with the -s option, you can determine the directory name lookup cache hit rate. A cache miss indicates that disk I/O was probably needed to access the directory when traversing the path
components to get to a file. If the hit rate falls below 70 percent, this parameter should be checked.
% vmstat -s
0 swap ins
0 swap outs
0 pages swapped in
0 pages swapped out
1530750 total address trans. faults taken
39351 page ins
22369 page outs
45565 pages paged in
114923 pages paged out
73786 total reclaims
65945 reclaims from free list
0 micro (hat) faults
1530750 minor (as) faults
38916 major faults
88376 copyonwrite faults
120412 zero fill page faults
634336 pages examined by the clock daemon
10 revolutions of the clock hand
122233 pages freed by the clock daemon
45913303 cpu context switches
28556694 device interrupts
665339442 system calls
622350 total name lookups (cache hits 94%)
2281992 user cpu
3172652 system cpu
62275344 idle cpu
967604 wait cpu
In this example, you can see that the cache hits are 94 percent, and therefore enough directory name lookup cache is allocated on the system.
By the way, if your NFS traffic is heavy and irregular in nature, you should increase the number of nfsd NFS daemons. Some system administrators recommend that this should be set between 40 and 60 on dedicated NFS servers. This will increase the speed
with which the nfsd daemons take the requests off the network and pass them on to the I/O subsystem. Conversely, decreasing this value can throttle the NFS load on a server when that is appropriate.
The section isn't large enough to review in detail how tuning can affect each of the kernel tables. However, for illustration purposes, this section describes how kernel parameters influence paging and swapping activities in a SunOS system. Other tables
affecting other subsystems can be tuned in much the same manner as these.
As processes make demands on memory, pages are allocated from the free list. When the UNIX system decides that there is no longer enough free memoryless than the lotsfree parameterit searches for pages that haven't been used lately to add
them to the free list. The page daemon will be scheduled to run. It begins at a slow rate, based on the slowscan parameter, and increases to a faster rate, based on the fastscan parameter, as free memory continues toward depletion. If there is less memory
than desfree, and there are two or more processes in the run queue, and the system stays in that condition for more than 30 seconds, the system will begin to swap. If the system gets to a minimum level of required memory, specified by the minfree
parameter, swapping will begin without delay. When swapping begins, entire processes will be swapped out as described earlier.
NOTE: If you have your swapping spread over several disks, increasing the maxpgio parameter may be beneficial. This parameter limits the number of pages scheduled to be paged out, and is based on
single-disk swapping. Increasing it may improve paging performance. You can use the po field from vmstat, as described earlier, which checks against maxpgio and pagesize to examine the volumes involved.
The kernel swaps out the oldest and the largest processes when it begins to swap. The maxslp parameter is used in determining which processes have exceeded the maximum sleeping period, and can thus be swapped out as well. The smallest higher-priority
processes that have been sleeping the longest will then be swapped back in.
The most pertinent kernel parameters for paging and swapping are the following:
minfree This is the absolute minimum memory level that the system will tolerate. Once past minfree, the system immediately resorts to swapping.
desfree This is the desperation level. After 30 seconds at this level, paging is abandoned and swapping is begun.
lotsfree Once below this memory limit, the page daemon is activated to begin freeing memory.
fastscan This is the number of pages scanned per second.
slowscan This is the number of pages scanned per second when there is less memory than lotsfree available. As memory decreases from lotsfree the scanning speed increases from slowscan to fastscan.
maxpgio This is the maximum number of page out I/O operations per second that the system will schedule. This is normally set at approximately 40 under SunOS, which is appropriate for a single 3600 RPM disk. It can be increased with more or faster
Newer versions of UNIX, such as Solaris 2.x, do such a good job of setting paging parameters that tuning is usually not required.
Increasing lotsfree will help on systems on which there is a continuing need to allocate new processes. Heavily used interactive systems with many Windows users often force this condition as users open multiple windows and start processes. By increasing
lotsfree you create a large enough pool of free memory that you will not run out when most of the processes are initially starting up.
For servers that have a defined set of users and a more steady-state condition to their underlying processes, the normal default values are usually appropriate.
However, for servers such as this with large, stable work loads, but that are short of memory, increasing lotsfree is the wrong idea. This is because more pages will be taken from the application and put on the free list.
Some system administrators recommend that you disable the maxslp parameter on systems where the overhead of swapping normally sleeping processes (such as clock icons and update processes) isn't offset by any measurable gain due to forcing the processes
out. This parameter is no longer used in Solaris 2.x releases, but is used on older versions of UNIX.
You have now seen how to optimize memory subsystem performance by tuning a system's kernel parameters. Other subsystems can be tuned by similar modifications to the relevant kernel parameters. When such changes correct existing kernel configurations
that have become obsolete and inefficient due to new requirements, the result can sometimes dramatically increase performance even without a hardware upgrade. It's not quite the same as getting a hardware upgrade for free, but it's about as close as you're
likely to get in today's computer industry.
With a little practice using the methodology described in this chapter, you should be able to determine what the performance characteristics, positive or negative, are for your system. You have seen how to use the commands that enable you to examine
each of the resources a UNIX system uses. In addition to the commands themselves, you have learned procedures that can be utilized to analyze and solve many performance problems.