What to monitor on a windows operating system

I posted this answer to a forum following a question - it seemed like a good blog entry post too.
When monitoring an application under load that's hosted on a windows operating system start by monitoring the following metrics on the windows servers.

Physical Disk

% Disk TimeThe amount of time the disk was busy read or writing bytes, anything over 90% is bad.
Queue LengthNumber of requests outstanding on the disk at the time the performance data is collected. It also includes requests in service at the time of the collection. Multi-spindle disk devices can have multiple requests that are active at one time, but other concurrent requests are awaiting service. This counter might reflect a transitory high or low queue length, but if there is a sustained load on the disk drive, it is likely that this will be consistently high. Requests experience delays proportional to the length of this queue minus the number of spindles on the disks. For high performance, this difference should average less than two.
Time per TransferTime in milliseconds of the average disk transfer, anything over 30ms is not good.

Memory

% UsedTo calculate find 100 * committed bytes (committed bytes + available bytes). This value should not exceed 95%
Page FaultsAverage number of pages faulted per second. It is measured in number of pages faulted per second because only one page is faulted in each fault operation, hence this is also equal to the number of page fault operations. This counter includes both hard faults (those that require disk access) and soft faults (where the faulted page is found elsewhere in physical memory.) Most processors can handle large numbers (more than 350/s) of soft faults without significant consequence. However, hard faults, which require disk access, can cause significant delays.
PagingRate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays. It is the sum of "Pages Input/sec" and "Pages Output/sec". It is counted in numbers of pages, so it can be compared to other counts of pages, such as "Page Faults/sec", without conversion. It includes pages retrieved to satisfy faults in the file system cache (usually requested by applications) non-cached mapped memory files.

Network Interface:

% Bytes/sBytes Total/sec is the rate at which bytes are sent and received over each network adapter, including framing characters. Network Interface\Bytes Total/sec is a sum of Network Interface\Bytes Received/sec and Network Interface\Bytes Sent/sec.
Output Queue Length
Length of the output packet queue (in packets). If this is longer than two, there are delays and the bottleneck should be found and eliminated, if possible.

Processor

% Used % Processor Time is the percentage of elapsed time that the processor spends to execute a non-Idle thread. It is calculated by measuring the percentage of time that the processor spends executing the idle thread and then subtracting that value from 100%. (Each processor has an idle thread that consumes cycles when no other threads are ready to run). This counter is the primary indicator of processor activity, and displays the average percentage of busy time observed during the sample interval. It should be noted that the accounting calculation of whether the processor is idle is performed at an internal sampling interval of the system clock (10ms). On todays fast processors, % Processor Time can therefore underestimate the processor utilization as the processor may be spending a lot of time servicing threads between the system clock sampling interval. Workload based timer applications are one example of applications which are more likely to be measured inaccurately as timers are signaled just after the sample is taken.

System

CongestionCalculate this with Processor Queue Length / number of CPU's. The value should be less than 10 for a performant system.

TCP

Segments Retransmitted/sSegments Retransmitted/sec is the rate at which segments are retransmitted, that is, segments transmitted containing one or more previously transmitted bytes. There will be a higher rate per second for a fast network connection, however this should not happen and a value of greater than 1/s would indicate a problem with the network.
Here's an interesting conversation on retranmits:http://fixunix.com/tcp-ip/66636-segments-retransmitted.html

Wednesday, 2 February 2011

What to monitor on a windows operating system

No comments:

Post a Comment

Contributors