NmonFAQ

nmon for Linux and AIX Frequently Asked Questions (FAQ)


This is a work in progress - November 2016 - Recovered from backup after it went missing.


The postings on this site solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management.

Colour key:

  • Answers in GREEN are related to nmon for AIX
  • Answers in RED are for very old versions of nmon for AIX
  • Answers in BLUE are for nmon for Linux
  • Answer in BLACK apply to both versions

Summary of the questions:

  • Question 1: Which nmon for my version of AIX or Linux?
  • Question 2: nmon crash shortly after starting a data capture please send me the next version?
  • Question 3: I have a problem with nmon running on AIX 4.0.3 (or any really old AIX versions)?
  • Question 4: All I get is "nmon not found"?
  • Question 5: Can you add the monitoring tape drive on AIX?
  • Question 6: Can I get the adapters stats from other tools?
  • Question 7: When I start nmon 9 on a system that it use to run fine I know get an error message?
  • Question 8: What is the most reported error for nmon?
  • Question 9: Can you add the monitoring of process priority?
  • Question 10: on AIX, nmon 9 does not run, please fix?
  • Question 11: Can I decide the filename it saves data too?
  • Question 12: What is the default output filename?
  • Question 13: I want nmon output piped into a further command, how?
  • Question 14: Why do you support all these old unsupported AIX versions?
  • Question 15: What if I want Support?
  • Question 16: Why don't you add a Java front end to nmon and get graphical output?
  • Question 17: The command line options don't seem to work right for file capture?
  • Question 18: What is paging to a filesystem?
  • Question 19: Where can I get nmon and further information?
  • Question 20: nmon crashes after about 200 snapshots on AIX?
  • Question 21: TOP process stats get switched on when I request Asynchronous I/O stats?
  • Question 23: nmon2rrd fails, please fix it?
  • Question 24: NANQ and INF?
  • Question 25: nmon and AIX commands do not agree?
  • Question 26: nmon reports more than 100% for a process - clearly it is wrong?
  • Question 27: On AIX the disk adapter are wrong?
  • Question 28: on AIX the adapter busy goes over 100%. That is impossible surely?
  • Question 29: What about nmon for HP/UX, Solaris on Sparc or x86 or Linux on Itanium?
  • Question 30: What about nmon for Windows?
  • Question 31: Seeing double the number of CPUs?
  • Question 32: 0509-036 Cannot load program /usr/lib/drivers/nfs_kdes.ext ?
  • Question 33: Hello, I am new to UNIX and want to tune AIX, what do you recommend?
  • Question 34: CPU wait is too high, how can I reduce it?
  • Question 35: On AIX, free memory is near zero, how do I free more memory?
  • Question 36: How can I set numperm better?
  • Question 37: What format is the nmon output file?
  • Question 38: I have collected once a second for 8 hours but I can't get the Analyser to work?
  • Question 39: nmon does not work on my Linux machine!!
  • Question 40: When do we get nmon 10 for Linux?
  • Question 41: The boxes and lines in nmon do not work right online with: DTterm, xterm, rvxt, putty, VNC, (whatever you have)?
  • Question 42: I have 2400 disk (small SAN LUNs) and nmon is slow to collect the stats from so many, can you help?
  • Question 43: Adapter stats and IOADAPT is not saved to the nmon file seems to be missing with AIX 5.1?
  • Question 44: What is CharIO (a column of the TOP processes stats)?
  • Question 45: On Linux, the disk stats are all doubled?
  • Question 46: On AIX the disk seems to be mostly on the first adapter?
  • Question 47: On nmon for Linux the CPU Wait for IO number is zero or odd?
  • Question 48: On nmon for Linux the paging details are missing and the PAGE lines for the capture to file are missing.
  • Question 49: I want to collect data every second and then see weekly and monthly reports. How?
  • Question 50: nmon will not start on AIX 5.1 due to a libperfstat error?
  • Question 51: How do I work out the Physical CPU use on Linux on POWER for shared processor LPARs?
  • Question 52: The Disk Busy stats are missing on AIX
  • Question 53: Sort order problems with massive nmon output files.
  • Question 54: AIX 5.3 updated but then nmon gives "Illegal instruction(core dump)"
  • Question 54: AIX 5.3 updated but then nmon gives "Assert Failure"
  • Question 55: On AIX 5.3 ML6, nmon output files contain zeros, missing CPU stats, corrupt ZZZ lines and "NFS" strings found in the stats
  • Question 56: Does nmon capture point in time stats or averages?
  • Question 57: Why is the Process memory percentage zero? (same for System and User percent)
  • Question 100: When will nmon collect data from lots of machines or LPARs?
  • Question 101: When will nmon collect data like "topas -C"?
  • Question 102: If nmon crashes, how to determine where in the nmon code that happened?

Question 1: Which nmon for my version of AIX or Linux?

AIX

  • On AIX with these or later versions: AIX 5.3 TL09+ and AIX 6.1 TL02+ and AIX 7 any version You should run the nmon that comes with AIX and is installed by default.
  • It is strongly recommended if you have problems to first add all available service packs for your AIX release as this removes 99% of problems.
  • If you have earlier AIX versions then you can run nmon classic downloadable from XXX

Linux

  • On Linux go to the nmon for Linux website (http://nmon.sourceforge.net) to download nmon. It is compiled for 50 different platforms (POWER, x86, x86_64 and Mainframe)
 and Linux distributions combinations. 
  • If your combination is not on the list or you have a newer Linux version you can now compile it up yourself.

Question 2: nmon crashes shortly after starting a data capture, please fix this send me the next version?

  • When you are capturing data to a file, the nmon tool disconnects from the shell, to ensure that it continues running even if you log out.
  • This means that nmon can appear to crash but it is still running in the background.
  • Use: ps -ef | grep nmon to see the nmon process still running.

Question 3: I have a problem with nmon running on AIX 4.0.3 (or any really old AIX versions)?

  • Hard luck
  • I will actively help get AIX 5 bugs fixed but older versions are very much less interesting.
  • In particular, on AIX 4.1.5 the TOP processes does not work but I am not going to fix it unless someone offers me hard currency

Question 4: All I get is "nmon not found"?

Linux

  • First check it is executable (this gets switched off by FTP).
  • Second, if you are the root user, you have to name the executable directly with the full path name or (if in the current working directory) ./nmon or

put it into a directory in your $PATH.

AIX

  • nmon since AIX 5.3 TL09+ and AIX 6.1 TL02+ and AIX 7 any version is a default install and the starting shell script can be found in /usr/bin/nmon

- it actually starts the executable called nmon_topas.

Question 5: Can you add the monitoring tape drive on AIX?

AIX

  • No - the data is not available. The best you can do is to watch the disks and guess what the tape is doing. The adapter statistics is only adding up

the attached disks - so it does not help. You can guess at the tape drive I/O rates by looking at the disk I/O rates - after all this is where the data is coming from but it is only approximate and does not account for memory caching of data.

  • Yes - if your tape drive is Fibre Channel connected it is very common to have it connected on a different FC adapter to allow performance settings to suit

the tape drive = streams of large blocks.

  • In this case, use the Adapter stats using the ^ key or -^ startup option to monitor the tape(s).

Linux No FC Adapter options for Linux - unless you know the /proc file to find tape stats. In which case let Nigel know ASAP.

Question 6: Can I get the adapters stats from other tools?

AIX

  • Not in AIX 4 - there are no adapter stats in this AIX.
  • This is now available in AIX 5 and higher via the libperfstat library so programmers can get this information

- but a warning this is derived data from the connected disks (NOT tape drives) because there are no adapter stats. XXX

Question 7: When I start nmon 9 on a system that it use to run fine I know get an error message?

  • The error is something about "lslpp" AIX 5.1 about ML03 onwards - or - WLM stats go missing - after upgrading to AIX 5.2 ML5 - can you fix nmon?
  • These are bugs in AIX and not nmon -there are fixes available.
  • Please report these problems to your AIX support channel and not me.

nmon 10 has also been backported to AIX 5.1 and AIX 5.2 and has code to work around these bugs and can be used instead of nmon9a.

Question 8: What is the most reported error for nmon?

  1. nmon crashes have it starts in collecting to a file mode.
  2. nmon Analyser does not work - because the nmon file is empty or incomplete
  3. Can we have a new feature XYZ - and its already implemented so read the nmon -h output
  4. I have a problem with the nmon options - turns out they can't read nmon -h which stats -f or -F MUST be the first option on the line
  5. How do I interpret nmon output - first do your homework by learning UNIX and Linux performance statistics: read the command manuals,

take a course spend 5 years benchmarking

Question 9: Can you add the monitoring of process priority?

  • Available from the AIX 5.1 onwards

Question 10: nmon on AIX, nmon 9 does not run, please fix?

  • With reports like:
  • read error: No such device or address
  • nmon file=nmon.c line=1278 version=XXX
  • In 95% of the time it is because AIX was upgraded or a maintenance level added but the AIX/system was not rebooted.

It is very easy to miss the "You must reboot" message in the gallons of installp output. The reboot is required because the AIX kernel image has been updated and the reboot is the only way to activate the new /unix file. nmon reads the /unix file to find kernel data structure addresses but if the /unix file does no match what is actually running, you get this message. You can also get really weird effects, if you have messed up LIBPATH.

Question 11: Can I decide the filename nmon saves data too?

  • Use nmon -h and check out the -F <file> option which must be the first option on the line

Question 12: What is the default output filename?

  • <hostname>_<Year><Month><Day><Hours><Minute>.nmon
  • Notes:
    • This has been very carefully chosen after years of experience
    • A directory of nmon files will sort in machine order and then date+time order. So you can find the data file you want in a simple way.
    • Many people needlessly make up their own names via scripts and date commands that will not be in any sensible order = a pointless waste of time.
    • One side effect is that, if two nmon captures are started in the same minute they might use the same filename, so stagger the startup by 61 seconds.

Question 13: I want nmon output piped into a further command, how?

  • Use a FIFO and the -F option.
    mkfifo /tmp/xyz
    nmon -F /tmp/xyz s 5 c 300
    your-command </tmp/xyz
  • If you are doing this with the online data output, I think you are "barking mad" but some people are still trying it.

Question 14: Why do you support all these old unsupported AIX versions?

  • You would be amazed at what AIX versions are running out there!
  • I guess it is a case of - "if it isn't broken don't touch it".
  • nmon can also help when planned server consolidation from these old versions to, for example, micro-partitions on newer hardware.

Question 15: What if I want Support?

  • You have a few options:
    • Given me money (and I have no problem with this) or
    • Pay for and use IBM Tivoli Performance Monitoring product with support
    • Pay for and use PM for AIX a remote service where you servers performance data is sent and it generates all the graphs that you can view online.

AIX

  • nmon for AIX is a fully supported AIX command so you can raise an IBM Problem report (PMR).

However, you can't really ask for help with post-processing graphing tools that are not part of AIX.

Linux

  • nmon for Linux is becoming part of the popular distribution - if you have paid for Support you could request help
  • You can raise bugs on the sourceforge.net website for the nmon project: https://sourceforge.net/projects/nmon/

If it is something fairly simple you could ask a question on the IBM Performance tools Forum: https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000749

Question 16: Why don't you add a Java front end to nmon and get graphical output?

  • I don't have the time or the interest.
  • I have had a great laugh at Linux tools that do this sort of thing but then they highlight that the graphing takes serious CPU cycles.

I have seen very simple tools take from 20% to 100% of a CPU - which is not what nmon is all about. I don't want to waste server CPU time collecting the data when that CPU should be used for running the application, RDBMS or what-ever.

  • nmon aims to keep below a few percent of one CPU - this gets smaller as CPUs get faster.

Question 17: The command line options don't seem to work right for file capture?

  • The -f, -F, -x, -X or -z MUST be the first option on the line and only one of them.
  • This is documented in the nmon -h
  • This option sets all the other option flags to a sensible set.
  • You can then use the other flags to modify their default behavior.

Question 18: What is paging to a filesystem (rather than to paging space)?

  • Hopefully, you already understand paging to paging space (also called virtual memory).
  • There are other types of paging.
  • AIX (and other UNIX versions) page in the read-only code from a program as you start it and as it runs.

This is just like paging in from the paging space but is directly from the filesystem, this is also true for shared libraries (which you might not be aware you are using).

  • Also programs using memory mapped files access regular filesystem files - this allows access by simply reading and writing memory addresses

- AIX will page in the file pages as necessary and they will get paged back to the filesystem to free up memory or if the program forces it or if the program stops.

Question 19: Where can I get nmon and further information?

  • The data displayed by nmon are similar to the displays generated by the standard AIX and Linux commands such as vmstat, iostat, netpnmon, df, and sar.

Use the AIX and Linux manual pages for these standard commands to understand what the displayed data means.

  • Following are several useful IBM Redbooks that you can buy or download for free from http://www.redbooks.ibm.com/Redbooks.nsf/portals/Power?Open:
    1. Understanding IBM pSeries Performance and Sizing (new version SG24-4810-1) 400 pages.
    2. For Performance tuning on pSeries and AIX - Database Performance on AIX in the DB2 UDB and Oracle Environments (SG24-5511) 450 pages.

The techie's bible for tuning these databases for high performance.

  1. AIX 5L Performance Tools Handbook (SG24 6039) 950 pages - All the latest tools for AIX5L including truss and WLM.
  2. PowerVM Virtualization on IBM System p: Introduction and Configuration Fourth Edition - http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg247940.html
  3. AIX 5L Practical Performance Tools and Tuning Guide - http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246478.html
  4. AIX Performance Management Guide - http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp?topic=/com.ibm.aix.doc/doc/base/performance.htm&tocNode=toc:com.ibm.aix.doc/aix/7/
  5. XXX

Question 20: Very old question about nmon 10 and WPAR stats removed

  • Removed

Question 21: TOP process stats get switched on when I request AIX Asynchronous I/O stats?

  • This is working as normal. To get the AIX aioserver stats the details of all processes has to be collected, sorted and searched.

Having paid the CPU cycles for the TOP process stats you may as well see them on the screen or in the output file, so nmon automatically switches them on for you at no addition charge.

Question 23: nmon2rrd fails, please fix it?

  • nmon2rrd is a C program that takes nmon files and changes the data ready for the excellent RRDTOOL,

which can be used to generate graphs in .gif files for displaying on a webserver.

  • You have been supplied with the source code for nmon2rrd and it is supplied as a "toolbox".
  • This means users are expected to come up with fixed rather than the original developer.
  • Note there are updated versions from users on the nmon download site - well-done guys.

Question 24: what are NANQ and INF?

  • These are output when calculations within nmon have gone wrong.
  • Typically, when dividing by zero. NANQ means "Not a number" and INF means infinite.
  • Some times this can happen due to rounding errors but mostly it is a bug or that numbers a have overflowed the C data types.
  • when nmon uses printf to display the invalid number it outputs these strings instead.

Question 25: Old nmon version question: nmon and AIX commands do not agree?

  • A lot of this happens with nmon 10 and the Shared Processor Logical Partitions (SPLPAR) - what marketing calls Micro-partition.
  • Some of it is because the AIX commands are very unclear about what they are reporting.
  • What was CPU numbers can now be physical CPU, Logical CPU or Virtual CPU numbers and the documentation is unclear.
  • So you may not be comparing "like with like". This has been improved in nmon 11 - please report further issues from nmon 11 onwards.
  • also see question 26.

Question 26: nmon reports more than 100% for a process - clearly it is wrong?

  • Unlike AIX and some Linux commands, nmon reports the CPU utilisation of a process per CPU (the commands report as a percentage of all CPUs).
  • If your process is, for example, taking 250% then it is using 2.5 CPUs and must be multiple threaded as its more than one CPU.
  • This is far better than the commands because the percentages on larger machines make it very hard to determine if a process is using a whole CPU.
  • On a 64 CPU machine a single rogue process uselessly spinning on the CPU takes up 1.56% of the total CPU - this makes it very unclear what is going on.

Question 27: On AIX the disk adapters are wrong?

AIX

  • nmon just outputs what it gets from the libperfstat library.
  • For multipath I/O it is often the disk to adapter mapping reflects the order of disk discovery rather than some balanced view.
  • This is an AIX problem and not nmon's fault.
  • To list what nmon is extracting from the libperfstat library you can use the sample code and precompiled for AIX 5.3 binaries from

the Roll Your Own Wiki page at ryo - and the adapt sample program: https://www.ibm.com/developerworks/community/wikis/home?lang=en#/wiki/Power%20Systems/page/Roll-Your-Own-Performance-Tool

  • If you don't like the way libperfstat reports the adapter stats raise a PMR and refer to the adapt sample - as you will get nowhere reporting nmon errors.

Question 28: On AIX the adapter busy goes over 100%. That is impossible surely?

  • There are no adapter stats in AIX (see above). They are derived from the disk stats. The adapter busy% is simply the sum of the disk busy%.
  • So if the adapter busy% is, for example, 350% then you have 3.5 disks busy on that adapter. Or it could be 7 disks at 50% busy or 14 disks at 25% or ....
  • There is no way to determine the adapter busy and in fact, it is not clear what it would really mean.

The adapter has a dedicated onboard CPU that is always busy (probably no real OS) and we don't run nmon of these adapter CPUs to find out what they are really doing!!

Question 29: What about nmon for HP/UX, Solaris on Sparc or x86 or Linux on Itanium?

  • As I don't have access to such machines this is not going to happen.
  • There is also a problem that IBM gives me access to the current hardware because nmon is seen as a competitive advantage.

If this was ported to every UNIX then I would not be allowed this access.

  • There have been attempts to port nmon for Linux to other operating system but they have not been continued after a year or so.

Question 30: What about nmon for Windows?

  • Now you must be joking.
  • this does get asked a couple of times a year.* The real problems are
    1. How would the stats be extracted from Windows by a C program given no llibperfstat or /proc?
    2. The stats would be completely different and for AIX/UNIX/Linux performance people very hard/impossible to understand
    3. Given 2 none of the graphing tools would work.

Question 31: Seeing double the number of CPUs on my POWER server?

  • This is a POWER based machine question
  • This is due to the SMT feature of the POWER5 chip (and later POWER chips), where each CPU (core) runs two processes at the same time.
  • This gives you a 40% boost in performance for most commercial workloads and it s really "good thing".
  • You need to read up on SMT or get yourself a presentation from IBM on the subject.

Question 32: Very old nmon version for AIX: question about NFS driver failures removed

  • Removed

Question 33: Hello, I am new to UNIX and want to tune AIX, what do you recommend?

  • Don't do it.
  • AIX is very good at looking after itself and self-tuning. I have seen rookie systems admin nearly halt a machine by making "improvements".
  • Go on a course or read the AIX performance Redbooks from http://www.redboooks.ibm.com but don't just try changing things unless you first of all

have a problem and second know what you are doing and have practiced on a non-production machine or LPAR.

Question 34: CPU wait is too high, how can I reduce it?

  • This question is asked a lot and it can mean your CPUs are actually too fast!
  • CPU "waiting for I/O" state and utilisation numbers (as opposed to User, System and Idle) means the CPU is Idle but has a disk I/O outstanding.

In history, this was used to highlight that your application is being held up by slow disks or disks problems. In the Wait for I/O state, the CPU is actually free to do other work and the CPU is NOT looping waiting for the disk - it, in fact, actioned the adapter to perform the disk I/O, put the calling process to sleep and carried on. If there is no other process it is in the same loop as in the Idle state i.e. it is available to do other things. In AIX the processor does one of two things

  • in regular stand-alone machines or a dedicated CPU LPAR the process runs a special kernel level process called "wait" from which it can exit very quickly

at the arrival of the next interrupt

  • In a micro-partition (Shared Processor LPAR) the processor after a few microseconds will call the Hypervisor to yield the processor for other LPARs
  • In benchmarks, Wait for I/O is seen positively as an opportunity - we can do throw in more work to boost throughput.
  • Any workload in which the CPU does comparatively little work compared to the volume of disk I/O is going to give you high Wait for I/O.
  • If this high Wait for I/O is a sudden change from the normal pattern then it needs investigating and you should make sure as many disks as possible are

involved in the disk I/O.

  • But lots of workloads just run like this - a common example I come across regularly is SAP databases. SAP cleverly caches lots of data but on a large database,

it has to do lots of disk I/O for a particular customer or whatever records. Once the data is available it is sent to the SAP application servers i.e. little work is done on the database.

  • In fact, faster CPUs would mean even high wait values.

Question 35: On AIX, free memory is near zero, how do I free more memory?

AIX

  • This is just how AIX works and is perfectly normal. All of the memory will be soaked up with copies of filesystem blocks after a reasonable length of

time and the free memory will be near zero. AIX will then use the lrud process to keep the free list at a reasonable level.

  • If you see the lrud process taking more than 30% of a CPU then you need to investigate and make memory parameter changes.

[+Question 36: '''How can I set numperm better?

  • You can't. This number just reflects the amount of memory being used for disk blocks - called the buffer cache. It is controlled by three

parameters minperm, maxperm and strictperm but these set thresholds and algorithms. The actual numperm number reflects what is actually going on. You will have to find other places for tuning these parameters as it is beyond the scope of this FAQ.

  • It is also worth noting that the nmon values for numperm and maxperm are based on a percentage of physical memory.

The AIX commands report a percentage but not of all memory - they seem to remove some memory that might be something like the memory allocated to the AIX kernel (i.e. it could never be used as cache). Unfortunately, this is not documented and the memory size not counted is not available with any public API. So nmon does the best it can but the numbers will not be absolutely the same.

Question 37: What format is the nmon output file?

  • Plain ASCII text that you can edit and editable with vi (but you might hit the 2048 byte line limit on the AIX vi).
  • I use the Open Source vim on AIX to avoid this or do it on Linux.
  • The first token on the line tells you what sort of data it is
    • AAA lines are basic nmon data about this collection of data
    • BBB lines are about the configuration of the machine
    • ZZZ lines include the date and time stamp stored here once ro reduce output
    • others should be obvious
  • The second field is the Timestamp - see the ZZZ section to the actual time
  • Then there is the data
  • Each sort of data (CPU, DISK, etc.) has a Header line that describes the columns and the header lines also include the graph titles
  • You do not need to sort the nmon output file for nmon2rrd or the Analyser but if you do then you can see the sections easier for editing.

Question 38: I have collected once a second for 8 hours but I can't get the Analyser to work?

  • You have 28800 data points and you want to see this on a screen with say 1024 pixels wide !!
    • that is 29 data points per pixel.
  • My new Thinkpad has 1400 pixels across the screen, so I am down to just 18 data points per pixel
    • What were you thinking when you started collecting so much data!!
  • I think even with the best will in the world, the analyser spreadsheet is going to struggle at some point with too much data.
  • On a tiny machine you get about 1.5KB per snapshot and a normal size machine with a few nmon options it is more like 60KB each.

At 60KB the maths --> 28800*60KB = 1.6GB. How big is your output file?

  • I hope you have at least 16 GBs of memory in your PC to handle this!
  • As I hope you know the nmon file is text and editable with vi (but you might hit the 2048 byte line limit on the AIX vi).

I use the Open Source vim on AIX to avoid this or do it on Linux. If you take a look at the file format you should be able to cut done the file size and make a series of files but each will need the header section that you will find at the top of the file and then a different set of snapshots.

Question 39: nmon does not work on my Linux machine!

  • nmon runs on x86 (Intel and AMD), mainframe, ARM and POWER processors and on a dozen or so versions of Linux.
  • If your Linux system has a C compiler and ncurses you can have nmon running in a minute or two.
  • If you report problems I will need to know which platform and which Linux version plus distro before I can help so please include these with initial questions.

Question 40: When do we get nmon for AIX version X for Linux?

  • The Linux & AIX source code for nmon is very different apart from curses framework and basic approach.
  • AIX gets all the information from system and library calls (with two exceptions) and in Linux, this has to be read from the /proc filesystem and

some classic UNIX style kernel functions.

  • This means the AIX code is more straight forward.
  • The code base of the AIX and Linux version are completely different.
  • So there is no need for Linux and AIX to have the same version number.

Question 41: The boxes and lines in nmon do not work right online with: DTterm, xterm, rvxt, putty, VNC, (whatever you have)?

  • nmon uses curses to handle the displaying of characters on the terminal.
  • This is controlled mostly by your TERM variable setting.
  • The nmon developer tests with all of the above.
  • They work perfectly and they work perfectly all the time.
  • If it does not work for you then you have some setting wrong on your machine or X Windows or have some strange settings for TERM and/or TERMINFO shell

variable setting or you are using a duff terminal emulator.

  • , For example, you can tell putty this is an xterm session via putty sessions but tell Linux this is a vt100 session with TERM=vt100 then expect odd things to happen.
  • Let me state that again: your system has a problem not nmon.
  • The TERM shell variable should be set to the terminal emulator you are using.
    • If you are using an xterm then TERM should be xterm
    • If you are using DTterm then TERM should be dtterm
    • If you are using an AIX term then TERM should be set to aixterm
    • Get the idea - other combinations are your problem.
    • Unless you are using a genuine 1970's DEC VT100 then you should not be using this setting with more advanced terminal emulators.

I remember VT100's well, even found a bug in the firmware once!

  • The TERMINFO variable should not be set to anything (in fact not set at all). If it is then you or someone has been mucking about with

terminfo databases and why are you blaming nmon?

  • Terminal Emulators:
    • xterm works well in black and white.
    • aixterm works well and has colour and nmon uses the colour.
    • DTterm works well and has colour and nmon uses the colour.
    • rxvt and xterm-color combination (see WWW for details on setup, on google.com search for xterm-color and AIX) - this combination also lets

vim (the improved vi from Open Source) use syntax highlighting in C code.

  • The Windows telnet terminals emulation is very poor indeed and not recommended under any circumstances - you are on your own.
  • The best alternative on a Windows PC is putty (see WWW for details and download) and is highly recommended - I use this every day

- this will work with TERM set to xterm perfectly.

  • VNC is, of course, even better and gives you X windows on a Windows workstation at zero cost - again highly recommended.
  • The -B option starts nmon with no boxes (or colour). Some purists do not like to waste the screen space with the box lines.

You could add 'B' to the NMON shell variable to make this automatic: export NMON=B

Question 42: I have 2400 disks (or 2400 small SAN LUNs) and nmon is slow to collect the stats from so many, can you help?

  • I guess you are learning the folly of small LUNs and that it makes the totally machine unmanageable. But you are not the first or worst

- the record stands at 4500. Some suggestions:

  • Have you got more than four paths to each LUN?
    • If yes, you need to fix this ASAP as it is bad for performance and terrible for RAS (and I mean really bad).
  • Use the -D flag to stop nmon collecting disk configuration each time can really help to reduce the startup time.
  • Collect this disk configuration just the once - unless you are changing the disks a lot!!
  • You can use nmon User Defined Disk Groups to limit the output but nmon will still have to collect all the data from all the disks and then

reduce what is actually reported.

  • But the only real solution is to reduce the number of disks you have - yes, I know this is a lot of work but you have a machine setup that

can not be managed and that is not viable in the long term.

  • Don't blame nmon for highlighting the issue.
  • I recommend 32 to 64 LUNs and make the disk subsystem do the hard work of spreading the data across disk - i.e. not you as your time is much more valuable.

After all that is what you buy big disk subsystems for and there a better uses of your time and thought.

Question 43: Old nmon for AIX: Adapter stats and IOADAPT is not saved to the nmon file seems to be missing with AIX 5.1?

AIX

  • Correct, this data is not available on AIX 5.1 from the libperfstat library.
  • This also causes a problem on nmon2rrd version 10 where it expects the IOADAPT section and crashes.
  • Recommended action upgrade AIX as 5.1 is not supported without purchasing extended support.

Question 44: What is CharIO (a column of the TOP processes stats)?

This is the character I/O that a process is generating and it is counted from calls to the read() and write() systems calls. I/O started in other ways like Async I/O (commonly used by an RDBMS), paging or memory mapped files are not included. The number fetch from the AIX kernel using the getprocs64() system call and the structure found in /usr/include/procinfo.h - look for the pi_ioch variable.

Question 45: On Linux the disk stats are all doubled?

nmon collects the data from /proc and displays it. On newer Kernels, this is ht e/proc/diskstats file. It was decided a long time ago that hiding data was a very bad idea as it can go wrong and then be very misleading - this is how the ozone hole was missed for 5 years and not detected - the algorithm decided the data must be wrong and deleted it from the stats. The Linux disk stats (in three different files and four formats depending on the Linux version - great coding guys!!) reports both disk level and disk partition level stats in the same file. nmon just shows you the stats - it is your job to understanding them. nmon does not and with LUNs on SAN disks and software RAID and LVM's it is much safer to show everything.

Question 46: On AIX the disk seem to be mostly on the first adapter?

nmon now collects the adapter data from AIX libperfstat. This is the addition of the disk stats added up by knowing which disk is connected to which adapter. This of course, is complex for multipath IO disks. AIX seems to build this map from the order in which disks are discovered rather than used. Depending on your initial setup it can often mean that most disks are assigned the first one or two adapters. Sorry, there is nothing that nmon can do about this. To list what nmon is extracting from the libperfstat library you can use the sample code and precompiled for AIX 5.3 binaries from the Roll Your Own Wiki page at ryo

Question 47: On nmon for Linux the CPU Wait for IO number is zero or odd?

This number is not available in the /proc filesystem until the 2.6 kernel and then it appears in the undocumented fields at the end of a line - I have fixed this for the 2.6 kernels in nmon for Linux version 11c. Question 48: On nmon for Linux the paging details are missing and the PAGE lines for the capture to file are missing.

This data was very hard to locate and now appear in nmon for Linux version 11d onward for the 2.6 kernel. Before this kernel version the data is not present in /proc.

Question 49: I want to collect data every second and then see weekly and monthly reports. How?

Let us take this in simple bite-size chunks:

    First, a piratical point, most Laptop and PC screens are 1024x x768 pixels. 

The point is that no matter how many data points you have you can not even see a maximum of about 800 data points. This is why I recommend about 300 to 400 data captures with nmon to get good looking graphs.

    Second, one-second stats for a day give you (60 x 60 x 24) 86400 data points! So OK let us try one-minute stats then we have 1440 data points, 

which is still too many. So we need to move to 5 minutes captures and we get to a sensible 288 data points and a good looking graph.

    Third, we then collect data for a month 288 x 31 = 8928 data points - oh dear far to may data points again!! 

So now we have to drop down to once an hour data capture (24 x 31) and we have 741 data points which are only just possible - we had better start thinking about the purchase of a bigger screen.

    If you then want to compare months or have a yearly report ... well, you get the idea by now, we are now monitoring 12 hour periods.

But the above is only a physical problem. The much larger logical problem is still there to catch you out and that problem is averaging out. A long time ago I noticed that the shorter the time period that you use to monitor the more fluctuation you notice in the data.

Philosophy: If you keep using shorter and shorter periods you will eventually see that the CPUs are either 100% busy or 100% idle all the other numbers are just a feature of humans not thinking fast enough and having to average out the CPU use in longer periods.

Anyway, for performance tuning, we need to concentrate on the peaks. Take a look at the below graph:

If we average the whole day we get 50% which completely hid the peaks of the data time and the heavy CPU load during the evening batch. If this computer was not used during Saturday and Sunday the average might come down to 35%. The point is averaging data over longer periods removed all the important peaks.

This is in addition to the data management problem.

Due to these three problems:

    Data overload - to many data points
    Averaging out - eliminates the vital data
    Manipulation - the data will need to be stored, manipulated and displayed - non-trivial

I think many people make the mistake that this long term reports from nmon is an easy task but it will turn out to be very hard work and often the results are utterly pointless or meaningless.

If you must attempt this then I recommend:

    rrdtool to summarise data for you and draw graphs
    ploticus looks like a good tool
    take a look at Ganglia

Question 50: nmon will not start on AIX 5.1 due to a libperfstat error?

The error is something like: exec(): 0509-036 Cannot load program <nmon binary file here> because of the following errors: 0509-150 Dependent module libperfstat.a(shr.o) could not be loaded. 0509-022 Cannot load module libperfstat.a(shr.o). 0509-026 System error: A file or directory in the path name does not exist.

You will need to have installed the libperfstat library from the AIX CDROMs. This is in bos.perf.libperfstat package.

I hope you realise that AIX 5.1 is not normally supported without extra payments as it is so old.

Question 51: How do I work out the Physical CPU use on Linux on POWER for shared processor LPARs?

Here is a Korn shell script that shows you where to get the data and the maths involved.

  1. !/usr/bin/ksh

before=`grep purr /proc/ppc64/lparcfg | sed 's/purr=//'` echo before=$before

integer seconds=2 sleep $seconds

after=`grep purr /proc/ppc64/lparcfg | sed 's/purr=//'` echo afterr=$after

timebase=`grep timebase /proc/cpuinfo | awk '{print $3 }' ` echo timebase=$timebase

string="($after-$before)/$timebase/$seconds" echo string $string bc <<EOF scale=5 $string EOF

Question 52: The Disk Busy stats are missing on AIX

If you are watching this online it will be flashing

To enable disk stats as root: chdev -l sys0 -a iostat=true

at you - this is a big hint on how to switch them on !!!

Question 53: Sort order problems with massive nmon output files.

So you collected more than 9999 snapshots in a single nmon capture. Ignoring the fact that the Excel Analyser can't cope with all this data and it makes the data unmanageable. We suggest a good aim is between 400 and 700 snapshots per file for good graphs and manageable file sizes. Anyway, you then find out that if you sort the file the rows don't even sort in the right order. The problem is you have four digit and five digit Timeshot numbers - the T numbers. This mucks up the sort ordering. What can you do? Try this on the AIX system - should work on Linux too, it makes all the T numbers 5 digit and then they can be sorted:

sed 's/\(,T\)\([0-9][0-9][0-9][0-9]\)\(,\)/\10\2\3/' original.nmon >original5digit.csv
sort -n original5digit.csv >fixed.csv

Full marks if you understand the sed command - this is very advanced regular express stuff

Question 54: AIX 5.3 updated but then nmon gives "Illegal instruction(coredump)"

This has been reported shortly after an upgrade to a AIX 5.3 higher ML (like ML5 or ML6) and reboot. After a lot of research and experiments the following was found by a persistent nmon user called Xi Chen. The problem seems to be nmon jumping to a library like libperfstat and the jump vectors are not right so the library/system call jumps to address zero and attempts to execute instruction zero (invalid, of course). This is a bug in AIX and its update process where the libperfstat kernel package does not match the library. Try the following command: # lslpp -L | grep -i perfstat

You may get something like:

# lslpp -L | grep -i perfstat
  bos.perf.libperfstat      5.3.0.50    C     F    Performance Statistics Library
  bos.perf.perfstat         5.3.0.60    C     F    Performance Statistics

Update the package bos.perf.libperfstat to the same (5.3.0.60) or at least much closer levels (like 5.3.0.60 and 5.3.0.61) as bos.perf.perfstat. Preferably, the latest available levels.

Question 54: AIX 5.3 updated but then nmon gives "Assert Failure"

This has been reported shortly after an upgrade - some machines have this problem while others don't. There does not seem to be a pattern. There has been a lot of investigation of this issue with tools being written but it is still a mystery. The libperfstat library is claiming that an invalid parameter has been passed but tools have shown this is not true. The three parameters are a pointer to memory (just malloc'ed in the code), the number of adapters (just returned by the previous call to libperfstat) and the size of the disk adapter structure (which has never changed). The output looks like this:

ERROR: Assert Failure in file="nmon11.c" in function="main" at line=3300 ERROR: Reason=System call returned -1 ERROR: Expression=perfstat_diskadapter((perfstat_id_t * )FIRST_DISKADAPTER, p? ERROR: errno=22 ERROR: errno means : Invalid argument

Then it has been found that a reboot fixes most of these Assert Failures. We don't fully understand this but it may be adapters in funny states, or kernel modules need to be reloaded or libperfstat in a twist - one thing we do know - it is not nmon! If you hit this problem:

    Check the software levels, see Question 53
    Do you think that you rebooted after the upgrade or do you know for absolutely sure!!
    Try: export NMON_IGNORE_ASSERT=1 and then start nmon from this same ksh. This may work around the problem as nmon bravely tries to carry on 

even with library errors.

    Try the latest beta version of nmon (if it supports your AIX level).
    I know rebooting can be a problem with production systems but it fixes this the vast majority of the time.
    If still its a problem, let us know via the usual AIX Performance Tools Forum.

Question 55: On AIX 5.3 ML6, nmon output files contain zeros, missing CPU stats, corrupt ZZZ lines and "nfs" strings found in the stats

This is yet another bug in the AIX libperfstat library at this ML6. The NFS data returned to nmon is corrupt and these characters may be output directly from the library (very bad form chaps!).

The workaround is:

    Do not include NFS statistics (remove the -N)
    Move to nmon12 that codes around these bugs.

Question 56: Does nmon capture point in time stats or averages?

Well there are two type of numbers

    rates and
    absolutes.

For an absolute example, free memory is an absolute - nmon just show you how much is memory is free. For a rate example, the network stats are rates, here nmon does the following:

    Capture a complete set of counters - these are incremented by the kernel like the number of bytes sent.
    then nmon waits the number of seconds you asked
    then nmon captures the second set of these counters
    then nmon calculates the difference between the two sets and divides by the number of seconds, so everything is per second
    this number is then displayed on screen or written to the data file

So the rates are the average between the two capture points. As the number of seconds increases the rates get more and more steady but note if you reduce the seconds to just one (the minimum to make sure nmon does not use too much CPU time) you will see lots more peaks and dips in the numbers.

"Point in time" numbers would be very misleading as they would miss all the peaks and dips in between - you would have to take dozens of them to be sure you are really seeing a representative number.

Question 57: Why is the Process memory percentage zero? (same for System and User percent)

This seems to happen in AIX 5.3 TL07 or thereabout. In fact, it is the AIX libperfstat library, which nmon uses, that has a bug in it that returns a large negative number for the Process% value. The Process, System and User Percentages are approximations (remember memory has many modes, types and uses and some overlap) and the calculation goes wrong.

nmon reports this problem by showing 0% - which is clearly impossible.

The bug was very hard to reproduce and track down because the problem only happens in particular circumstances and changes in memory use (like starting and stopping large memory applications). I am pretty sure you have a good chance of the number being fixed (for at least some time but may reappear), if you reboot the machine/LPAR.

The fix is to update AIX to AIX 5.3 TL09 (or even better AIX 6) but there may be a PTF or efix. You will have to ask AIX Support by asking for a fix to the libperfstat library to fix the real_system, real_process and real_user members of the perfstat_memory_total_t structure. That will give them the right details to search for in the Retain database. Do not ask for nmon classic support as the answer could be short and/or rude!

In my experience, AIX systems administrators don't like adding these updates to a production machine. So it may be better to just accept that if any of these numbers are zero then do not use any of these percentages.

Question 100: When will nmon collect data from lots of machines or LPARs?

Answer: Never. I like to think nmon does one job and does it well - it collects data from one machine and saves it in one file. Going multiple machine or LPAR has many problems:

    Collecting data from lots of machines or LPARs would require network access and lots of error handling for missing or late data.

    The nmon output file would then be far more complex and have to include the machine names and totally rewrite the time stamps.
    We already suffer from too much data than Excel can handle.
    There would simply be too much data to display
    This complication would mean nmon becomes very large and code stability would take a long time to settle down

What you do need is:

    Less data and then you drill down of particular nodes
    Automated database generation to store the data
    Automated graphing of the data you really want
    History for the last hour, day, week, month year
    Small simple daemons on the nodes and automated central collection point
    Simple method of collecting more stats
    Open Source code to make it safe and simple to implement.

This tool is called Ganglia, see http://ganglia.sourceforge.net/ See Question 101

Question 101: When will nmon collect data like "topas -C"?

It may not be obvious but topas and topas -C are two completely different programs hidden in one binary. The cross partition stats involved communicating with each LPAR and the HMC to get the data, unlike the local stats that just calls the local kernel API. The cross partition version of nmon has already been written it is called Ganglia please see http://www.ibm.com/collaboration/wiki/display/WikiPtype/ganglia for more details. OK, it is an excellent Open Source tool and nothing to do with nmon but it has all the right stats, many brilliant features, is very simple to implement and has very little impact on performance. There is no need to duplicate this work and it also supports lots of operating systems, the output is via a website and the data is in graph form and it keeps historic data - so this is better than text output on a dumb screen and only for root users.

Question 102: If nmon crashes, how to determine where in the nmon code that happened?

This is a great help, identifying the nmon code with the problem.

To debug the problem code, we need a stack trace to identify the nmon code calling libc.

Assuming you have gdb available and the core file, source code + binary in the current directory. If compiling yourself include -g and may be don't optimize by removing the -O option.

$ gdb nmon
GNU gdb verion xxxxxxxx
GDB Information here

For help, type "help".
Reading symbols from nmon...done.
(gdb) run [[YOUR NMON COMMAND LINE HERE]]
Starting program: /home/nag/nmon xxxxxxx
. . .
. . .
Program received signal SIGSEGV, Segmentation fault.
main (argc=<optimised out="">, argv=0x7fffffffe4d8) at nmon.c:2247
2247 *crashptr = 42;</optimised>

(gdb) where full

0 main (argc=<optimised out="">, argv=0x7fffffffe4d8) at nmon.c:2247</optimised>
. . .
- lots of details and variables here
. . .
(gdb) quit

Copy all the gdb output lines and send to the nmon developer or report it on SourceForge and nmon project.