Recent Changes - Search:

Home Page



Other tools


edit SideBar

NmonFAQ

nmon for Linux and AIX Frequently Asked Questions (FAQ)


This is a work in progress and may never be finished - Last update Jan 2017.

This website is about nmon for Linux but many users also use nmon for AIX so both are covered in this FAQ.


Frequently Asked Question may be Answered by a Quick Video


Prefer to watch a YouTube Video from the nmon designer / developer?

This is Nigel Griffiths' YouTube Channel - for lots of videos on nmon, POWER Chips, Power Systems servers, Performance, AIX, PowerVM, PowerVC, PowerSC, Linux on Power.

nmon on AIX on Power

nmon for Linux on Power, x86/AMD64, mainframe, ARM

  • There are 3.5 for getting started for nmon with Linux which take roughly 57 minutes in total
    1. Install, download and online on-screen - https://www.youtube.com/watch?v=prVzcj3vXNc
    2. Data Capture to file - https://www.youtube.com/watch?v=_PDAQLflfEc
    3. Graphing with nmonchart - https://www.youtube.com/watch?v=5P4neOqoCTo
  • If you don't mind using the Microsoft Excel spreadsheet, also take a look and the nmon Analyser video above.
  • Ignore the YouTube videos which are more than 2 years old as they are based on older versions although they are still mostly true.
  • If you don't have access to a machine to run the nmon command then you can't read the help output - so these links will help you:
    • The Flash screen that you see when nmon starts up: nmon Flash welcome with basic help information
    • The command Help Information is very useful so here is a link to nmon -h output: nmon -h output

Frequently Asked Question


Colour key:

  • Answers in BLUE are for nmon for Linux
  • Answers in GREEN are related to nmon for AIX
  • Answers in BLACK apply to both versions

Summary of the questions

  1. Which nmon version am I running?
  2. Which nmon for my version of AIX or Linux?
  3. nmon crash shortly after starting a data capture please send me the next version?
  4. Significant nmon dates?
  5. All I get is "nmon not found"?
  6. What is the most reported errors for nmon?
  7. Can I decide the filename it saves data too?
  8. What is the default output filename?
  9. I want nmon output piped into a further command, how?
  10. Why do you support all these old unsupported AIX versions?
  11. What if I want support?
  12. Why don't you add a Java front end to nmon and get graphical output?
  13. The command line options don't seem to work right for file capture?
  14. What is paging to a filesystem?
  15. Where can I get nmon and further information?
  16. TOP process stats get switched on when I request Asynchronous I/O stats?
  17. nmon2rrd fails, please fix it?
  18. What are NANQ and INF?
  19. nmon reports more than 100% for a process - clearly it is wrong?
  20. On AIX the disk adapters are wrong?
  21. On AIX the adapter busy goes over 100%. That is impossible surely?
  22. What about nmon for HP/UX, Solaris on Sparc or x86 or Linux on Itanium?
  23. What about nmon for Windows?
  24. Seeing double the number of CPUs?
  25. Hello, I am new to UNIX and want to tune AIX, what do you recommend?
  26. CPU wait is too high, how can I reduce it?
  27. On AIX, free memory is near zero, how do I free more memory?
  28. How can I set numperm better?
  29. What format is the nmon output file?
  30. I have collected once a second for 8 hours but I can't get the Analyser to work?
  31. nmon does not work on my Linux machine!!
  32. When do we get nmon 10 for Linux?
  33. The boxes and lines in nmon do not work right online with: DTterm, xterm, rvxt, putty, VNC, (whatever you have)?
  34. I have 2400 disk (small SAN LUNs) and nmon is slow to collect the stats from so many, can you help?
  35. What is CharIO (a column of the TOP processes stats)?
  36. On Linux the disk stats are all doubled?
  37. On AIX the disk seem to be mostly on the first adapter?
  38. On nmon for Linux the CPU Wait for IO number is zero or odd?
  39. On nmon for Linux the paging details are missing and the PAGE lines for the capture to file are missing.
  40. I want to collect data every second and then see weekly and monthly reports. How?
  41. How do I work out the Physical CPU use on Linux on POWER for shared processor LPARs?
  42. Automatic starting with certain statistics for nmon online mode?
  43. Sort order problems with massive nmon output files?
  44. Does nmon capture point in time stats or averages?
  45. When will nmon collect data from lots of machines or LPARs?
  46. When will nmon collect data like "topas -C"?
  47. nmon will no stay running - What should I check?
  48. Why isn't nmon for Linux on the Distro media or online repository or it is there but out of date?
  49. Do you have nmon presentations I could use for training others?
  50. nmon Analyser: What is Wavg?
  51. LPAR Tab/Statistics missing with Dedicated CPU mode?
  52. Adding External Data Collectors to nmon files so it graphs your extra data ?
  53. Sharing nmon files - Are they a security risk?
  54. The Disk stats are far too high or 100%, nmon is broken?
  55. What files does nmon for Linux use to get its data?
  56. Can you add the monitoring tape drive on AIX?
  57. How to user an External Data Collector with nmon?
  58. How to RDBMS Oracle Transaction Counters External Data Collectors Example?
  59. How to use the AIX Workload Manager stats?
  60. How to use change the Top Processes Minimum CPU Threshold?
  61. How to start nmon file collection with cron?
  62. Can I reset the peak counters for disks, network, AIO (AIX only) and CPU graphs online?
  63. Is sharing nmon data capture file a possible security risk?
  64. How to determine optimal memory size for a VM from nmon data?
  65. Please explain the TOP Process Memory stats?
  66. What are User Defined Disk Groups for?
  67. Using User Defined Disk Groups with nmon for AIX?
  68. Using User Defined Disk Groups with nmon for Linux?
  69. How do I get more disk stats because I can never get enough of these?
  70. How to limit top processes to certain commands?
  71. How does nmon for AIX extract its data?
  72. How can I see 100's of disks on-screen?
  73. On-screen displaying only busy Top Processes and Hot disks?
  74. Do not use kill -9 on nmon as kill -USR2 will end it cleanly!

- To do:

  1. nmon for Linux 16 - major user interface upgrade with pictures:nmon for Linux v16 - New Stats On screen & Face lift
  2. Got a suggestion ?

- Historic questions


Question 1: Which nmon version am I running?

  • AIX
    • nmon -v - Will just say TOPAS-NMON
    • lslpp -w /usr/bin/nmon - Details that nmon is in the bos.perf.tools package
    • lslpp -l bos.perf.tools - Details the version number of that package. Normally that is very similar to the AIX version: oslevel -s
  • Linux
    • nmon -V - Details the nmon version
    • The version it is also displayed at the top left when used interactively on a terminal.

Question 2: Which nmon for my version of AIX or Linux?

  • AIX
    • On AIX with these or later versions: AIX 5.3 TL09+ and AIX 6.1 TL02+ and AIX 7 any version You should run the nmon that comes with AIX and is installed by default.
    • It is strongly recommended if you have problems to first add all available service packs for your AIX release as this removes 99% of problems.
    • If you have earlier AIX versions then you can run nmon classic downloadable from XXX
  • Linux
    • On Linux go to the nmon for Linux website (http://nmon.sourceforge.net) to download nmon. It is compiled for 50 different platforms (POWER, x86, x86_64 and Mainframe ) and Linux distributions combinations.
    • If your combination is not on the list or you have a newer Linux version you can now compile it up yourself.

Question 3: nmon crashes shortly after starting a data capture, please fix this send me the next version?

  • When you are capturing data to a file, the nmon tool disconnects from the shell, to ensure that it continues running even if you log out.
  • This means that nmon can appear to crash but it is still running in the background.
  • Use: ps -ef | grep nmon to see the nmon process still running.

Question 4: Significant nmon dates?

  • 1997 - nmon for AIX (AIX version 3.1.5). Developed to save Nigel Griffiths effort in monitoring benchmarks and creating benchmark report graphs.
    • Remember back then these benchmarks only used dumb terminals with 80x25 character screens (hence dense on-screen stats). Graphics adapters were available but took to much CPU time to use in benmarks.
    • Also Microsoft Excel not invented so we used Lotus 1-2-3 with very limited CSV file size limits (hence dense file stats with low duplication).
  • 1998 - nmon for AIX (AIX 3+) only available on floppy disk to IBM Benchmark team in the UK and IBM Montpelier.
  • 2001 - nmon for AIX 5.5 binaries first released internally in IBM on a Webserver.
  • 2003 nmon for Linux code started distributed only as binaries
  • 21st Nov 2008 - nmon made part of AIX
    • In the following year nmon appears on every subsequent release as a default installed command.
    • This means nmon for AIX gets full IBM Problem Management Report (PMR) support like any other command.
    • Note graphing tools like nmon Analyser and nmonchart are not part of AIX and not part of AIX Support. These are still handled by willing IBMers and largely in their own time.
  • Ongoing nmon for AIX development by the IBM AIX development team - specially the performance tools team.
  • 27th July 2009 - nmon for Linux released to open source
    • nmon for Linux released to open source under GPL - it was an internal project at IBM for many years.
    • The source code and further information is available at Sourceforge and in particular at the new nmon for Linux wiki at http://nmon.sourceforge.net
    • This means that you can compile nmon for Linux for your specific Linux flavour and help improve it further.
  • Ongoing nmon for Linux development by Nigel Griffiths

Question 5: All I get is "nmon not found"?

  • Linux
    • First check the nmon file is executable (this gets switched off by FTP) with: ls -l `which nmon`
    • In case you are new to Linux or UNIX set the executable flag with: chmod ugo+c `which nmon`
    • Second, if you are the root user, you have to name the executable directly with the full path name or (if in the current working directory) ./nmon or put it into a directory in your $PATH, for example: /usr/local/bin
  • AIX
    • nmon since AIX 5.3 TL09+ and AIX 6.1 TL02+ and AIX 7 any version is a default install and the starting shell script can be found in /usr/bin/nmon - it actually starts the executable called nmon_topas.

Question 6: What is the most reported errors for nmon?

  1. nmon crashes as it starts in collecting to a file mode.
    • See question 2.
  2. nmon Analyser does not work
    • Quite often the nmon output file is empty or only has the config info due to not waiting long enough - if you request data every 5 minutes then wait 16 minutes (three snapshots of performance data) before you try analysing the file!
    • Incomplete last line. if nmon is still running and outputting data and you can grab the file it is possible to have an incomplete last line of the file - you could edit with vi to remove the last set of output - see the lines starting ZZZ.
  3. Can we have a new feature XYZ?
    • But it turns out XYZ is already implemented (and has been for a few years)
    • So read the nmon -h output and you might find it
    • See below for External Data Providers (question 57) and User Defined Disk Groups (questions 66 to 68)
  4. I have a problem with the nmon options
    • Turns out the user can't read the nmon -h output which states: The -f or -F MUST be the first option on the line
  5. How do I interpret nmon output?
    • First do your home work by learn UNIX and Linux performance statistics: read the command manuals, take a course or spend 5 years in a benchmark centre.
    • Sorry but I can't write nmon and teach the world the basics on UNIX/Linux performance tuning.
  6. The AIX and Linux memory stats are different or missing?
    • The answer is "Yes you are correct". Some of the basic memory stats map OK between AIX and Linux for example: memory total size and memory free but the bulk are very different.
    • Also note that early Linux on Intel/AMD had to cope with small memory size with high and low memory areas due to 16 bit then 32 bit hardware. This has died out now with the move to 64 bit memory addressing.
    • Linux and AIX are very different in the memory area and it not me forgetting to implement some of the stats.
    • For example, the AIX NEWMEM starts are not available under Linux and never will be.
  7. What is causing AIX to run at 99% memory used?
    • This is perfectly normal and show AIX is making use of memory to optimise performance. This is normal and it is a "good thing".

Question 7: Can I decide the filename nmon saves data too?

  • Use nmon -h and check out the -F <file> option which must be the first option on the line

Question 8: What is the default output filename?

  • <hostname>_<Year><Month><Day><Hours><Minute>.nmon
  • Notes:
    • This has been very carefully chosen after years of experience
    • A directory of nmon files will sort in machine and then date+time order. So you can find the data file you want in a simple way.
    • Many people needlessly make up their own names via scripts and date commands that will not be in any sensible order = a pointless waste of time.
    • One side effect is that, if two nmon captures are started in the same minute they might use the same filename, so stagger the start up by 61 seconds.

Question 9: I want nmon output piped into a further command, how?

  • Use a FIFO and the -F option.
    mkfifo /tmp/xyz
    nmon -F /tmp/xyz s 5 c 300
    your-command </tmp/xyz
  • If you are doing this with the online data output, I think you are "barking mad" but some people are still trying it.

Question 10: Why do you support all these old unsupported AIX versions?

  • You would be amazed at what AIX versions are running out there!
  • I guess it is a case of - "if it isn't broken don't touch it".
  • nmon can also help when planned server consolidation from these old version to, for example, micro-partitions on newer hardware.

Question 11: What if I want support for nmon?

  • You have a few options:
    • Give me (Nigel Griffiths) loads of money (and I have no problem with this) or
    • AIX: Pay for and use IBM Tivoli Performance Monitoring product with Support
    • AIX: Pay for and use PM for AIX (Also called Performance Manager for AIX) which is a remote service. Your server's performance data is sent to IBM over a secure link and it generates all the performance graphs, reports and suggestions that you can view via a Web Browser.
  • AIX - assuming you have paid IBM for Support
    • nmon for AIX is a fully supported AIX command so you can raise a IBM Problem report (PMR).
    • However, you can't really ask for help with post-processing graphing tools that are not part of AIX. So, please, no nmon Analyser or nmonchart questions.
    • AIX Support is not really there to read the nmon for AIX manual pages for you!
    • Also please read very carefully the nmon help information: nmon -h
    • If you raise a PMR then AIX Support will immediate want a snap report and a PerfPMR data collection during the problem - so have these ready.
  • Next note that nmon is not really a problem determination tool (it is a performance monitoring tool) so AIX support will only take a quick look at the nmon data and move on to other problem determinations tools like the excellent AIX trace, svmon tools and other advanced tools.
  • Linux
    • nmon for Linux is becoming part of the popular distribution - if you have paid for Linux support you could request help.
    • You can raise bugs on the sourceforge.net website (if you have or get a Sourceforge user account = not hard) for the nmon project: https://sourceforge.net/projects/nmon/

If it is something fairly simple you could ask a question on the IBM Performance Tools Forum (if you have or get a IBM DeveloperWorks user account = not hard): IBM Performance Tools Forum -

  • It is rather annoying when I get asked questions like: "The nmon feature #### is broken, please fix it immediately?"
    • Which OS and its version?
    • Which nmon version?
    • What are the symptoms?
  1. How to report an nmon problem well?
  • Please include the following to get a quicker response and save wasting time asking all these questions before getting started:
    1. OS version
      • AIX: oslevel -s
      • Linux Distro: cat /etc/*ease
    2. nmon version:
      • See Question 1.
    3. Briefly, the type and size of hardware
      • Processor type: POWER, Z, AMD, ARM, Intel other
      • No of CPUs and size of RAM
      • Equipment type: applicance, PC, Laptop, small server, large server - virtual machine or whole server
    4. The actual nmon command ran
      • with all the options
      • Or send me the script used to start because I like a laugh! 9 time out of 10 the script is pointless.
    5. Have you read carefully the nmon -h output?
      • This answer 33% of questions - partilucarly the line: "the -f or -F MUST be the first option on the line"
    6. Describe the symptoms of the perceived problem
      • What were you expecting?
      • What did you get?
    7. Have you tried something simpler?
      • Instead of 25 options have you tried the problem one by itself or using nmon without that option?
    8. Send details
      • Send me a sample file - hopefully not to large
      • or screen capture/scrap showing the problem.

Then you get your question answered sooner.

Question 12: Why don't you add a Java front end to nmon and get graphical output?

  • I don't have the time or the interest.
  • I have had a great laugh at Linux tools that do this sort of thing but then they highlight that the graphing takes serious CPU cycles. I have seen very simple tools take from 20% to 100% of a CPU - which is not what nmon is all about. I don't want to waste server CPU time collecting the data when that CPU should be used for running the application, RDBMS or what-ever.
  • nmon aims to keep below a few percent of one CPU - this gets smaller as CPUs get faster.

Question 13: The command line options don't seem to work right for file capture?

  • The -f, -F, -x, -X or -z MUST be the first option on the line and only one of them.
  • This is documented in the nmon -h
  • This option sets all the other option flags to a sensible set.
  • You can then use the other flags to modify their default behaviour.

Question 14: What is paging to a filesystem (rather than to paging space)?

  • Hopefully, you already understand paging to paging space (also called virtual memory).
  • There are other types of paging.
  • AIX (and other UNIX versions) page in the read-only code from a program as you start it and as it runs. This is just like paging in from the paging space but is directly from the filesystem, this is also true for shared libraries (which you might not be aware you are using).
  • Also programs using memory mapped files access regular filesystem files - this allows access by simply reading and writing memory addresses - AIX will page in the file pages as necessary and they will get paged back to the filesystem to free up memory or if the program forces it or if the program stops.

Question 15: Where can I get nmon and further information?

Question 16: TOP process stats get switched on when I request AIX Asynchronous I/O stats?

  • These are often just called AIO on AIX.
  • This is working as normal.
  • To get the AIX aioserver process stats the details of all processes has to be collected, sorted and searched.
  • Having paid the CPU cycles for the TOP process stats you may as well see them on the screen or in the output file, so nmon automatically switches them on for you at no addition charge!

Question 17: nmon2rrd fails, please fix it?

  • nmon2rrd is a C program that takes nmon files and changes the data ready for the excellent RRDTOOL which can be used to generate graphs in .gif files for displaying on a webserver.
  • You have been supplied with the source code for nmon2rrd and it is supplied as a "toolbox".
  • This means users are expected to come up with fixed rather than the original developer.
  • Note there are updated versions from users on the nmon download site - well done guys.

Question 18: What are NANQ and INF?

  • These are output when calculations within nmon have gone wrong.
  • Typically, when dividing by zero. NANQ means "Not a number" and INF means infinite.
  • Some times this can happen due to rounding errors but mostly it is a bug or that numbers a have overflowed the C data types.
  • When nmon uses printf to display the invalid number it outputs these strings instead.

Question 19: nmon reports more than 100% for a process - clearly it is wrong?

  • Unlike AIX and some Linux commands, nmon reports the CPU utilisation of a process per CPU (the commands report as a percentage of all CPUs).
  • If your process is, for example, taking 250% then it is using 2.5 CPUs and must be multiple threaded as its more than one CPU.
  • This is far better than the commands because the percentages on larger machines make it very hard to determine if a process is using a whole CPU.
  • On a 64 CPU machine a single rogue process uselessly spinning on the CPU takes up 1.56% of the total CPU - this makes it very unclear what is going on.

Question 20: On AIX the disk adapters are wrong?

  • AIX
    • nmon just outputs what it gets from the libperfstat library.
    • For multipath I/O it is often the disk to adapter mapping reflects the order of disk discovery rather than some balanced view.
    • This is an AIX problem and not nmon's fault.
    • To list what nmon is extracting from the libperfstat library you can use the sample code and precompiled for AIX 5.3 binaries from the Roll Your Own Wiki page at ryo - and the adapt sample program: Roll-Your-Own Performance Tools
  • If you don't like the way libperfstat reports the adapter stats raise a PMR and refer to the adapt sample - as you will get no where reporting nmon errors.

Question 21: On AIX the adapter busy goes over 100%. That is impossible surely?

  • There are no adapter stats in AIX (see above). They are derived from the disk stats. The adapter busy% is simply the sum of the disk busy%.
  • So if the adapter busy% is, for example, 350% then you have 3.5 disks busy on that adapter. Or it could be 7 disks at 50% busy or 14 disks at 25% or ....
  • There is no way to determine the adapter busy and in fact it is not clear what it would really mean. The adapter has a dedicated on-board CPU that is always busy (probably no real OS) and we don't run nmon of these adapter CPUs to find out what they are really doing!!

Question 22: What about nmon for HP/UX, Solaris on Sparc or x86 or Linux on Itanium?

  • As I don't have access to such machines this is not going to happen.
  • There is also a problem that IBM gives me access to the current hardware because nmon is seen as a competitive advantage. If this was ported to every UNIX then I would not be allowed this access.
  • There have been attempts to port nmon for Linux to other operating system but they have not been continued after a year or so.

Question 23: What about nmon for Windows?

  • Now you must be joking.
  • this does get asked a couple of times a year.* The real problems are
    1. How would the stats be extracted from Windows by a C program given no llibperfstat or /proc?
    2. The stats would be completely different and for AIX/UNIX/Linux performance people very hard/impossible to understand
    3. Given 2 none of the graphing tools would work.

Question 24: Seeing double the number of CPUs on my POWER server?

  • This is a POWER based machine question
  • This is due to the SMT feature of the POWER5 chip (and later POWER chips), where each CPU (core) runs two processes at the same time.
  • This gives you a 40% boost in performance for most commercial workloads and it s really "good thing".
  • You need to read up on SMT or get yourself a presentation from IBM on the subject.
  • Of course, POWER6, POWER7 and POWER8 machines have higher SMT levels.

Question 25: Hello, I am new to UNIX and want to tune AIX, what do you recommend?

  • Don't do it.
  • AIX is very good at looking after itself and self tuning. I have seen rookie systems admin nearly halt a machine by making "improvements".
  • Go on a course or read the AIX performance Redbooks from http://www.redboooks.ibm.com but don't just try changing things unless you first of all have a problem and second know what you are doing and have practiced on a non-production machine or LPAR.

Question 26: CPU wait is too high, how can I reduce it?

  • This question is asked a lot and it can mean your CPUs are actually too fast!
  • CPU "waiting for I/O" state and utilisation numbers (as opposed to User, System and Idle) means the CPU is Idle but has a disk I/O outstanding. In history this was used to highlight that your application is being held up by slow disks or disks problems. In the Wait for I/O state the CPU is actually free to do other work and the CPU is NOT looping waiting for the disk - it in fact actioned the adapter to perform the disk I/O, put the calling process to sleep and carried on. If there is no other process it is in the same loop as in the Idle state i.e. it is available to do other things. In AIX the processor does one of two things
    • in regular stand-alone machines or a dedicate CPU LPAR the process runs a special kernel level process called "wait" from which it can exit very quickly at the arrival of the next interrupt
    • In a micro-partition (Shared Processor LPAR) the processor after a few micro seconds will call the Hypervisor to yield the processor for other LPARs
  • In benchmarks, Wait for I/O is seen positively as an opportunity - we can do throw in more work to boost throughput.
  • Any workload in which the CPU does comparatively little work compared to the volume of disk I/O is going to give you high Wait for I/O.
  • If this high Wait for I/O is a sudden change from the normal pattern then it needs investigating and you should make sure as many disks as possible are involved in the disk I/O.
  • But lots of workloads just run like this - a common example I come across regularly is SAP databases. SAP cleverly caches lots of data but on large database it has to do lots of disk I/O for particular customer or whatever records. Once the data is available it is sent to the SAP application servers i.e. little work is done on the database.
  • In fact, faster CPUs would mean even high wait values.

Question 27: On AIX, free memory is near zero, how do I free more memory?

  • AIX
    • This is just how AIX works and is perfectly normal. All of memory will be soaked up with copies of filesystem blocks after a reasonable length of time and the free memory will be near zero. AIX will then use the lrud process to keep the free list at a reasonable level.
    • If you see the lrud process taking more than 30% of a CPU then you need to investigate and make memory parameter changes.

Question 28: How can I set numperm better?

  • You can't. This number just reflects the amount of memory being used for disk blocks - called the buffer cache. It is controlled by three parameters minperm, maxperm and strictperm but these set thresholds and algorithms. The actual numperm number reflects what is actually going on. You will have to find other places for tuning these parameters as it is beyond the scope of this FAQ.
  • It is also worth noting that the nmon values for numperm and maxperm are based on a percentage of physical memory. The AIX commands report a percentage but not of all memory - they seem to remove some memory that might be something like the memory allocated to the AIX kernel (i.e. it could never be used as cache). Unfortunately this is not documented and the memory size not counted is not available with any public API. So nmon does the best it can but the numbers will not be absolutely the same.

Question 29: What format is the nmon output file?

  • Plain ASCII text that you can edit and editable with vi (but you might hit the 2048 byte line limit on the AIX vi).
  • I use the Open Source vim on AIX to avoid this or do it on Linux.
  • The first token on the line tells you what sort of data it is
    • AAA lines are basic nmon data about this collection of data
    • BBB lines are about the configuration of the machine
    • ZZZ lines include the date and time stamp stored here once ro reduce output
    • others should be obvious
  • The second field is the Timestamp - see the ZZZ section to the actual time
  • Then there is the data
  • Each sort of data (CPU, DISK, etc.) has a Header line that describes the columns and the header lines also include the graph titles
  • You do not need to sort the nmon output file for nmon2rrd or the Analyser but it you do then you can see the sections easier for editing.

Question 30: I have collected once a second for 8 hours but I can't get the Analyser to work?

  • You have 28800 data points and you want to see this on a screen with say 1024 pixels wide !!
    • that is 29 data points per pixel.
  • My new Thinkpad has 1400 pixels across the screen, so I am down to just 18 data points per pixel
    • What where you thinking when you started collecting so much data!!
  • I think even with the best will in the world, the analyser spreadsheet is going to struggle at some point with too much data.
  • On a tiny machine you get about 1.5KB per snapshot and a normal size machine with a few nmon options it is more like 60KB each. At 60KB the maths --> 28800*60KB = 1.6GB. How big is your output file?
  • I hope you have at least 16 GBs of memory in your PC to handle this!
  • As I hope you know the nmon file is text and editable with vi (but you might hit the 2048 byte line limit on the AIX vi). I use the Open Source vim on AIX to avoid this or do it on Linux. If you take a look at the file format you should be able to cut done the file size and make a series of files but each will need the header section that you will find at the top of the file and then a different set of snapshots.

Question 31: nmon does not work on my Linux machine!

  • nmon runs on x86 (Intel and AMD), mainframe, ARM and POWER processors and on a dozen or so versions of Linux.
  • If your Linux system has a C compiler and ncurses you can have nmon running in a minute or two.
  • If you report problems I will need to know which platform and which Linux version plus distro before I can help so please include these with initial questions.

Question 32: When do we get nmon for AIX version X for Linux?

  • The Linux & AIX source code for nmon is very different apart from curses framework and basic approach.
  • AIX gets all the information from system and library calls (with two exceptions) and in Linux this has to be read from the /proc filesystem and some classic UNIX style kernel functions.
  • This means the AIX code is more straight forward.
  • The code base of the AIX and Linux version are completely different.
  • So there is no need for Linux and AIX to have the same version number.

Question 33: The boxes and lines in nmon do not work right online with: DTterm, xterm, rvxt, putty, VNC, (whatever you have)?

  • nmon uses curses to handle the displaying of characters on the terminal.
  • This is controlled mostly by your TERM variable setting.
  • The nmon developer tests with all of the above.
  • They work perfectly and they work perfectly all the time.
  • If it does not work for you then you have some setting wrong on your machine or X Windows or have some strange settings for TERM and/or TERMINFO shell variable setting or you are using a duff terminal emulator.
  • For example you can tell putty this is a xterm session via putty sessings but tell Linux this is a vt100 session with TERM=vt100 then expect odd things to happen.
  • Let me state that again: your system has a problem not nmon.
  • The TERM shell variable should be set to the terminal emulator you are using.
    • If you are using a xterm then TERM should be xterm
    • If you are using DTterm then TERM should be dtterm
    • If you are using an AIX term then TERM should be set to aixterm
    • Get the idea - other combinations are your problem.
    • Unless you are using a genuine 1970's DEC VT100 then you should not be using this setting with more advanced terminal emulators. I remember VT100's well, even found a bug in the firmware once!
    • The TERMINFO variable should not be set to anything (in fact not set at all). If it is then you or someone has been mucking about with terminfo databases and why are you blaming nmon?
  • Terminal Emulators:
    • xterm works well in black and white.
    • aixterm works well and has colour and nmon uses the colour.
    • DTterm works well and has colour and nmon uses the colour.
    • rxvt and xterm-color combination (see WWW for details on setup, on google.com search for xterm-color and AIX) - this combination also lets vim (the improved vi from Open Source) use syntax highlighting in C code.
    • The Windows telnet terminals emulation is very poor indeed and not recommended under any circumstances - you are on your own.
    • The best alternative on a Windows PC is putty (see WWW for details and download) and is highly recommended - I use this every day - this will work with TERM set to xterm perfectly.
    • VNC is, of course, even better and gives you X windows on a Windows workstation at zero cost - again highly recommended.
  • The -B option starts nmon with no boxes (or colour). Some purists do not like to waste the screen space with the box lines. You could add 'B' to the NMON shell variable to make this automatic: export NMON=B

Question 34: I have 2400 disks (or 2400 small SAN LUNs) and nmon is slow to collect the stats from so many, can you help?

  • I guess you are learning the folly of small LUNs and that it makes the totally machine unmanageable. But you are not the first or worst - the record stands at 4500. Some suggestions:
    • Have you got more than four paths to each LUN?
      • If yes, you need to fix this ASAP as it is bad for performance and terrible for RAS (and I mean really bad).
    • nmon for AIX only: Use the -D flag to stop nmon collecting disk configuration each time can really helps to reduce the start up time.
    • Collect this disk configuration just the once - unless you are changing the disks a lot!!
    • You can use nmon User Defined Disk Groups to limit the output but nmon will still have to collect all the data from all the disks and then reduce what is actually reported.
    • But the only real solution is to reduce the number of disks you have - yes, I know this is a lot of work but you have a machine setup that can not be managed and that is not viable in the long term.
    • Don't blame nmon for highlighting the issue.
    • I recommend 32 to 64 LUNs and make the disk subsystem do the hard work of spreading the data across disk - i.e. not you as your time is much more valuable. After all that is what you buy big disk subsystems for and there a better uses of your time and thought.

Question 35: What is CharIO (a column of the TOP processes stats)?

  • This is the Character I/O that a process is generating and it is counted from calls to the read() and write() systems calls.
  • This will include I/O to files, terminals (now rare), FIFO, pipes and network sockets.
  • I/O started in other ways like Async I/O (commonly used by an RDBMS), paging or memory mapped files are not included.
  • The number fetched from the AIX kernel using the getprocs64() system call and the structure found in /usr/include/procinfo.h - look for the pi_ioch variable.

Question 36: On Linux the disk stats are all doubled?

  • Linux
    • nmon collects the data from /proc and displays it.
    • On newer Linux Kernels this is the /proc/diskstats file.
    • It was decided a long time ago that hiding data was a very bad idea as it can go wrong and then be very misleading
      • This is how the ozone hole was missed for 5 years and not detected - the algorithm decided the data must be wrong and deleted it from the stats.
    • The Linux disk stats (in three different files and four formats depending on the Linux version - great coding guys!!) reports both disk level and disk partition level stats in the same file. nmon just shows you the stats - it is your job to understanding them.
    • nmon does not and with LUNs on SAN disks and software RAID and LVM's it is much safer to show everything.
    • Consider using the nmon feature called "User Defined Disk Groups" to remove the doubling and make disks simpler to understand.

Question 37: On AIX the disk seem to be mostly on the first adapter?

  • AIX
    • nmon now collects the adapter data from AIX libperfstat.
    • This is the addition of the disk stats added up by knowing which disk is connected to which adapter.
    • This of course, is complex for multipath IO disks.
    • AIX seems to build this map from the order in which disks are discovered rather than used.
    • Depending on your initial setup it can often mean that most disks are assigned the first one or two adapters.
    • Sorry, there is nothing that nmon can do about this.
    • To list what nmon is extracting from the libperfstat library you can use the sample code and precompiled for AIX 5.3 (and onwards) binaries from the Roll Your Own Wiki page at ryo
    • Consider using the nmon feature called "User Defined Disk Groups" to remove the doubling and make disks simpler to understand.

Question 38: On nmon for Linux the CPU Wait for IO number is zero or odd?

  • This number is not available in the /proc filesystem until the 2.6 kernel and then it appears in the undocumented fields at the end of a line - I have fixed this for the 2.6 kernels in nmon for Linux version 11c onwards.

Question 39: nmon for Linux has paging details missing and the PAGE lines for the capture to file are missing.

  • This data was very hard to locate and now appear in nmon for Linux version 11d onward for the 2.6 kernel.
  • Before this kernel version the data is not present in /proc.

Question 40: I want to collect data every second and then see weekly and monthly reports. How?

  • Let us take this in simple bite-size chunks:
    1. First, a practical point, most Laptop and PC screens are 1024x x768 pixels (or about 1.5 times that). The point is that no matter how many data points you have you can not even see a maximum of about 800 data points. This is why I recommend about 300 to 400 data captures with nmon to get good looking graphs.
    2. Second, one second stats for a day give you (60 x 60 x 24) 86400 data points! So OK let us try one minute stats then we have 1440 data points, which is still to many. So we need to move to 5 minutes captures and we get to a sensible 288 data points and a good looking graph.
    3. Third, we then collect data for a month 288 x 31 = 8928 data points - oh dear far to may data points again!! so now we have to drop down to once an hour data capture (24 x 31) and we have 741 data points which is possible - we had better start thinking about the purchase of a bigger screen.
    4. If you then want to compare months or have a yearly report ... well you get the idea by now, we are now monitoring 12 hour periods.
  • But the above is only a physical problem. The much larger logical problem is still there to catch you out and that problem is averaging out.
  • A long time ago I noticed that the shorter the time period that you use to monitor the more fluctuations you notice in the data.
  • Philosophy: If you keep using shorter and shorter periods you will eventually see that the CPUs are either 100% busy or 100% idle all the other numbers are just a feature of humans not thinking fast enough and having to average out the CPU use in longer periods.
  • Anyway, for performance tuning we need to concentrate on the peaks. Take a look at the below graph:
  • If we average the whole day we get 50% which completely hid the peaks of the data time and the heavy CPU load during the evening batch. If this computer was not used during Saturday and Sunday the average might come down to 35%. The point is averaging data over longer periods removed all the important peaks.
  • This is in addition to these 3 data management problem:
    1. Data overload - to many data points
    2. Averaging out - eliminates the vital data
    3. Manipulation - the data will need to be stored, manipulated and displayed - non-trivial
  • I think many people make the mistake that this long term reports from nmon is an easy task but it will turn out to be very hard work and often the results are utterly pointless or meaningless.
  • If you must attempt this then I recommend:
    • rrdtool to manage and aggregate the data for you and draw graphs
    • nmon2web
  • or non nmon data based take a look at
    • Ganglia
    • LPAR2RRD

Question 41: How do I work out the Physical CPU use on Linux on POWER for shared processor LPARs?

Linux

  • Here is a Korn shell script that shows you where to get the data and the maths involved.
  • #!/usr/bin/ksh
    before=`grep purr /proc/ppc64/lparcfg | sed 's/purr=//'`
    echo before=$before
    
    integer seconds=2
    sleep $seconds
    
    after=`grep purr /proc/ppc64/lparcfg | sed 's/purr=//'`
    echo afterr=$after
    
    timebase=`grep timebase /proc/cpuinfo | awk '{print $3 }' `
    echo timebase=$timebase
    
    string="($after-$before)/$timebase/$seconds"
    echo string $string
    bc <<EOF
    scale=5
    $string
    EOF
    

Question 42: Automatic starting with certain statistics for nmon online mode?

  • Use the NMON shell variable to determine which statistics are shown automatically at start up time.
  • If you find you always want CPU, kernel, Memory and Disks i.e. you type: ckmd then set the shell variable as below:
    export NMON=ckmd
    
  • Next time you start nmon these will be shown automatically.

Question 43: Sort order problems with massive nmon output files?

  • So you collected more than 9999 snapshots in a single nmon capture.
  • Ignoring the fact that the Excel Analyser can't cope with all this data and it makes the data unmanageable.
    • I suggest a good aim is between 400 and 700 snapshots per file for good graphs and manageable file sizes.
  • Anyway, you then find out that if you sort the file the rows don't even sort in the right order.
  • The problem is you have four digit and five digit Timeshot numbers - the T numbers.
  • This mucks up the sort ordering.
  • What can you do?
    1. nmon for AIX add the -w 8 option to the end of the command line this makes the Timestamp string 8 digits wide instead of 4 i.e up to T99999999
    2. nmon for Linux does not have this option yet but could.
  • It you have already collected the data fix the nmon file using the below - it makes all the T numbers 5 digit and then they can be sorted:
  • sed 's/\(,T\)\([0-9][0-9][0-9][0-9]\)\(,\)/\10\2\3/' original.nmon >original5digit.csv
    sort -n original5digit.csv >fixed5digit.csv
    
  • Full marks if you understand the sed command - this is very advanced regular express stuff

Question 44: Does nmon capture point in time stats or averages?

  • Well there are two type of numbers
    1. rates and
    2. absolutes.
  • For an absolute example, free memory is an absolute
    1. nmon just show you how much is memory is free at that specific point in time.
  • For a rate example, the network stats are rates (so too are CPU and disks KBps), here nmon does the following:
    1. Capture a complete set of counters - these are incremented by the kernel like the number of bytes sent.
    2. then nmon waits the number of seconds you asked
    3. then nmon captures a second set of these counters
    4. then nmon calculates the difference between the two sets and divides by the number of seconds, so everything is per second
    5. this number is then displayed on screen or written to the data file
  • So the rates are the average between the two capture points. As the number of seconds increases the rates get more and more steady.
  • Note if you reduce time btween snapshots i.e. the seconds to just one (the minimum to make sure nmon does not use too much CPU time) you will see lots more peaks and dips in the numbers.
  • "Point in time" numbers for rates would be very misleading as they would miss all the peaks and dips in between - you would have to take dozens of them to be sure you are really seeing a representative number.

Question 45: When will nmon collect data from lots of machines or LPARs?

  • Answer: Never.
  • I like to think nmon does one job and does it well - it collects data from one Server / virtual machine and saves it in one file.
  • Going multiple machine or LPAR has many problems:
    1. Collecting data from lots of machines or LPARs would require network access and lots of error handling for missing or late data.
    2. The nmon output file would then be far more complex and have to include the machine names and totally rewrite the time stamps.
    3. We already suffer from too much data than Excel can handle.
    4. There would simply be too much data to display
    5. This complication would mean nmon becomes very large and code stability would take a long time to settle down
  • What you do need is:
    1. Less data and then you drill down of particular nodes
    2. Automated database generation to store the data
    3. Automated graphing of the data you really want
    4. History for the last hour, day, week, month year
    5. Small simple daemons on the nodes and automated central collection point
    6. Simple method of collecting more stats
    7. Open Source code to make it safe and simple to implement.
  • This tool is called Ganglia, see http://ganglia.sourceforge.net/ See Question 101

Question 46: When will nmon collect data like the AIX "topas -C"?

  • It may not be obvious but topas and topas -C are two completely different programs hidden in one binary.
  • The cross partition stats involved communicating with each LPAR and the HMC to get the data unlike the local stats that just calls the local kernel API.
  • The cross partition version of nmon has already been written it is called Ganglia please see http://www.ibm.com/collaboration/wiki/display/WikiPtype/ganglia for more details. OK, it is an excellent Open Source tool and nothing to do with nmon but it is has all the right stats, many brilliant features, is very simple to implement and has very little impact on performance.
  • There is no need to duplicate this work and it also supports lots of operating systems, the output is via a website and the data is in graph form and it keeps historic data - so this is better then text output on a dumb screen and only for root users.

Question 47: nmon will not stay running - What should I check?

First the regular house keeping:

  • Have you filled up a filesystem? df -m
  • Have you got a recent Operating System level? i.e. missing bug fixes
    • Linux cat /etc/*ease
    • AIX oslevel -s if the last four digits don't start with a 15 or 16 then your AIX is probably not fully supported !!!
  • On Linux have you got a current version on nmon?
    • You would be amazed how often I get questions about 2 to 5 year old version of nmon for Linux
    • There seems to be a mentality with some users, that if nmon works now then you can pass that version on yo your grand children!
  • What do you get in the nmon output file?
    • That can provide clues about where it stopped.
    • Use vi to take a look at the end of the file.
  • Have you forgotten to reboot AIX after an upgrade?
    • The result is /unix not matching what is running content of memory
  • Stop using duff nmon command line options.
    • Is it possible for you to read the nmon -h output :-)
    • No one seems to read the line Note: use only one of f,F,z,x or X and make it the first argument
  • Are you sure you really want all these two dozen options.
    • Some times invalid options causes problems like, just as an example, requesting NFS stats when the OS is not using NFS.
  • KISS = What happens if you try something simple like: nmon -f -s1 -c 10
    • This eliminates the advanced options/stats being the cause of an issue.

Still got a problem? Get some help

Question 48: Why isn't nmon for Linux on the Distro media or online repository or it is there but out of date?

  • Good question but there are size limits to the typical Distribution 4 GB Media DVD.
  • Things are improving: for Ubuntu, Red Hat and SUSE - my focus for enterprise Linux.
  • But each has bizarre processes to get packages accepted and updated.
  • For POWER systems the IBM Internal repositories have current nmon for Linux for current and new SUSE and Red Hat releases.
  • For Ubuntu the person who added the original nmon for Linux 14g package - fell asleep for two years!! Only recently (Sept 2016) updating the package that might appear in Ubuntu in 2017.
  • If you as Linux user request this to your Distributor it might improve the situation.

Question 49: Do you have nmon presentations I could use for training others?

  • We are not in the 1990 any more! The worlds has moved on.
  • The new way is watching YouTube videos to learn at your own speed and at a convenient time.
  • See the nmon Documentation for the list of nmon for Linux and nmon for AIX videos by me.

Roughly 45 minutes plus either of the two popular graphing tools: nmonchart (browser graphs) or nmon analyser (Excel) - which both work doe Linux and AIX files.

Question 50: nmon Analyser: What is Wavg?

  • Wavg or WAVG is the Weight Average.
  • This data is not in the nmon output file but calculated in the Analyser.
  • This is the average of the busy periods and largely ignores the idle times.
  • If you take a 100 snapshots of a static with 50 busy at 100% and 50 with idle at 0% then the average is 50% but that number does not really describe the situation and can be misleading.
  • The Analyser uses a mathematical trick to boost the importance of the busy times and discount the idle periods.
  • Take a look at the Analyser spreadsheet for the calculation.

Question 51: LPAR Tab/Statistics missing with Dedicated CPU mode?

  • The LPAR Tab or LPAR Statistics lines in the nmon file are all about Shared CPU usage.
  • For a LPAR in Dedicated CPU mode these stats don't make any sense, so they are not collected.
  • This is NOT a bug.

Question 52: Adding External Data Collectors to nmon files so it graphs your extra data?

  • So you have extra data you want nmon to collect and add to the nmon file for graphing
  • nmon will help you collect the data at the right intervals and in the right format
  • So the data can simply be added to the end of the nmon file and graphed
  • The nmon Analyser is pretty good at graphing "unexpected extra data" provided the number are in a similar range.
  • Read the AIXpert Blog article for all the details nmon and External Data Collectors

Question 53: Sharing nmon files - Are they a security risk?

  • If you are worried Remove the following:
    • Hostname - Which you can simply alter with with vi or sed
    • IP address - Unlikely to be directory on the Internet anyway
    • Some processes names show software that you are running - You may not want others to know what you use!
    • Same for some file system mount points.
    • Machine serial numbers - IBM does not recommend making these public
    • Old Machine types, firmware levels and OS level - Could be embarrassing!!
  • Are the files a risk? Nothing here that helps a hacker.
  • Read my AIXpert Blog Article at nmon Data Files Are they a Security Risk?

Question 54: The Disk stats are far too high or 100%, nmon is broken?

  • Check that your OS is at the current level
    • on Linux upgrade it and
    • on AIX run oslevel -s (last four digits are the year (16 = 2016) and week number). If you AIX is two years out of date you badly need to update.
  • Just because a disk is 100% busy it does not necessarily mean you have a performance issue.
    • Perhaps you really are reading a large file(s)!
    • If your CPUs deal with the data faster than the disk can deliver it then you get a 100% busy disk.
  • What happens when you just look at the stat numbers ?
    • nmon
    • Then D
  • If on AIX try an alternative command like topas -D
    • Does that give you the same stats ?
  • If on Linux, if necessary, install sysstat and check the iostat output
  • If they show the same sorts of disk I/O numbers then it is very likely you are doing the I/O.
  • Look at top processes which are doing lots of I/O with: nmon
    • Then t 5
  • For Linux: If it still looks wrong report the problem at the DeveloperWorks Performance Tools Forum - or email me if you can work out the email address
  • For AIX, If you think you have a performance issue or a nmon issue
    1. create a snap
    2. capture the workload: perfpmr
    3. raise a PMR
    4. Note AIX Support is not there to analyse your nmon data and graphs for comment. Just like they are not there to correct your spelling for file created by vi!

Question 55: What files does nmon for Linux use to get its data?

  • It is a popular mistake to think that nmon for Linux uses Linux commands to get its data. That would be expensive in CPU cycles especially if you request the data every second.
  • For efficiency nmon for Linux gets the data from system calls (where possible) and from the /proc file system mostly.
    • Even though /proc looks like a bunch of files that are in fact more like device drivers where a file read results in a system call to get the data and not disk I/O.
  • nmon for Linux reads the test from the following files. Hopefully the files names explain the data in each file. If not then you can take a look at the file.
  • There is some information available: from the Linux manual: man 5 proc
    • but the Manual is often vague and does not explain units or why the data is sometimes missing.
  1. Performance stats
    • /proc/cpuinfo
    • /proc/stat
    • /proc/version
    • /proc/meminfo
    • /proc/uptime
    • /proc/loadavg
    • /proc/net/rpc/nfs
    • /proc/net/rpc/nfsd
    • /proc/vmstat
    • /proc/ppc64/lparcfg - POWER systems only
    • /proc/net/rpc/nfs
    • /proc/net/rpc/nfsd
    • /proc/diskstats
    • /proc/partitions
    • /proc/net/dev
  2. Process stats where PID is replaced with the Process ID number in turn
    • /proc/PID/stat
    • /proc/PID/statm
    • /proc/PID/io
  3. Configuration data - includes the above in full text and then these too
    • /proc/device-tree/host-model
    • /proc/device-tree/host-serial
    • /proc/device-tree/ibm,partition-name
    • /proc/diskinfo
    • /proc/sysinfo
    • /proc/modules
  4. Some extra data it extracted using classic UNIX system calls like those to detail the file systems and mount points

Question 56: Can you add the monitoring tape drive on AIX?

  • AIX
    • No - the data is not available. The best you can do is to watch the disks and guess what the tape is doing. The adapter statistics is only adding up the attached disks - so it does not help. You can guess at the tape drive I/O rates by looking at the disk I/O rates - after all this is where the data is coming from but it is only approximate and does not account for memory caching of data.
    • Yes - if your tape drive is Fibre Channel connected it is very common to have it connected on a different FC adaapter to allow performance settings to suit the tape drive = streams of large blocks.
    • In this case, use the Adapter stats using the ^ key or -^ startup option to monitor the tape(s).
  • Linux
    • No FC Adapter options for Linux - unless you know the /proc file to find tape stats. In which case let Nigel know ASAP.

Question 57: How to use External Data Collectors with nmon?

  • The external data collectors feature is to get nmon to run other commands that you can then add to the nmon data file for analysis. A typical example is to collect DB2 or Oracle stats to compare against nmon data. You can run a command when:
    • nmon starts using the shell variable NMON_START
    • nmon ends using the shell variable NMON_END
    • each snap shot using the shell variable NMON_SNAP
    • a subset of snap shots using the shell variable NMON_ONE_IN
      • This is controlled by shell variables set before you run nmon. The separate file that the data collectors generate is merged into the nmon file before analysis with the cat command. You don't need to have all of these - i.e. could do start + end or just the snap shots or - a special start-up plus snap shots. This is a bit complex so here is a worked example.
    • First set the TIMESTAMP shell variable:
    • if TIMESTAMP = 0, then lines will have the classic nmon Tnnnn timestamps at the start of the line and work well with the nmon data file
    • if TIMESTAMP = 1, then lines will have a timestamp that has the hours, minutes, seconds and day, month, year - this can be used if you don't want to merge the data with the nmon file for analysis.
  • Setting the shell variables
  • export TIMESTAMP=0
    export NMON_START="mystart"
    export NMON_SNAP="mysnap"
    export NMON_END="myend"
    export NMON_ONE_IN=1        # 1 is the default
    
  • We set the above shell variables, so they refer to a program or shell script
  • If the mystart, myend, mysnap contain the following shell scripts
    • mystart
    • ps -ef >start_ps.txt
      echo "PROCCOUNT,Process Count, Procs" >ps.csv
      
    • mysnap
    • cho PROCCOUNT,$1,`ps -ef | wc -l` >>ps.csv
      
    • myend
    • ps -ef >end_ps.txt
      
  • Now run nmon as normal, for example: nmon -f -s 2 -c 10
  • At the end of the capture, the ps.csv file might contain (for example):
    • PROCCOUNT,T0001,56
      PROCCOUNT,T0002,58
      PROCCOUNT,T0003,67
      PROCCOUNT,T0004,65
      PROCCOUNT,T0005,71
      PROCCOUNT,T0006,68
      PROCCOUNT,T0007,66
      PROCCOUNT,T0008,58
      PROCCOUNT,T0009,57
      PROCCOUNT,T0010,60
      
  • The start_ps.txt and end_ps.txt files would have a list of running processes at the time. The ps.csv file can be merged with the nmon output file (below called this_050607_0916.nmon, yes my machine is called "this") after nmon finishes with the following command:
    • cat this_050607_0916.nmon ps.csv >combined.csv
  • Then run the nmon Analyser on the combined file - if you are lucky, the analyser may draw you a graph. Here is what was produced:
  • Hints:
    • comma separate the data and don't go over 2K bytes in line length
    • make the important data in the first couple of columns.
    • keep the stats in the same range - i.e. all KB/s or all percentages
  • If you set the NMON_ONE_IN variable you can also run the NMON_SNAP command less often!!
  • By default this is set to 1 - run it every time - but if the command you want to capture is heavy in CPU terms or takes a long elapsed time to finish. You can run it less often. For example to run in just one in ten snapshots: export NMON_ONE_IN=10

Any other nmon user wants to be able to track the username of processes that are using a lot of CPU time. This is the approach recommended

  • Briefly in pseodo code and commands:
    NMON_START would create the empty file.
        rm -f /tmp/nmon_proc_user; 
        touch /tmp/nmon_proc_user
    
    NMON_SNAP would append ps output to a log file: 
        # This ps command outputs lines like
        #  PID USER
        # 2122 root
        # 2143 root
        # 2175 root
        # 2224 nag
        # 2226 nag
        ps -Ao pid,user >>/tmp/nmon_proc_user
    
    NMON_END would sort the file and remove duplicates then you have a map of PID to Username
        sort -n /tmp/nmon_proc_user | uniq | awk '{ print "BBBU," $1 "," $2 }'
    
  • You could also look at man ps and select any further columns you fancy.
  • It is assumed we are letting the nmon capture run to completion and do post processing.
  • The data could be appended at the end of the nmon file - perhaps making lines start with (say) BBBU so its treated as configuration data for a look up feature.
  • Note of warning running ps takes CPU time but its better than say opening all the /proc/PID/status lines at grep-ing out the Pid and Uid lines and then converting the User ID to a User name. But don't go doing this ps command every second on a machine with 1000's of processes or extremely low memory as it could take a whole CPU out and fail to complete in under a second. If you are capturing say once a minute or a slower rate the ps command should not danage performance.

Question 58: How to RDBMS Oracle Transaction Counters External Data Collectors Example?

  • Here is another example collecting transaction commits and rollback statistics from the Oracle database using two scripts called oraclestart and oraclesnap that run an SQL statement and save the data in a file called dbstats.csv:
  • oraclestart
  • echo "DATABASE,Transactions,commit,rollback" >dbstats.csv
    ]@
    
    * oraclesnap
    * [@
    export ORACLE_SID=MYDATABASE
    ( sqlplus -s "system/manager as sysdba" <<EOF
    set heading off
    set headsep off
    set echo off
    set lines 2000
    set feedback off
    set newpage none
    set recsep off
    select 'DATABASE,$1,'||
    	sum(decode(name, 'user commits', value, 0))||','||
    	sum(decode(name, 'user rollbacks', value, 0))
    	from
    		sys.v_\$sysstat;
    EOF
    ) >> dbstats.csv
    
  • Setting up the shell variables
  • export TIMESTAMP=0
    export NMON_START="oraclestart"
    export NMON_SNAP="oraclesnap"
    unset NMON_END
    
  • Now run nmon
  • You need to ensure the ORACLE_SID and usernames and password work in your environment. Do this by running the command manually with: * oraclesnap T9999
  • And checking the results in the file dbstats.csv
  • This should put one line in the file dbstats.csv. This script has to log on to the Oracle database each time it runs, so you should not be doing this every second as it will take elapsed time and CPU resources. But if you are collecting nmon data once a minute or more this overhead should be small.
  • Thanks to Ralf Schmidt-Dannert of the IBM SAP and Oracle Solutions team in Minneapolis, USA for this example.
  • One Caveat on External Data Collectors
    • The "T" or "t" as the first letter of the second column is used by tools to recognise the difference between new header lines of new data sections and the data lines (i.e. those containing the timestamp values for example, T0000, T0001, etc.) So do not use a header line like "PROCCOUNT,The Process Count, Procs" - the "T" in "The" will cause problems.
  • Also see my AIXpert Blog article - nmon_and_External_Data_Collectors

Question 59: How to use the AIX Workload Manager Statistics?

  • This is a AIX feature. Work Load Management statistics are started with: W (upper-case) to see them. Note: AIX 433 does not support the gathering of WLM stats. Work Load Management - this is the major benefit of AIX and no charge too. I have written a white paper on this find it at: http://www.ibm.com/developerworks/aix/]/au-Practical_WLM.html If you use passive mode you can use WLM to find out which applications are taking the CPU, RAM and IO resources of the machine with zero overhead. I tested WLM and could not detect WLM taking any resources at all or at least below 0.25% of one CPU. nmon outputs
  • actual resource use percentage per class
  • desired percent AIX sets as a target based on active class shares and limits. These are worth watching as for example classes without processes get zero targets. See the Junk class in the example below.
  • share values (-1 means it is not set)
  • number of processes per class (try for zero in Default class)
  • class Inheritance and Shared Memory flags
  • Is there missing data you need? - remember things like min hard and soft are for CPU and RAM and Block IO and for each class there are limits to what we can output on the screen. The -S options allows you to see sub-classes but if you have lots they may not fit on the screen or over run the captured data file line length limit from Excel.
  • The nmon file capture records the full WLM details once (at the start) in the BBBP section but then only the actual resources used to reduce output. Online the output looks like this:
    • Online WLM example
    • Work Load Manager CPU MEM BIO  CPU MEM IO  CPU   MEM   BIO     Tier Inheritance
      Class Name       |---Used----||--Desired-||----Shares-----|Proc's T I Localshm
      Unclassified       0%  0%  0% 100 100 100    -1    -1    -1     1 0 0 0
      Unmanaged          0% 11%  0% 100  99 100    -1    -1    -1     1 0 0 0
      Default            0% 29%  0% 100  98 100    -1    -1    -1    34 0 0 0
      Shared             0% 21%  0% 100  98 100    -1    -1    -1     0 0 0 0
      System             0% 50%  0% 100  99 100    50    -1    -1    80 0 0 0
      database          72%  0%  0%  75 100 100   300    -1    -1     9 0 1 0
      batch             26%  0%  0%  25 100 100   100    -1    -1     4 0 1 0
      junk               0%  0%  0% 100 100 100   400    -1    -1     0 0 0 0
      

Question 60: How to use change the Top Processes Minimum CPU Threshold?

  • nmon will not save to file process using less than 0.1% of a CPU. This is to reduce the file output to useful information. But 0.1% of the fastest CPU is now quite a lot of CPU power, so the threshold is now changeable using the -I option. This was requested by a nmon user as a useful idea. So add the following option when you start nmon:
    • -I <percent>
  • This sets the Ignore Process Percent threshold (default 0.1) i.e. don't save TOP stats if proc using less CPU than this percentage. Example:
    • nmon -f -s 10 -c 300 -I 0.01
  • This will mean a lot more top processes statistics will be gathered.

Question 61: How to start nmon file collection with cron?

  • The nmon default capture to file filenames has bee carefully chosen. If you save the output of many machines and captures in one directory and list the directory you will have the files in first machine hostname order and second orders by time (and date). This is a sensible ordering. Many people have written scripts to start nmon via cron and many of the scripts are a complete waste of time or even wrong. One feature that was added to nmon to make this easy was the -m flag so the nmon moves to a particular directory before saving data.
  • So here is what I put in my crontab (use crontab -e to add tasks to your crontab file). This collects the data once a day in the directory /home/nmon_data at once every 5 minutes and with 288 snapshot which makes a excellent graph detail level. It also collects top processes and user command lines (T), NFS stats (N), Workload Manager but no Subclasses (W), Large page stats (L) and Asynchronous I/O details. The reporting threshold is 0.001 percent of a CPU.
    • cron entry example
    • 0 0 * * * /usr/lbin/nmon_aix53 -fTNWLA -I 0.001 -s 300 -c 288 -m /home/nmon_data
      
  • There is no need of any shell scripts to start this collection.
  • Note: that is you start two nmon processes running at the same time they will have the same filename. So if you want to, for example, collect details and summary stats start then one minute apart. So if I also wanted hourly statistics with less top process details a second crontab entry might be:
    • cron entry example
    • 2 0 * * * /usr/lbin/nmon_aix53 -ftNWLA -s 3600 -c 24 -m /home/nmon_data
      
  • Also note that only one of f, F, z, x or X should be used and it should be the first argument. You have been warned as not following this can cause confusion.

Question 62: Can I reset the peak counters for disks, network, AIO (AIX only) and CPU graphs online?

  • Network, Disks stats (not graphs) hit D (upper case d), AIO statistics track the peak values and display them. Also the CPU graphs provide peak indicator.
  • These can all be reset to zero by typing 0 (zero).

Question 63: How do I use User Defined Disk Groups to monitor large numbers of disks in ESS disk ranks?

  • On a recent benchmark with 3 x ESS = 1024 disks it became impossible to monitor them to ensure balanced I/O loading. So this was developed. The idea is to merge the disks into sets and monitor the sets. It is like the adapter stats but you get to choose which disks go into which set (adapter). Three obviously ways of doing this are by the:
    • disk use = group disks that have common data for example a databases data, index, sort, logs, archive = 5 disk groups
    • disk placement = the disks in a particular rack/drawer for example ESS, cluster, rank, loop - makes 8 groups per ESS
    • disk type or volume group/logical volume
    • Or any thing else you think up.
  • To set this up create a file with:
    • one line per disk group
    • starting with the name of the group
    • then a list of hdisks
    • all space separated
  • Then start nmon with the following option: -g filename
    • If online hit: g
    • If saving to a file there will be more sections for diskgroups = DGxxxx. The nmon analyser understands these new sections thanks to Stephen Atkins its developer.
  • Here are a few examples:
    • For my ESS placement disk groups I used the following script (this assumes you have the lsess command installed):
  • Creating the ESS disk group file example
    • Creating the ESS disk group file example
    • FILE1=/tmp/lsess_arary.tmp1
      FILE2=/tmp/lsess_arary.tmp2
      lsess >$FILE1
      grep hdisk $FILE1 | grep -v "not ready" | awk '{ print $3 }' | cut -b 4-8 | sort | uniq >$FILE2
      for j in `cat $FILE2`
      do
      	for i in 1100 1101 1300 1301 1500 1501 1700 1701 1000 1001 1200 1201 1400 1401 1600 1601
      	do
      		echo "ESS${j}_${i} \c"
      		grep hdisk $FILE1 | grep $j | grep ${i} | awk '{ printf " " $1 }'
      		echo
      	done
      done
      rm $FILE1 $FILE2
      exit
      
    • and generated the following disk group file:
    • Generated file:
      
      array_1100  hdisk44 hdisk45 hdisk46 hdisk47 hdisk48 hdisk49 hdisk50 hdisk51
      array_1101  hdisk52 hdisk53 hdisk54 hdisk55 hdisk56 hdisk57 hdisk58 hdisk59
      array_1300  hdisk60 hdisk61 hdisk62 hdisk63 hdisk64 hdisk65 hdisk66 hdisk67
      array_1301  hdisk68 hdisk69 hdisk70 hdisk71 hdisk72 hdisk73 hdisk74 hdisk75
      		... etc.
      

Question 63: Is sharing nmon data capture file a possible security risk?

  • Briefly, the data that might be a risk:
    • nmon filename includes the hostname
    • network config - IP Address and hostname
    • File system mount point can include names pf products use like the RDBMS
    • System serial numbers
  • The paranoid could remove or change these with a script.
  • Nothing of a major risk here.
  • See my AIXpert Blog article for the full information - nmon data files are they a security risk

Question 64: How to determine optimal memory size for a VM from nmon data?

  • Briefly, this is very hard to determine.
    • Completely unused memory can be "harvested".
    • You may or may not be able to reassign file-system cache memory
    • But if you are short on memory there can be latent demand that can't be predicted.
  • See my AIXpert Blog article for the full information - How to determine optimal memory size for a VM from nmon data

Question 65: Please explain the TOP Process Memory stats?

Before answering I am going to assume you are aware there is no single number the tells you everything about the memory of a processes. This is because of many complications like programs share program code memory (one read-only copy for all processes running the same program) and partially share data (on a fork() the memory is shared with a Copy-On-Write flag to make different copies only if a page is written too) and then some of the program can be paged to/from disk or paged from file systems and some not exist in memory unless its updated (static data in the program file).

TOP process stats (switched on with - tot -T) have a header line describing the columns like this for Linux

  • TOP,+PID,Time,Usr,%Sys,Size,ResSet,ResText,ResData,ShdLib,MinorFault,MajorFault,Command,Threads,IOwaitTime

and like this for AIX

  • TOP,+PID,Time,Usr,RAM,Paging,Command,WLMclass

Size, ResSet, ResText, ResData are the Memory stats

  • Size is the program size as found on the file system file from which you start the program - this is fixed.
  • ResSet is the resident set size - this is the memory of the process running the program- it changes as the program runs (typically growing but can shrink) and it is partly shared across other processes running the same program.
  • ResText is the resident set size of the code of the program (this is read-only so highly shared).
  • ResData is the resident set size of the date of the program (this is mostly read-write can be shared but on the first write to a memory page a copy is made for that particular process).
  • To make life complex a typical C program will be links to shared libraries (the most common is the C Lib to support C library function and system calls). Typically, a minimum of around eight libraries but it could be 50+. Each library can have read-only code, read-only static data plus read-write memory which is partly shared.
  • now add in memory mapped files and you start to see its complicated.

If you want one number for the memory size of a process then use (ResText + ResData) but note some of that memory is shared between processes.

nmonchart in it's TOP Process bubble chart reports the maximum value found in all the memory sizes reported for a particular process i.e ResText + ResData.

  • Note nmonchart assumes all the processes with the same name are the same. So for say Apache the processes with the name "httpd"
    • For CPU: all the CPU time is added together
    • For memory it takes the highest value of ResText + ResData across all the processes
    • For IO: all the I/O is added together

Question 66: What are User Defined Disk Groups for?

Here are a few good use cases for this nmon feature that is covered in more details in the following three questions:

  1. Servers with 100's or 1000's of disks are very difficult to monitor on screen.
    • Unless you have a screen that can display 100's of lines!
    • You have reduce the disks on-screen graphs to just tiny font size and use a modern HD screen but there are limits.
  2. Servers with 100's or 1000's of disks are very difficult to graph later.
    • One extreme case with 4000 disks produced a black oblong because there was so many lines.
    • They complained that they could not see the details so the disks were unmanageable. They are correct - the problem was their default LUN size on the Fibre-Channel disks was ridiculously small but this is not nmon's fault. It was set-in-stone, out of date systems management practices.
  3. With many disks with the same data it is useful to group the disks together and then see the total I/O to that group of disks.
    • For example: the disks that make up a RDBMS data, RDBMS index and RDBMS logs - each should have different I/O characteristics in RW ratio, and block sizes.
    • For example: the disks used for backup, batch processing or background tasks like data arriving to be loaded in to a database - will be busy at different times.
  4. On AIX hdiskN and on Linux sdX are not helpful names while monitoring - changing the name to something meaningful aids comprehension
    • For example: rootvg, paging, webpages or rdbms_log immediately lets you know the data on the disk(s).

This feature is covered in the nmon -h output as follows

  •         -g <filename> User decided Disk Groups
                          - file = on each line: group_name <hdisk_list> space separated
                          - like: rootvg hdisk0 hdisk1 hdisk2
                          - upto 32 groups hdisks can appear more than once
    
    
  • I think this limit is actually 64 disk groups and each can have 512 disks but don't quote me on that - I don't have the source code any longer.
  • If you want the name to appear nicely on-screen then keep the name below 12 characters.

Question 67: Using User Defined Disk Groups with nmon for AIX?

  • To use User Defined Disk Groups you need to prepare a diskgroup to disks mapping file.
  • This is not complicated or hard to do - if you know your machine
  • This small text file has a simple format:
    • At the start of the line the diskgroupname followed by a list of hdiskN, all separated by a space characters.
    • For example: a file called diskgroups
      rootvg hdisk5 hdisk5
      backup hdisk7
      
  • you can call your fileanything you like but here we call it diskgroups

Then when you start nmon add at the end:

  • nmon -f . . .  -g diskgroups
    
  • If collecting data to a file you will find new data in the file on lines starting DG
    • The nmon Analyser and nmonchart both know what to do with this data.
  • If monitoring on-screen, to get the Disk Groups displayed, hit: g
  • +--Disk-Group-I/O-------------------------------------------------------------------+
    |Name          Disks AvgBusy Read|Write-KB/s  TotalMB/s   xfers/s BlockSizeKB       |
    |rootvg             2   0.0%       0.0|0.0          0.0       0.0    0.0            |
    |backup             1  99.5%   71687.5|71677.5    140.0     844.0  169.9            |
    |Groups= 2 TOTALS   3  33.2%   71687.5|71677.5    140.0     844.0                   |
    +-----------------------------------------------------------------------------------+
    
  • You can immediately see there is a backup running rather that one disk is unexplainably very busy and needs further investigation
  • Note you could make this more complex and even have a disk in more than one group line - a disk might be in the RDBMS Disk Group but also in the RDBMS_Logs disk group.
  • Here is a sample I used on a benchkmark:
    root hdisk0 hdisk1
    home hdisk2 hdisk3
    apps hdisk4 hdisk5 hdisk6
    data hdisk7 hdisk8 hdisk9 hdisk10 hdisk11 hdisk12 hdisk13 hdisk14
    index hdisk15 hdisk16 hdisk17 hdisk18 hdisk19 hdisk20 hdisk21 hdisk22
    archive hdisk23 hdisk24 hdisk25
    sort hdisk26 hdisk27 hdisk28 hdisk29 hdisk30
    logs hdisk31 hdisk32
    others hdisk33 hdisk34
    

Question 68: Using User Defined Disk Groups with nmon for Linux?

  • The above applies to nmon on Linux too
  • The Linux equivalent of the Question 67 AIX examples file could be
    Linux-OS sda sdb
    backup sdd sde
    
  • The nmon limits of 64 groups and 512 disks is definitely correct as we have the code open source - I checked.
  • Of course, the disks are named very differently - typically sda, sdb, sdc etc. (not hdisk0, hdisk1 etc that we find in AIX).
  • However, Linux comes with its own problems one is historic but still true for PC size machines.
    • This is that the disks are often partitioned - for example /dev/sda1 is the first (1) disk partition on the first disk (a).
    • But that is not the problem. In the /proc file-system (where nmon gets the disk stats) there are multiple problems:
    1. IN the past there were different files in different formats which made developing nmon unnecessarily very hard work.
      • Fortunately all current Linux Distrobutions have moved to /proc/diskstats
    2. We still have one problem: the disk I/O is reported against a partition (like sda1) AND the disk (like sda) which results in duplication.
  • if you want to see your disk + partitions a good command is lsblk (I think this means list block devices.
  • The result is nmon reports double or even triple the disk I/O stats.
  • Why don't we just code around this problem?
  • nmon rule: Don't hid (or remove) data because if the code is wrong or the data format changes in the future that would driver the nmon user mad!!!
  • I should add: or some device driver developer use truly bonkers disks names.
  • If is was just the case of sda and sda1 duplication we could code in ignoring the sda1 but I have seen disks with all sorts of names, disk partition naming conventions as letters and numbers or mixed, in different orders and names including punctuation like &*%$"!?@~
  • That makes removing the duplicates very error prone and as the nmon developer I don't want to take the blame of some massive performance problem that I made invisible by removing the wrong data.
  • Its better to live with duplicates that can be explained as nmon is just the messenger and if you don't like it talk to the Linux Kernel developers.
  • However, I relented in the end.
    • With the new 'lsblk command that defines very clearly: What is a disk and What is a partition we have off-loaded the decision.
    • By using the User Defined Disk Groups feature we can define the included disks and what is ignored.
  • To switch on this ignoring of the partitions and thus deduplicate feature use the -g auto option
    • nmon before it really starts takes the output of the lsblk command, ignores partitions and generates User Defined Disk Groups file called 'auto and uses that for the disk groups.
    • The actual command run depend on the Linux version:
    1. lsblk --nodeps --output NAME --noheadings
    2. lsblk --nodeps --output NAME,TYPE --raw
    • Here is some example output
      $ lsblk
      NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
      sda      8:0    1 204.8G  0 disk
      +-sda1   8:1    1 172.8G  0 part /
      +-sda2   8:2    1     1K  0 part
      \-sda5   8:5    1    32G  0 part [SWAP]
      sr0     11:0    1  1024M  0 rom
      
      $ lsblk --nodeps --output NAME,TYPE --raw
      NAME TYPE
      sda disk
      sr0 rom
      
      $ lsblk --nodeps --output NAME --noheadings
      sda
      sr0
      $
      
    • Note: It will leave the file auto in the current directory. If you start nmon again with a auto file present, it will overwrite the current file.
    • Note: That if you have User Defined Disk Groups switched on then you can also switch on the Extended Disk stats with -d - see the below question.

Question 69: How do I get more disk stats because I can never get enough of these?

  • A decade ago I realised that there are
    1. far to many stats available for disks - this is due to "old school thinking: that its always the disks causing problems"
    2. there are far to many disks - I have samples of servers with 1000's of disks (pretty dumb IMHO as you can't graph them all)
  • This could result in nmon becoming a disk stat collection tool with minor extra data covering CPU and memory on the side!!
  • We have
    • DISKBUSY highlights which disks are slowing you down,
    • DISKREAD and DISKWRITE highlights how much data you are shifting,
    • DISKXFER highlight is you are approaching the Disk seek limits and adapter operation limits and
    • DISKBSIZE highlights if your application is doing silly small boxes.
  • I think that is enough to workout most disk problems

nmon for Linux

  • nmon for Linux already has what you ask for but it does come at a price for servers with large numbers of disks in the output file size.
  • You can switch on extended disk stats with (for example): nmon -f -s10 -c 600 -g auto -D
  • Its the -g auto -D that is important.
  • You could use your own User Defined Disk Groups file for the -g option but "auto" generates this for you and strips outpu the disk partition duplication.
  • This adds for my simple two disk (it is RAID5-ed) server called violet the following stats:
    1. DGBUSY,Disk Group Busy violet,sda,sdb
    2. DGREAD,Disk Group Read KB/s violet,sda,sdb
    3. DGWRITE,Disk Group Write KB/s violet,sda,sdb
    4. DGSIZE,Disk Group Block Size KB violet,sda,sdb
    5. DGXFER,Disk Group Transfers/s violet,sda,sdb
    6. DGREADS,Disk Group read/s violet,sda,sdb
    7. DGREADMERGE,Disk Group merged read/s violet,sda,sdb
    8. DGREADSERV,Disk Group read service time (SUM ms) violet,sda,sdb
    9. DGWRITES,Disk Group write/s violet,sda,sdb
    10. . DGWRITEMERGE,Disk Group merged write/s violet,sda,sdb
    11. DGWRITESERV,Disk Group write service time (SUM ms) violet,sda,sdb
    12. DGINFLIGHT,Disk Group in flight IO violet,sda,sdb
    13. DGIOTIME,Disk Group time spent for IO (ms) violet,sda,sdb
    14. DGBACKLOG,Disk Group Backlog time (ms) violet,sda,sdb

I hope the names are clear enough for you to understand the meaning.

nmon for AIX

  • Warning these flags are different for AIX
    • -D switches off the disk configuration collection at the start of the nmon files. This can be useful when you have 100's of disks as this config collection take take time (10's of seconds to a few minutes) and can hang if you have serious disk problems - in which case, don't go blaming nmon and go fix your disks. errpt is a good place to start.
    • -d switches on Disk Service time stats
  • For example on my four disk web server running AIX 7.2 I get these extra lines:
    1. DISKSERV,Disk Service Time msec/xfer blue,usbms0,hdisk7,hdisk6,hdisk5,hdisk4
    2. DISKREADSERV,Disk Read Service Time msec/xfer blue,usbms0,hdisk7,hdisk6,hdisk5,hdisk4
    3. DISKWRITESERV,Disk Write Service Time msec/xfer blue,usbms0,hdisk7,hdisk6,hdisk5,hdisk4
    4. DISKWAIT,Disk Wait Queue Time msec/xfer blue,usbms0,hdisk7,hdisk6,hdisk5,hdisk4

Question 70: How to limit top processes to certain commands?

  • If there are lots of processes running but you want to limit your monitoring to just a few commands of particular interest then you can do this in two ways for online and file capture modes. Note these are the program names and don't include the parameters.
  • Method 1: Using shell variables
    • There are 64 shell variables to use and set to the commands you want to monitor. Follow this simple example to monitor just ksh, vi and syncd commands:
    • Setting the commands of interest:
      export NMONCMD0=ksh
      export NMONCMD1=vi
      export NMONCMD2=syncd
      
    • Then start nmon and it will just show you just these commands on-screen or saving them to the nmon file.
  • Method 2: Using nmon command line options
    • This involves using the -C option:
      nmon -C ksh:vi:syncd
      
    • Up to 64 commands in this list.
  • Notes:
    1. The command name is only checked up to the characters you give, so "or" will match "oracle" and "orifice" = a sort of limited wild cards feature!
    2. If you are new to UNIX then also note that you use the "unset" command to remove this shell variable as in: unset NMONCMD0

Question 71: How does nmon for AIX extract its data?

  • These comments are based on the original source code handed over to the AIX Performance Tools developer team - the code could have changed since.
  • It is a common mistake to assume nmon simply uses AIX commands to extract the data it needs. This is not true.
    • That would require nmon to fork and exec dozens of commands every second (the fastest nmon will run) and that could easily take a whole CPU in computer time.
  • If saving to a nmon file then nmon does use AIX commands (just once) to collect the machine configuration = the BBBx lines at the start of the nmon file.
  • The bulk of the stats comes from a special C language library that comes with AIX call libperfstat.
    • Making a library call to extract performance stats is something like 1000 times faster = 1/1000th of the CPU cycles.
    • This covers all the basics like CPU, memory, Disk I/O and networks in details plus lots of POWER specific stats for example, LPAR config & stats, WPAR and virtual CPUs etc.
    • For more information:
      • Read on AIX the C header file for the data structions and function call interfaces at /usr/include/libperfstat.h
      • Read the manual in KnowledgeCenter: perftools libperfstat
      • Note: this often gets updated with each AIX release to add new features.
      • This can make compiling a binary to run on many AIX release impossible.
      • Fortunately, you get a new updated nmon with every AIX release upgrade.
  • There are some other places it gets specific information:
    • Top Process stats are extracted with a getprocs64() system call - see KnowledgeCenter getprocs manual page
    • File system use is extracted with the old classic UNIX system calls setfsent()< getfs() and endfsent()
  • The three exceptions to this are due to there not being a library function to get the data.
    • The three commands are:
      1. fcstat for Fibre Channel adapter stats - Command line option -~
      2. entstat for Ethernet stats used for VIOS SEA stats - command line option -O (usercase 0 for Ocean = SEA !!!)
      3. Top Process user arguments i.e. expanded command like settings - command line option -T (-t only saves the command name). If using nmon online to a screen these are requested using u or U.
    • The first two commands are not regular UNIX ones but highly device driver dependant and the only place to get the adapter level stats on things like packets per adapter send and received.
    • On very busy production machines it is recommended to
      1. Not use these nmon command line options to switch on the collection of these stats
      2. Or not collect the stats too quickly - if you collect them, sa, one a minute or longer they will not add significant CPU cycles.
      3. Particular Warning: if you have thousands of processes (I have seen servers with 40,000 processes). In this case, nmon can struggle to collect process data at all.
    • The Top Process command line uses the regular user ps -Aeo pid,args command to gather the process command lines used to start the processes
      • This is because they can be long i.e. multiple KB's in size (especially the insane Java commands) and so they are not held in the UNIX Kernel data structures.
    • So System call do not return the full command line.
      • The fule command line is held within the user process virtual memory and can cause paging of process memory to extract them to return to nmon.
      • Note nmon caches these command lines to reduce CPU overheads.
      • If you use nmon online with AIX and further processes are started hit u or U twice to refresh the user command line cache.
      • If collecting to a nmon file the ps -p PID -o "c," -o thcount -o",G,"s (PID is the Process ID) command is only called once for every new process found that needs to go in the output file. This also collects the parent PID, number of threads and the user name plus user group.

Question 72: How can I see 100's of disks on-screen?

  • This is a common problem and nmon for Linux and AIX has a solution for this with the Disk Busy Map
  • This only works online on a screen.
  • Hit: o (lowercase oh!) for the Map
  • Here each disk gets one single character on the screen.
  • The busier the disk the more pixels are shown by using different characters - this is surprisingly effective.
  • nmon for AIX does 50 disks per line
  • nmon for Linux does 64 disks per line
  • Example output from nmon for AIX:
    Disk-Busy-Map-Key(%): @=90 #=80 X=70 8=60 O=50 0=40 o=30 +=20 -=10 .=5 _=0%
     hdisks numbers->           1         2         3         4
                      01234567890123456789012345678901234567890123456789
     hdisk0 to 49     __X_X_.__X__Oooo+____#_+___---___X_____@___.______
                      _#___@____X_O___.__O__X____--____.__--__@@_.__@___
                      ___++_O__+__O_O_.__._@@@#___#__oOOo____@__________
                      _--_X@@OO_oo+__#.___X_.__O_+_______@ @XoOOO##@0O_-
    
  • Example output from nmon for Linux:
    + Disk %Busy Map --Key: @=90 #=80 X=70 8=60 O=50 0=40 o=30 +=20 -=10 .=5 _=0% ----------------------------------------------+
    |             Disk No.  1         2         3         4         5         6                                                 |
    |Disks=4      0123456789012345678901234567890123456789012345678901234567890123                                              |
    |disk 0 to 63 @o_8O0__XXXXXXXX                                                                                              |
    +---------------------------------------------------------------------------------------------------------------------------+
    
  • Note to self: We could use colour on nmon for Linux to highlight the the hotter disks.

Question 73: On-screen displaying only busy Top Processes and Hot disks?

  • If you want to NOT see on-screen disks that are zero busy - OR - Top Processes using zero CPU time then you need to use the dot command i.e. "."
  • If currently displayed online then this dot toggles both Top Processes and Disk Graph stats at that same time.
  • Example here Linux server has 5 disks and 300+ processes but below they are reduced to just the busy ones:
    | Disk I/O --/proc/diskstats-----mostly in KB/s------Warning:contains duplicates--------------------------------------------+
    |DiskName Busy  Read WriteMB|0          |25         |50          |75       100|                                             |
    |sda       65%    0.0   81.8|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>                |                                             |
    |sda1      66%    0.0   81.8|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>                |                                             |
    |Totals Read-MB/s=0.0      Writes-MB/s=163.6    Transfers/sec=659.2                                                         |
    | Top Processes Procs=301-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Args------------------------------------------------+
    |  PID        %CPU      Size       Res      Res       Res       Res      Shared   Faults   Faults Command                   |
    |              Used        KB       Set      Text      Data       Lib        KB      Min      Maj                           |
    |    27906    146.5   2436432    931052        12   1894772         0     47408        0        0 compiz                    |
    |     8863    100.0    121848     17172        32      7484         0      9764        0        0 appstreamcli              |
    |    20880    100.0      7292       704        28       324         0       632        0        0 yes                       |
    |     2027     66.8   1145756    857556      2288    841072         0     20884        0        0 Xorg                      |
    |    13590     20.3    709840    132616      3728    345632         0     81244        0        0 update-manager            |
    |    16905     14.4  13616476   6746892      6396  13309284         0     24864       45        0 qemu-system-x86           |
    |    20822      8.4         0         0         0         0         0         0        0        0 kworker/u16:0             |
    |      396      2.5    747412     42428     27716    688312         0     24172        0        0 docker                    |
    |       51      1.0         0         0         0         0         0         0        0        0 ksmd                      |
    |     2210      1.0         0         0         0         0         0         0        0        0 kworker/4:1H              |
    |    20845      1.0     20168      6308       152      6420         0      2152        0        0 nmon_x86_ubuntu           |
    |        7      0.5         0         0         0         0         0         0        0        0 rcu_sched                 |
    |    20811      0.5         0         0         0         0         0         0        0        0 kworker/u16:3             |
    |    27851      0.5    528156     30556       288    301184         0     23616        0        0 bamfdaemon                |
    +---------------------------------------------------------------------------------------------------------------------------+
    

Question 74: Do not use kill -9 on nmon as kill -USR2 will end it cleanly!

  • Using the kill -9 PID command on nmon to instantly stop it is a thoroughly unpleasant thing to do because if the last line being output is not complete then you have just broken the file format and later this file might fail in a graphing tool in an ugly way.
  • One case where you do want to stop nmon quickly (before its natural end) is in benchmarks, where once the benchmark run is finished you want to stop nmon as any further details are not required.
  • nmon in file capture mode detaches itself from the shell session so that it will continue to run even if you log out or switch off your terminal or X Windows session.
  • This can make it hard to kill as you have to search from the "ps -ef | grep nmon | grep -v ps" command output to find the nmon and if there is more than one you have to guess.
  • If you add the -p option to the nmon start command, it will return the process id of the nmon process before going in to the background. For example:
    $ nmon -f -s60 -c 60 -p
    428963
    
  • The 428963 is the PID.
  • To cleanly, shutdown nmon use the pid with a polite kill signal USR2 (instead of -9).
  • This request nmon to stop after the next collection and thus avoids the last line of output being incomplete.
  • So in this example use:
    $ kill -USR2 428963
    
  • You can save the Process Id (pid) easily in a script, for example:
    pid=$(nmon -s 60 -c 60 -p)
    echo $pid
    . . .
    . . .
    . . .
    # Later in the script you can just use $pid
    kill -USR2 $pid
    . . .
    . . .
    . . .
    # Alternatively, you could save the pid to a file to pick it up later so your script can finish.
    echo $pid >nmon_pid
    . . . 
    . . .
    . . .
    # Then much later read the pid back in from the file and stop nmon
    killpid=$(cat nmon_pid)
    kill -USR2 $killpid
    
  • Either way you get a nice clean stop of nmon

      - - - F r e q u e n t l y  -  A s k e d  -  Q u e s t i o n s  -  E n d  - - -

Edit - History - Print - Recent Changes - Search
Page last modified on January 03, 2017, at 04:11 PM