		NTPROF





WHAT IT'S USEFUL FOR:

--------------------

 	NTPROF is a performance profiling tool for the Windows NT operating

	system.  For any process in the system, it gives information

	such as: the dlls being loaded, where the most time is spent in the

	kernel and where the most time is spent in user space.  On Alpha

	platforms, it can also locate areas of a process which are producing

	large numbers of events such as alignment faults, cache misses and

	more.  



WHEN TO USE NTPROF:

------------------

	NTPROF can be an asset in finding software performance problems.

	However, before diving into NTPROF, take the time to look at the

	problem at a higher level using such tools as the NT Performance

	Monitor.  Make sure that the problem is not caused by a factor

	which has nothing to do with software performance, such as

	excessive disk I/O or network activity.



	Once you are reasonably sure there is a software performance

	problem, NTPROF can be invaluable.



	For more information on when to use NTPROF, see triage.doc, which

	is included in triage.zip, www address <TBS>.



WHAT IT DOES:

------------

	NTPROF performs sample based PC profiling for any process in the

	system.   To do the profiling, NTPROF uses:

	

	    Profile range -  the portion of the target process's address

	    space which is being profiled.  The default profile range is

	    the entire process address space.  Or, the user can specify a

	    portion of the process's address space using the /range option.



	    Buckets and bucket size -  NTPROF divides the profile range

	    into buckets.  Each bucket is assigned a specific address range of

	    the profile range and the number of bytes in the bucket is

	    called the bucket size.



	For time-based profiles, NTPROF samples the program

	counter (PC) every few CPU cycles, and counts the number of times

	the PC is within the range of each bucket.  If run for a

	sufficiently long time, this will give a statistically accurate

	view of where the process time is spent.



	The sampling frequency is time-based by default.  On alpha platforms, 

	NTPROF can also profile based on the frequency of events such as

	alignment faults.



NTPROF FEATURES:

---------------



	- Available for x86, Alpha, MIPS, and PPC platforms.

	- Both user-mode and kernel-mode profiling is supported.

	- Images do not need to be recompiled or relinked in order for

	  NTPROF to operate.  

	- Both new or currently running processes may be targeted for

 	  profiling.

	- NTPROF can profile multiple processes simultaneously.

	- The user can specify an arbitrary number of profile address

	  ranges and bucket sizes.

	- If the program is linked with symbols, they will be displayed

	  in the place of addresses. There is support for both Coff and

	  CodeView symbols and split symbols are handled.

	- On Alpha platforms, code disassembly and instruction-level

	  profiling is possible when a 4-byte bucket size is specified.







HOW TO USE NTPROF:

-----------------

    Starting NTPROF:

	NTPROF and the process being profiled each run in a separate process

	context.  Start the process and then start NTPROF.  Or, start NTPROF

	using the /wait parameter, and then start the process.



    Getting a good profile:

	Be sure to run NTPROF long enough to get a statistically significant

	sampling of the interesting portions of the target process.  It may

	be necessary put a load on the process during the profiling.  For

	example, for applications with windows, you can move the mouse, click,

	type, etc. in order to get a profile which reflects real use of the

	application.



    Use NTPROF iteratively:

	Use NTPROF iteratively.  For general system performance problems,

	start by profiling the entire system (/all).  Choose a process

	which is using an unexpectedly large amount of CPU time.  Profile

	this process with a large bucket size to get a coarse, high level

	view of where the time is being spent.  Once you get a high-level

	view, you can zero in on a range of a process's address space, and

	use smaller bucket sizes to get a finer grained view of where the

	time is being spent.



	For example: 



	Suppose you have chosen the File Manager to profile.  First, use

	NTPROF to profile the entire process's address space:



		ntprof /process:winfile /time:30

	

	By default, NTPROF will use a 4 GB range and a 64 KB bucket size.

	Examine the NTPROF output, and locate the address range(s) with

	the highest event (tick) count(s).  Then, rerun NTPROF against the

	process, this time specifying a range (/range).   Choose a range

	(or ranges) with a suspiciously high tick count.  If the process

	is linked with symbols, add /dlls and /symbols to display them.

  	Note that, if more than one symbol falls within a bucket, NTPROF

	will display "(two symbols)" or "(multiple symbols)".  Use smaller

	bucket sizes, to display these symbols.

	 

		ntprof /p:winfile /t:30 /range:77dc0000-77dcffff /dlls /symbols 

					

	When /range is used without a bucket size, as above, NTPROF will use

	a 256 byte bucket size and profile only the specified profile range.

	NTPROF always displays the bucket size at the head of the process's

	portion of the output. 



	For the next iteration, use a smaller bucket size (add @bucket_size at

	the end of the range), again choosing a range with a suspiciously high

	tick count.:



		ntprof /p:winfile /t:30 /r:77dcbd00-77dcbdff@16 /dlls /symbols 



	Continue, using a smaller profile range and bucket size each time.

	When your bucket size reaches the instruction size of the platform,

	there is no point in continuing with smaller bucket sizes.  

	On Alpha platforms, when the bucket size reaches 4, use the /dis

	parameter to see the disassembled instructions and set /floor to 0

	so that all instructions in the range are disassembled whether they

	are sampled or not:



		ntprof /p:winfile /t:30 /r:77dcbd70-77dcbd8f@4 /dis /floor:0



	Note that it may not be necessary to iterate all the way down

	to the instruction level.  At any point in the iterative process,

	you may decide you have learned enough to continue your investigation

	elsewhere.  This could mean examining part of the program's sources,

	or to continue on with NTSTEP for a trace of part of the program.

	Or you may decide that the particular range you have been investigating

	is an unlikely contributor to the program's performance problem, and

	use NTPROF to examine a different range which is getting a high tick

	count.

	



NTPROF COMMANDS AND OPTIONS: 

---------------------------

   The command syntax for NTPROF is:



		ntprof [options] /process:PID ...



   Following are NTPROF options.  Most options can be abbreviated to one or

   a few characters.



   HELP:

	    /help        Print usage message.





   PROFILE CONTROL OPTIONS:

	    /all         Generate summary profile of all processes.



	    /process:PID Specify target process(s) to profile.



			 ProcessIds may be decimal, hex (0x prefix), or process

			 names. Multiple process options may be given.  See

			 example below.



	    /dlls        Profile each image (DLL) of target process(s).



			 Set the environment variable _nt_symbol_path to

			 display dll symbols.  Use with /symbols.



	    /time:#      Specify profile duration (decimal seconds).



			 Without the /time option, you are prompted for

			 start/end points.



	    /wait[:#]    Wait for creation of target process (then delay

			 # msec).



			 This option allows profiling of processes with

			 relatively short run times.



   OUTPUT OPTIONS:

	    /floor:x#.#  Specify minimum profile bucket value to print. Floor

			 argument is t# or # (ticks), p#.# (percent), or

			 m# (milliseconds).    Default floor value is 1 tick

			 count.



			 This option eliminates most uninteresting buckets,

			 but the first and last will always be printed to

			 indicate the address range used for the profile.



	    /range:#-#@# Specify profile address range (#-#) and bucket size

			 (@#).  Multiple range options may be given.  Bucket

			 size is optional and defaults to 256 bytes.



			 Range beginning and ending addresses are given in hex.

			 Range may instead be a wild-card address.  For

			 example:

			 	75* = 75000000-75ffffff 

			 	75c9* = 75c90000-75c9ffff

			 Bucket sizes are in #[kmg]b format. For example: 64,

			 8kb, 128mb.



	    /full        Synonym for 4 GB range with a 64 KB bucket size.



			 /full is the default if no range is specified.



	    /scale:#     Print simple bar chart with N ticks per character.  



			 Provides a crude histogram which can quickly be

			 visually scanned for address ranges with the most

			 events.



	    /verbose     Print profile detail for /all and /dll options.







    SYMBOLS DISPLAY OPTIONS:

	    /symbols     Translate addresses to symbols when possible.



			 To get symbols information, the program must be linked

			 with:

				linkdebug = -debug:partial -debugtype:both.

			 Use with /dll.  See also /kernel.  Because NTPROF 

			 displays one symbol per bucket, be sure to use small

			 bucket sizes with this parameter.



	    /kernel      Display kernel mode symbols.



			 The _nt_symbol_path environment variable must point to

			 kernel symbols.  Use with /symbols and /dll.



  SPECIALIZED OPTIONS:

	    /affinity:#  Set profile clock processor affinity mask.



			 On a multiprocessor, this parameter restricts profiles

			 to threads running in processor #.

			 Note: /affinity:0 profiles all processors.

 

 	   /dis          Print instruction disassembly for ranges with 4 byte

			 buckets.  Alpha platforms only.



			 Use with /floor = 0 to see all instructions even

			 if they were not sampled. 



	    /hz:#        Specify profile sample rate (decimal Hz).



			 The default sample rate with which NTPROF samples a

			 process is approximately 4000 Hz (an interval of 250

			 uSec) for all platforms.  The /hz parameter allows

			 the user to vary the sample rate, although it should

			 be noted that it is treated as a recommended sample

			 rate, and that the platform may not be able to honor

			 the request.  NTPROF displays the precise profile

			 frequency in the Profiler Statistics section of the

			 output.



			 Note that, the higher profile frequency, the more

			 system overhead NTPROF is inducing in the form of

			 interrupts.  For example, interrupt rates > 10,000/sec

			 can severely skew the profile results.



	    /source:#    By default, NTPROF samples on a time basis.  On Alpha

			 platforms only, other event sources can be specified.

			 # = one of the following profile sources:

			    Time (the default)	    TotalNonissues

			    AlignmentFixup	    DcacheMisses

			    TotalIssues		    IcacheMisses

			    PipelineDry		    CacheMisses

			    LoadInstructions	    BranchMispredictions

			    PipelineFrozen	    StoreInstructions

			    BranchInstructions



			 Except for the default Time source, these sources are

			 sampled on a threshold basis. Generally every 4K or

			 64K occurrences of the event, NTPROF increments the

			 bucket for the PC at which the most recent event

			 occurred.

	

			 To list the sources available, type:

				ntprof /sources:?









EXAMPLES:

--------

   1. Profile all active processes, and run the profile for 10 seconds.

      Displays two profiles:

	- Any Process Any Mode: all processes.

	- All Initall Process Summary: each process into a single bucket.



	    ntprof /all /time:10



   2. Profile all active processes into one set of buckets.  In

      addition, create individual /full profiles and Image/DLL

      Summaries for each of ntvdm and csrss.  Run the profile for

      30 seconds.



	    ntprof /all /process:ntvdm /process:csrss /dlls /t:30



   3. Profile process ntvdm, address range 76940000-7694ffff, using

      16 byte buckets.   Minimum profile bucket to display = 0.1% of

      elapsed time.  Also, generate an Image/DLL Summary for both

      ntvdm and csrss.

     

	    ntprof /p:ntvdm /r:7694*@16 /p:csrss /dlls /floor:p0.1 /t:30



   4. Profile process csrss, address range 76405000-76405fff, using

      4 byte buckets and display all buckets, regardless of the

      number of ticks in the bucket.  On an Alpha platforms, display the

      disassembled instruction in each bucket (instruction size

      on Alpha = 4 bytes).  Profile csrss for 10 seconds.



	    ntprof /p:csrss /r:76405*@4 /floor:0 /dis /t:10



INTERPRETING THE OUTPUT:

-----------------------



    The output from NTPROF is divided into several sections depending on

    the command parameters.  



    Profiler Statistics:

	This section is always present and always first.  It displays

	information about the profiling process itself.



	For time based profiling:

	    Clock frequencies: Profile clock: displays the sample

		frequency used by NTPROF during the profiling.



	    Samples Accounted For: 

		- Unless you are running on a heavily loaded system, you 

		  should expect this number to be > 98%.

		- On a multiprocessor, expect this number to be the sum of the

		  percentages for all processors.





    Process profiles:

	The Profiler Statistics are followed by a series of tables.

	Each of these profile tables contains the following information:



	Address Range	Address range of the process represented by

			the bucket.



	Ticks		The number of samples for which the PC was within

			the bucket's address range.  Samples are time based

			by default.  Other sample criteria are specified

			by the /source parameter.



	Millisec	Milliseconds execution time the PC was within the

			bucket specified by Address Range.  Meaningful for

			time based profiles only. 



	Elapse %	Percent of total profiling time for which the

			PC was within the bucket specified by Address Range.

			Meaningful for time based profiles only.



	Target %	Percent of profiled process execution time in which

			the PC was within the bucket specified by Address

			Range.



	Description	Symbol or scale information.  



			Symbols are displayed here when /symbols and 

			/dlls and/or /kernel are used, and the program being

			profiled is linked with symbols.  Since, with larger

			buckets, more than one symbol can fall within 

			a bucket, try smaller and smaller buckets until

			all the desired symbols are displayed.



			When /scale is used, this column contains the bar

			chart for each bucket.







BUGS

----

/dis	Kernel mode disassembly does not work.

	

