dkBusy is a tool that allows you to investigate the memory performance of a machine from an applications perspective. I wrote the first version of this tool in 2001 to compare the difference of local and remote memory access on a Unisys ES/7000 to show the importance of cache and memory locality on NUMA systems.
The tool uses the "memset" function from the C runtime library to calculate memory bandwidth. Normally "memset" is highly optimized machine code for various target architectures.
The tool comes for the x86, x64 and Itanium processor architecture. It is tested on Windows 7, Windows Server 2008 and Vista. In theory it should run on earlier versions of Windows NT. The x86 version is built "Large Address Aware" which means it can access 2GB on a x86 system, 3GB on a x86 system booted with the /3GB switch and almost 4GB on an x64 system. So on an x64 system the x86 version can test up to 24 CPUs. Interestingly enough, even on the same hardware, the x86 results are very different from the x64 results, a fact that I attribute to the different "memset" implementations of the x86 and x64 C runtime.
When the tool is started without parameters, one thread per CPU is created. Each thread will allocate 128MBytes of memory. Next all threads will synchronize and start the first test. The tests will write 0xFF using the "memset" function to memory. The size of the buffer is 8Kbyte for the first test. On each test iteration the buffer size is increased so that the decrease in memory bandwidth can be seen on the L1, L2 and maybe L3 cache boundaries. The last test uses a buffer size of 128Mbyte which at least now, 2008 is not a common L3 cache size and should touch main memory.
After the first run, the threads sequentially test each others memory buffer to determine which memory buffer has the lowest bandwidth and is therefore remote. The threads then synchronize and rerun the tests in parallel to show the penalty of accessing remote memory. The results can help to determine if there is a benefit on that specific machine to invest in cache and memory locality using CPU affinity and other techniques. The results also show the benefits of the Itanium architecture compared to the x64 architecture with today's hardware.
The first parameter defines how many CPUs will be tested. If the -uniform switch is used, the threads are affinitized incrementally, e.g. CPU 1, 2, 3, 4 and so on. If no switch is specified the threads are distributed evenly over the NUMA nodes.
After the test completes, two CSV files are created that can be used to analyze the data with the Excel Pivot Table. The first CSV file contains the header so that multiple CSV files can be appended.
dkBusy [<p1>] [-uniform]
© 2001 by David Kubelka - firstname.lastname@example.org