Need help with perftest?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

linux-rdma
148 Stars 110 Forks Other 979 Commits 33 Opened issues

Description

Infiniband Verbs Performance Tests

Services available

!
?

Need anything else?

Contributors list

     Open Fabrics Enterprise Distribution (OFED)
                Performance Tests README

===============================================================================

Table of Contents

  1. Overview
  2. Notes on Testing Methodology
  3. Test Descriptions
  4. Running Tests
  5. Known Issues

===============================================================================

1. Overview

This is a collection of tests written over uverbs intended for use as a performance micro-benchmark. The tests may be used for HW or SW tuning as well as for functional testing.

The collection contains a set of bandwidth and latency benchmark such as:

* Send        - ib_send_bw and ib_send_lat
* RDMA Read   - ib_read_bw and ib_read_lat
* RDMA Write  - ib_write_bw and ib_write_lat
* RDMA Atomic - ib_atomic_bw and ib_atomic_lat
* Native Ethernet (when working with MOFED2) - raw_ethernet_bw, raw_ethernet_lat 

Please post results/observations to the openib-general mailing list. See "Contact Us" at http://openib.org/mailman/listinfo/openib-general and http://www.openib.org.

===============================================================================

2. Notes on Testing Methodology

  • The benchmarks use the CPU cycle counter to get time stamps without context switch. Some CPU architectures (e.g., Intel's 80486 or older PPC) do not have such capability.

  • The latency benchmarks measure round-trip time but report half of that as one-way latency. This means that the results may not be accurate for asymmetrical configurations.

  • On all unidirectional bandwidth benchmarks, the client measures the bandwidth. On bidirectional bandwidth benchmarks, each side measures the bandwidth of the traffic it initiates, and at the end of the measurement period, the server reports the result to the client, who combines them together.

  • Latency tests report minimum, median and maximum latency results. The median latency is typically less sensitive to high latency variations, compared to average latency measurement. Typically, the first value measured is the maximum value, due to warmup effects.

  • Long sampling periods have very limited impact on measurement accuracy. The default value of 1000 iterations is pretty good. Note that the program keeps data structures with memory footprint proportional to the number of iterations. Setting a very high number of iteration may have negative impact on the measured performance which are not related to the devices under test. If a high number of iterations is strictly necessary, it is recommended to use the -N flag (No Peak).

  • Bandwidth benchmarks may be run for a number of iterations, or for a fixed duration. Use the -D flag to instruct the test to run for the specified number of seconds. The --run_infinitely flag instructs the program to run until interrupted by the user, and print the measured bandwidth every 5 seconds.

  • The "-H" option in latency benchmarks dumps a histogram of the results. See xgraph, ygraph, r-base (http://www.r-project.org/), PSPP, or other statistical analysis programs.

*** IMPORTANT NOTE: When running the benchmarks over an Infiniband fabric, a Subnet Manager must run on the switch or on one of the nodes in your fabric, prior to starting the benchmarks.

Architectures tested: i686, x86_64, ia64

===============================================================================

3. Benchmarks Description

The benchmarks generate a synthetic stream of operations, which is very useful for hardware and software benchmarking and analysis. The benchmarks are not designed to emulate any real application traffic. Real application traffic may be affected by many parameters, and hence might not be predictable based only on the results of those benchmarks.

ibsendlat latency test with send transactions ibsendbw bandwidth test with send transactions ibwritelat latency test with RDMA write transactions ibwritebw bandwidth test with RDMA write transactions ibreadlat latency test with RDMA read transactions ibreadbw bandwidth test with RDMA read transactions ibatomiclat latency test with atomic transactions ibatomicbw bandwidth test with atomic transactions

Raw Ethernet interface benchmarks: rawethernetsendlat latency test over raw Ethernet interface rawethernetsendbw bandwidth test over raw Ethernet interface

===============================================================================

4. Running Tests

Prerequisites: kernel 2.6 (kernel module) matches libibverbs (kernel module) matches librdmacm (kernel module) matches libibumad (kernel module) matches libmath (lm).

Server: ./ Client: ./

    o   is IPv4 or IPv6 address. You can use the IPoIB
               address if IPoIB is configured.
    o  --help lists the available 

*** IMPORTANT NOTE: The SAME OPTIONS must be passed to both server and client.

Common Options to all tests:

-h, --help Display this help message screen -p, --port= Listen on/connect to port (default: 18515) -R, --rdmacm Connect QPs with rdmacm and run test on those QPs -z, --comrdmacm Communicate with rdmacm module to exchange data - use regular QPs -m, --mtu= QP Mtu size (default: activemtu from ibvdevinfo) -c, --connection= Connection type RC/UC/UD/XRC/DC/SRD (default RC). -d, --ib-dev= Use IB device (default: first device found) -i, --ib-port= Use network port of IB device (default: 1) -s, --size= Size of message to exchange (default: 1) -a, --all Run sizes from 2 till 2^23 -n, --iters= Number of exchanges (at least 100, default: 1000) -x, --gid-index= Test uses GID with GID index taken from command -V, --version Display version number -e, --events Sleep on CQ events (default poll) -F, --CPU-freq Do not fail even if cpufreqondemand module -I, --inline_size= Max size of message to be sent in inline mode -u, --qp-timeout= QP timeout = (4 uSec)*(2^timeout) (default: 14) -S, --sl= Service Level (default 0) -r, --rx-depth= Receive queue depth (default 600)

Options for latency tests:

-C, --report-cycles Report times in CPU cycle units -H, --report-histogram Print out all results (Default: summary only) -U, --report-unsorted Print out unsorted results (default sorted)

Options for BW tests:

-b, --bidirectional Measure bidirectional bandwidth (default uni) -N, --no peak-bw Cancel peak-bw calculation (default with peak-bw) -Q, --cq-mod Generate Cqe only after completion -t, --tx-depth= Size of tx queue (default: 128) -O, --dualport Run test in dual-port mode (2 QPs). Both ports must be active (default OFF) -D, --duration= Run test for period of seconds -f, --margin= When in Duration, measure results within margins (default: 2) -l, --postlist= Post list of send WQEs of size (instead of single post) --recvpostlist= Post list of receive WQEs of size (instead of single post) -q, --qp= Num of QPs running in the process (default: 1) --runinfinitely Run test until interrupted by user, print results every 5 seconds

SEND tests (ibsendlat or ibsendbw) flags:

-r, --rx-depth= Size of receive queue (default: 512 in BW test) -g, --mcg= Send messages to multicast group with qps attached to it -M, --MGID= In multicast, uses as the group MGID

ATOMIC tests (ibatomiclat or ibatomicbw) flags:

-A, --atomictype= type of atomic operation from {CMPANDSWAP,FETCHAND_ADD} -o, --outs= Number of outstanding read/atomic requests - also on READ tests

Options for rawethernetsend_bw:

-B, --sourcemac source MAC address by this format XX:XX:XX:XX:XX:XX (default take the MAC address form GID) -E, --destmac destination MAC address by this format XX:XX:XX:XX:XX:XX MUST be entered -J, --serverip server ip address by this format X.X.X.X (using to send packets with IP header) -j, --clientip client ip address by this format X.X.X.X (using to send packets with IP header) -K, --serverport server udp port number (using to send packets with UDP header) -k, --clientport client udp port number (using to send packets with UDP header) -Z, --server choose server side for the current machine (--server/--client must be selected) -P, --client choose client side for the current machine (--server/--client must be selected)


Special feature detailed explanation in tests:

  1. Usage of postlist feature (-l, --postlist= and --recvpostlist=) In this case, each QP will prepare WQEs (instead of 1), and will chain them to each other. In chaining we mean allocating array, and setting 'next' pointer of each WQE in the array to point to the following element in the array. the last WQE in the array will point to NULL. In this case, when posting the first WQE in the list, will instruct the HW to post all of those WQEs. Which means each post send/recv will post messages. This feature is good if we want to know the maximum message rate of QPs in a single process. Since we are limited to SW posts (for example, on postsend ~ 10 Mpps, since we have ~ 500 ns between each SW postsend), we can see the true HW message rate when setting of 64 (for example) since it's not depended on SW limitations.

  2. RDMA Connected Mode (CM) You can add the "-R" flag to all tests to connect the QPs from each side with the rdma_cm library. In this case, the library will connect the QPs and will use the IPoIB interface for doing it. It helps when you don't have Ethernet connection between the 2 nodes. You must supply the IPoIB interface as the server IP.

  3. Multicast support in ibsendlat and in ibsendbw Send tests have built in feature of testing multicast performance, in verbs level. You can use "-g" to specify the number of QPs to attach to this multicast group. "-M" flag allows you to choose the multicast group address.

  4. GPUDirect usage: To utilize GPUDirect feature, perftest should be compiled as: ./autogen.sh && ./configure CUDAHPATH= && make -j, e.g.: ./autogen.sh && ./configure CUDAHPATH=/usr/local/cuda/include/cuda.h && make -j

    Thus --usecuda=<gpuindex> flag will be available to add to a command line: ./ibwritebw -d ibdev --usecuda= -a

===============================================================================

5. Known Issues

  1. Multicast support in ibsendlat and in ibsendbw is not stable. The benchmark program may hang or exhibit other unexpected behavior.

  2. Bidirectional support in ibsendbw test, when running in UD or UC mode. In rare cases, the benchmark program may hang. perftest-2.3 release includes a feature for hang detection, which will exit test after 2 mins in those situations.

  3. Different versions of perftest may not be compatible with each other. Please use the same perftest version on both sides to ensure consistency of benchmark results.

  4. Test version 5.3 and above won't work with previous versions of perftest. As well as 5.70 and above.

  5. This perftest package won't compile on MLNXOFED-2.1 due to API changes in MLNXOFED-2.2 In order to compile it properly, please do: ./configure --disable-verbs_exp make

  6. In the x390x platform virtualized environment the results shown by package test applications can be incorrect.

  7. perftest-2.3 release includes support for dualport VPI test - port1-Ethernet , port2-IB. (in addition to Eth:Eth, IB:IB) Currently, running dualport when port1-IB , port2-Ethernet still not working.

  8. If GPUDirect is not working, (e.g. you see "Couldn't allocate MR" error message), consider disabling Scatter to CQE feature. Set the environmental variable MLX5SCATTERTOCQE=0. E.g.: MLX5SCATTERTOCQE=0 ./ibwritebw -d ibdev --usecuda= -a

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.