Port of Argonne National Laboratory's FMA chains benchmark flops.cpp. More...

#include "shambase/assert.hpp"
#include "shambase/time.hpp"
#include "shambackends/DeviceBuffer.hpp"
#include "shambackends/DeviceScheduler.hpp"
#include "shambackends/math.hpp"

Include dependency graph for fma_chains.hpp:

This graph shows which files directly or indirectly include this file:

Go to the source code of this file.

Classes
struct	sham::benchmarks::fma_chains_result
	Structure containing the results of an fma_chains benchmark. More...

Namespaces
namespace	sham
	namespace for backends this one is named only sham since shambackends is too long to write

Macros
#define	MAD_4(x, y)
#define	MAD_16(x, y)

Functions
template<class T>
void	sham::benchmarks::fma_chains (u32 i, int nrotation, T y0, T __restrict in, T __restrict out)
	Kernel for the fma_chains benchmark.
template<class T>
fma_chains_result	sham::benchmarks::fma_chains_bench (DeviceScheduler_ptr sched, int N, f64 time_threshold)
	Run the fma_chains benchmark.

Detailed Description

Port of Argonne National Laboratory's FMA chains benchmark flops.cpp.

Author: Timothée David–Cléris (tim.s.nosp@m.hamr.nosp@m.ock@p.nosp@m.roto.nosp@m.n.me)

Definition in file fma_chains.hpp.

Macro Definition Documentation

◆ MAD_16

#define MAD_16	(		x,
			y )

Value:

    MAD_4(x, y);                                                                                   \
    MAD_4(x, y);                                                                                   \
    MAD_4(x, y);                                                                                   \
    MAD_4(x, y);

◆ MAD_4

#define MAD_4	(		x,
			y )

Value:

    x = y * x + y;                                                                                 \
    y = x * y + x;                                                                                 \
    x = y * x + y;                                                                                 \
    y = x * y + x;

Function Documentation

◆ fma_chains()

template<class T>

void sham::benchmarks::fma_chains	(	u32	i,
		int	nrotation,
		T	y0,
		T *__restrict	in,
		T *__restrict	out )

inline

Kernel for the fma_chains benchmark.

Saturates the FPU to hide memory latency. Since we know that there are 16 * 2 flops per iteration, this kernel can be used to compute the achieved flops.

Template Parameters

T	value type of the input and output vectors

Parameters

i	index of the element to process
nrotation	number of FMA-chain rotations to apply
y0	initial value of the second input vector
in	input vector
out	output vector

Definition at line 41 of file fma_chains.hpp.

◆ fma_chains_bench()

template<class T>

fma_chains_result sham::benchmarks::fma_chains_bench	(	DeviceScheduler_ptr	sched,
		int	N,
		f64	time_threshold )

inline

Run the fma_chains benchmark.

Based on Argonne's Aurora node performance overview: https://docs.alcf.anl.gov/aurora/node-performance-overview/node-performance-overview/

Template Parameters

T	value type used in the benchmark

Parameters

sched	scheduler for the target device
N	number of elements (independent FMA chains) to process
time_threshold	minimum wall-clock time to run the benchmark in seconds

Returns: benchmark results as an fma_chains_result

Definition at line 85 of file fma_chains.hpp.

Here is the call graph for this function:

Classes

Namespaces

Macros