goyalankit/set_associativity.md

Table of Contents

Set Associativity cache conflicts

Set Associativity cache conflicts

It’s a tool built to get the different type of cache conflicts for a given program. It works by inserting instrumentation to a user specified method and records all the address access within that method. It’s part of the MACPO framework and there are two types of instrumentations that are supported: static or dynamic. The static instrumentation will generate a trace file, which can later be analyzed using macpo-analyze to get the conflict values whereas the dynamic instrumentation will calculate the counter values at run time for each set.

Usage

Running the analysis involves three steps. Instrumentation, compilations and run. In case of static instrumentation, an additional step on analyzing is also required.

Example gives expected output of each step for a given program.

Instrumentation

Here we insert the instrumentation functions in the method_name specified by the user in file source.c. Make sure to give the file containing the main function along with the file containing the method method_name (if it's not present in the same file).

Static Instrumentation

macpo.sh --macpo:disable-sampling --macpo:instrument=method_name source.c

Dynamic Instrumentation

For dynamic instrumentation pass additional flag: macpo:dynamic_inst

macpo.sh --macpo:disable-sampling --macpo:instrument=method_name --macpo:instrument=dynamic_inst source.c

Compilation

Above command will create a new file rose_source.c with instrumentation. You can compile the above file using any compiler using the following command

g++ rose_source.c -std=c++0x -lmrt -lstdc++ -ldl -rdynamic -lelf -lbfd -liberty -lz -o macpo

Running

./macpo

Analyze

The above command will create a macpo.out file which contains the trace information and can be analyzed using the macpo-analyze command.

macpo-analyze macpo.out

if you ran the dynamic instrumentation, then at the end of execution it should print results. It will also print results if you cancel the job by sending SIGINT to the job (i.e., equivalent to CTRL + C).

Algorithms

To calculate set associativity conflicts

Link to Dynamic code: set_cache_conflict_lib.cpp
Link to Static code: set_cache_conflict_analysis.cpp

Main Algorithm for conflict detection (static version)

Data Structures

/*
Trace that we get after running the instrumented code consists of mem_info_t structs.
Note: I have written comments for the variables that we are using in the algorithm.
*/
typedef struct {
    uint16_t coreID;            // we get the core number from here
    uint16_t read_write:2;
    size_t address;             // address that was accessed
    size_t var_idx;             // Variable id. Using this we can identify the variable in original source code
    size_t line_number;
    int type_size;
} mem_info_t;

/*
This is the conflict object that we store on a conflict
*/
struct conflict_t{
    size_t set_number; // conflicting set number
    size_t var_idx;    // Variable for which cache conflict miss happened.
};

// This struct is used to store the stats
struct cache_stats_t {
    uint64_t addr_accesses;
    uint64_t addr_hits;
    uint64_t addr_conflict_misses;
    uint64_t addr_cold_misses;
    uint64_t addr_capacity_misses;
}

Algorithm

1. Initialize Reuse Distance Vector RVs for each set.
2. Initialize vector < vector<conflict_t> > set_conflicts_per_core; // array of an array of conflict_t for each core
3. for each mem_info Mi in trace: - O(n)
4.       Calculate set number Sn and Cache line Cli for Mi.address
5.       Get reuse distance Ri from RVs using Cli - O(n) 
6.       if Ri is non-negative:
7.             if reuse distance Ri > l1_associativity && reuse distance Ri < total number of cache lines:
8.                   Add the conflict(Sn, Mi.var_idx) to set_conflicts_per_core[Mi.coreId]
9.                   Increment the conflict misses
10.             else if reuse_distance >= l1_cache_lines // l1_cache_lines = 512 on stampede.
11.                  Increment the capacity misses
12.             else
13.                  Increment the hit count
14.     else
15.            Increment the cold mist count
16.      Insert or update RVs with Mi.address - O(n)

Cache line and set Number calculation:

# Give an address for example, 
63 62   12 11 10 9  8  7  6  5  4  3  2  1  0         
0  0  ..0  0  0  1  1  0  0  0  0  0  0  1  0

6-63 bits: cache line -> 00..00 -> 0. (This is the reason why two arrays separated by more than 64 bytes will always be on different cache line)
0-5 bits: offset within the cache line.  000010 -> 2
6-11 bits: set number 001100 -> 12

To calculate reuse distance

Link to Code: reuse_distance.h

Definition: Reuse distance is calculated per cache line for each set. For a given cache line, reuse distance is calculated as the number of distinct cache lines that were accessed between two accesses of that cache line before the cache line have being evicted.

The main idea of the implementation below is to maintain a counter (access_count) which gets incremented everytime when the current cache line number is not equal to the last cache line. For example consider the following cases where C[N] represents that the cache line number N has been accessed. For example:

Case 1: When the last cache line accessed is same as current cache line.

# Initial List
C1 C2 C3 C4
access_count = 4

# C4 is accesed again
C1 C2 C3 C4 C4
access_count = 4


Case 2: When the last cache line accessed is different from the current cache line

# Initial List
C1 C2 C3 C4
access_count = 4

# C4 is accesed again
C1 C2 C3 C4 C1
access_count = 5

1. Initialize the counter access_count
2. Initialize hash map cache_line_count_map;  // mapping cache_line to a counter. Only maintains 512 cache_lines
3. Initialize hash map already_accessed; // this contained record of all the cache lines that has been accessed.

// If the cache line is still in cache, it will return the number of
// distinct cache lines that have been accessed (reuse distance).
// If the cache line is not in the cache but has been evicted, so it must be
// present in the already_accessed set, it will return number greater than 512. which indicates capacity miss.
// Otherwise it return -1, indicating the first time access.

get_distance(cache_line_t cache_line) {
  // if we have already seen the cache_line
  if cache_line in cache_line_count_map {
    return access_count - cache_line_count_map[cache_line]
  } else if cache_line in already_accessed {
    return 513; // anything greater than 512 will do.
  } else { // new cache_line
      return INFINITY;
  }
}

// while inserting we increment access count if the cache current cache line has been accessed and
// is not equal to the last cache line. (as explained in the above examples.)
// we also need to increment all the counters for the cache lines that are not in the current and last access
// of the current cache line. since there will be no increase in reuse distance increase for those cache lines.

insert(cache_line_t cache_line) {
  if (get_distance(cache_line) != 0) { // not on the head
    adjust_other_counters(cache_line);
    access_count++;
  }
  already_accessed = 1;
  cache_line_counter[cache_line] = access_count;
}

adjust_other_counters(cache_line_t cache_line) {
  int previous = cache_line_counter[cache_line];
  // we need to do this for all cache lines since
  // we need capacity misses.
  for (cl in cache_line_counter) {
    if (cache_line_counter[cl] < previous) {
      cache_line_counter[cl]++;
    }
  }
}

Error handling, Testing and Expected outputs

To give an idea of things that you may need to include

More explicit commands:

Instrumentation

macpo.sh -I/work/02681/ankitg/installs/perfexpert_install/include -g -L/work/01174/ashay/apps/binutils/lib --macpo:disable-sampling --macpo:instrument=method_name --macpo:dynamic_inst hello.c -L/work/02681/ankitg/installs/perfexpert_install/lib -L/work/02681/ankitg/installs/libelf/lib -Wl,-rpath=/work/02681/ankitg/installs/libelf/lib -lmrt -lstdc++ -ldl -rdynamic -lelf -lbfd -liberty -lz -lset

Compilation

g++ rose_hello.c -std=c++0x -I/work/02681/ankitg/installs/perfexpert_install/include -g -L/work/01174/ashay/apps/binutils/lib  -L/work/02681/ankitg/installs/perfexpert_install/lib -L/work/02681/ankitg/installs/libelf/lib -Wl,-rpath=/work/02681/ankitg/installs/libelf/lib -lmrt -lstdc++ -ldl -rdynamic -lelf -lbfd -liberty -lz -o matrix

AST Graph Generation

To debug any instrumentation related issues, you can generate a .dot file and then analyze it in Graphviz software.

Uncomment generateDOT(*project); in minst.cpp to generate a .dot file.

Code to generate controlled trace

Link to code: trace.c (file attached below)

This program generates a controlled trace using which you can test your set associativity conflicts

My configure command

./configure --prefix=/work/02681/ankitg/installs/perfexpert_install --with-rose=/work/02681/ankitg/installs/rose2_install --with-jvm=/usr/java/latest/jre/lib/amd64/server/ --with-papi=/opt/apps/papi/5.3.0/x86_64 --with-rose-include=/work/02681/ankitg/installs/rose2_install --with-boost=/work/02681/ankitg/installs/boost_install --with-gsl=/opt/apps/intel13/gsl/1.15 --with-libelf=/work/02681/ankitg/installs/libelf --with-libdwarf=/work/01174/ashay/apps/libdwarf --with-matheval=/work/01174/ashay/apps/perfexpert/externals/ --disable-module-hpctoolkit-mic --disable-module-timb --disable-memsniffer --disable-module-gcc --disable-module-lcpi --disable-module-gpp --disable-module-make --disable-module-icc --disable-module-sqlrules --disable-module-vtune --no-create --no-recursion

Valgrind trace to Macpo trace

Link to code

The link above contains instructions on how to convert trace from valgrind to a trace that macpo-analyze can understand and then run the analysis on top of that.

Table of Contents

Expected Outputs

Expected Outputs

The original source file `matrix.c`

#include <stdio.h>

#define FIXED 400

#define X_1 FIXED
#define Y_1 FIXED
#define X_2 FIXED
#define Y_2 FIXED
#define Z_1 FIXED
#define Z_2 FIXED


void mul()
{
  double m1[X_1][Y_1];
  double m2[Y_1][Y_2];
  double m3[Z_1][Z_2];

  int i,j,k;


  // warm up the cache
  for (i = 0; i < X_1; ++i)
  {
    for (j = 0; j < Y_2; ++j)
    {
      m1[i][j] = 1;
      m2[i][j] = 3;
      m3[i][j] = 0;
    }
  }

  int sum=0;
  for(i = 0; i < X_1; ++i)
  {
    for (j = 0; j < Y_2; ++j)
    {
      for (k = 0; k < Y_2; ++k)
      {
        sum = sum + m1[i][k] * m2[k][j];
      }
      m3[i][j] = sum;
      sum=0;
    }
  }

#ifdef PRINT
  for (i = 0; i < X_1; ++i)
  {
    for (j = 0; j < Y_2; ++j)
    {
      printf("%f\t", m3[i][j]);

    }
    printf("\n");
  }
#endif
}


int main(void)
{
  mul();
}

Static Instrumentation Output `rose_matrix.c`

void indigo__create_map();
#include <stdio.h>
#define FIXED 400
#define X_1 FIXED
#define Y_1 FIXED
#define X_2 FIXED
#define Y_2 FIXED
#define Z_1 FIXED
#define Z_2 FIXED
#include "mrt.h"

void mul()
{
  double m1[400UL][400UL];
  double m2[400UL][400UL];
  double m3[400UL][400UL];
  int i;
  int j;
  int k;
// warm up the cache
  for (i = 0; i < 400; ++i) {
    for (j = 0; j < 400; ++j) {
      indigo__record_c(2,27,((void *)(&m1[i][j])),0,sizeof(double ));
      m1[i][j] = 1;
      indigo__record_c(2,28,((void *)(&m2[i][j])),1,sizeof(double ));
      m2[i][j] = 3;
      indigo__record_c(2,29,((void *)(&m3[i][j])),2,sizeof(double ));
      m3[i][j] = 0;
    }
  }
  int sum = 0;
  for (i = 0; i < 400; ++i) {
    for (j = 0; j < 400; ++j) {
      for (k = 0; k < 400; ++k) {
        indigo__record_c(1,40,((void *)(&m1[i][k])),0,sizeof(double ));
        indigo__record_c(1,40,((void *)(&m2[k][j])),1,sizeof(double ));
        sum = (sum + (m1[i][k] * m2[k][j]));
      }
      indigo__record_c(2,42,((void *)(&m3[i][j])),2,sizeof(double ));
      m3[i][j] = sum;
      sum = 0;
    }
  }
#ifdef PRINT
#endif
}

int main()
{
  indigo__init_(1,0);
  indigo__create_map();
  mul();
  return 0;
}

void indigo__create_map()
{
  indigo__write_idx_c("m1",2);
  indigo__write_idx_c("m2",2);
  indigo__write_idx_c("m3",2);
}

Dynamic Instrumentation Output `rose_matrix.c`

void indigo__create_map();
#include <stdio.h>
#define FIXED 400
#define X_1 FIXED
#define Y_1 FIXED
#define X_2 FIXED
#define Y_2 FIXED
#define Z_1 FIXED
#define Z_2 FIXED
#include "mrt.h"
#include "set_cache_conflict_lib.h"

void mul()
{
  double m1[400UL][400UL];
  double m2[400UL][400UL];
  double m3[400UL][400UL];
  int i;
  int j;
  int k;
// warm up the cache
  for (i = 0; i < 400; ++i) {
    for (j = 0; j < 400; ++j) {
      indigo__process_c(2,27,((void *)(&m1[i][j])),0,sizeof(double ));
      m1[i][j] = 1;
      indigo__process_c(2,28,((void *)(&m2[i][j])),1,sizeof(double ));
      m2[i][j] = 3;
      indigo__process_c(2,29,((void *)(&m3[i][j])),2,sizeof(double ));
      m3[i][j] = 0;
    }
  }
  int sum = 0;
  for (i = 0; i < 400; ++i) {
    for (j = 0; j < 400; ++j) {
      for (k = 0; k < 400; ++k) {
        indigo__process_c(1,40,((void *)(&m1[i][k])),0,sizeof(double ));
        indigo__process_c(1,40,((void *)(&m2[k][j])),1,sizeof(double ));
        sum = (sum + (m1[i][k] * m2[k][j]));
      }
      indigo__process_c(2,42,((void *)(&m3[i][j])),2,sizeof(double ));
      m3[i][j] = sum;
      sum = 0;
    }
  }
#ifdef PRINT
#endif
}

int main()
{
  indigo__init_(1,0);
  indigo__create_map();
  mul();
  indigo__end();
  return 0;
}

void indigo__create_map()
{
  indigo__write_idx_c("m1",2);
  indigo__write_idx_c("m2",2);
  indigo__write_idx_c("m3",2);
}

Makefile

all: rose_matrix.c matrix

rose_matrix.c:
	macpo.sh -O0 -I/work/02681/ankitg/installs/perfexpert_install/include -g -L/work/01174/ashay/apps/binutils/lib --macpo:dynamic_inst --macpo:disable-sampling --macpo:instrument=mul matrix.c  -L/work/02681/ankitg/installs/perfexpert_install/lib -L/work/02681/ankitg/installs/libelf/lib -Wl,-rpath=/work/02681/ankitg/installs/libelf/lib -lmrt -lstdc++ -ldl -rdynamic -lelf -lbfd -liberty -lz

matrix: rose_matrix.c
	g++ rose_matrix.c -I/work/02681/ankitg/installs/perfexpert_install/include -g -L/work/01174/ashay/apps/binutils/lib  -L/work/02681/ankitg/installs/perfexpert_install/lib -L/work/02681/ankitg/installs/libelf/lib -Wl,-rpath=/work/02681/ankitg/installs/libelf/lib -lmrt -lstdc++ -ldl -rdynamic -lelf -lbfd -liberty -lz	-lset -lhwloc -o matrix

clean:
	rm rose_matrix.c matrix

	#include "trace.h"
	#include <stdio.h>
	#include <stdlib.h>
	#include <sys/mman.h>
	#include <sys/stat.h>
	#include <fcntl.h>
	#include <unistd.h>
	#include <assert.h>
	//#include "papi_helper.h"
	#ifdef SET
	#include <set>
	#endif

	#define PG_SIZE 4096
	#define PG_START(_v) ((_v) & ~(unsigned long)(PG_SIZE-1))

	// has to be aligned to page. otherwise it will be done for you.
	#define ARRAY_ONE_BASE PG_START(0x10ffeffd000)
	#define ARRAY_ONE_SIZE (4096 * 3 * sizeof(int))
	#define ARRAY_TWO_BASE PG_START(PG_START(0x10ffeffd000) + ARRAY_ONE_SIZE + 1)

	#define BIT_MASK(a, b) (size_t)(((size_t) -1 >> ((sizeof(size_t)*8) \
	- (b))) & ~((1U << (a)) - 1))

	#define CACHE_LINE(addr) (addr >> 6)

	unsigned address_to_set(size_t address) {
	return ((address & BIT_MASK(6, 12)) >> 6);
	}

	unsigned long * allocateArray(unsigned long addr, size_t size) {

	int stack_prot = PROT_READ \| PROT_WRITE;
	int flags = MAP_SHARED \| MAP_FIXED \| MAP_ANONYMOUS;
	unsigned long *m_map;

	m_map = (unsigned long *)mmap((caddr_t)PG_START(addr), size, stack_prot, flags, -1, 0);

	// fail if mmap faield
	if (m_map == MAP_FAILED) {
	perror("mmap failed");
	abort();
	}

	//printf("Base address of allocated variable: %li\n", m_map);
	assert((void )m_map == (void )addr);

	return m_map;
	}

	void printSetAndCacheLine(int *array) {
	#ifdef SET
	std::set<size_t> addresses;
	#endif
	//printf("sizeof int is %d", sizeof(int));
	int i;
	for (i = 0; i < 4096 * 3; ++i) {
	#ifdef SET
	if (addresses.find(CACHE_LINE(((size_t)&array[i]))) == addresses.end()) {
	#endif
	printf("%li\n", &array[i]);
	// printf("Set: %d, CL: %li, ADD: %li\n", address_to_set((size_t)&array[i]), CACHE_LINE((size_t)&array[i]), &array[i]);
	#ifdef SET
	addresses.insert(CACHE_LINE((size_t)&array[i]));
	}
	#endif
	}
	}

	void iterateToProduceTrace(int *array) {
	int i, nthreads;
	for (i = 0; i < 4096 * 2; ++i) {
	array[i] = i;
	}
	}

	int main(int argc, char *argv[]) {

	unsigned long *addrOne = allocateArray(ARRAY_ONE_BASE, ARRAY_ONE_SIZE);
	unsigned long *addrSecond = allocateArray(ARRAY_TWO_BASE, ARRAY_ONE_SIZE);

	int firstArray = (int )addrOne;
	int secondArray = (int )addrSecond;

	//printf("varinfo:a:4096:%08lx:%lu\n",(void *) &firstArray[0], sizeof(firstArray[0]));
	//printf("varinfo:b:4096:%08lx:%lu\n",(void *) &secondArray[0], sizeof(secondArray[0]));

	//startPapiCounters();
	iterateToProduceTrace(firstArray);
	iterateToProduceTrace(secondArray);
	iterateToProduceTrace(firstArray);
	iterateToProduceTrace(secondArray);
	//stopPapiCounters();


	// printf("---First Array---\n");
	// printSetAndCacheLine(firstArray);
	// printf("---Second Array---\n");
	// printSetAndCacheLine(secondArray);

	// printf("---First Array---\n");
	//printSetAndCacheLine(firstArray);

	// printf("---Second Array---\n");
	// printSetAndCacheLine(secondArray);

	return 0;
	}

goyalankit/set_associativity.md

Set Associativity cache conflicts

Usage

Instrumentation

Static Instrumentation

Dynamic Instrumentation

Compilation

Running

Analyze

Algorithms

To calculate set associativity conflicts

Main Algorithm for conflict detection (static version)

To calculate reuse distance

Error handling, Testing and Expected outputs

Instrumentation

Compilation

AST Graph Generation

Code to generate controlled trace

My configure command

Valgrind trace to Macpo trace

Expected Outputs

The original source file matrix.c

Static Instrumentation Output rose_matrix.c

Dynamic Instrumentation Output rose_matrix.c

Makefile

The original source file `matrix.c`

Static Instrumentation Output `rose_matrix.c`

Dynamic Instrumentation Output `rose_matrix.c`