Table of Contents
It’s a tool built to get the different type of cache conflicts for a given program. It works by inserting instrumentation to a user specified method and records all the address access within that method. It’s part of the MACPO framework and there are two types of instrumentations that are supported: static or dynamic. The static instrumentation will generate a trace file, which can later be analyzed using macpo-analyze to get the conflict values whereas the dynamic instrumentation will calculate the counter values at run time for each set.
Running the analysis involves three steps. Instrumentation, compilations and run. In case of static instrumentation, an additional step on analyzing is also required.
Example gives expected output of each step for a given program.
Here we insert the instrumentation functions in the method_name
specified by the user in file source.c
. Make sure to give the file containing the main
function along with the file containing the method method_name
(if it's not present in the same file).
macpo.sh --macpo:disable-sampling --macpo:instrument=method_name source.c
For dynamic instrumentation pass additional flag: macpo:dynamic_inst
macpo.sh --macpo:disable-sampling --macpo:instrument=method_name --macpo:instrument=dynamic_inst source.c
Above command will create a new file rose_source.c with instrumentation. You can compile the above file using any compiler using the following command
g++ rose_source.c -std=c++0x -lmrt -lstdc++ -ldl -rdynamic -lelf -lbfd -liberty -lz -o macpo
./macpo
The above command will create a macpo.out
file which contains the trace information and can be analyzed using the macpo-analyze
command.
macpo-analyze macpo.out
- if you ran the dynamic instrumentation, then at the end of execution it should print results. It will also print results if you cancel the job by sending
SIGINT
to the job (i.e., equivalent toCTRL + C
).
-
Link to Dynamic code: set_cache_conflict_lib.cpp
-
Link to Static code: set_cache_conflict_analysis.cpp
Data Structures
/*
Trace that we get after running the instrumented code consists of mem_info_t structs.
Note: I have written comments for the variables that we are using in the algorithm.
*/
typedef struct {
uint16_t coreID; // we get the core number from here
uint16_t read_write:2;
size_t address; // address that was accessed
size_t var_idx; // Variable id. Using this we can identify the variable in original source code
size_t line_number;
int type_size;
} mem_info_t;
/*
This is the conflict object that we store on a conflict
*/
struct conflict_t{
size_t set_number; // conflicting set number
size_t var_idx; // Variable for which cache conflict miss happened.
};
// This struct is used to store the stats
struct cache_stats_t {
uint64_t addr_accesses;
uint64_t addr_hits;
uint64_t addr_conflict_misses;
uint64_t addr_cold_misses;
uint64_t addr_capacity_misses;
}
Algorithm
1. Initialize Reuse Distance Vector RVs for each set.
2. Initialize vector < vector<conflict_t> > set_conflicts_per_core; // array of an array of conflict_t for each core
3. for each mem_info Mi in trace: - O(n)
4. Calculate set number Sn and Cache line Cli for Mi.address
5. Get reuse distance Ri from RVs using Cli - O(n)
6. if Ri is non-negative:
7. if reuse distance Ri > l1_associativity && reuse distance Ri < total number of cache lines:
8. Add the conflict(Sn, Mi.var_idx) to set_conflicts_per_core[Mi.coreId]
9. Increment the conflict misses
10. else if reuse_distance >= l1_cache_lines // l1_cache_lines = 512 on stampede.
11. Increment the capacity misses
12. else
13. Increment the hit count
14. else
15. Increment the cold mist count
16. Insert or update RVs with Mi.address - O(n)
Cache line and set Number calculation:
# Give an address for example,
63 62 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 ..0 0 0 1 1 0 0 0 0 0 0 1 0
6-63 bits: cache line -> 00..00 -> 0. (This is the reason why two arrays separated by more than 64 bytes will always be on different cache line)
0-5 bits: offset within the cache line. 000010 -> 2
6-11 bits: set number 001100 -> 12
Link to Code: reuse_distance.h
Definition: Reuse distance is calculated per cache line for each set. For a given cache line, reuse distance is calculated as the number of distinct cache lines that were accessed between two accesses of that cache line before the cache line have being evicted.
The main idea of the implementation below is to maintain a counter (access_count
) which gets incremented everytime when the current cache line number is not equal to the last cache line. For example consider the following cases where C[N]
represents that the cache line number N
has been accessed. For example:
Case 1: When the last cache line accessed is same as current cache line.
# Initial List
C1 C2 C3 C4
access_count = 4
# C4 is accesed again
C1 C2 C3 C4 C4
access_count = 4
Case 2: When the last cache line accessed is different from the current cache line
# Initial List
C1 C2 C3 C4
access_count = 4
# C4 is accesed again
C1 C2 C3 C4 C1
access_count = 5
1. Initialize the counter access_count
2. Initialize hash map cache_line_count_map; // mapping cache_line to a counter. Only maintains 512 cache_lines
3. Initialize hash map already_accessed; // this contained record of all the cache lines that has been accessed.
// If the cache line is still in cache, it will return the number of
// distinct cache lines that have been accessed (reuse distance).
// If the cache line is not in the cache but has been evicted, so it must be
// present in the already_accessed set, it will return number greater than 512. which indicates capacity miss.
// Otherwise it return -1, indicating the first time access.
get_distance(cache_line_t cache_line) {
// if we have already seen the cache_line
if cache_line in cache_line_count_map {
return access_count - cache_line_count_map[cache_line]
} else if cache_line in already_accessed {
return 513; // anything greater than 512 will do.
} else { // new cache_line
return INFINITY;
}
}
// while inserting we increment access count if the cache current cache line has been accessed and
// is not equal to the last cache line. (as explained in the above examples.)
// we also need to increment all the counters for the cache lines that are not in the current and last access
// of the current cache line. since there will be no increase in reuse distance increase for those cache lines.
insert(cache_line_t cache_line) {
if (get_distance(cache_line) != 0) { // not on the head
adjust_other_counters(cache_line);
access_count++;
}
already_accessed = 1;
cache_line_counter[cache_line] = access_count;
}
adjust_other_counters(cache_line_t cache_line) {
int previous = cache_line_counter[cache_line];
// we need to do this for all cache lines since
// we need capacity misses.
for (cl in cache_line_counter) {
if (cache_line_counter[cl] < previous) {
cache_line_counter[cl]++;
}
}
}
To give an idea of things that you may need to include
More explicit commands:
macpo.sh -I/work/02681/ankitg/installs/perfexpert_install/include -g -L/work/01174/ashay/apps/binutils/lib --macpo:disable-sampling --macpo:instrument=method_name --macpo:dynamic_inst hello.c -L/work/02681/ankitg/installs/perfexpert_install/lib -L/work/02681/ankitg/installs/libelf/lib -Wl,-rpath=/work/02681/ankitg/installs/libelf/lib -lmrt -lstdc++ -ldl -rdynamic -lelf -lbfd -liberty -lz -lset
g++ rose_hello.c -std=c++0x -I/work/02681/ankitg/installs/perfexpert_install/include -g -L/work/01174/ashay/apps/binutils/lib -L/work/02681/ankitg/installs/perfexpert_install/lib -L/work/02681/ankitg/installs/libelf/lib -Wl,-rpath=/work/02681/ankitg/installs/libelf/lib -lmrt -lstdc++ -ldl -rdynamic -lelf -lbfd -liberty -lz -o matrix
To debug any instrumentation related issues, you can generate a .dot
file and then analyze it in Graphviz software.
Uncomment generateDOT(*project);
in minst.cpp to generate a .dot
file.
Link to code: trace.c (file attached below)
This program generates a controlled trace using which you can test your set associativity conflicts
./configure --prefix=/work/02681/ankitg/installs/perfexpert_install --with-rose=/work/02681/ankitg/installs/rose2_install --with-jvm=/usr/java/latest/jre/lib/amd64/server/ --with-papi=/opt/apps/papi/5.3.0/x86_64 --with-rose-include=/work/02681/ankitg/installs/rose2_install --with-boost=/work/02681/ankitg/installs/boost_install --with-gsl=/opt/apps/intel13/gsl/1.15 --with-libelf=/work/02681/ankitg/installs/libelf --with-libdwarf=/work/01174/ashay/apps/libdwarf --with-matheval=/work/01174/ashay/apps/perfexpert/externals/ --disable-module-hpctoolkit-mic --disable-module-timb --disable-memsniffer --disable-module-gcc --disable-module-lcpi --disable-module-gpp --disable-module-make --disable-module-icc --disable-module-sqlrules --disable-module-vtune --no-create --no-recursion
The link above contains instructions on how to convert trace from valgrind to a trace that macpo-analyze
can understand and then run the analysis on top of that.