- What can be traced?
- How can trace events be specified?
- "match specifications": twisty passages, all alike
- WTF, can I just use DTrace and drink my coffee/beer/whisky in peace?
- Trace delivery mechanisms: pick one of two
- Trace event receivers: what can they do?
- Redbug vs. The World
- Examples of event receivers
- See docs:
dbg:p/2
orerlang:trace/3
- They're somewhat interchangable.
- Use whatever one you're comfortable with API-wise
- They are twiddling the same knobs, inside the VM
- I talk a lot about the RedBug app:
- Which processes?
- A specific pid.
existing
: All processes currently existingnew
: All processes that will be created in the futureall
:existing
andnew
- What events?
- function calls: call, return
- sending of messages
- receiving of messages
- process-related events: spawn, exit, register, unregister, link, unlink, getting linked, getting unlinked
- scheduling of processes: in, out
- process exiting
- garbage collection activities
- Additional bells & whistles
- microsecond timestamps on all messages
- report arity # instead of arg list
- set on spawn / first spawn / link / first link
- See docs in
ets:fun2ms/1
ms_transform:parse_transform/2
- Because regular expressions on Erlang data sucks.
- Match specs attempt to be Erlang'ish
- And try to be quick to execute
- But are obtuse to the 7th degree....
- Example match specs:
[{'_',[],[{message,{return_trace}}]}]
[{{'$1','$2'},[{'>','$2',{const,3}},{is_atom,'$1'}],['$1']}]
ets:fun2ms(fun(A, B) when is_atom(A), B > 3 -> A end).
- Mostly useful for restricting tracing on function calls
- After-the-fact tracing filters are good but...
- ... generating exactly the traces that you want is better.
- E.g., match
foo:bar/3
but only if 1st arg ishellooooooo_world
.
- Erlang process receiver
- Send the trace events to any pid in the cluster
- Easy to use
- TCP receiver
- Send binary trace events to a TCP port listener
- Faster & lower overhead: less likely to kill VM in trace event tsunami
- IMPORTANT: All tracing events must go to the same single receiver.
- DTrace is far more flexible, alas.
- I sooooo wish.......
- ... unfortunate reality hits ....
- DTrace isn't available everywhere that Erlang is
- The VM's tracing infrastructure isn't a good match with DTrace, yet.
- Pulls requests are welcome!
It is quite possible to create a tsunami of trace events.
- You can easily perturb performance with trace event receiver @ 100% CPU
- You can easily run your machine out of RAM
- You can deadlock the tracer and/or other processes.
- Some trace patterns can disable tail call recursion ... run your VM out of memory even when the # of trace events is very small.
To be safe, do all of these things:
- Make your trace specifications as specific as possible
- Do not use
"'_':'_'"
- I.e.: Any module + any function name
- If you must use
"your_module_name:'_'"
, run for a very short time, until you are confident that you will not create an event tsunami.
- Do not use
- Simplify your trace receiver as much as possible
- Limit the amount of time that tracing is enabled
- Read the docs!
- Redbug has all of these limits, that's one reason we use/recommend it.
-
Anything, really
- The events are just data. Fold/spindle/mutilate/....
-
In practice, simple is better
-
Most tracers just format the event, in the order received
- Simple, stateless
- For example, this is what Redbug does by default
- But Redbug can use custom event handler, if you wish....
-
Redbug vs. The World
-
Redbug's default trace event receiver is stateless, prints to console
-
Redbug has option to write binary events to a file for offline processing
-
Redbug can use multiple match specs at "full strength & power"
-
So ... when should you not use Redbug?
- When online & stateful event receiver is needed
Redbug specification:
> redbug:start("erlang:now -> return").
ok
21:31:52 <timer_server> {erlang,now,[]}
21:31:52 <timer_server> {erlang,now,0} -> {1385,555512,322417}
21:31:53 <timer_server> {erlang,now,[]}
21:31:53 <timer_server> {erlang,now,0} -> {1385,555513,322384}
[...]
quitting: msg_count
To limit the formatting depth (see all of the "..." for omitted data)
([email protected])26> redbug:start("timer:'_' -> return", {print_depth,7}).
ok
21:33:41 <{erlang,apply,2}> {timer,tc,[riak_pipe_stat,produce_stats,[]]}
21:33:41 <<0.3234.1>> {timer,now_diff,[{1385,555621,...},{1385,...}]}
21:33:41 <<0.3234.1>> {timer,now_diff,2} -> 130
21:33:41 <<0.3234.1>> {timer,tc,3} -> {130,
{riak_pipe,[{{...},...},{...}|...]}}
For Redbug help, run redbug:help()
. For explanation of Redbug
print_depth
, see the ~P formatting description in the io
man page.
- These are not great examples of Erlang code style
- Close your eyes if you are sensitive to ugly code.
^_^
Source: Scott Fritchie: https://gist.github.com/slfritchie/007b399675fb76f2d4f0
Report every Interval
seconds the count of init()
function calls for the following Riak KV FSM modules: riak_kv_buckets_fsm, riak_kv_exchange_fsm, riak_kv_get_fsm, riak_kv_index_fsm, riak_kv_keys_fsm, riak_kv_put_fsm
Source: Greg Burd: https://gist.github.com/2656289
%% @doc Dynamically add timing to MFA. There are various types of
%% timing.
%%
%% {all, Max} - time latency of all calls to MFA
%%
%% {sample, N, Max} - sample every N calls and stop sampling after Max
%%
%% {threshold, Millis, Max} - count # of calls where latency is > Millis
%% and count # of calls total, thus percentage of calls over threshold
Sample output:
> timeit:timeit(timer, tc, 3, {sample, 10, 5}).
{ok,[{matched,'[email protected]',1},{saved,1}]}
[sample 1/10] <0.10771.1>:timer:tc: 0.152 ms
[sample 2/20] <0.10796.1>:timer:tc: 0.484 ms
[sample 3/30] <0.10825.1>:timer:tc: 0.111 ms
[sample 4/40] <0.10849.1>:timer:tc: 0.101 ms
[sample 5/50] <0.10875.1>:timer:tc: 194.682 ms
Source: Scott Fritchie: https://gist.github.com/slfritchie/7a87269148280e4724ce
Report each riak_kv_put_fsm
and riak_kv_get_fsm
process that executes in more than LatencyMS
milliseconds. Also, for various strategic API functions in the file
, bitcask
, and eleveldb
modules, report latencies above the same limit.
Source: Scott Fritchie: https://gist.github.com/slfritchie/e7c3ad866f67d3fc8935
A more sophisticated version of fsm_latency_tracer
. Watches more
API functions in the bitcask
, eleveldb
, file
, and
riak_kv_fs2_backend
modules. Uses the folsom
library to generate
histograms of latencies of the file:pread()
function to show
outliers and multi-modal patterns that simple averages & maximums can
show.
Source: Scott Fritchie: https://gist.github.com/slfritchie/c01690d9c30e5e7b0100
Demonstrates how much of every second that the file_server_2
process
is busy. It shows pretty dramatically how Bitcask's highest latencies
are frequently caused by serialization by file_server_2
.
Source: Joe Blomstedt: https://gist.github.com/jtuple/244c578773962077e215
- For calls to
riak_kv_get_fsm:calculate_objsize/2
, print the Bucket, Key, and object size. - For calls to
riak_kv_eleveldb_backend:put/5
andriak_kv_bitcask_backend:put/5
andriak_kv_memory_backend:put/5
, report if the object size is larger thanSize
. - Stop after
Count
events have been recorded orTime
milliseconds have elapsed.
Source: Scott Fritchie: https://gist.github.com/slfritchie/23121803af63d76eeca5
A small example of reading a binary-formatted Erlang trace file and doing something simple with it.
Note that Redbug can write its trace events into a binary file. This technique can be used for off-line processing, e.g., to avoid high processing cost in a production environment.
Source: Scott Fritchie: https://gist.github.com/slfritchie/159a8ce1f49fc03c77c6
func_args_tracer.erl
%% For example: what ETS tables are being called the most by ets:lookup/2?
%% The 1st arg of ets:lookup/2 is the table name.
%% Watch for 10 seconds.
%%
%% > func_args_tracer:start(ets, lookup, 2, 10, fun(Args) -> hd(Args) end).
%%
%% Tracer pid: <0.16102.15>, use func_args_tracer:stop() to stop
%% Otherwise, tracing stops in 10 seconds
%% Current date & time: {2013,9,19} {18,5,48}
%% {started,<0.16102.15>}
%% Total calls: 373476
%% Call stats:
%% [{folsom_histograms,114065},
%% {ac_tab,69689},
%% {ets_riak_core_ring_manager,67147},
%% {folsom_spirals,57076},
%% {riak_capability_ets,48862},
%% {riak_core_node_watcher,8149},
%% {riak_api_pb_registrations,8144},
%% {folsom,243},
%% {folsom_meters,43},
%% {folsom_durations,20},
%% {timer_tab,18},
%% {folsom_gauges,8},
%% {riak_core_stat_cache,5},
%% {sys_dist,3},
%% {inet_db,1},
%% {21495958,1},
%% {3145765,1},
%% {3407910,1}]
latency_histogram_tracerl.erl
%% For example: create a histogram of call latencies for bitcask:put/3.
%% Watch for 10 seconds.
%%
%% > latency_histogram_tracer:start(bitcask, put, 3, 10).
%%
%% Tracer pid: <0.2108.18>, use latency_histogram_tracer:stop() to stop
%% Otherwise, tracing stops in 10 seconds
%% Current date & time: {2013,9,19} {18,14,13}
%% {started,<0.2108.18>}
%% Histogram stats:
%% [{min,0},
%% {max,48},
%% {arithmetic_mean,2.765411819271055},
%% [...]
%% {percentile,[{50,3},{75,4},{90,5},{95,6},{99,8},{999,14}]},
%% {histogram,[{1,13436},
%% {2,12304},
%% {3,10789},
%% {4,7397},
%% {5,4191},
%% {6,1929},
%% [...]
%% {30,0},
%% {31,0},
%% {40,2},
%% {50,1}]},
%% {n,51746}]
Source: Joe Blomstedt: https://gist.github.com/jtuple/161ad7c8c278b45729df Source: Jordan West: https://gist.github.com/jrwest/2999f8f217f1cbceca83
- Attempts to capture enough data from Erlang tracing to construct a flamegraph.
- Aside/see also: https://github.com/proger/eflame
- On VM #1:
ktrace:c()
- On VM #2, on same machine, running Riak:
ktrace:s()
- Then get Riak to do some get requests.
- ktrace output should appear in VM #1's console. Save it to a file.
- Run
ktrace_fold:go("/path/to/input-file", "/path/to/output-file").
- Run
flamegraph.pl /path/to/output-file > "/path/to/output.svg
For example: http://www.snookles.com/scotttmp/vnode-get.svg ... which
suggests that only about 15% of the wall-clock time for
riak_kv_vnode
get processing is due to the backend (Bitcask in this
case).