Skip to content

Latest commit

 

History

History
327 lines (254 loc) · 12.2 KB

tracing.asciidoc

File metadata and controls

327 lines (254 loc) · 12.2 KB

Tracing

Concurrency and distributed systems present unique challenges and complexities, particularly regarding debugging and monitoring. Tracing is helpful in this context, offering developers a window into the running system’s behavior. This chapter begins by exploring the role of tracing on the BEAM.

Tracing provides visibility into process execution, communication between these processes, and the system’s overall behavior. Erlang systems are known for their robustness and fault tolerance, traits that we want to achieve. However, these traits can complicate pinpointing the source of errors or unexpected behavior. Tracing allows developers to follow the execution flow in real time or through recorded logs, making identifying and rectifying bugs easier. Tracing also serves a vital function in monitoring the performance of Erlang applications. Tracing function calls, message passing, and other system activities allows developers to gather data on performance bottlenecks, inefficient code paths, and unexpected system behavior. Erlang’s tracing capabilities extend to providing insights into the system’s behavior. Tracing helps in understanding how different parts of the system interact, how they scale, and how they recover from failures. Tracing thus becomes an educational tool, enabling developers to model and predict system behavior more accurately.

We will review some of the tracing tools provided by ERTS and some built on top of the built-in support.

Built-in Erlang Tracing Facilities

io:format Tracing

While not a tracing tool in the conventional sense, io:format serves as Erlang’s equivalent to traditional debugging techniques like printf debugging found in other programming languages. It allows developers to insert print statements within their code, facilitating immediate and simple output of variable states, function call results, and process statuses. This method is highly accessible and requires minimal setup, making it an excellent first step for identifying issues before employing more sophisticated tracing tools.

Erlang Trace (erl_trace)

This tracing facility offers a low-level interface for tracing Erlang functions, messages, and processes. It provides the core functionalities upon which higher-level tracing tools are built, delivering essential tracing primitives directly supported by the Erlang runtime.

Sequential Tracing (seq_trace)

Specialized in tracing the flow of messages within distributed Erlang systems, seq_trace enables developers to track the sequence of message passing between processes.

Third-party Tools

Redbug

Redbug is part of the eper performance and debugging suite and is known for its user-friendly interface and minimal performance impact. It focuses on safe, efficient tracing, avoiding common pitfalls like overwhelming trace output or system overload, making it an ideal choice for live system analysis. It is an external library and therefore it has to be installed separately. One of the best Redbug features is its ability to shut itself down in case of overload.

Installing Redbug

You can clone redbug via:

$ git clone https://github.com/massemanet/redbug

You can then compile it with:

$ cd redbug
$ make

Ensure redbug is included in your path when starting an Erlang shell and you are set to go. This can be done by explicitly adding the path to the redbug beam files when invoking erl:

$ erl -pa /path/to/redbug/ebin

Alternatively, the following line can be added to the ~/.erlang file. This will ensure that the path to redbug gets included automatically at every startup:

code:add_patha("/path/to/redbug/ebin").
Using Redbug

Redbug is safe to be used in production, thanks to a self-protecting mechanism against overload, which kills the tool in case too many tracing messages are sent, preventing the Erlang node to become overloaded. Let’s see it in action:

$ erl
Erlang/OTP 19 [erts-8.2] [...]

Eshell V8.2 (abort with ^G)
1> l(redbug). (1)
{module,redbug}
2> redbug:start("lists:sort/1"). (2)
{30,1}
3> lists:sort([3,2,1]).
[1,2,3]

% 15:20:20 <0.31.0>({erlang,apply,2}) (3)
% lists:sort([3,2,1])
redbug done, timeout - 1 (4)
  1. First, we ensure that the redbug module is available and loaded.

  2. We then start redbug. We are interested in the function named sort with arity 1, exported by the module lists. Remember that, in Erlang lingo, the arity represents the number of input arguments that a given function takes.

  3. Finally, we invoke the lists:sort/1 function and we verify that a message is produced by redbug.

  4. After the default timeout (15 seconds) is reached, redbug stops and displays the message "redbug done". Redbug is also kind enough to tell us the reason why it stopped (timeout) and the number of messages that collected until that point (1).

Let’s now look at the actual message produced by redbug. By default messages are printed to the standard output, but it’s also possible to dump them to file:

% 15:20:20 <0.31.0>({erlang,apply,2})
% lists:sort([3,2,1])

Depending on the version of redbug you are using, you may get a slightly different message. In this case, the message is split across two lines. The first line contains a timestamp, the Process Identifier (or PID) of the Erlang process which invoked the function and the caller function. The second line contains the function called, including the input arguments. Both lines are prepended with a %, which reminds us of the syntax for Erlang comments.

We can also ask Redbug to produce an extra message for the return value. This is achieved using the following syntax:

4> redbug:start("lists:sort/1->return").
{30,1}

Let’s invoke the lists:sort/1 function again. This time the output from redbug is slightly different.

5> lists:sort([3,2,1]).
[1,2,3]

% 15:35:52 <0.31.0>({erlang,apply,2})
% lists:sort([3,2,1])

% 15:35:52 <0.31.0>({erlang,apply,2})
% lists:sort/1 -> [1,2,3]
redbug done, timeout - 1

In this case two messages are produced, one when entering the function and one when leaving the same function.

When dealing with real code, trace messages can be complex and therefore hardly readable. Let’s see what happens if we try to trace the sorting of a list containing 10.000 elements.

6> lists:sort(lists:seq(10000, 1, -1)).
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
23,24,25,26,27,28,29|...]

% 15:48:42.208 <0.77.0>({erlang,apply,2})
% lists:sort([10000,9999,9998,9997,9996,9995,9994,9993,9992,9991,9990,9989,9988,9987,9986,
% 9985,9984,9983,9982,9981,9980,9979,9978,9977,9976,9975,9974,9973,9972,9971,
% 9970,9969,9968,9967,9966,9965,9964,9963,9962,9961,9960,9959,9958,9957,9956,
% 9955,9954,9953,9952,9951,9950,9949,9948,9947,9946,9945,9944,9943,9942,9941,
% 9940,9939,9938,9937,9936,9935,9934,9933,9932,9931,9930,9929,9928,9927,9926,
% 9925,9924,9923,9922,9921,9920,9919,9918,9917,9916,9915,9914,9913,9912,9911,
% [...]
% 84,83,82,81,80,79,78,77,76,75,74,73,72,71,70,69,68,67,66,65,64,63,62,61,60,
% 59,58,57,56,55,54,53,52,51,50,49,48,47,46,45,44,43,42,41,40,39,38,37,36,35,
% 34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,
% 8,7,6,5,4,3,2,1])

% 15:48:42.210 <0.77.0>({erlang,apply,2}) lists:sort/1 ->
% [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
% 23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,
% 42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,
% 61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,
% 80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,
% 99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,
% [...]
% 9951,9952,9953,9954,9955,9956,9957,9958,9959,9960,9961,
% 9962,9963,9964,9965,9966,9967,9968,9969,9970,9971,9972,
% 9973,9974,9975,9976,9977,9978,9979,9980,9981,9982,9983,
% 9984,9985,9986,9987,9988,9989,9990,9991,9992,9993,9994,
% 9995,9996,9997,9998,9999,10000]
redbug done, timeout - 1

Most of the output has been truncated here, but you should get the idea. To improve things, we can use a couple of redbug options. The option {arity, true} instructs redbug to only display the number of input arguments for the given function, instead of their actual value. The {print_return, false} option tells Redbug not to display the return value of the function call, and to display a …​ symbol, instead. Let’s see these options in action.

7> redbug:start("lists:sort/1->return", [{arity, true}, {print_return, false}]).
{30,1}

8> lists:sort(lists:seq(10000, 1, -1)).
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
23,24,25,26,27,28,29|...]

% 15:55:32 <0.77.0>({erlang,apply,2})
% lists:sort/1

% 15:55:32 <0.77.0>({erlang,apply,2})
% lists:sort/1 -> '...'
redbug done, timeout - 1

By default, redbug stops after 15 seconds or after 10 messages are received. Those values are a safe default, but they are rarely enough. You can bump those limits by using the time and msgs options. time is expressed in milliseconds.

9> redbug:start("lists:sort/1->return", [{arity, true}, {print_return, false}, {time, 60 * 1000}, {msgs, 100}]).
{30,1}

We can also activate redbug for several function calls simultaneously. Let’s enable tracing for both functions lists:sort/1 and lists:sort_1/3 (an internal function used by the former):

10> redbug:start(["lists:sort/1->return", "lists:sort_1/3->return"]).
{30,2}

11> lists:sort([4,4,2,1]).
[1,2,4,4]

% 18:39:26 <0.32.0>({erlang,apply,2})
% lists:sort([4,4,2,1])

% 18:39:26 <0.32.0>({erlang,apply,2})
% lists:sort_1(4, [2,1], [4])

% 18:39:26 <0.32.0>({erlang,apply,2})
% lists:sort_1/3 -> [1,2,4,4]

% 18:39:26 <0.32.0>({erlang,apply,2})
% lists:sort/1 -> [1,2,4,4]
redbug done, timeout - 2

Last but not least, redbug offers the ability to only display results for matching input arguments. This is when the syntax looks a bit like magic.

12> redbug:start(["lists:sort([1,2,5])->return"]).
{30,1}

13> lists:sort([4,4,2,1]).
[1,2,4,4]

14> lists:sort([1,2,5]).
[1,2,5]

% 18:45:27 <0.32.0>({erlang,apply,2})
% lists:sort([1,2,5])

% 18:45:27 <0.32.0>({erlang,apply,2})
% lists:sort/1 -> [1,2,5]
redbug done, timeout - 1

In the above example, we are telling redbug that we are only interested in function calls to the lists:sort/1 function when the input arguments is the list [1,2,5]. This allows us to remove a huge amount of noise in the case our target function is used by many actors at the same time and we are only interested in a specific use case. Oh, and don’t forget that you can use the underscore as a wildcard:

15> redbug:start(["lists:sort([1,_,5])->return"]).  {30,1}

16> lists:sort([1,2,5]).  [1,2,5]

% 18:49:07 <0.32.0>({erlang,apply,2}) lists:sort([1,2,5])

% 18:49:07 <0.32.0>({erlang,apply,2}) lists:sort/1 -> [1,2,5]

17> lists:sort([1,4,5]).  [1,4,5]

% 18:49:09 <0.32.0>({erlang,apply,2}) lists:sort([1,4,5])

% 18:49:09 <0.32.0>({erlang,apply,2}) lists:sort/1 -> [1,4,5] redbug
% done, timeout - 2

This section does not pretend to be a comprehensive guide to redbug, but it should be enough to get you going. To get a full list of the available options for redbug, you can ask the tool itself:

18> redbug:help().

Recon Trace (recon_trace)

As a component of the recon library for production system diagnosis, recon_trace stands out for its powerful tracing features, designed with safety and minimal performance impact in mind.

Comparison and Use Cases

The selection among built-in and third-party tools largely depends on the specific debugging needs. io:format offers a straightforward, quick debugging method, whereas tools like erl_trace and recon_trace provide deeper insights with more comprehensive tracing capabilities.

Tools designed to minimize operational impact, such as redbug and recon_trace, are particularly valuable in production environments. They enable real-time issue diagnosis without significant performance drawbacks.

seq_trace provides essential insights into the dynamics of message passing and process interaction across distributed Erlang nodes.