Skip to content

Latest commit

 

History

History
590 lines (471 loc) · 21.7 KB

ops.asciidoc

File metadata and controls

590 lines (471 loc) · 21.7 KB

Operation and Maintenance

One guiding principle behind the design of the runtime system is that bugs are more or less inevitable. Even if through an enormous effort you manage to build a bug free application you will soon learn that the world or your user changes and your application will need to be "fixed."

The Erlang runtime system is designed to facilitate change and to minimize the impact of bugs.

The impact of bugs is minimized by compartmentalization. This is done from the lowest level where each data structure is separate and immutable to the highest level where running systems are dived into separate nodes. Change is facilitated by making it easy to upgrade code and interacting and examining a running system.

Connecting to the System

We will look at many different ways to monitor and maintain a running system. There are many tools and techniques available but we must not forget the most basic tool, the shell and the ability to connect a shell to node.

In order to connect two nodes they need to share or know a secret passphrase, called a cookie. As long as you are running both nodes on the same machine and the same user starts them they will automatically share the cookie (in the file $HOME/.erlang.cookie).

We can see this in action by starting two nodes, one Erlang node and one Elixir node. First we start an Erlang node called node1. We see that it has no connected nodes:

$ erl -sname node1
Erlang/OTP 19 [erts-8.1] [source-0567896] [64-bit] [smp:4:4]
              [async-threads:10] [hipe] [kernel-poll:false]

Eshell V8.1  (abort with ^G)
(node1@GDC08)1> nodes().
[]
(node1@GDC08)2>

Then in another terminal window we start an Elixir node called node2. (Note that the command line flags have to be specified with double dashes in Elixir!)

$ iex --sname node2
Erlang/OTP 19 [erts-8.1] [source-0567896] [64-bit] [smp:4:4]
              [async-threads:10] [hipe] [kernel-poll:false]

Interactive Elixir (1.4.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(node2@GDC08)1>

In Elixir we can connect the nodes by running the command Node.connect name. In Erlang you do this with net_kernel:connect_node(Name). The node connection is bidirectional so you only need to run the command on one of the nodes.

iex(node2@GDC08)1> Node.connect :node1@GDC08
true
iex(node2@GDC08)2> Node.list
[:node1@GDC08]
iex(node2@GDC08)3>

And from the Erlang side we can see that now both nodes know about each other. (Note that the result from nodes() does not contain the local node itself.)

(node1@GDC08)2> nodes().
[node2@GDC08]
(node1@GDC08)3>

In the distributed case this is somewhat more complicated since we need to make sure that all nodes know or share the cookie. This can be done in three ways. You can set the cookie used when talking to a specific node, you can set the same cookie for all systems at start up with the -set_cookie parameter, or you can copy the file .erlang.cookie to the home directory of the user running the system on each machine.

The last alternative, to have the same cookie in the cookie file of each machine in the system is usually the best option since it makes it easy to connect to the nodes from a local OS shell. Just set up some secure way of logging in to the machine either through VPN or ssh.

Let’s create a node on a separate machine, and connect it to the nodes we have running. We start a third terminal window, check the contents of our local cookie file, and ssh to the other machine:

happi@GDC08:~$ cat ~/.erlang.cookie
pepparkaka
happi@GDC08:~$ ssh gds01
happi@gds01:~$

We launch a node on the remote machine, telling it to use the same cookie passphrase, and then we are able to connect to our existing nodes. (Note that we have to specify node1@GDC08 so Erlang knows where to find these nodes.)

happi@gds01:~$ erl -sname node3 -setcookie pepparkaka
Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit] [smp:8:8]
              [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.3  (abort with ^G)
(node3@gds01)1> net_kernel:connect('node1@GDC08').
true
(node3@gds01)2> nodes().
[node1@GDC08,node2@GDC08]
(node3@gds01)3>

Even though we did not explicitly talk to node 2, it has automatically been informed that node 3 has joined, as we can see:

iex(node2@GDC08)3> Node.list
[:node1@GDC08,:node3@gds01]
iex(node2@GDC08)4>

In the same way, if we terminate one of the nodes, the remaining nodes will remove that node from their lists automatically. If the node gets restarted, it can simply rejoin the network again.

Note
A Potential Problem with Different Cookies

Note that the default for the Erlang distribution is to create a fully connected network. That is, all nodes become connected to all other nodes in the network. If each node has its own cookie, you will have to tell each node the cookies of every other node before you try to connect them. You can start up a node with the flag -connect_all false in order to prevent the system from trying to make a fully connected network. Alternatively, you can start a node as hidden with the flag -hidden, which makes node connections to that node non-transitive.

Now that we know how to connect nodes, even on different machines, to each other, we can look at how to connect a shell to a node.

The Shell

The Elixir and the Erlang shells works much the same way as a shell or a terminal window on your computer, except that they give you a terminal window directly into your runtime system. This gives you an extremely powerful tool, a CLI with full access to the runtime. This is fantastic for operation and maintenance.

In this section we will look at different ways of connecting to a node through the shell and some of the shell’s perhaps less known but more powerful features.

Configuring Your Shell

Both the Elixir shell and the Erlang shell can be configured to provide you with shortcuts for functions that you often use.

The Elixir shell will look for the file .iex.exs first in the local directory and then in the users home directory. The code in this file is executed in the shell process and all variable bindings will be available in the shell.

In this file you can configure aspects such as the syntax coloring and the size of the history. [See hexdocs for a full documentation.](https://hexdocs.pm/iex/IEx.html#module-the-iex-exs-file)

You can also execute arbitrary code in the shell context.

When the Erlang runtime system starts, it first interprets the code in the Erlang configuration file. The default location of this file is in the users home directory ~/.erlang. It can contain any Erlang expressions, each terminated by a dot and a newline.

This file is typically used to add directories to the code path for loading Erlang modules:

code:add_path("/home/happi/hacks/ebin").

as well as to load the custom user_default module (a .beam file) which you can use to extend the Erlang shell with user-defined functions:

code:load_abs("/home/happi/.config/erlang/user_default").

Just replace the paths above to match your own system. Do not include the .beam extension in the load_abs command.

If you call a function from the shell without specifying a module name, for example foo(bar), it will try to find the function first in the module user_default (if it exists) and then in the module shell_default (which is part of stdlib). This is how shell commands such as ls() and help() are implemented, and you are free to add your own or override the existing ones.

Connecting a Shell to a Node

When running a production system you will want to start the nodes in daemon mode through run_erl. We will go through how to do this, and some of the best practices for deployment and running in production in [xxx](#ch.live). Fortunately, even when you have started a system in daemon mode, which implies it does not have a default shell, you can connect another shell to the system. There are actually several ways to do that. Most of these methods rely on the normal distribution mechanisms and hence require that you have the same Erlang cookie on both machines as described in the previous section.

Remote shell (Remsh)

The easiest and probably the most common way to connect to an Erlang node is by starting a named node that connects to the system node through a remote shell. This is done with the erl command line flag -remsh NodeName. Note that you need to be running a named node in order to be able to connect to another node. If you don’t specify the -name or -sname flag when you use -remsh, Erlang will generate a random name for the new node. In either case, you will typically not see this name printed, since your shell will get connected directly to the remote node. For example:

$ erl -sname node4 -remsh node2
Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit] [smp:8:8]
              [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.3  (abort with ^G)
(node2@GDC08)1>

Or using Elixir:

$ iex --sname node4 --remsh node2
Erlang/OTP 19 [erts-8.1] [source-0567896] [64-bit] [smp:4:4]
              [async-threads:10] [hipe] [kernel-poll:false]

Interactive Elixir (1.4.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(node2@GDC08)1>

Note that an Erlang node can typically start a shell on an Elixir node (node2 above), but starting an Elixir shell on an Erlang node (node1) will not work, because the Erlang node will be missing the necessary Elixir libraries:

$ iex --remsh node1
Could not start IEx CLI due to reason: nofile
$
Note
No Default Security

There is no security built into either the normal Erlang distribution or to the remote shell implementation. You should not leave your system node exposed to the internet, and you do not want to connect from a node on your development machine to a live node. You would typically access your live environment through a VPN tunnel or ssh via a bastion host, so you can log in to a machine that runs one of your live nodes. Then from there you can connect to one of the nodes using remsh.

It is important to understand that there are actually two nodes involved when you start a remote shell. The local node, named node4 in the previous example, and the remote node node2. These nodes can be on the same machine or on different machines. The local node is always running on the machine on which you gave the iex or erl command with remsh. On the local node there is a process running the tty program which interacts with the terminal window. The actual shell process runs on the remote node. This means, first of all, that the code for the shell you want to run (Iex or the Erlang shell) has to exist at the remote node. It also means that code is executed on the remote node. And it also means that any shell default settings are taken from the settings of the remote machine.

Imagine that we have the following .erlang file in our home directory on the machine GDC08.

link:../code/ops_chapter/src/.erlang[role=include]

And the user_default.erl file looks like this:

link:../code/ops_chapter/src/user_default.erl[role=include]

Then we create two directories ~/example/dir1 and ~/example/dir2 and we put two different .iex.exs files in those directories, which will set the prompt to show either <d1> or <d2>, respectively.

# File 1
link:../code/ops_chapter/src/example/dir1/.iex.exs[role=include]
# File 2
link:../code/ops_chapter/src/example/dir2/.iex.exs[role=include]

Now if we start four different nodes from these directories we will see how the shell configurations are loaded. First node1 in dir1:

GDC08:~/example/dir1$ iex --sname node1
Erlang/OTP 19 [erts-8.1] [source-0567896] [64-bit]
              [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

ERTS is starting in /home/happi/example/dir1
 on [node1@GDC08]
Interactive Elixir (1.4.0) - press Ctrl+C to exit (type h() ENTER for help)
iEx starting in
/home/happi/example/dir1
iEx starting on
node1@GDC08
(node1@GDC08)iex<d1>

Then node2 in dir2:

GDC08:~/example/dir2$ iex --sname node2
Erlang/OTP 19 [erts-8.1] [source-0567896] [64-bit]
              [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

ERTS is starting in /home/happi/example/dir2
 on [node2@GDC08]
Interactive Elixir (1.4.0) - press Ctrl+C to exit (type h() ENTER for help)
iEx starting in
/home/happi/example/dir2
iEx starting on
node2@GDC08
(node2@GDC08)iex<d2>

Then node3 in dir1, but launching a remote shell on node2:

GDC08:~/example/dir1$ iex --sname node3 --remsh node2@GDC08
Erlang/OTP 19 [erts-8.1] [source-0567896] [64-bit] [smp:4:4]
              [async-threads:10] [hipe] [kernel-poll:false]

ERTS is starting in /home/happi/example/dir1
 on [node3@GDC08]
Interactive Elixir (1.4.0) - press Ctrl+C to exit (type h() ENTER for help)
iEx starting in
/home/happi/example/dir2
iEx starting on
node2@GDC08
(node2@GDC08)iex<d2>

As we see, the remote shell started up in dir2, since that is the directory of node2. Finally, we start an Erlang node and check that the function we defined in our user_default module can be called from the shell:

GDC08:~/example/dir2$ erl -sname node4
Erlang/OTP 19 [erts-8.1] [source-0567896] [64-bit] [smp:4:4]
              [async-threads:10] [hipe] [kernel-poll:false]

ERTS is starting in /home/happi/example/dir2
 on [node4@GDC08]
Eshell V8.1  (abort with ^G)
(node4@GDC08)1> tt().
test
(node4@GDC08)2>

These shell configurations are loaded from the node running the shell, as you can see from the above examples. If we were to connect to a node on a different machine, these configurations would not be present.

Passing the -remsh flag at startup is not the only way to launch a remote shell. You can actually change which node and shell you are connected to on the fly by going into job control mode.

Job Control Mode

By pressing Ctrl+G you enter the job control mode (JCL). You are then greeted by another prompt:

User switch command
 -->

By typing h (followed by enter) you get a help text with the available commands in JCL:

  c [nn]            - connect to job
  i [nn]            - interrupt job
  k [nn]            - kill job
  j                 - list all jobs
  s [shell]         - start local shell
  r [node [shell]]  - start remote shell
  q                 - quit erlang
  ? | h             - this message

The interesting command here is the r command which starts a remote shell. You can give it the name of the shell you want to run, which is needed if you want to start an Elixir shell, since the default is the standard Erlang shell. Once you have started a new job (i.e. a new shell) you need to connect to that job with the c command. You can also list all jobs with j.

(node2@GDC08)iex<d2>
User switch command
 --> r node1@GDC08 'Elixir.IEx'
 --> c
Interactive Elixir (1.4.0) - press Ctrl+C to exit (type h() ENTER for help)
iEx starting in
/home/happi/example/dir1
iEx starting on
node1@GDC08

Starting a new local shell with s and connecting to that instead is useful if you have started a very long running command and want to do something else in the meantime. If your command seems to be stuck, you can interrupt it with i, or you can kill that shell with k and start a fresh one.

See the [Erlang Shell manual](http://erlang.org/doc/man/shell.html) for a full description of JCL mode.

You can quit your session by typing ctrl+G q [enter]. This shuts down the local node. You do not want to quit with any of q()., halt(), init:stop(), or System.halt. All of these will bring down the remote node which seldom is what you want when you have connected to a live server. Instead use ctrl+\, ctrl+c ctrl+c, ctrl+g q [enter] or ctrl+c a [enter].

If you do not want to use a remote shell, which requires you to have two instances of the Erlang runtime system running, there are actually two other ways to connect to a node. You can also connect either through a Unix pipe or directly through ssh, but both of these methods require that you have prepared the node you want to connect to by starting it in a special way or by starting an ssh server.

Connecting through a Pipe

By starting the node through the command run_erl you will get a named pipe for IO and you can attach a shell to that pipe without the need to start a whole new node. As we shall see in the next chapter there are some advantages to using run_erl instead of just starting Erlang in daemon mode, such as not losing standard IO and standard error output.

The run_erl command is only available on Unix-like operating systems that implement pipes. If you start your system with run_erl, something like:

> run_erl -daemon log/erl_pipe log "erl -sname node1"

or

> run_erl -daemon log/iex_pipe log "iex --sname node2"

You can then attach to the system through the named pipe (the first argument to run_erl).

> to_erl dir1/iex_pipe

iex(node2@GDC08)1>

You can exit the shell by sending EOF (ctrl+d) and leave the system running in the background.

Note
With to_erl the terminal is connected directly to the live node so if you type ctrl-C or ctrl-G q [enter] you will bring down that node - probably not what you want! When using run_erl, it can be a good idea to also use the flag +Bi, which disables the ctrl-C signal and removes the q option from the ctrl-G menu.

The last method for connecting to the node is through ssh.

Connecting through SSH

Erlang comes with a built in SSH server which you can start on your node and then connect to directly. This is completely separate from the Erlang distribution mechanism, so you do not need to start the system with -name or -sname. The [documentation for the ssh module](http://erlang.org/doc/man/ssh.html) explains all the details. For a quick test all you need is a server key which you can generate with ssh-keygen:

> mkdir ~/ssh-test/
> ssh-keygen -t rsa -f ~/ssh-test/ssh_host_rsa_key

Then you start the ssh daemon on the Erlang node:

gds01> erl
Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit] [smp:8:8]
              [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.3  (abort with ^G)
1> ssh:start().
{ok,<0.47.0>}
2> ssh:daemon(8021, [{system_dir, "/home/happi/ssh-test"},
                     {auth_methods, "password"},
                     {password, "pwd"}]).
Note
The system_dir defaults to /etc/ssh, but those keys are only readable the root user, which is why we create our own in this example.

You can now connect from another machine. Note that even though you’re using the plain ssh command to connect, you land directly in an Erlang shell on the node:

happi@GDC08:~> ssh -p 8021 happi@gds01
happi@gds01's password: [pwd]
Eshell V7.3  (abort with ^G)
1>

In a real world setting you would want to set up your server and user ssh keys as described in the documentation. At a minimum you would want to have a better password.

In this shell you do not have access to neither JCL mode (Ctrl+G) nor the BREAK mode (Ctrl+C). Entering q(), halt() or init:stop() will bring down the remote node. To disconnect from the shell you can enter exit(). to terminate the shell session, or you can shut down your terminal window.

The break mode is really powerful when developing, profiling and debugging. We will take a look at it next.

Breaking (out or in).

When you press ctrl+c you enter BREAK mode. This is most often used just to terminate the node by either typing a [enter] for abort, or simply by hitting ctrl+c once more. But you can actually use this mode to look in to the internals of the Erlang runtime system.

When you enter BREAK mode you get a short menu:

BREAK: (a)bort (A)bort with dump (c)ontinue (p)roc info (i)nfo
       (l)oaded (v)ersion (k)ill (D)b-tables (d)istribution

c [enter] (continue) takes you back to the shell. You can also terminate the node with a forced crash dump (different from a core dump - see [CH-Debugging]) for debugging purposes with A [enter] .

Hitting p [enter] will give you internal information about all processes in the system. We will look closer at what this information means in the next chapter (See [xxx](#ch.processes)).

You can also get information about the memory and the memory allocators in the node through i [enter]. In [xxx](#ch.memory) we will look at how to decipher this information.

You can see all loaded modules and their sizes with l [enter] and the system version with v [enter], while k [enter] will let you step through all processes and inspect them and kill them. Capital D [enter] will show you information about all the ETS tables in the system, and lower case d [enter] will show you information about the distribution (basically just the node name).

If you have built your runtime with OPPROF or DEBUG you will be able to get even more information. We will look at how to do this in [AP-BuildingERTS]. The code that implements the break mode can be found in <filename>[OTP_SOURCE]/erts/emulator/beam/break.c</filename>.

Note
If you prefer that the node shuts down immediately on ctrl+c instead of bringing up the BREAK menu, you can pass the option +Bd to erl. This is typically what you want for running Erlang in Docker or similar. There is also the variant +Bc which makes ctrl-c just interrupt the current shell command - this can be nice for interactive work.

Note that going into break mode freezes the node. This is not something you want to do on a production system. But when debugging or profiling in a test system, this mode can help us find bugs and bottlenecks, as we will see later in this book.