Skip to content

Commit a9adb8d

Browse files
committed
updated README.md
1 parent 3d6cfa8 commit a9adb8d

File tree

1 file changed

+105
-8
lines changed

1 file changed

+105
-8
lines changed

README.md

Lines changed: 105 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -126,14 +126,81 @@ cmake ..
126126
make
127127
```
128128

129+
### build examples
130+
131+
Simply do the following:
132+
133+
```bash
134+
cmake .. -DBUILD_EXAMPLES=On
135+
make
136+
```
137+
138+
### build tests
139+
140+
```bash
141+
cmake .. -DBUILD_TESTS=On
142+
```
143+
144+
### enable serialization
145+
146+
In order to enable serialization, you need boost and boost-serialization.
147+
See your distro on how to install those.
148+
To enable it, pass a cmake flag:
149+
150+
```bash
151+
cmake .. -DUSING_BOOST_SERIALIZATION=On
152+
```
153+
154+
For example if you want to run the tests **with** serialization:
155+
156+
```bash
157+
cmake .. -DUSING_BOOST_SERIALIZATION=On -DBUILD_EXAMPLES=On
158+
```
159+
160+
You can also set this flag for your **own** project, if you wish to save and load
161+
policies, states or actions.
162+
Do bear in mind that the `state_trait` (e.g., your state **descriptor**) and the
163+
`action_trait` (e.g., your action **descriptor**) must **also be serializable**.
164+
On how to achieve this, [have a look at this tutorial](http://www.boost.org/doc/libs/1_64_0/libs/serialization/doc/tutorial.html)
165+
if this condition is not met, you will end up with compilation errors.
166+
167+
Because of the flexibility of boost serialization, you can save and load binary, text or xml archives.
168+
Later versions of boost support smart pointers, so even if your descriptors are
169+
`std::shared_ptr` or `std::unique_ptr` you can still save and load them.
170+
129171
# Examples
130172

131173
There is a folder `examples` which I'm populating with examples, starting from your typical *gridworld* problem,
132174
and then moving on to a *blackjack* program.
133-
Currently there is a classical "Gridworld" example, with two versions:
175+
Currently there are two "Gridworld" examples:
134176
- an offline on-policy algorithm: `examples/gridworld_offline.cpp` built as `ex_gridworld_offline`
135177
- an online on-policy algorithm: `examples/gridworld_online.cpp` built as `ex_gridworld_online`
136178

179+
## basic usage
180+
181+
The basic way of using the library is the following:
182+
183+
1. create a class **state**, or use an existing class, structure, or PDT which describes (in a *Markovian* sense) your state
184+
2. create a class **action**, or use an existing class, structure, or PDT which describes the action.
185+
3. create an *episode* which by default is an `std::deque<relearn::link<state,action>>` which you populate, and then reward.
186+
187+
At this point, depending on wether you are using an **online** or **offline** algorith/approach, you have the following options:
188+
189+
4. keep creating episodes, obtain a reward for the last/terminal state, and once you have finished, train the policy will all of them,
190+
**or**
191+
5. every time you create an episode, obtain the reward, then you can train your policy with it.
192+
193+
That choice is up to you, and almost always depends on the domain, system or problem you're trying to solve.
194+
It is for this reason that there is no implementation of `on_policy` or `off_policy` or `e_greedy`,
195+
those are very simple algorithms, and are application-specific.
196+
197+
Take a look at the `gridworld` examples, which demonstrate two different ways of achieving the same task.
198+
The `blackjack` example is different: because playing Blackjack is a *probability* task, we can't use
199+
a *deterministic* approach, rather we use a *probabilistic* approach, in which case
200+
201+
1. we have to take an offline approach (we don't know the transition from one state to another until we've experienced it)
202+
2. we have to train on probabilities of transitioning (e.g., non-deterministic)
203+
137204
## Gridworld
138205

139206
![gridworld image](https://github.com/alexge233/relearn/blob/master/images/gridworld.png?raw=true)
@@ -158,14 +225,22 @@ And the online approach:
158225
- the entire process is repeated until the goal is discovered.
159226

160227
The actual gridworld is saved in a textfile `gridworld.txt` (feel free to change it).
161-
The example `src/gridworld.cpp` provides the minimal code to demonstrate this staged approach.
228+
The example `examples/gridworld_header.hpp` provides the minimal code to demonstrate this staged approach.
229+
The two files:
230+
231+
- `examples/gridworld_offline.cpp`
232+
- `examples/gridworld_online.cpp`
233+
234+
have the different versions of how this task can be solved.
162235

163236
Once we have loaded the world (using function `populate`) we set the start at x:1, y:8 and then
164237
begin the exploration.
165238

166-
The exploration runs in an inifinite until the grid block with a **positive** reward is found.
239+
### offline q-learning
240+
241+
The offline exploration runs in an inifinite until the grid block with a **positive** reward is found.
167242
Until that happens, the agent takes a *stochastic* (e.g., random) approach and searches the gridworld.
168-
The function:
243+
The function (template parameter `S` is state, and `A` is action):
169244

170245
```cpp
171246
template <typename S,
@@ -178,7 +253,7 @@ std::deque<relearn::link<S,A>> explore(const world & w,
178253
does the following:
179254
180255
- creates a new episode (e.g., `relearn::markov_chain`)
181-
- sets as root state the starting gridblock x:1, y:8
256+
- sets as root state the starting gridblock
182257
- randomly picks a direction (see struct `rand_direction` for more)
183258
- repeats this until either (a) a negative reward has been found (e.g., stepped into a fire block), or (b) the goal block is discovered
184259
@@ -212,6 +287,25 @@ and unnecessarily searching the gridworld.
212287
This is a __deterministic__ scenario, because the agent knows at any given moment, which action he is taking,
213288
and to __which__ state that action will lead to.
214289
290+
### online q-learning
291+
292+
The online exploration is somewhat different, because the `explore` method does the following:
293+
294+
- creates a new episode
295+
- sets the root state
296+
- if a good policy (Q-value higher than zero and a valid action pointer) exist, it follows them
297+
- else if no good policy (or action pointer) exist, it takes a random action
298+
- repeats until a reward is found, it then **trains** the policies with the latest episode
299+
300+
The difference here, is that the actual exploration is instantly affected by what has already be learnt.
301+
In comparison, the offline method is not affected, and may repeat the same sequences over and over again.
302+
However, if the online version stops too early, there is no guarantee that the agent has learned the
303+
ideal or optimal path to the goal, it could in fact be just a *mediocre* or *silly* path it has discovered.
304+
This of course, is also a problem with offline, where the solution may never be discovered.
305+
306+
Other more complex algorithms exist (e.g., e-greedy) where the agent may follow the policy,
307+
but randomly chose to ignore it, in order to try and discover a better solution.
308+
215309
## Blackjack
216310
217311
A simplified attempt, where one player uses classic probabilities, the dealer (house) simply draws until 17,
@@ -224,9 +318,12 @@ as well as the label or symbol of the cards held (feel free to change this, simp
224318
This example takes a lot of time to run, as the agent maps the transitional probabilities,
225319
using the observations from playing multiple games.
226320
227-
## TODO
321+
The header file `examples/blackjack_header.hpp` contains the simple structures and methods needed to play blackjack,
322+
whereas the source file `examples/blackjack.cpp` has the high level logic behind it.
323+
324+
# TODO
228325
229-
1. implement the `boost_serialization` with internal header
230-
2. do the R-Learning continous algorithm
326+
1. do the R-Learning continous algorithm
327+
2. add eligibility traces (decay)
231328
232329
[1]: Sutton, R.S. and Barto, A.G., 1998. Reinforcement learning: An introduction (Vol. 1, No. 1). Cambridge: MIT press

0 commit comments

Comments
 (0)