@@ -126,14 +126,81 @@ cmake ..
126
126
make
127
127
```
128
128
129
+ ### build examples
130
+
131
+ Simply do the following:
132
+
133
+ ``` bash
134
+ cmake .. -DBUILD_EXAMPLES=On
135
+ make
136
+ ```
137
+
138
+ ### build tests
139
+
140
+ ``` bash
141
+ cmake .. -DBUILD_TESTS=On
142
+ ```
143
+
144
+ ### enable serialization
145
+
146
+ In order to enable serialization, you need boost and boost-serialization.
147
+ See your distro on how to install those.
148
+ To enable it, pass a cmake flag:
149
+
150
+ ``` bash
151
+ cmake .. -DUSING_BOOST_SERIALIZATION=On
152
+ ```
153
+
154
+ For example if you want to run the tests ** with** serialization:
155
+
156
+ ``` bash
157
+ cmake .. -DUSING_BOOST_SERIALIZATION=On -DBUILD_EXAMPLES=On
158
+ ```
159
+
160
+ You can also set this flag for your ** own** project, if you wish to save and load
161
+ policies, states or actions.
162
+ Do bear in mind that the ` state_trait ` (e.g., your state ** descriptor** ) and the
163
+ ` action_trait ` (e.g., your action ** descriptor** ) must ** also be serializable** .
164
+ On how to achieve this, [ have a look at this tutorial] ( http://www.boost.org/doc/libs/1_64_0/libs/serialization/doc/tutorial.html )
165
+ if this condition is not met, you will end up with compilation errors.
166
+
167
+ Because of the flexibility of boost serialization, you can save and load binary, text or xml archives.
168
+ Later versions of boost support smart pointers, so even if your descriptors are
169
+ ` std::shared_ptr ` or ` std::unique_ptr ` you can still save and load them.
170
+
129
171
# Examples
130
172
131
173
There is a folder ` examples ` which I'm populating with examples, starting from your typical * gridworld* problem,
132
174
and then moving on to a * blackjack* program.
133
- Currently there is a classical "Gridworld" example, with two versions :
175
+ Currently there are two "Gridworld" examples :
134
176
- an offline on-policy algorithm: ` examples/gridworld_offline.cpp ` built as ` ex_gridworld_offline `
135
177
- an online on-policy algorithm: ` examples/gridworld_online.cpp ` built as ` ex_gridworld_online `
136
178
179
+ ## basic usage
180
+
181
+ The basic way of using the library is the following:
182
+
183
+ 1 . create a class ** state** , or use an existing class, structure, or PDT which describes (in a * Markovian* sense) your state
184
+ 2 . create a class ** action** , or use an existing class, structure, or PDT which describes the action.
185
+ 3 . create an * episode* which by default is an ` std::deque<relearn::link<state,action>> ` which you populate, and then reward.
186
+
187
+ At this point, depending on wether you are using an ** online** or ** offline** algorith/approach, you have the following options:
188
+
189
+ 4 . keep creating episodes, obtain a reward for the last/terminal state, and once you have finished, train the policy will all of them,
190
+ ** or**
191
+ 5 . every time you create an episode, obtain the reward, then you can train your policy with it.
192
+
193
+ That choice is up to you, and almost always depends on the domain, system or problem you're trying to solve.
194
+ It is for this reason that there is no implementation of ` on_policy ` or ` off_policy ` or ` e_greedy ` ,
195
+ those are very simple algorithms, and are application-specific.
196
+
197
+ Take a look at the ` gridworld ` examples, which demonstrate two different ways of achieving the same task.
198
+ The ` blackjack ` example is different: because playing Blackjack is a * probability* task, we can't use
199
+ a * deterministic* approach, rather we use a * probabilistic* approach, in which case
200
+
201
+ 1 . we have to take an offline approach (we don't know the transition from one state to another until we've experienced it)
202
+ 2 . we have to train on probabilities of transitioning (e.g., non-deterministic)
203
+
137
204
## Gridworld
138
205
139
206
![ gridworld image] ( https://github.com/alexge233/relearn/blob/master/images/gridworld.png?raw=true )
@@ -158,14 +225,22 @@ And the online approach:
158
225
- the entire process is repeated until the goal is discovered.
159
226
160
227
The actual gridworld is saved in a textfile ` gridworld.txt ` (feel free to change it).
161
- The example ` src/gridworld.cpp ` provides the minimal code to demonstrate this staged approach.
228
+ The example ` examples/gridworld_header.hpp ` provides the minimal code to demonstrate this staged approach.
229
+ The two files:
230
+
231
+ - ` examples/gridworld_offline.cpp `
232
+ - ` examples/gridworld_online.cpp `
233
+
234
+ have the different versions of how this task can be solved.
162
235
163
236
Once we have loaded the world (using function ` populate ` ) we set the start at x:1, y:8 and then
164
237
begin the exploration.
165
238
166
- The exploration runs in an inifinite until the grid block with a ** positive** reward is found.
239
+ ### offline q-learning
240
+
241
+ The offline exploration runs in an inifinite until the grid block with a ** positive** reward is found.
167
242
Until that happens, the agent takes a * stochastic* (e.g., random) approach and searches the gridworld.
168
- The function:
243
+ The function (template parameter ` S ` is state, and ` A ` is action) :
169
244
170
245
``` cpp
171
246
template <typename S,
@@ -178,7 +253,7 @@ std::deque<relearn::link<S,A>> explore(const world & w,
178
253
does the following:
179
254
180
255
- creates a new episode (e.g., `relearn::markov_chain`)
181
- - sets as root state the starting gridblock x:1, y:8
256
+ - sets as root state the starting gridblock
182
257
- randomly picks a direction (see struct `rand_direction` for more)
183
258
- repeats this until either (a) a negative reward has been found (e.g., stepped into a fire block), or (b) the goal block is discovered
184
259
@@ -212,6 +287,25 @@ and unnecessarily searching the gridworld.
212
287
This is a __deterministic__ scenario, because the agent knows at any given moment, which action he is taking,
213
288
and to __which__ state that action will lead to.
214
289
290
+ ### online q-learning
291
+
292
+ The online exploration is somewhat different, because the `explore` method does the following:
293
+
294
+ - creates a new episode
295
+ - sets the root state
296
+ - if a good policy (Q-value higher than zero and a valid action pointer) exist, it follows them
297
+ - else if no good policy (or action pointer) exist, it takes a random action
298
+ - repeats until a reward is found, it then **trains** the policies with the latest episode
299
+
300
+ The difference here, is that the actual exploration is instantly affected by what has already be learnt.
301
+ In comparison, the offline method is not affected, and may repeat the same sequences over and over again.
302
+ However, if the online version stops too early, there is no guarantee that the agent has learned the
303
+ ideal or optimal path to the goal, it could in fact be just a *mediocre* or *silly* path it has discovered.
304
+ This of course, is also a problem with offline, where the solution may never be discovered.
305
+
306
+ Other more complex algorithms exist (e.g., e-greedy) where the agent may follow the policy,
307
+ but randomly chose to ignore it, in order to try and discover a better solution.
308
+
215
309
## Blackjack
216
310
217
311
A simplified attempt, where one player uses classic probabilities, the dealer (house) simply draws until 17,
@@ -224,9 +318,12 @@ as well as the label or symbol of the cards held (feel free to change this, simp
224
318
This example takes a lot of time to run, as the agent maps the transitional probabilities,
225
319
using the observations from playing multiple games.
226
320
227
- ## TODO
321
+ The header file `examples/blackjack_header.hpp` contains the simple structures and methods needed to play blackjack,
322
+ whereas the source file `examples/blackjack.cpp` has the high level logic behind it.
323
+
324
+ # TODO
228
325
229
- 1. implement the `boost_serialization` with internal header
230
- 2. do the R-Learning continous algorithm
326
+ 1. do the R-Learning continous algorithm
327
+ 2. add eligibility traces (decay)
231
328
232
329
[1]: Sutton, R.S. and Barto, A.G., 1998. Reinforcement learning: An introduction (Vol. 1, No. 1). Cambridge: MIT press
0 commit comments