-
-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy path07-pipelines.qmd
213 lines (159 loc) · 6.87 KB
/
07-pipelines.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
engine: julia
---
# Building pipelines
```{julia}
#| echo: false
#| output: false
import Pkg
Pkg.activate(".")
using GeoStats
```
In previous chapters, we learned a large number of transforms for
manipulating and processing geotables. In all those code examples,
we used Julia's pipe operator `|>` to apply the transform and send
the resulting geotable to the next transform:
```julia
geotable |> transform1 |> transform2 |> ... |> viewer
```
In this chapter, we will learn two new powerful operators `→` and `⊔` provided
by the framework to combine transforms into **pipelines** that can be optimized
and reused with different geotables.
## Motivation
The pipe operator `|>` in Julia is very convenient for sequential application of functions.
Given an input `x`, we can type `x |> f1 |> f2` to apply functions `f1` and `f2` in sequence,
in a way that is equivalent to `f2(f1(x))` or, alternatively, to the function composition
`(f2 ∘ f1)(x)`.
Its syntax can drastically improve code readability when the number of functions is large.
However, the operator has a major limitation in the context of geospatial data science:
it evaluates all intermediate results as soon as the data is inserted in the pipe.
This is known in computer science as [eager evaluation](https://en.wikipedia.org/wiki/Evaluation_strategy).
Taking the expression above as an example, the operator will first evaluate `f1(x)` and store
the result in a variable `y`. After `f1` is completed, the operator evaluates `f2(y)` and
produces the final (desired) result. If `y` requires a lot of computer memory as it is usually
the case with large geotables, the application of the pipeline will be slow.
Another evaluation strategy, known as **lazy evaluation**, consists of building the entire
pipeline without the data in it. The major advantage of this strategy is that it can analyze
the functions, and potentially simplify the code before evaluation. For example, the pipeline
`cos → acos` can be replaced by the much simpler pipeline `identity` for some values of the
input `x`.
## Operator →
In our framework, the operator `→` (`\to`) can be used in place of the pipe operator to build
lazy **sequential pipelines** of transforms. Consider the synthetic data from previous chapters:
```{julia}
N = 10000
a = [2randn(N÷2) .+ 6; randn(N÷2)]
b = [3randn(N÷2); 2randn(N÷2)]
c = randn(N)
d = c .+ 0.6randn(N)
table = (; a, b, c, d)
gtb = georef(table, CartesianGrid(100, 100))
```
And suppose that we are interested in converting the columns "a", "b" and "c" of the geotable with
the `Quantile` transform. Instead of creating the intermediate geotable with the `Select` transform,
and then sending the result to the `Quantile` transform, we can create the entire pipeline without
reference to the data:
```{julia}
pipeline = Select("a", "b", "c") → Quantile()
```
The operator `→` creates a special `SequentialTransform`, which can be applied like any other
transform in the framework:
```{julia}
gtb |> pipeline
```
It will perform optimizations whenever possible. For instance, we know a priori that adding the
`Identity` transform anywhere in the pipeline doesn't have any effect:
```{julia}
pipeline → Identity()
```
## Operator ⊔
The operator `⊔` (`\sqcup`) can be used to create lazy **parallel transforms**. There is no
equivalent in Julia as this operator is very specific to tables. It combines the geotables
produced by two or more pipelines into a single geotable with the disjoint union of all
columns.
Let's illustrate this concept with two pipelines:
```{julia}
pipeline1 = Select("a") → Indicator("a", k=3)
```
```{julia}
pipeline2 = Select("b", "c", "d") → PCA(maxdim=2)
```
The first pipeline creates 3 indicator variables from variable "a":
```{julia}
gtb |> pipeline1
```
The second pipeline runs principal component analysis with variables "b", "c" and "d"
and produces 2 principal components:
```{julia}
gtb |> pipeline2
```
We can combine the two pipelines into a single pipeline that executes in parallel:
```{julia}
pipeline = pipeline1 ⊔ pipeline2
```
```{julia}
gtb |> pipeline
```
All 5 columns are present in the final geotable.
## Revertibility
An important concept related to pipelines that is very useful in
geospatial data science is **revertibility**. The concept is useful
whenever we need to answer geoscientific questions in terms of
variables that have been transformed for geostatistical analysis.
Let's illustrate the concept with the following geotable and pipeline:
```{julia}
a = [-1.0, 4.0, 1.6, 3.4]
b = [1.6, 3.4, -1.0, 4.0]
c = [3.4, 2.0, 3.6, -1.0]
table = (; a, b, c)
gtb = georef(table, [(0, 0), (1, 0), (1, 1), (0, 1)])
```
```{julia}
pipeline = Center()
```
We saw that our pipelines can be evaluated with Julia's pipe operator:
```{julia}
gtb |> pipeline
```
In order to revert a pipeline, however; we need to save auxiliary constants that
were used to transform the data (e.g., mean of selected columns). The `apply` function
serves this purpose:
```{julia}
newgtb, cache = apply(pipeline, gtb)
newgtb
```
The function produces the new geotable as usual and an additional `cache`
with all the information needed to revert the transforms in the pipeline.
We say that a pipeline `isrevertible`, if there is an efficient way to
revert its transforms starting from any geotable that has the same schema
of the geotable produced by the `apply` function:
```{julia}
isrevertible(pipeline)
```
```{julia}
revert(pipeline, newgtb, cache)
```
A very common workflow in geospatial data science consists of:
1. Transforming the data to an appropriate sample space for geostatistical analysis
2. Doing additional modeling to predict variables in new geospatial locations
3. Reverting the modeling results with the saved pipeline and cache
We will see examples of this workflow in **Part V** of the book.
## Congratulations!
Congratulations on finishing **Part II** of the book. Let’s quickly review what
we learned so far:
- Transforms and pipelines are powerful tools to achieve **reproducible**
geospatial data science.
- The operators `→` and `⊔` can be used to build **lazy pipelines**. After
a pipeline is built, it can be applied to different geotables, which may
have different types of geospatial domain.
- Lazy pipelines can always be optimized for **computational performance**,
and the Julia language really thrives to dispatch the appropriate
optimizations when they are available.
- Map projections are specific types of coordinate transforms. They can be
combined with many other transforms in the framework to produce advanced
geostatistical visualizations.
There is a long journey until the technology reaches its full potential.
The good news is that Julia code is easy to read and modify, and
you can become an active contributor after just a few weeks working with
the language. We invite you to contribute new transforms and optimizations
as soon as you feel comfortable with the framework.