-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path01-software_for_modeling.Rmd
116 lines (88 loc) · 6.6 KB
/
01-software_for_modeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
# Software para modelado
**Objetivos de aprendizaje:**
- **Reconocer los principios** alrededor del diseño de los paquetes en `{tidymodels}`.
- Clasificar modelos en **descriptivos, inferenciales,** y/o **predictivos**
- Definir un **modelo descriptivo.**
- Definir un **modelo inferencial.**
- Definir un **model predictivo.**
- Diferenciar entre modelos **supervisados** y **no supervisados**.
- Diferenciar entre modelos de **regresión** y **clasificación**.
- Diferenciar entre datos **cuantitativos** y **cualitativos**.
- Entender los **roles que los datos pueden tener** en un análisis.
- Aplicar el **proceso de ciencia de datos**.
- Reconocer las **fases de modelado**.
>The utility of a model hinges on its ability to be *reductive*. The primary influences in the data can be captured mathematically in a useful way, such as in a relationship that can be expressed as an equation.
<blockquote> <img src="https://www.tmwr.org/images/robot.png" class="robot"> There are two reasons that models permeate our lives today: an abundance of software exists to create models and it has become easier to record data and make it accessible. </blockquote>
## The pit of success
`{tidymodels}` aims to help us fall into the Pit of Success:
> The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks.
- **Avoid confusion:** Software should facilitate proper usage.
- **Avoid mistakes:** Software should make it easy for users to do the right thing.
Examples of creating a pit of success (discussed in more details later)
- internal consistency
- sensible defaults
- fail with meaningful error messages rather than silently producing incorrect results
## Types of models
- **Descriptive models:** Describe or illustrate characteristics of data.
- **Inferential models:** Make some statement of truth regarding a predefined conjecture or idea.
- Inferential techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability.
- Usually delayed feedback between inference and actual result.
- **Predictive models:** Produce the most accurate possible prediction for new data. *Estimation* ("How much?") rather than *inference* ("Will it?").
- **Mechanistic models** are derived using first principles to produce a model equation that is dependent on assumptions.
- Depend on the assumptions that define their model equations.
- Unlike inferential models, it is easy to make data-driven statements about how well the model performs based on how well it predicts the existing data
- **Empirically driven models** have more vague assumptions, and are derived directly from the data.
- No theoretical or probabilistic assumptions are made about the equations or the variables
- The primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data
<sub>1. Broader discussions of these distinctions can be found in Breiman ([2001b](https://www.tmwr.org/software-modeling.html#ref-breiman2001)) and Shmueli ([2010](https://www.tmwr.org/software-modeling.html#ref-shmueli2010))</sub>
## Terminology
- **Unsupervised models** are used to understand relationships between variables or sets of variables without an explicit relationship between variables and an outcome.
- Examples: PCA, clustering, autoencoders.
- **Supervised models** have an outcome variable.
- Examples: linear regression, decision trees, neural networks.
- **Regression:** numeric outcome
- **Classification:** ordered or unordered qualitative values.
- **Quantitative** data: numbers.
- **Qualitative** (nominal) data: non-numbers.
- *Qualitative data still might be coded as numbers, e.g. one-hot encoding or dummy variable encoding*
- Data can have different roles in analyses:
- **Outcomes** (labels, endpoints, dependent variables): the value being predicted in supervised models.
- **Predictors** (independent variables): the variables used to predict the outcome.
- Identifiers
Choosing a model type will depend on the type of question we want to answer / problem to solve and on the available data, among other things.
## The data analysis process
1. Cleaning the data: investigate the data to make sure that they are applicable to the project goals, accurate, and appropriate
2. Understanding the data: often referred to as exploratory data analysis (EDA). EDA brings to light how the different variables are related to one another, their distributions, typical ranges, and other attributes.
- "How did I come by *these* data?"
- "Is the data *relevant*?"
3. Develop clear expectations of the goal of your model and how performance will be judged ([Chapter 9](https://www.tmwr.org/performance.html))
- "What is/are the *performance metrics or realistic goal/s* of what can be achieved?"
::: {style="text-align:center;"}

:::
## The modeling process
::: {style="text-align:center;"}

:::
- **Exploratory data analysis:** Explore the data to see what they might tell you. (See previous)
- **Feature engineering:** Create specific model terms that make it easier to accurately model the observed data. Covered in [Chapter 6](https://www.tmwr.org/recipes.html#recipes).
- **Model tuning and selection:** Generate a variety of models and compare performance.
- Some models require **hyperparameter tuning**
- **Model evaluation:** Use EDA-like analyses and compare model performance metrics to choose the best model for your situation.
The final model may be used for a conclusion and/or produce predictions on new data.
## Videos de las reuniones
### Cohorte 1
`r knitr::include_url("https://www.youtube.com/embed/O4cJiSH0zIc")`
<details>
<summary> Chat de la reunión </summary>
```
00:08:44 Armando Ocampo: ok, de acuerdo
00:21:35 Roberto Villegas-Diaz: GitHub: https://github.com/r4ds/bookclub-tmwr_es
00:22:52 Roberto Villegas-Diaz: Hoja de Excel para presentar: https://docs.google.com/spreadsheets/d/1apDY5yyimVUwebhZvTwM3P7Pysa9ztZvj4EUF_KFVdw/edit#gid=0
00:26:35 Armando Ocampo: yo estoy bien en esta hora
00:27:49 Armando Guzmán Mx: No tengo problema con el horario que acuerden.
00:44:50 Armando Ocampo: la hoja me pide permiso de acceso
00:46:01 Cecilia Cruz: Perdón, soy esmeralda
00:47:33 Armando Guzmán Mx: 👋🏼
```
</details>