01-software_for_modeling.Rmd

# Software para modelado

**Objetivos de aprendizaje:**

- **Reconocer los principios** alrededor del diseño de los paquetes en `{tidymodels}`.
- Clasificar modelos en **descriptivos, inferenciales,** y/o **predictivos**
- Definir un **modelo descriptivo.**
- Definir un **modelo inferencial.**
- Definir un **model predictivo.**
- Diferenciar entre modelos **supervisados** y **no supervisados**.
- Diferenciar entre modelos de **regresión** y **clasificación**.
- Diferenciar entre datos **cuantitativos** y **cualitativos**.
- Entender los **roles que los datos pueden tener** en un análisis.
- Aplicar el **proceso de ciencia de datos**.
- Reconocer las **fases de modelado**.

>The utility of a model hinges on its ability to be *reductive*. The primary influences in the data can be captured mathematically in a useful way, such as in a relationship that can be expressed as an equation.

<blockquote> <img src="https://www.tmwr.org/images/robot.png" class="robot"> There are two reasons that models permeate our lives today: an abundance of software exists to create models and it has become easier to record data and make it accessible. </blockquote>

## The pit of success

`{tidymodels}` aims to help us fall into the Pit of Success:

> The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks.

- **Avoid confusion:** Software should facilitate proper usage.
- **Avoid mistakes:** Software should make it easy for users to do the right thing.

Examples of creating a pit of success (discussed in more details later)

-   internal consistency
-   sensible defaults
-   fail with meaningful error messages rather than silently producing incorrect results

## Types of models

- **Descriptive models:** Describe or illustrate characteristics of data.
- **Inferential models:** Make some statement of truth regarding a predefined conjecture or idea.
  - Inferential techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability.
  - Usually delayed feedback between inference and actual result.
- **Predictive models:** Produce the most accurate possible prediction for new data. *Estimation* ("How much?") rather than *inference* ("Will it?").
  - **Mechanistic models** are derived using first principles to produce a model equation that is dependent on assumptions.
    - Depend on the assumptions that define their model equations.
    - Unlike inferential models, it is easy to make data-driven statements about how well the model performs based on how well it predicts the existing data
  - **Empirically driven models** have more vague assumptions, and are derived directly from the data.
    - No theoretical or probabilistic assumptions are made about the equations or the variables
    - The primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data

<sub>1. Broader discussions of these distinctions can be found in Breiman ([2001b](https://www.tmwr.org/software-modeling.html#ref-breiman2001)) and Shmueli ([2010](https://www.tmwr.org/software-modeling.html#ref-shmueli2010))</sub>

## Terminology

-   **Unsupervised models** are used to understand relationships between variables or sets of variables without an explicit relationship between variables and an outcome.
    -   Examples: PCA, clustering, autoencoders.
-   **Supervised models** have an outcome variable.
    -   Examples: linear regression, decision trees, neural networks.
    -   **Regression:** numeric outcome
    -   **Classification:** ordered or unordered qualitative values.
-   **Quantitative** data: numbers.
-   **Qualitative** (nominal) data: non-numbers.
    -   *Qualitative data still might be coded as numbers, e.g. one-hot encoding or dummy variable encoding*
-   Data can have different roles in analyses:
    -   **Outcomes** (labels, endpoints, dependent variables): the value being predicted in supervised models.
    -   **Predictors** (independent variables): the variables used to predict the outcome.
    -   Identifiers

Choosing a model type will depend on the type of question we want to answer / problem to solve and on the available data, among other things.

## The data analysis process

1.  Cleaning the data: investigate the data to make sure that they are applicable to the project goals, accurate, and appropriate
2.  Understanding the data: often referred to as exploratory data analysis (EDA). EDA brings to light how the different variables are related to one another, their distributions, typical ranges, and other attributes.
    -   "How did I come by *these* data?"
    -   "Is the data *relevant*?"
3.  Develop clear expectations of the goal of your model and how performance will be judged ([Chapter 9](https://www.tmwr.org/performance.html))
    -   "What is/are the *performance metrics or realistic goal/s* of what can be achieved?"

::: {style="text-align:center;"}
![The data science process (from R for Data Science by Wickham and Grolemund.](https://www.tmwr.org/premade/data-science-model.svg)
:::

## The modeling process

::: {style="text-align:center;"}
![The modeling process.](https://www.tmwr.org/premade/modeling-process.svg)
:::

-   **Exploratory data analysis:** Explore the data to see what they might tell you. (See previous)
-   **Feature engineering:** Create specific model terms that make it easier to accurately model the observed data. Covered in [Chapter 6](https://www.tmwr.org/recipes.html#recipes).
-   **Model tuning and selection:** Generate a variety of models and compare performance.
    -   Some models require **hyperparameter tuning**
-   **Model evaluation:** Use EDA-like analyses and compare model performance metrics to choose the best model for your situation.

The final model may be used for a conclusion and/or produce predictions on new data.

## Videos de las reuniones

### Cohorte 1

`r knitr::include_url("https://www.youtube.com/embed/O4cJiSH0zIc")`

<details>
  <summary> Chat de la reunión </summary>
  
```
00:08:44	Armando Ocampo:	ok, de acuerdo
00:21:35	Roberto Villegas-Diaz:	GitHub: https://github.com/r4ds/bookclub-tmwr_es
00:22:52	Roberto Villegas-Diaz:	Hoja de Excel para presentar: https://docs.google.com/spreadsheets/d/1apDY5yyimVUwebhZvTwM3P7Pysa9ztZvj4EUF_KFVdw/edit#gid=0
00:26:35	Armando Ocampo:	yo estoy bien en esta hora
00:27:49	Armando Guzmán Mx:	No tengo problema con el horario que acuerden.
00:44:50	Armando Ocampo:	la hoja me pide permiso de acceso
00:46:01	Cecilia Cruz:	Perdón, soy esmeralda 
00:47:33	Armando Guzmán Mx:	👋🏼
```
</details>