-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathideas.txt
61 lines (52 loc) · 2.6 KB
/
ideas.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
move hardcoded values to config.py
validate elo calculation
broaden tournament and player scope
gridsearchcv for xgb params. be sure to explore the ranges of params in the grid..
Input:
round 1:
param_grid = {
'max_depth': [3,4,5],
'learning_rate': [0.1,0.01,0.05],
'gamma': [0, 0.25, 1.0],
'reg_lambda': [0, 1.0, 10.0],
'scale_pos_weight': [1,3,5]
}
Output:
max_depth: 4, ,learning_rate: 0.1, gamma:0.25, reg_lambda: 10, scale_pos_weight: 3
Note how max_depth, gamma, and scale_pos_weight were in the middle of the range.
learning rate and reg_lambda were at the extremes, so
we should run a second round of gridsearchcv with more extreme values to explore the ranges.
round 2:
param_grid = {
'max_depth': [4],
'learning_rate': [0.1,0.5,1],
'gamma': [0.25],
'reg_lambda': [10.0, 20, 100],
'scale_pos_weight': [3]
}
Output:
max_depth: 4, learning_rate: 0.1, reg_lambda: 10
Note: to speed up cross-validation, and to prevent overfitting,
use a random subset of the data (e.g. 90%) and only use random subset of features (columns) e.g. (50% per tree)
subsample: 0.9, colsample_bytree: 0.5
n_estimators: build 1 tree to begin with, so we can get a sense of the performance of the model.
Categorical Features:
Surface: Represents the type of court surface (e.g., 'Hard', 'Clay', 'Grass').
Round: Indicates the stage of the tournament (e.g., 'First Round', 'Quarterfinal', 'Final').
Best of: Specifies the maximum number of sets in the match (e.g., 3, 5).
Winner: Name or identifier of the winning player.
Loser: Name or identifier of the losing player.
Comment: Contains match-specific notes or observations.
Converting these columns to categorical data types can improve memory efficiency and performance in analyses.
Ordinal Categorical Columns:
Round: If the 'Round' column has a natural order (e.g., 'First Round' < 'Quarterfinal' < 'Final'), it can be treated as an ordinal categorical variable.
Best of: While numerical, it represents discrete categories and can be treated as ordinal if there's a meaningful order.
Numerical Columns:
Date: Represents the date of the match; typically stored as a datetime object.
WRank, LRank: Rankings of the winner and loser, respectively.
WPts, LPts: Points of the winner and loser, respectively.
Wsets, Lsets: Number of sets won by the winner and loser, respectively.
PSW, PSL: Probabilities or predictions related to the winner and loser.
elo_winner, elo_loser: Elo ratings of the winner and loser.
proba_elo: Probability derived from Elo ratings.
These columns are numerical and should remain as such for quantitative analyses.