Leanpub: Publish Early, Publish Often

Appendix A: Complete Parameter List

x: A vector containing the names of the predictors in the model. No default.
y: The name of the response variable in the model. No default.
data: An H2OParsedData object containing the training data. No default.
key: The unique hex key assigned to the resulting model. If none is given, a key will automatically be generated.
override_with_best_model: If enabled, override the final model with the best model found during training. Default is true.
classification: A logical value indicating whether the algorithm should conduct classification. Otherwise, regression is performed on a numeric response variable.
nfolds: Number of folds for cross-validation. If the number of folds is more than 1, then validation must remain empty. Default is false.
validation: An H2OParsedData object indicating the validation dataset used to construct confusion matrix. If left blank, default is the training data.
holdout_fraction: (Optional) Fraction of the training data to hold out for validation.
checkpoint: Model checkpoint (either key or H2ODeepLearningModel) to resume training with.
activation: The choice of nonlinear, differentiable activation function used throughout the network. Options are Tanh, TanhWithDropout, Rectifier, RectifierWithDropout, Maxout, MaxoutWithDropout, and the default is Rectifier. See section Activation and Loss Functions for more details.
hidden: The number and size of each hidden layer in the model. For example, if c(100,200,100) is specified, a model with 3 hidden layers will be produced, and the middle hidden layer will have 200 neurons. The default is c(200,200). For grid search, use list(c(10,10), c(20,20)) etc. See section Performing a Trial Run for more details.
autoencoder: Default is false. See section Deep Autoencoders for more details.
use_all_factor_levels: Use all factor levels of categorical variables. Otherwise, the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder.
epochs: The number of passes over the training dataset to be carried out. It is recommended to start with lower values for initial grid searches. The value can be modified during checkpoint restarts and allows continuation of selected models. Default is 10.
train_samples_per_iteration: Default is -1, but performance might depend greatly on this parameter. See section Specifying the Number of Training Samples per Iteration for more details.
seed: The random seed controls sampling and initialization. Reproducible results are only expected with single-threaded operation (i.e. when running on one node, turning off load balancing and providing a small dataset that fits in one chunk). In general, the multi-threaded asynchronous updates to the model parameters will result in (intentional) race conditions and non-reproducible results. Default is a randomly generated number.
adaptive_rate: The default enables this feature for adaptive learning rate. See section Adaptive Learning for more details.
rho: The first of two hyperparameters for adaptive learning rate (when it is enabled). This parameter is similar to momentum and relates to the memory to prior weight updates. Typical values are between 0.9 and 0.999. Default value is 0.95. See section Adaptive Learning for more details.
epsilon: The second of two hyperparameters for adaptive learning rate (when it is enabled). This parameter is similar to learning rate annealing during initial training and momentum at later stages where it allows forward progress. Typical values are between 1e-10 and 1e-4. This parameter is only active if adaptive learning rate is enabled. Default is 1e-6. See section Adaptive Learning for more details.
rate: The learning rate $\alpha$ . Higher values lead to less stable models while lower values lead to slower convergence. Default is 0.005.
rate_annealing: Default value is 1e-6 (when adaptive learning is disabled). See section Rate Annealing for more details.
rate_decay: Default is 1.0 (when adaptive learning is disabled). The learning rate decay parameter controls the change of learning rate across layers.
momentum_start: The momentum_start parameter controls the amount of momentum at the beginning of training (when adaptive learning is disabled). Default is 0. Refer to section Momentum Training for more details.
momentum_ramp: The momentum_ramp parameter controls the amount of learning for which momentum increases assuming momentum_stable is larger than momentum_start. It can be enabled when adaptive learning is disabled. The ramp is measured in the number of training samples. Default is 1e-6. See section Momentum Training for more details.
momentum_stable: The momentum_stable parameter controls the final momentum value reached after momentum_ramp training samples (when adaptive learning is disabled). The momentum used for training will remain the same for training beyond reaching that point. Default is 0. See section Momentum Training for more details.
nesterov_accelerated_gradient: The default is true (when adaptive learning is disabled). See Section Momentum Training for more details.
input_dropout_ratio: The default is 0. See Section Regularization for more details.
hidden_dropout_ratios: The default is 0. See Section Regularization for more details.
l1: L1 regularization (can add stability and improve generalization, causes many weights to become 0. The default is 0. See section Regularization for more details.
l2: L2 regularization (can add stability and improve generalization, causes many weights to be small. The default is 0. See section Regularization for more details.
max_w2: A maximum on the sum of the squared incoming weights into any one neuron. This tuning parameter is especially useful for unbound activation functions such as Maxout or Rectifier. The default leaves this maximum unbounded.
initial_weight_distribution: The distribution from which initial weights are to be drawn. The default is the uniform adaptive option. Other options are Uniform and Normal distributions. See section Initialization for more details.
initial_weight_scale: The scale of the distribution function for Uniform or Normal distributions. For Uniform, the values are drawn uniformly from (-initial_weight_scale, initial_weight_scale). For Normal, the values are drawn from a Normal distribution with a standard deviation of initial_weight_scale. The default is 1.0. See section Initialization for more details.
loss: The default is automatic based on the particular learning problem. See section Activation and Loss Functions for more details.
score_interval: The minimum time (in seconds) to elapse between model scoring. The actual interval is determined by the number of training samples per iteration and the scoring duty cycle. Default is 5.
score_training_samples: The number of training dataset points to be used for scoring. Will be randomly sampled. Use 0 for selecting the entire training dataset. Default is 10000.
score_validation_samples: The number of validation dataset points to be used for scoring. Can be randomly sampled or stratified (if “balance classes” is set and “score validation sampling” is set to stratify). Use 0 for selecting the entire training dataset (this is also the default).
score_duty_cycle: Maximum fraction of wall clock time spent on model scoring on training and validation samples, and on diagnostics such as computation of feature importances (i.e., not on training). Default is 0.1.
classification_stop: The stopping criteria in terms of classification error (1-accuracy) on the training data scoring dataset. When the error is at or below this threshold, training stops. Default is 0.
regression_stop: The stopping criteria in terms of regression error (MSE) on the training data scoring dataset. When the error is at or below this threshold, training stops. Default is 1e-6.
quiet_mode: Enable quiet mode for less output to standard output. Default is false.
max_confusion_matrix_size: For classification models, the maximum size (in terms of classes) of the confusion matrix for it to be printed. This option is meant to avoid printing extremely large confusion matrices. Default is 20.
max_hit_ratio_k: The maximum number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable). Default is 10.
balance_classes: For imbalanced data, balance training data class counts via over/under-sampling. This can result in improved predictive accuracy. Default is false.
class_sampling_factors: Desired over/under-sampling ratios per class (lexicographic order). Only when balance_classes is enabled. If not specified, they will be automatically computed to obtain class balance during training.
max_after_balance_size: When classes are balanced, limit the resulting dataset size to the specified multiple of the original dataset size. This is the maximum relative size of the training data after balancing class counts (can be less than 1.0). Default is 5.0.
score_validation_sampling: Method used to sample validation dataset for scoring. The possible methods are Uniform and Stratified. Default is Uniform.
diagnostics: Gather diagnostics for hidden layers, such as mean and RMS values of learning rate, momentum, weights and biases. Default is true.
variable_importances: Whether to compute variable importances for input features. The implementation considers the weights connecting the input features to the first two hidden layers. Default is false.
fast_mode: Enable fast mode (minor approximation in back-propagation), should not affect results significantly. Default is true.
ignore_const_cols: Ignore constant training columns (no information can be gained anyway). Default is true.
force_load_balance: Increase training speed on small datasets by splitting it into many chunks to allow utilization of all cores. Default is true.
replicate_training_data: Replicate the entire training dataset onto every node for faster training on small datasets. Default is true.
single_node_mode: Run on a single node for fine-tuning of model parameters. Can be useful for faster convergence during checkpoint resumes after training on a very large count of nodes (for fast initial convergence). Default is false.
shuffle_training_data: Enable shuffling of training data (on each node). This option is recommended if training data is replicated on N nodes, and the number of training samples per iteration is close to N times the dataset size, where all nodes train will (almost) all the data. It is automatically enabled if the number of training samples per iteration is set to -1 (or to N times the dataset size or larger), otherwise it is disabled by default.
max_categorical_features: Max. number of categorical features, enforced via hashing (Experimental)..
reproducible: Force reproducibility on small data (will be slow - only uses 1 thread).
sparse: Enable sparse data handling (experimental).
col_major: Use a column major weight matrix for the input layer; can speed up forward propagation, but may slow down backpropagation.
input_dropout_ratios: Enable input layer dropout ratios, which can improve generalization, by specifying one value per hidden layer. The default is 0.5.

Up next

Appendix B: References