Config Structure

Info

If you are not familiar with Hydra, please read our short introduction or the Hydra docs.

Our config is located in conf/ folder and consists of the following groups:

Backbone

Path: conf/backbone

Default: rugpt3large

Description: Defined the name of pretrained model and tokenizer.

Options

rugpt3small - loads sberbank-ai/rugpt3small_based_on_gpt2.
rugpt3medium - loads sberbank-ai/rugpt3medium_based_on_gpt2.
rugpt3large - loads sberbank-ai/rugpt3large_based_on_gpt2.

Option format

pretrained_model_name_or_path: <string>

Model

Path: conf/model

Default: default

Description: Creates a model.

Options

default - loads an AutoLMHeadModel based on backbone option.
gpt - the same as default, but loads a GPT2LMHeadModel.

Option format

An instantiatable config, returning an instance of pretrained model:

_target_: <module>.<callable>
arg1: value1
arg2: value2

Tokenizer

Path: conf/tokenizer

Default: autotokenizer

Description: Creates a tokenizer.

Options

autotokenizer - loads tokenizer based on backbone option.
rugpt3 - the same as autotokenizer, but also adds missing special tokens.

Option format

An instantiatable config, returning an instance of pretrained tokenizer:

_target_: <module>.<callable>
arg1: value1
arg2: value2

Dataset

Path: conf/dataset

Default: default

Description: Loads a dataset dict containing at least train and validation datasets.

Options

default - loads a dataset dict using datasets.load_dataset function.
from_jsonl - inherits from default, allows to load the dataset dict from json files.
Required fields: data_files.train and data_files.validation.
Usage example: dataset=from_jsonl data_files.train=/path/to/train.jsonl data_files.validation=/path/to/validation.jsonl.

Option format

An instantiatable config, returning an instance of dataset dict:

_target_: <module>.<callable>
arg1: value1
arg2: value2

Preprocessing

Path: conf/preprocessing

Default: text2text

Description: Returns an instance of preprocessor.

Options

text2text - creates an instance of Text2TextPreprocessor.
Required fields match those of target class.

Option format

An instantiatable config, returning an instance of preprocessor:

_target_: <module>.<callable>
arg1: value1
arg2: value2

Prompt Format

Path: conf/prompt_format

Default: default

Description: Defines the prompt format.

Options

default - creates an instance of PromptFormat.

Option format

An instantiatable config, returning an instance of prompt format:

_target_: <module>.<callable>
arg1: value1
arg2: value2

Prompt Provider

Path: conf/prompt_provider

Default: tensor

Description: Defines the prompt provider.

Options

tensor - creates an instance of TensorPromptProvider.
lstm - creates an instance of LSTMPromptProvider.

Option format

An instantiatable config, returning an instance of prompt provider:

_target_: <module>.<callable>
arg1: value1
arg2: value2

Optimizer

Path: conf/optimizer

Default: adamw

Description: Defines the optimizer.

Options

adamw - creates an instance of AdamW optimizer.

Option format

An instantiatable config, returning an instance of torch optimizer:

_target_: <module>.<callable>
arg1: value1
arg2: value2

Scheduler

Path: conf/scheduler

Default: adamw

Description: Defines the learning rate schedule.

Options

linear_schedule_with_warmup - creates a linear schedule.
constant_schedule_with_warmup - creates a constant schedule.

Option format

An instantiatable config, returning an instance of torch lr scheduler:

_target_: <module>.<callable>
arg1: value1
arg2: value2

Training arguments

Path: conf/training

Default: default

Description: Defines the training arguments.

Options

default - creates an instance of TrainingArguments.

Option format

No other options are assumed.

Callbacks

Path: conf/callbacks

Default: - freeze_transformer_unfreeze_prompt - reduce_checkpoint - save_pretrained_prompt - wb_log_hydra_config

Description: Selects the trainer callbacks.

Options

freeze_transformer_unfreeze_prompt - creates an instance of FreezeTransformerUnfreezePrompt. Freezes the pretrained transformer and unfreezes the prompt provider before training.
reduce_checkpoint - creates an instance of ReduceCheckpoint. After each saving reduces the size of saved model by removing all weights but those of prompt provider.
save_pretrained_prompt - creates an instance of SavePretrainedPrompt. Saves the trained prompt using Prompt.save_pretrained in each checkpoint.
wb_log_hydra_config - creates an instance of WBLogHydraConfig. Logs the composed Hydra config to Weights and Biases before training.

Option format

An instantiatable config, returning an instance ofTrainerCallback:

_target_: <module>.<callable>
arg1: value1
arg2: value2

Task

Path: conf/task

Default: default

Description: Overrides the parameters of other groups.

Options

text2text - selects model=gpt, dataset=default and preprocessing=text2text.
other configs that inherit text2text - should define required group parameters in their bodies.

Option format

task_name: detoxification

defaults:
  - text2text
  - /dataset: from_jsonl

dataset:
  data_files:
    train: /path/to/train.jsonl
    validation: /path/to/validation.jsonl

prompt_format:
  template: "<P*60>{toxic}<P*20>"

preprocessing:
  target_field: "polite"
  truncation_field: "toxic"
  max_tokens: 1792