Skip to content

Preprocessing

dataclass class
ruprompts.preprocessing.Text2TextPreprocessor
(prompt_format: BasePromptFormat, tokenizer: PreTrainedTokenizerBase, target_field: str, max_tokens: Optional[int] = None, truncation_field: Optional[str] = None) -> None

Carries out preprocessing for text2text tasks.

Applies prompt format, appends target sequence, tokenizes and truncates each dataset item.

Examples:

>>> prompt_format = PromptFormat("<P*20>{text}<P*10>")
>>> preprocessor = Text2TextPreprocessor(
...     prompt_format=prompt_format,
...     tokenizer=tokenizer,
...     target_field="summary",
...     max_tokens=1024,
...     truncation_field="text"
... )
>>> dataset = dataset.map(preprocessor)
>>> Trainer(..., train_dataset=dataset, ...)

Parameters:

Name Type Description Default
prompt_format

BasePromptFormat

Prompt format to be applied to dataset items.

required
tokenizer PreTrainedTokenizerBase required
target_field str

Target dataset field.

required
max_tokens Optional[int]

Max sequence length in tokens.

required
truncation_field Optional[str]

Field to be truncated when sequence length exceeds max_tokens.

required