Preprocessing
dataclass
class
ruprompts.preprocessing.Text2TextPreprocessor
(prompt_format: BasePromptFormat, tokenizer: PreTrainedTokenizerBase, target_field: str, max_tokens: Optional[int] = None, truncation_field: Optional[str] = None) -> None
ruprompts.preprocessing.Text2TextPreprocessor
(prompt_format: BasePromptFormat, tokenizer: PreTrainedTokenizerBase, target_field: str, max_tokens: Optional[int] = None, truncation_field: Optional[str] = None) -> None
Carries out preprocessing for text2text tasks.
Applies prompt format, appends target sequence, tokenizes and truncates each dataset item.
Examples:
>>> prompt_format = PromptFormat("<P*20>{text}<P*10>")
>>> preprocessor = Text2TextPreprocessor(
... prompt_format=prompt_format,
... tokenizer=tokenizer,
... target_field="summary",
... max_tokens=1024,
... truncation_field="text"
... )
>>> dataset = dataset.map(preprocessor)
>>> Trainer(..., train_dataset=dataset, ...)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt_format |
Prompt format to be applied to dataset items. |
required | |
tokenizer |
PreTrainedTokenizerBase
|
required | |
target_field |
str
|
Target dataset field. |
required |
max_tokens |
Optional[int]
|
Max sequence length in tokens. |
required |
truncation_field |
Optional[str]
|
Field to be truncated when sequence length exceeds |
required |