microsoft/onnxruntime-extensions

Public

mirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
compb

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

docs/huggingface_compatibility.md

39lines · modecode

1# HuggingFace Compatibility
2
3HuggingFace compatibility is a feature that allows you to use HuggingFace model data files with ONNXRuntime-Extensions for pre-/post-processing.
4
5
6## HuggingFace Tokenizer
7
8HuggingFace tokenizer always contains `tokenizer.json` and `tokenizer_config.json` files.
9The following fields in `tokenizer_config.json` are supported in onnxruntime-extensions:
10
11- `model_max_length`: the maximum length of the tokenized sequence.
12- `bos_token`: the beginning of the sequence token, both `string` and `object` types are supported.
13- `eos_token`: the end of the sequence token, both `string` and `object` types are supported.
14- `unk_token`: the unknown token, both `string` and `object` types are supported.
15- `pad_token`: the padding token, both `string` and `object` types are supported.
16- `clean_up_tokenization_spaces`: whether to clean up the tokenization spaces.
17- `tokenizer_class`: the tokenizer class.
18
19The following fields in `tokenizer.json` are supported in onnxruntime-extensions:
20
21- `add_bos_token`: whether to add the beginning of the sequence token.
22- `add_eos_token`: whether to add the end of the sequence token.
23- `added_tokens`: the list of added tokens.
24- `normalizer`: the normalizer, only 2 normalizers are supported, `Replace` and `precompiled_charsmap`.
25- `pre_tokenizer`: Not supported.
26- `post_processor`: post process the tokenized sequence, only add bos/eos token in post processor is supported.
27- `decoder/decoders`: the decoders, only `Replace` decoder step is supported.
28- `model/type`: the type of the model, only `BPE` is supported.
29- `model/vocab`: the vocabulary of the model.
30- `model/merges`: the merges of the model.
31- `model/end_of_word_suffix`: the end of the word suffix.
32- `model/continuing_subword_prefix`: the continuing subword prefix.
33- `model/byte_fallback`: Not supported.
34- `model/unk_token_id`: the id of the unknown token.
35
36`tokenizer_module.json` is a file that contains the user customized Python module information of the tokenizer, which is defined by onnxruntime-extensions, which is optional. The following fields are supported:
37
38- `tiktoken_file`: the path of the tiktoken file base64 encoded vocab file.
39- `added_tokens`: same as `tokenizer.json`. If `tokenizer.json` does not contain `added_tokens` or the file does not exist, this field can be input by the user.
40