microsoft/onnxruntime-extensions

Public

mirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable

Watch0 Fork0 Star0

Code Commits Issues Pull requests Actions Insights Security

compb

Find a branch or tag

Branches

compb

Clone

HTTPS

Download ZIP

onnxruntime-extensions/docs

docs/huggingface_compatibility.md

39lines · modecode

Raw Download

Latest commit unavailable.

unknown

1	`# HuggingFace Compatibility`
2
3	`HuggingFace compatibility is a feature that allows you to use HuggingFace model data files with ONNXRuntime-Extensions for pre-/post-processing.`
4
5
6	`## HuggingFace Tokenizer`
7
8	HuggingFace tokenizer always contains `tokenizer.json` and `tokenizer_config.json` files.
9	The following fields in `tokenizer_config.json` are supported in onnxruntime-extensions:
10
11	- `model_max_length`: the maximum length of the tokenized sequence.
12	- `bos_token`: the beginning of the sequence token, both `string` and `object` types are supported.
13	- `eos_token`: the end of the sequence token, both `string` and `object` types are supported.
14	- `unk_token`: the unknown token, both `string` and `object` types are supported.
15	- `pad_token`: the padding token, both `string` and `object` types are supported.
16	- `clean_up_tokenization_spaces`: whether to clean up the tokenization spaces.
17	- `tokenizer_class`: the tokenizer class.
18
19	The following fields in `tokenizer.json` are supported in onnxruntime-extensions:
20
21	- `add_bos_token`: whether to add the beginning of the sequence token.
22	- `add_eos_token`: whether to add the end of the sequence token.
23	- `added_tokens`: the list of added tokens.
24	- `normalizer`: the normalizer, only 2 normalizers are supported, `Replace` and `precompiled_charsmap`.
25	- `pre_tokenizer`: Not supported.
26	- `post_processor`: post process the tokenized sequence, only add bos/eos token in post processor is supported.
27	- `decoder/decoders`: the decoders, only `Replace` decoder step is supported.
28	- `model/type`: the type of the model, only `BPE` is supported.
29	- `model/vocab`: the vocabulary of the model.
30	- `model/merges`: the merges of the model.
31	- `model/end_of_word_suffix`: the end of the word suffix.
32	- `model/continuing_subword_prefix`: the continuing subword prefix.
33	- `model/byte_fallback`: Not supported.
34	- `model/unk_token_id`: the id of the unknown token.
35
36	`tokenizer_module.json` is a file that contains the user customized Python module information of the tokenizer, which is defined by onnxruntime-extensions, which is optional. The following fields are supported:
37
38	- `tiktoken_file`: the path of the tiktoken file base64 encoded vocab file.
39	- `added_tokens`: same as `tokenizer.json`. If `tokenizer.json` does not contain `added_tokens` or the file does not exist, this field can be input by the user.
40

microsoft/onnxruntime-extensions

Branches

Tags

Clone