microsoft/onnxruntime-extensions
Publicmirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable
docs/huggingface_compatibility.md
39lines · modecode
| 1 | # HuggingFace Compatibility |
| 2 | |
| 3 | HuggingFace compatibility is a feature that allows you to use HuggingFace model data files with ONNXRuntime-Extensions for pre-/post-processing. |
| 4 | |
| 5 | |
| 6 | ## HuggingFace Tokenizer |
| 7 | |
| 8 | HuggingFace tokenizer always contains `tokenizer.json` and `tokenizer_config.json` files. |
| 9 | The following fields in `tokenizer_config.json` are supported in onnxruntime-extensions: |
| 10 | |
| 11 | - `model_max_length`: the maximum length of the tokenized sequence. |
| 12 | - `bos_token`: the beginning of the sequence token, both `string` and `object` types are supported. |
| 13 | - `eos_token`: the end of the sequence token, both `string` and `object` types are supported. |
| 14 | - `unk_token`: the unknown token, both `string` and `object` types are supported. |
| 15 | - `pad_token`: the padding token, both `string` and `object` types are supported. |
| 16 | - `clean_up_tokenization_spaces`: whether to clean up the tokenization spaces. |
| 17 | - `tokenizer_class`: the tokenizer class. |
| 18 | |
| 19 | The following fields in `tokenizer.json` are supported in onnxruntime-extensions: |
| 20 | |
| 21 | - `add_bos_token`: whether to add the beginning of the sequence token. |
| 22 | - `add_eos_token`: whether to add the end of the sequence token. |
| 23 | - `added_tokens`: the list of added tokens. |
| 24 | - `normalizer`: the normalizer, only 2 normalizers are supported, `Replace` and `precompiled_charsmap`. |
| 25 | - `pre_tokenizer`: Not supported. |
| 26 | - `post_processor`: post process the tokenized sequence, only add bos/eos token in post processor is supported. |
| 27 | - `decoder/decoders`: the decoders, only `Replace` decoder step is supported. |
| 28 | - `model/type`: the type of the model, only `BPE` is supported. |
| 29 | - `model/vocab`: the vocabulary of the model. |
| 30 | - `model/merges`: the merges of the model. |
| 31 | - `model/end_of_word_suffix`: the end of the word suffix. |
| 32 | - `model/continuing_subword_prefix`: the continuing subword prefix. |
| 33 | - `model/byte_fallback`: Not supported. |
| 34 | - `model/unk_token_id`: the id of the unknown token. |
| 35 | |
| 36 | `tokenizer_module.json` is a file that contains the user customized Python module information of the tokenizer, which is defined by onnxruntime-extensions, which is optional. The following fields are supported: |
| 37 | |
| 38 | - `tiktoken_file`: the path of the tiktoken file base64 encoded vocab file. |
| 39 | - `added_tokens`: same as `tokenizer.json`. If `tokenizer.json` does not contain `added_tokens` or the file does not exist, this field can be input by the user. |
| 40 | |