microsoft/onnxruntime-extensions
Publicmirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable
docs/c_api.md
70lines · modecode
| 1 | # ONNXRuntime Extensions C ABI |
| 2 | |
| 3 | ## Introduction |
| 4 | |
| 5 | <span style="color:red">The C APIs in onnxruntime-extensions are experimental and subject to change.</span> |
| 6 | |
| 7 | ONNXRuntime Extensions provides a C-style ABI for pre-processing. It offers support for tokenization, image processing, speech feature extraction, and more. You can compile the ONNXRuntime Extensions as either a static library or a dynamic library to access these APIs. |
| 8 | |
| 9 | The C ABI header files are named `ortx_*.h` and can be found in the include folder. There are three types of data processing APIs available: |
| 10 | |
| 11 | - [`ortx_tokenizer.h`](../include/ortx_tokenizer.h): Provides tokenization for LLM models. |
| 12 | - [`ortx_processor.h`](../include/ortx_processor.h): Offers image processing APIs for multimodel models. |
| 13 | - [`ortx_extraction.h`](../include/ortx_extractor.h): Provides speech feature extraction for audio data processing to assist the Whisper model. |
| 14 | |
| 15 | ## ABI QuickStart |
| 16 | |
| 17 | Most APIs accept raw data inputs such as audio, image compressed binary formats, or UTF-8 encoded text for tokenization. |
| 18 | |
| 19 | **Tokenization:** You can create a tokenizer object using `OrtxCreateTokenizer` and then use the object to tokenize a text or decode the token ID into the text. A C-style code snippet is available [here](../test/pp_api_test/test_tokenizer.cc#L448). |
| 20 | |
| 21 | **Image processing:** `OrtxCreateProcessor` can create an image processor object from a pre-defined workflow in JSON format to process image files into a tensor-like data type. An example code snippet can be found [here](../test/pp_api_test/test_processor.cc#L16). |
| 22 | |
| 23 | **Audio feature extraction:** `OrtxCreateSpeechFeatureExtractor` creates a speech feature extractor to obtain log mel spectrum data as input for the Whisper model. An example code snippet can be found [here](../test/pp_api_test/test_feature_extraction.cc#L15). |
| 24 | |
| 25 | **Note:** To build onnxruntime-extensions as a shared library with full functionality, ensure the `OCOS_ENABLE_AUDIO` and `OCOS_ENABLE_GPT2_TOKENIZER` build flags are enabled. For a minimal build with selected operators, you can use the static library version and disable the shared library build by setting `-DOCOS_BUILD_SHARED_LIB=OFF`. |
| 26 | |
| 27 | **Note:** A simple Python wrapper for these C APIs is available in [pp_api](../onnxruntime_extensions/pp_api.py), but it is not included in the default build. To enable it, use the extra build option `--config-settings "ortx-user-option=pp-api,no-opencv"`. For example, you can install it with the following command: `python3 -m pip install --config-settings "ortx-user-option=pp-api,no-opencv" git+https://github.com/microsoft/onnxruntime-extensions.git`. The following code demonstrates how to use the Python API to validate the tokenizer output. |
| 28 | |
| 29 | ```Python |
| 30 | from onnxruntime_extensions.pp_api import Tokenizer |
| 31 | # the name can be the same one used by Huggingface transformers.AutoTokenizer |
| 32 | pp_tok = Tokenizer('google/gemma-2-2b') |
| 33 | print(pp_tok.tokenize("what are you? \n 给 weiss ich, über was los ist \n")) |
| 34 | ``` |
| 35 | |
| 36 | ## Hugging Face Tokenizer Data Compatibility |
| 37 | |
| 38 | In the C API build, onnxruntime-extensions can directly load Hugging Face tokenizer data. Typically, a Hugging Face tokenizer includes `tokenizer.json` and `tokenizer_config.json` files, unless the model author has fully customized the tokenizer. The onnxruntime-extensions can seamlessly load these files. The following sections describe the supported fields in `tokenizer_config.json` and `tokenizer.json`. |
| 39 | |
| 40 | 1) The fields in `tokenizer_config.json` can affect the results of the onnxruntime-extensions tokenizer: |
| 41 | |
| 42 | - `model_max_length`: the maximum length of the tokenized sequence. |
| 43 | - `bos_token`: the beginning of the sequence token, both `string` and `object` types are supported. |
| 44 | - `eos_token`: the end of the sequence token, both `string` and `object` types are supported. |
| 45 | - `unk_token`: the unknown token, both `string` and `object` types are supported. |
| 46 | - `pad_token`: the padding token, both `string` and `object` types are supported. |
| 47 | - `clean_up_tokenization_spaces`: whether to clean up the tokenization spaces. |
| 48 | - `tokenizer_class`: the tokenizer class. |
| 49 | |
| 50 | 2) The fields in `tokenizer.json` can affect the results of the onnxruntime-extensions tokenizer: |
| 51 | |
| 52 | - `add_bos_token`: whether to add the beginning of the sequence token. |
| 53 | - `add_eos_token`: whether to add the end of the sequence token. |
| 54 | - `added_tokens`: the list of added tokens. |
| 55 | - `normalizer`: the normalizer, only 2 normalizers are supported, `Replace` and `precompiled_charsmap`. |
| 56 | - `pre_tokenizer`: Not directly used, but some properties can be inferred from other fields such as `decoders`. |
| 57 | - `post_processor`: `add_bos_token` and `add_eos_token` can be inferred from the `post_processor` field. |
| 58 | - `decoder/decoders`: the decoders, only `Replace` and `Strip` decoder steps are checked. |
| 59 | - `model/type`: the type of the model. If the type is missing, it will be treated as a Unigram model. Otherwise, the value of `model/type`. |
| 60 | - `model/vocab`: the vocabulary of the model. |
| 61 | - `model/merges`: the merges of the model. |
| 62 | - `model/end_of_word_suffix`: the end of the word suffix. |
| 63 | - `model/continuing_subword_prefix`: the continuing subword prefix. |
| 64 | - `model/byte_fallback`: Not supported. |
| 65 | - `model/unk_token_id`: the id of the unknown token. |
| 66 | |
| 67 | 3) `tokenizer_module.json` is a file that contains the user customized Python module information of the tokenizer, which is defined by onnxruntime-extensions, which is optional. The following fields are supported: |
| 68 | |
| 69 | - `tiktoken_file`: the path of the tiktoken file base64 encoded vocab file, which can be loaded by `OrtxCreateTokenizer` too. |
| 70 | - `added_tokens`: same as `tokenizer.json`. If `tokenizer.json` does not contain `added_tokens` or the file does not exist, this field can be input by the user. |
| 71 | |