microsoft/onnxruntime-extensions

Public

mirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
9edf572de9b3e5eb261ca06060e6b2e4ab4012df

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

docs/custom_ops.md

1592lines · modecode

1# Operators
2
3
4## Natural language operators
5
6### BertTokenizer
7
8<details>
9<summary>BertTokenizer details</summary>
10
11BertTokenizer replicates `encode_plus` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
12
13#### Inputs
14
15***text: tensor(string)*** The string tensor for tokenization
16
17#### Attributes
18
19***vocab_file: string***
20
21The content of vocab which has same with huggingface.
22
23***do_lower_case: int64_t*** (default is 1, 1 represents True, 0 represents False)
24
25Whether or not to lowercase the input when tokenizing.
26
27***do_basic_tokenize: int64_t*** (default is 1, 1 represents True, 0 represents False)
28
29Whether or not to do basic tokenization before WordPiece.
30
31***unk_token: string***
32
33The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
34token instead.
35
36***sep_token: string***
37
38The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
39sequence classification or for a text and a question for question answering. It is also used as the last
40token of a sequence built with special tokens.
41
42***pad_token: string***
43
44The token used for padding, for example when batching sequences of different lengths.
45
46***cls_token: string***
47
48The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
49
50***mask_token: string***
51
52The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
53
54***tokenize_chinese_chars: int64_t*** (default is 1, 1 represents True, 0 represents False)
55
56Whether or not to tokenize Chinese characters.
57
58***strip_accents: int64_t*** (default is 1, 1 represents True, 0 represents False)
59
60Whether or not to strip all accents. If this option is not specified, then it will be determined by the
61value for :obj:`lowercase` (as in the original BERT).
62
63***tokenize_punctuation: int64_t*** (default is 0, 1 represents True, 0 represents False)
64
65Splits punctuation on a piece of text.
66
67***remove_control_chars: int64_t*** (default is 0, 1 represents True, 0 represents False)
68
69Remove control chars(such as NUL, BEL) in the text.
70
71***truncation_strategy_name: string***
72
73The name of truncation strategy, it could be `longest_first`, `only_first`, `only_second`, `longest_from_back`.
74
75#### Outputs
76
77***input_ids: tensor(int64_t)***
78
79List of token ids.
80
81***token_type_ids: tensor(64_t)***
82
83List of token type ids
84
85***attention_mask: tensor(64_t)***
86
87List of indices specifying which tokens should b
88e attended to by the model
89
90
91#### Examples
92
93```python
94import transformers
95
96bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')
97
98node = onnx.helper.make_node(
99 'BertTokenizer',
100 inputs=['text'],
101 outputs=['tokens'],
102)
103
104text = "Hello world louder"
105inputs = np.array([text], dtype=object),
106
107bert_tokenize_result = bert_cased_tokenizer.tokenize(text)
108
109input_ids = np.array(bert_tokenize_result[0])
110token_type_ids = np.array(bert_tokenize_result[1])
111attention_mask = np.array(bert_tokenize_result[2])
112
113expect(node, inputs=[inputs],
114 outputs=[input_ids, token_type_ids, attention_mask], name='test_bert_tokenizer')
115```
116</details>
117
118### BertTokenizerDecoder
119
120<details>
121<summary>BertTokenizerDecoder details</summary>
122
123BertTokenizerDecoder replicates `decode` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
124
125#### Inputs
126
127***token_ids: tensor(int64)***
128
129List of tokenized input ids.
130
131***indices: tensor(int64)***
132
133List of `[start_position, end_position]` to indicate what segments of input ids should be decoded. This input only enabled when attribute `use_indices`=1.
134
135Usually, it is used to decode the slot in the text.
136
137#### Attributes
138
139***vocab_file: string***
140
141The content of vocab which has same with huggingface.
142
143***unk_token: string***
144
145The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
146token instead.
147
148***sep_token: string***
149
150The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
151sequence classification or for a text and a question for question answering. It is also used as the last
152token of a sequence built with special tokens.
153
154***pad_token: string***
155
156The token used for padding, for example when batching sequences of different lengths.
157
158***cls_token: string***
159
160The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
161
162***mask_token: string***
163
164The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
165
166***suffix_indicator: string***
167
168The suffix indicator.
169
170***use_indices: int64_t***
171
172Whether use second input.
173
174***skip_special_tokens: int64_t***
175
176Whether or not to remove special tokens in the decoding.
177
178***clean_up_tokenization_spaces: int64_t***
179
180Whether or not to clean up the tokenization spaces.
181
182#### Outputs
183
184***sentences: tensor(int64_t)***
185
186The decoded sentences.
187
188#### Examples
189
190
191```python
192import transformers
193
194def get_file_content(path):
195 with open(path, "rb") as file:
196 return file.read()
197
198bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')
199bert_cased_tokenizer.save('.', 'bert')
200
201
202node = onnx.helper.make_node(
203 'BertTokenizerDecoder',
204 inputs=['token_ids'],
205 outputs=['sentences'],
206 vocab_file=get_file_content("bert-vocab.txt")
207)
208
209text = "Hello world louder"
210token_ids = np.array([bert_cased_tokenizer.tokenize(text)], dtype=object),
211sentences = np.array(text)
212
213
214expect(node, inputs=[token_ids],
215 outputs=[sentences], name='test_bert_tokenizer')
216```
217</details>
218
219
220
221### GPT2Tokenizer
222
223<details>
224<summary>GPT2Tokenizer details</summary>
225
226GPT2Tokenizer that performs byte-level bpe tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html).
227
228#### Attributes
229
230***vocab***
231
232The **content** of the vocabulary file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).
233
234***merges***
235
236The **content** of the merges file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/merges.txt).
237
238***padding_length(optional)***
239
240When the input is a set of query, the tokenized result is ragged tensor, so we need to pad the tensor to tidy tensor and the `padding_length` indicates the strategy of the padding. When the padding_length equals -1, we will pad the tensor to length of longest row. When the padding_length is more than 0, we will pad the tensor to the number of padding_length.
241
242The default value of `padding_length` is -1.
243
244#### Inputs
245
246***data: tensor(string)***
247
248The string tensor for tokenization
249
250#### Outputs
251
252***input_ids: tensor(int64)***
253
254The tokenized ids of input
255
256***attention_mask: tensor(int64)***
257
258A tensor indicates which part of input_ids is padded.
259
260#### Examples
261
262
263```python
264def get_file_content(path):
265 with open(path, "rb") as file:
266 return file.read()
267
268node = onnx.helper.make_node(
269 'GPT2Tokenizer',
270 inputs=['x'],
271 outputs=['y'],
272 vocab=get_file_content(vocabulary_file),
273 merges=get_file_content(merges_file)
274)
275
276x = ["hey cortana"]
277y = np.array([20342, 12794, 2271], dtype=np.int64)
278
279expect(node, inputs=[x], outputs=[y],
280 name='test_gpt2_tokenizer')
281```
282</details>
283
284### WordpieceTokenizer
285
286<details>
287<summary>WordpieceTokenizer details</summary>
288
289
290WordpieceTokenizer that performs WordPiece tokenization to the input tensor,
291based on the [hugging face version](https://huggingface.co/transformers/model_doc/bert.html#WordpieceTokenizer).
292[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)
293from *tensorflow_text* can be implemented by a pair of nodes
294*RegexSplitWithOffets* followed by *WordpieceTokenizer*.
295it
296
297#### Attributes
298
299***vocab***
300
301The **content** of the vocabulary file, its format is same with
302[hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).
303
304***suffix_indicator***
305
306Suffix added to token not in the first position before looking into the vocabulary.
307
308***unk_token***
309
310Unknown tokens. Every token not found in the vocabulary is replaced by this one.
311
312***max_input_chars_per_word***
313
314Maximum number of characters per token (optional, defaults to 200).
315
316#### Inputs
317
318***data: tensor(string)***
319
320The string tensor for tokenization
321
322***row_indices: tensor(int64)*** Empty or the fndices of every first token of input sentences.
323`indices[i+1] - indices[i]` is the number of tokens in input `i`.
324
325[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)
326includes two steps. The first one splits sentences into words and then splits
327every work into tokens. This operator only implements the second step.
328The first one can be done with operator *StringRegexSplit*.
329This parameter can either be empty or it can be the third output
330of operator *StringRegexSplit*.
331
332#### Outputs
333
334***tokens: tensor(string)*** Every token.
335
336***token_indices: tensor(int32)*** Indices of each token. -1 means a token outside the vocabulary.
337
338***row_indices: tensor(int64)*** Indices of every first token of input sentences.
339`indices[i+1] - indices[i]` is the number of tokens in input `i`.
340These are updates row indices given as inputs or new ones if the second input is empty.
341
342#### Examples
343
344
345```python
346words = ["want", "##want",
347 "##ed", "wa", "un", "runn", "##ing"]
348vocab = {w: i + 10 for i, w in enumerate(words)}
349st = json.dumps(vocab)
350nodes = []
351mkv = helper.make_tensor_value_info
352reg = helper.make_tensor(
353 "pattern", onnx_proto.TensorProto.STRING, [1, ], ["(\\s)".encode('ascii')])
354reg_empty = helper.make_tensor(
355 "keep_pattern", onnx_proto.TensorProto.STRING, [0, ], [])
356
357nodes = [
358 helper.make_node(
359 'StringRegexSplitWithOffsets,
360 inputs=['text', 'pattern', 'keep_pattern'],
361 outputs=['words', 'begin_end', 'indices'],
362 name='StringRegexPlsitOpName',
363 domain='ai.onnx.contrib'),
364 helper.make_node(
365 'WordpieceTokenizer',
366 inputs=['words', 'indices'],
367 outputs=['out0', 'out1', 'out2'],
368 name='WordpieceTokenizerOpName',
369 domain='ai.onnx.contrib',
370 vocab=st.encode('utf-8'),
371 suffix_indicator="##",
372 unk_token="[UNK]")
373]
374inputs = [mkv('text', onnx_proto.TensorProto.STRING, [None])]
375graph = helper.make_graph(
376 nodes, 'test0', inputs, [
377 mkv('out0', onnx_proto.TensorProto.STRING, [None]),
378 mkv('out1', onnx_proto.TensorProto.INT32, [None]),
379 mkv('out2', onnx_proto.TensorProto.INT64, [None]),
380 mkv('words', onnx_proto.TensorProto.STRING, [None]),
381 mkv('indices', onnx_proto.TensorProto.INT64, [None])],
382 [reg, reg_empty])
383model = helper.make_model(
384 graph, opset_imports=[helper.make_operatorsetid(domain, 1)])
385
386text = np.array(["unwanted running", "unwantedX running"], dtype=object)
387tokens = np.array(['un', '##want', '##ed', 'runn', '##ing', 'un', '##want', '##ed',
388 '[UNK]', 'runn', '##ing'], dtype=object),
389indices = np.array([14, 11, 12, 15, 16, 14, 11, 12, -1, 15, 16], dtype=int32)
390row_indices = np.array([ 0, 5, 11], dtype=int64)
391
392expect(model, inputs=[text], outputs=[tokens, indices, row_indices],
393 name='test_bert_tokenizer')
394```
395
396</details>
397
398### SentencepieceTokenizer
399
400<details>
401<summary>SentencepieceTokenizer details</summary>
402
403SentencepieceTokenizer replicates [SentencepieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/SentencepieceTokenizer.md).
404
405#### Inputs
406
407***data: tensor(string)*** The string tensor for tokenization
408
409***nbest_size: tensor(int64)*** A scalar for sampling. nbest_size = {0,1}: No sampling is performed.
410(default) nbest_size > 1: samples from the nbest_size results. nbest_size < 0: assuming that
411nbest_size is infinite and samples from the all hypothesis (lattice) using
412forward-filtering-and-backward-sampling algorithm.
413
414***alpha: tensor(float)*** A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
415
416***reverse: tensor(bool)*** Reverses the tokenized sequence (Default = false)
417
418***add_bos: tensor(bool)*** Add beginning of sentence token to the result (Default = false)
419
420***add_eos: tensor(bool)*** Add end of sentence token to the result (Default = false).
421When reverse=True beginning/end of sentence tokens are added after reversing.
422
423#### Attributes
424
425***model: string*** The sentencepiece model serialized proto as stored as a string.
426
427#### Outputs
428
429***tokens: tensor(int32)*** Indices of each token.
430
431***indices: tensor(int64)*** Indices of every first token of input sentences.
432`indices[i+1] - indices[i]` is the number of tokens in input `i`.
433
434Tokenized result of the input
435
436#### Examples
437
438
439```python
440
441url = "https://github.com/microsoft/ort-customops/raw/main/test/data/test_sentencepiece_ops_model__6.txt"
442with urllib.request.urlopen(url) as f:
443 content = f.read()
444model = np.array(list(base64.decodebytes(content.encode())), dtype=np.uint8)
445
446node = onnx.helper.make_node(
447 'SentencepieceTokenizer',
448 inputs=['inputs', 'nbest_size', 'alpha', 'add_bos', 'add_eos', 'reverse'],
449 outputs=['indices', 'output'],
450 mapping_file_name='vocabulary.txt',
451 unmapping_value="unknown_word",
452 model=model,
453 domain='ai.onnx.contrib'
454)
455
456inputs = np.array(["Hello world", "Hello world louder"], dtype=object),
457nbest_size = np.array([0], dtype=np.float32),
458alpha = np.array([0], dtype=np.float32),
459add_bos = np.array([0], dtype=np.bool_),
460add_eos = np.array([0], dtype=np.bool_),
461reverse = np.array([0], dtype=np.bool_)
462
463tokens = np.array([17486, 1017, 17486, 1017, 155, 21869], dtype=np.int32)
464indices = np.array([0, 2, 6], dtype=np.int64)
465
466expect(node, inputs=[inputs, nbest_size, alpha, add_bos, add_eos, reverse],
467 outputs=[tokens, indices], name='sp')
468```
469</details>
470
471
472### BasicTokenizer
473
474<details>
475<summary>BasicTokenizer details</summary>
476
477TODO: is this still supported?
478
479BasicTokenizer performs basic tokenization to input string tensor, based on [basic tokenizer in BertTokenizer(hugging face version)](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
480
481#### Inputs
482
483***text: tensor(string)*** The string tensor for tokenization
484
485#### Attributes
486
487***do_lower_case: int64_t*** (default is 1, 1 represents True, 0 represents False)
488
489Whether or not to lowercase the input when tokenizing.
490
491***tokenize_chinese_chars: int64_t*** (default is 1, 1 represents True, 0 represents False)
492
493Whether or not to tokenize Chinese characters.
494
495***strip_accents: int64_t*** (default is 1, 1 represents True, 0 represents False)
496
497Whether or not to strip all accents. If this option is not specified, then it will be determined by the
498value for :obj:`lowercase` (as in the original BERT).
499
500***tokenize_punctuation: int64_t*** (default is 0, 1 represents True, 0 represents False)
501
502Splits punctuation on a piece of text.
503
504***remove_control_chars: int64_t*** (default is 0, 1 represents True, 0 represents False)
505
506Remove control chars(such as NUL, BEL) in the text.
507
508#### Outputs
509
510***tokens: tensor(string)*** Tokenized tokens.
511
512#### Examples
513
514```python
515import transformers
516
517tokenizer = transformers.BasicTokenizer()
518
519node = onnx.helper.make_node(
520 'BasicTokenizer',
521 inputs=['text'],
522 outputs=['tokens'],
523)
524
525inputs = np.array([ "Hello world louder"], dtype=object),
526tokens = np.array(tokenizer(inputs), dtype=int32)
527
528expect(node, inputs=[inputs],
529 outputs=[tokens], name='test_basic_tokenizer')
530```
531</details>
532
533
534## String operators
535
536### StringEqual
537
538<details>
539<summary>StringEqual details</summary>
540
541Compares two strings and returns true if they are equal and false if not.
542
543#### Inputs
544
545***x: tensor(string)***
546
547The first string input
548
549***x: tensor(string)***
550
551The second string input
552
553#### Outputs
554
555***z: tensor(boolean)***
556
557String with replacements.
558
559</details>
560
561
562### StringHash
563
564<details>
565<summary>StringHash details</summary>
566
567
568Hashes the input string based on the number of buckets
569
570#### Inputs
571
572***input: tensor(string)***
573
574The string to hash
575
576***num_buckets: tensor(int64)***
577
578The number of buckets (must be equal to 1?)
579
580#### Outputs
581
582***name: tensor(int64)***
583
584The hash value of the string
585
586</details>
587
588
589### StringHashFast
590
591<details>
592<summary>StringHashFast details</summary>
593
594
595A faster implementation of StringHash.
596
597</details>
598
599
600### StringJoin
601
602<details>
603<summary>StringJoin details</summary>
604
605
606Join an array of strings
607
608#### Inputs
609
610***input_X: tensor(string)***
611
612The input array of strings
613
614***input_sep: tensor(string)***
615
616The string separator for the resulting joing
617
618***input_axis: tensor(int64)***
619
620The axis along which to joing
621
622#### Outputs
623
624***out: tensor(string)***
625
626The resulting joined string
627
628#### Examples
629
630
631```bash
632
633input_X = [["a", "b", "c"], ["aa", "bb", ""]]
634input_sep=";"
635input_axis = 1
636
637out = ["a;b;c", "aa;bb;"]
638
639input_axis = 0
640
641out = ['a;aa', 'b;bb', 'c;']
642
643
644</details>
645
646
647### StringRegexReplace
648
649<details>
650<summary>StringRegexReplace details</summary>
651
652
653String replacement based on [Re2-format](https://github.com/google/re2/wiki/Syntax) regular expressions.
654
655#### Inputs
656
657***text: tensor(string)***
658
659String tensor to extract slices from.
660
661***pattern: tensor(string)***
662
663Pattern of the regular expression.
664
665***rewrite: tensor(string)***
666
667Replacement.
668
669#### Attributes
670
671***global_replace: int64*** (default is 1)
672
673Replace all strings matching the pattern or the first one.
674
675#### Outputs
676
677***output: tensor(string)***
678
679String with replacements.
680
681#### Examples
682
683```python
684
685node = onnx.helper.make_node(
686 'StringRegexReplace',
687 inputs=['text', 'pattern', 'rewrite'],
688 outputs=['y'],
689)
690
691text = np.array([['def myfunc():'], ['def dummy():']])
692pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
693rewrite = np.array([r'static PyObject* py_\1(void) {'])
694y = [['static PyObject* py_myfunc(void) {'],
695 ['static PyObject* py_dummy(void) {']]
696
697expect(node, inputs=[text, pattern, rewrite], outputs=[y],
698 name='test_string_regex_replace')
699```
700
701</details>
702
703### StringECMARegexReplace
704
705<details>
706<summary>StringECMARegexReplace details</summary>
707
708String replacement based on [ECMA-format](https://en.cppreference.com/w/cpp/regex/ecmascript) regular expressions.
709
710#### Inputs
711
712***text: tensor(string)***
713
714String tensor to extract slices from.
715
716***pattern: tensor(string)***
717
718Pattern of the regular expression.
719
720***rewrite: tensor(string)***
721
722Replacement.
723
724#### Attributes
725
726***global_replace: int64*** (default is 1)
727
728Replace all strings matching the pattern or the first one.
729
730
731***ignore_case: int64*** (default is 0)
732
733Replace
734
735#### Outputs
736
737***output: tensor(string)***
738
739String with replacements.
740
741#### Examples
742
743
744```python
745
746node = onnx.helper.make_node(
747 'StringRegexReplace',
748 inputs=['text', 'pattern', 'rewrite'],
749 outputs=['y'],
750)
751
752text = np.array([['def myfunc():'], ['def dummy():']])
753pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
754rewrite = np.array([r'static PyObject* py_$1(void) {'])
755y = [['static PyObject* py_myfunc(void) {'],
756 ['static PyObject* py_dummy(void) {']]
757
758expect(node, inputs=[text, pattern, rewrite], outputs=[y],
759 name='test_string_regex_replace')
760```
761
762</details>
763
764
765
766### StringSplit
767
768TODO
769
770### StringUpper
771
772TODO
773
774### StringLower
775
776TODO
777
778### StringLength
779
780<details>
781<summary>StringECMARegexReplace details</summary>
782
783Get the length of each string element in input tensor. Similar to the function `len("abcde"")` in python.
784
785#### Inputs
786
787***data: tensor(string)***
788
789String tensor to get length of its each string element.
790
791#### Outputs
792
793***output: tensor(int64)***
794
795Data length tensor.
796
797#### Examples
798
799
800```python
801
802node = onnx.helper.make_node(
803 'StringLength',
804 inputs=['x'],
805 outputs=['y']
806)
807
808x = ["abcdef", "hijkl"]
809y = np.array([len(x[0]), len(x[1])], dtype=np.int64)
810
811
812expect(node, inputs=[x], outputs=[y],
813 name='test_string_length')
814```
815</details>
816
817### StringConcat
818
819<details>
820<summary>StringConcat details</summary>
821
822Concat the corresponding string in the two string tensor. Two input tensors should have the same dimension.
823
824```python
825 output = []
826 shape = input1.shape
827 input1 = input1.flatten()
828 input2 = input2.flatten()
829 for i in range(len(input1)):
830 output.append(input1[i] + input2[i])
831 output = np.array(output).reshape(shape)
832```
833
834#### Inputs
835
836***input_1: tensor(string)***
837
838The first string tensor.
839
840***input_2: tensor(string)***
841
842The second string tensor.
843
844
845#### Outputs
846
847***output: tensor(string)***
848
849The result.
850
851#### Examples
852
853
854```python
855
856node = onnx.helper.make_node(
857 'StringConcat',
858 inputs=['x', 'y'],
859 outputs=['result'],
860)
861
862x = np.array(["abcd", "efgh"])
863y = np.array(["wxyz", "stuv"])
864result = np.array([x[0] + y[0], x[1] + y[1]])
865
866expect(node, inputs=[x, y], outputs=[result],
867 name='test_string_concat')
868```
869
870</details>
871
872### StringRegexSplitWithOffsets
873
874<details>
875<summary>StringRegexSplitWithOffsets details</summary>
876
877Splits string based on regular expressions.
878
879#### Inputs
880
881***text: tensor(string)***
882
883String tensor to extract slices from.
884
885***delim_regex_pattern: tensor(string)***
886
887Splitting attern of the regular expression.
888
889***keep_delim_regex_pattern: tensor(string)***
890
891By default, delimiters are not included in the split string results. Delimiters may be included by specifying a regex pattern keep_delim_regex_pattern.
892
893#### Outputs
894
895***words: tensor(string)*** Tensor of words.
896
897***offsets: tensor(int64)*** 2D tensor with 3 columns:
898sentence index, position of the first character, position of the last one (excluded)
899
900***row_indices: tensor(int64)*** Indices of every first token of input sentences.
901`row_indices[i+1] - row_indices[i]` is the number of tokens in input `i`.
902These are updates row indices given as inputs or new ones if the second input is empty.
903
904
905#### Examples
906
907
908```python
909
910node = onnx.helper.make_node(
911 'StringRegexSplit',
912 inputs=['text', 'pattern', 'rewrite'],
913 outputs=['y', 'begin_end', 'indices'],
914)
915
916text = np.array(["hello there"])
917pattern = np.array([r'\s'])
918rewrite = np.array([r'\s'])
919y = np.array(["hello", " ", "there"])
920z1 = np.array([[0, 0, 5],
921 [0, 5, 6],
922 [0, 6, 11]], dtype=np.int64)
923z2 = np.array([0, 2], dtype=np.int64)
924
925expect(node, inputs=[text, pattern, rewrite], outputs=[y, z1, z2],
926 name='test_string_regex_replace')
927```
928
929</details>
930
931
932### StringECMARegexSplitWithOffsets
933
934TODO
935
936### VectorToString
937
938<details>
939<summary>VectorToString details</summary>
940
941VectorToString is the contrary operation to the `StringToVector` , they share same format of mapping table:
942
943 <string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>
944
945Unmapped vector will output the value of the attribute `unk`.
946
947Example:
948
949*Attributes:*
950
951- `map`:
952 ```
953 a 0 0 1 2
954 b 0 1 2 3
955 d 0 1 3 4
956 ```
957
958- `unk`: "unknown_word"
959
960*Inputs:*
961- data: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]
962
963*Ouputs:*
964- output: ["a", "d", "unknown_word" ]
965
966#### Attributes
967
968***mapping_file_name***
969
970the formative mapping table
971
972***unmapping_value***
973
974the result returned when a vector aren't found in the map
975
976#### Inputs
977
978***data: tensor(T)***
979
980Input tensor
981
982#### Outputs
983
984***output: tensor(string)***
985
986The mapping result of the input
987
988#### Type Constraints
989***T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)***
990
991Constrain input and output types to numerical tensors.
992
993
994#### Examples
995
996
997```python
998mapping_table = \
999 """
1000 a 0 0 1 2
1001 b 0 1 2 3
1002 d 0 1 3 4
1003 """
1004
1005node = onnx.helper.make_node(
1006 'VectorToString',
1007 inputs=['x'],
1008 outputs=['y'],
1009 map=mapping_table,
1010 unk="unknown_word"
1011)
1012
1013
1014x = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)
1015y = ["a", "d", "unknown_word"]
1016
1017
1018expect(node, inputs=[x], outputs=[y],
1019 name='test_vector_to_string')
1020```
1021</details>
1022
1023
1024### StringToVector
1025
1026<details>
1027<summary>StringToVector details</summary>
1028
1029StringToVector will map each string element in the input to the corresponding vector according to the mapping file. The mapping file is a utf-8 encoding text file in tsv format:
1030
1031 <string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>
1032
1033Unmapped string will output the value of the attribute `unmapping_value`.
1034
1035Example:
1036
1037*Attributes:*
1038
1039- `mapping_file_name`: vocabulary.txt
1040 ```
1041 a 0 0 1 2
1042 b 0 1 2 3
1043 d 0 1 3 4
1044 ```
1045
1046- `unmapping_value`: [0 0 0 0]
1047
1048*Inputs:*
1049- data: ["a", "d", "e"]
1050
1051*Ouputs:*
1052- output: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]
1053
1054#### Attributes
1055
1056***mapping_file_name:string***
1057
1058The name of your string to vector mapping file.
1059
1060***unmapping_value:list(int)***
1061
1062Mapping result for unmapped string
1063
1064#### Inputs
1065
1066***data: tensor(string)***
1067
1068Input tensor
1069
1070#### Outputs
1071
1072***output: tensor(T)***
1073
1074The mapping result of the input
1075
1076#### Type Constraints
1077***T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)***
1078
1079Constrain input and output types to numerical tensors.
1080
1081#### Examples
1082
1083
1084```python
1085# what's in vocabulary.txt
1086
1087mapping_table = \
1088"""
1089a 0 0 1 2
1090b 0 1 2 3
1091d 0 1 3 4
1092"""
1093
1094node = onnx.helper.make_node(
1095 'StringToVector',
1096 inputs=['x'],
1097 outputs=['y'],
1098 mapping_table=mapping_table,
1099 unmapping_value=[0,0,0,0]
1100)
1101
1102
1103x = ["a", "d", "e"]
1104y = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)
1105
1106
1107expect(node, inputs=[x], outputs=[y],
1108 name='test_string_to_vector')
1109```
1110
1111</details>
1112
1113
1114
1115### StringSlice
1116
1117<details>
1118<summary>StringSlice details</summary>
1119
1120Do the slice operation to each string element in input tensor. Similar to string slice in python
1121
1122```python
1123a = "abcdef"
1124b = a[1:2]
1125c = a[3:1:-1]
1126```
1127
1128#### Inputs
1129
1130***data: tensor(string)***
1131
1132String tensor to extract slices from.
1133
1134***starts: tensor(int64/int32)***
1135
1136The tensor of starting indices of corresponding string in data, which has same dimension of data.
1137
1138***ends: tensor(int64/int32)***
1139
1140The tensor of ending indices of corresponding string in data, which has same dimension of data.
1141
1142***steps(optional): tensor(int64/int32)***
1143
1144The tensor of slice step of corresponding string in data, which has same dimension of data.If steps is empty tensor, we will use default value 1 for each string
1145
1146#### Outputs
1147
1148***output: tensor(string)***
1149
1150Sliced data tensor.
1151
1152#### Examples
1153
1154
1155```python
1156
1157node = onnx.helper.make_node(
1158 'StringSlice',
1159 inputs=['x', 'starts', 'ends', 'steps'],
1160 outputs=['y'],
1161)
1162
1163x = np.array(["abcdef", "hijkl"])
1164y = np.array([x[0][1:3:1], x[1][3:1:-1]])
1165starts = np.array([1, 3], dtype=np.int64)
1166ends = np.array([3, 1], dtype=np.int64)
1167axes = np.array([0, 1], dtype=np.int64)
1168steps = np.array([1, 1], dtype=np.int64)
1169
1170expect(node, inputs=[x, starts, ends, axes, steps], outputs=[y],
1171 name='test_string_slice')
1172```
1173
1174</details>
1175
1176
1177### MaskedFill
1178
1179<details>
1180<summary>MaskedFill details</summary>
1181
1182
1183Fills elements of self tensor with value where mask is True. The operator is similar with [`Tensor.masked_fill_`](https://pytorch.org/docs/stable/generated/torch.Tensor.masked_fill_.html#torch.Tensor.masked_fill_) in pytorch.
1184
1185
1186#### Inputs
1187
1188***value: tensor(string)***
1189
1190The value to fill in with, currently we only support string type and vector&scalar dimension.
1191
1192***mask: tensor(bool)***
1193
1194The boolean mask, the dimension of mask tensor should be same with value.
1195
1196#### Outputs
1197
1198***output: tensor(string)***
1199
1200The filled output of input tensor.
1201
1202
1203#### Examples
1204
1205
1206```python
1207
1208node = onnx.helper.make_node(
1209 'MaskedFill',
1210 inputs=['value', 'mask'],
1211 outputs=['output']
1212)
1213
1214
1215value = np.array(["a", "b", "c", "d"])
1216mask = np.array([True, False, True, False], dtype=bool)
1217output = np.array(["a", "c"])
1218
1219
1220expect(node, inputs=[value, mask], outputs=[output],
1221 name='test_masked_fill')
1222```
1223</details>
1224
1225
1226### StringRaggedTensorToDense
1227
1228TODO
1229
1230### StringMapping
1231
1232TODO
1233
1234## Math operators
1235
1236
1237### Inverse
1238
1239TODO
1240
1241### NegPos
1242
1243TODO
1244
1245### SegmentExtraction
1246
1247TODO
1248
1249### SegmentSum
1250
1251TODO
1252
1253## Tensor operators
1254
1255### RaggedTensorToSparse
1256
1257TODO
1258
1259### RaggedTensorToDense
1260
1261TODO
1262
1263### Template
1264
1265<details>
1266<summary>Template details</summary>
1267
1268Description
1269
1270#### Inputs
1271
1272***name: tensor(type)***
1273
1274Description
1275
1276#### Outputs
1277
1278***name: tensor(type)***
1279
1280Description
1281
1282#### Examples
1283
1284
1285```python
1286
1287node = onnx.helper.make_node(
1288 'StringRegexReplace',
1289 inputs=['text', 'pattern', 'rewrite'],
1290 outputs=['y'],
1291)
1292
1293text = np.array([['def myfunc():'], ['def dummy():']])
1294pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
1295rewrite = np.array([r'static PyObject* py_\1(void) {'])
1296y = [['static PyObject* py_myfunc(void) {'],
1297 ['static PyObject* py_dummy(void) {']]
1298
1299expect(node, inputs=[text, pattern, rewrite], outputs=[y],
1300 name='test_string_regex_replace')
1301```
1302
1303</details>
1304
1305
1306## Azure operators
1307Starting from onnxruntime-extensions 0.12, these Azure operators will be removed from the official onnxruntime-extensions packages. However, they can still be built from source using `cmake -DOCOS_ENABLE_AZURE=ON ...`.
1308### OpenAIAudioToText
1309
1310<details>
1311<summary>OpenAIAudioToText details</summary>
1312
1313
1314OpenAIAudioToText operator talks to [openAI audio](https://platform.openai.com/docs/api-reference/audio) endpoints.
1315
1316
1317#### Attributes
1318
1319***model_uri:string***
1320
1321Endpoint uri, like "https://api.openai.com/v1/audio/transcriptions".
1322
1323***audio_format:string***
1324
1325The format of the audio, by default "wav".
1326
1327#### Inputs
1328
1329***auth_token: tensor(string)***
1330
1331An access token comes with openAI subscription.
1332
1333***model_name: tensor(string)***
1334
1335Model name to send to the endpoint, such as "whisper-1".
1336
1337***response_format: tensor(string)***
1338
1339Expected format of the response, either be "text" or "json".
1340
1341***audio_blob: tensor(uint8)***
1342
1343A byte array containing raw data from the audio file.
1344
1345#### Outputs
1346
1347***transcriptions: tensor(string)***
1348
1349
1350#### Examples
1351
1352Note - OpenAIAudioToText operator composes a request based on last part of the input and output names split by "/",
1353
1354Meaning for input names, they must be of format:
1355- auth_token: "whatever-name-you-want-to-use"
1356- model_name: ".../.../.../model_name"
1357- response_format: ".../.../.../response_format"
1358- audio_blob: ".../.../.../file"
1359
1360for output name, it must be of format:
1361- transcriptions: ".../.../.../transcriptions"
1362
1363Hence there could be multiple OpenAIAudioToText operators accepting different inputs inside a model, and give varied outputs.
1364
1365Pls find sample code below for a better illustration.
1366
1367
1368```python
1369
1370import os
1371import numpy as np
1372
1373from onnx import *
1374from onnxruntime_extensions import PyOrtFunction, util, get_library_path
1375from onnxruntime import *
1376
1377
1378openai_model_uri = os.getenv('URI', '') # read uri from env
1379openai_auth_token = os.getenv('AUTH', '') # read auto token from env
1380
1381
1382def create_openai_audio_model():
1383 auth_token = helper.make_tensor_value_info('auth_token', TensorProto.STRING, [1])
1384 model = helper.make_tensor_value_info('node_1/model_name', TensorProto.STRING, [1])
1385 response_format = helper.make_tensor_value_info('node_1/response_format', TensorProto.STRING, [-1])
1386 file = helper.make_tensor_value_info('node_1/file', TensorProto.UINT8, [-1])
1387 transcriptions = helper.make_tensor_value_info('node_1/transcriptions', TensorProto.STRING, [-1])
1388
1389 invoker = helper.make_node('OpenAIAudioToText',
1390 ['auth_token', 'node_1/model_name', 'node_1/response_format', 'node_1/file'], # names must follow the format
1391 ['node_1/transcriptions'], # names must follow the format
1392 domain='com.microsoft.extensions',
1393 name='audio_invoker',
1394 model_uri=openai_model_uri,
1395 audio_format='wav')
1396
1397 graph = helper.make_graph([invoker], 'graph', [auth_token, model, response_format, file], [transcriptions])
1398 model = helper.make_model(graph,
1399 opset_imports=[helper.make_operatorsetid('com.microsoft.extensions', 1)])
1400
1401 onnx.save(model, 'openai_audio.onnx')
1402
1403
1404create_openai_audio_model()
1405opt = SessionOptions()
1406opt.register_custom_ops_library(get_library_path())
1407sess = InferenceSession(os.path.join(test_data_dir, "openai_audio.onnx"),
1408 opt, providers=["CPUExecutionProvider", "AzureExecutionProvider"])
1409auth_token = np.array([openai_auth_token])
1410model = np.array(['whisper-1'])
1411response_format = np.array(['text'])
1412
1413with open(os.path.join(test_data_dir, "test16.wav"), "rb") as _f:
1414 audio_blob = np.asarray(list(_f.read()), dtype=np.uint8)
1415 ort_inputs = {
1416 "auth_token": auth_token,
1417 "node_1/model_name": model,
1418 "node_1/response_format": response_format,
1419 "node_1/file": audio_blob,
1420 }
1421 out = sess.run(None, ort_inputs)[0]
1422```
1423</details>
1424
1425
1426### AzureTextToText
1427
1428<details>
1429<summary>AzureTextToText details</summary>
1430
1431
1432AzureTextToText talks to a GPT model hosted by [Azure openAI service](https://learn.microsoft.com/en-us/azure/ai-services/openai/).
1433
1434
1435#### Attributes
1436
1437***model_uri:string***
1438
1439Endpoint uri, like "https://myname-aoai-test.openai.azure.com/openai/deployments/mydeploy/chat/completions?api-version=2023-05-15'".
1440
1441#### Inputs
1442
1443***auth_token: tensor(string)***
1444
1445An access token comes with Azure openAI subscription.
1446
1447***chat: tensor(string)***
1448
1449A json string in requested [format](https://learn.microsoft.com/en-us/azure/ai-services/openai/chatgpt-quickstart?tabs=command-line&pivots=rest-api).
1450
1451#### Outputs
1452
1453***response_format: tensor(string)***
1454
1455A json string as response.
1456
1457
1458#### Examples
1459
1460
1461```python
1462
1463import os
1464import numpy as np
1465
1466from onnx import *
1467from onnxruntime_extensions import PyOrtFunction, util, get_library_path
1468from onnxruntime import *
1469
1470
1471azure_model_uri = os.getenv('URI', '') # read uri from env
1472azure_auth_token = os.getenv('AUTH', '') # read auto token from env
1473
1474
1475def create_azure_chat_model():
1476 auth_token = helper.make_tensor_value_info('auth_token', TensorProto.STRING, [-1])
1477 chat = helper.make_tensor_value_info('chat', TensorProto.STRING, [-1])
1478 response = helper.make_tensor_value_info('response', TensorProto.STRING, [-1])
1479
1480 invoker = helper.make_node('AzureTextToText', ['auth_token', 'chat'], ['response'],
1481 domain='com.microsoft.extensions',
1482 name='chat_invoker',
1483 model_uri=azure_model_uri)
1484
1485 graph = helper.make_graph([invoker], 'graph', [auth_token, chat], [response])
1486 model = helper.make_model(graph,
1487 opset_imports=[helper.make_operatorsetid('com.microsoft.extensions', 1)])
1488
1489 onnx.save(model, 'azure_chat.onnx')
1490
1491
1492create_azure_chat_model()
1493opt = SessionOptions()
1494opt.register_custom_ops_library(get_library_path())
1495sess = InferenceSession(os.path.join(test_data_dir, "azure_chat.onnx"), opt, providers=["CPUExecutionProvider", "AzureExecutionProvider"])
1496auth_token = np.array([azure_auth_token])
1497chat = np.array([r'{"messages":[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Does Azure OpenAI support customer managed keys?"},{"role": "assistant", "content": "Yes, customer managed keys are supported by Azure OpenAI."},{"role": "user", "content": "Do other Azure AI services support this too?"}]}'])
1498ort_inputs = {
1499 "auth_token": auth_token,
1500 "chat": chat,
1501}
1502out = sess.run(None, ort_inputs)[0]
1503```
1504</details>
1505
1506
1507### AzureTritonInvoker
1508
1509<details>
1510<summary>AzureTritonInvoker details</summary>
1511
1512
1513AzureTritonInvoker talks to [Azure Machine Learning triton services](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?view=azureml-api-2&tabs=azure-cli%2Cendpoint).
1514
1515
1516#### Attributes
1517
1518***model_uri:string***
1519
1520Endpoint uri, like "'https://endpoint-12345678.westus.inference.ml.azure.com".
1521
1522***model_name:string***
1523
1524***model_version:string***
1525
1526A version string, like "1", or "2".
1527
1528#### Inputs
1529
1530***auth_token: tensor(string)***
1531
1532An access token comes with Azure Machine Learning model deployment.
1533
1534***inputs: tensor(variadic)***
1535
1536Tensors of any supported onnx data type.
1537
1538#### Outputs
1539
1540***outputs: tensor(variadic)***
1541
1542Tensors of any supported onnx data type.
1543
1544
1545#### Examples
1546
1547
1548```python
1549
1550import os
1551import numpy as np
1552
1553from onnx import *
1554from onnxruntime_extensions import PyOrtFunction, util, get_library_path
1555from onnxruntime import *
1556
1557
1558triton_uri = os.getenv('URI', '') # read uri from env
1559triton_auth_token = os.getenv('AUTH', '') # read auto token from env
1560
1561
1562def createAddf():
1563 auth_token = helper.make_tensor_value_info('auth_token', TensorProto.STRING, [-1])
1564 X = helper.make_tensor_value_info('X', TensorProto.FLOAT, [-1])
1565 Y = helper.make_tensor_value_info('Y', TensorProto.FLOAT, [-1])
1566 Z = helper.make_tensor_value_info('Z', TensorProto.FLOAT, [-1])
1567 invoker = helper.make_node('AzureTritonInvoker', ['auth_token', 'X', 'Y'], ['Z'],
1568 domain='com.microsoft.extensions', name='triton_invoker',
1569 model_uri=triton_uri,
1570 model_name='addf', model_version='1')
1571 graph = helper.make_graph([invoker], 'graph', [auth_token, X, Y], [Z])
1572 model = helper.make_model(graph,
1573 opset_imports=[helper.make_operatorsetid('com.microsoft.extensions', 1)])
1574 save(model, 'triton_addf.onnx')
1575
1576
1577def run_add_f():
1578 opt = SessionOptions()
1579 opt.register_custom_ops_library(get_library_path())
1580 sess = InferenceSession(os.path.join(test_data_dir, "triton_addf.onnx"),
1581 opt, providers=["CPUExecutionProvider", "AzureExecutionProvider"])
1582 auth_token = np.array([triton_auth_token])
1583 x = np.array([1,2,3,4]).astype(np.float32)
1584 y = np.array([4,3,2,1]).astype(np.float32)
1585 ort_inputs = {
1586 "auth_token": auth_token,
1587 "X": x,
1588 "Y": y
1589 }
1590 out = sess.run(None, ort_inputs)[0]
1591```
1592</details>