microsoft/onnxruntime-extensions

Public

mirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
rel-0.10

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

docs/custom_ops.md

1600lines · modecode

1# Operators
2
3
4## Natural language operators
5
6### BertTokenizer
7
8<details>
9<summary>BertTokenizer details</summary>
10
11BertTokenizer replicates `encode_plus` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
12
13#### Inputs
14
15***text: tensor(string)*** The string tensor for tokenization
16
17#### Attributes
18
19***vocab_file: string***
20
21The content of vocab which has same with huggingface.
22
23***do_lower_case: int64_t*** (default is 1, 1 represents True, 0 represents False)
24
25Whether or not to lowercase the input when tokenizing.
26
27***do_basic_tokenize: int64_t*** (default is 1, 1 represents True, 0 represents False)
28
29Whether or not to do basic tokenization before WordPiece.
30
31***unk_token: string***
32
33The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
34token instead.
35
36***sep_token: string***
37
38The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
39sequence classification or for a text and a question for question answering. It is also used as the last
40token of a sequence built with special tokens.
41
42***pad_token: string***
43
44The token used for padding, for example when batching sequences of different lengths.
45
46***cls_token: string***
47
48The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
49
50***mask_token: string***
51
52The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
53
54***tokenize_chinese_chars: int64_t*** (default is 1, 1 represents True, 0 represents False)
55
56Whether or not to tokenize Chinese characters.
57
58***strip_accents: int64_t*** (default is 1, 1 represents True, 0 represents False)
59
60Whether or not to strip all accents. If this option is not specified, then it will be determined by the
61value for :obj:`lowercase` (as in the original BERT).
62
63***tokenize_punctuation: int64_t*** (default is 0, 1 represents True, 0 represents False)
64
65Splits punctuation on a piece of text.
66
67***remove_control_chars: int64_t*** (default is 0, 1 represents True, 0 represents False)
68
69Remove control chars(such as NUL, BEL) in the text.
70
71***truncation_strategy_name: string***
72
73The name of truncation strategy, it could be `longest_first`, `only_first`, `only_second`, `longest_from_back`.
74
75#### Outputs
76
77***input_ids: tensor(int64_t)***
78
79List of token ids.
80
81***token_type_ids: tensor(64_t)***
82
83List of token type ids
84
85***attention_mask: tensor(64_t)***
86
87List of indices specifying which tokens should b
88e attended to by the model
89
90
91#### Examples
92
93```python
94import transformers
95
96bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')
97
98node = onnx.helper.make_node(
99 'BertTokenizer',
100 inputs=['text'],
101 outputs=['tokens'],
102)
103
104text = "Hello world louder"
105inputs = np.array([text], dtype=object),
106
107bert_tokenize_result = bert_cased_tokenizer.tokenize(text)
108
109input_ids = np.array(bert_tokenize_result[0])
110token_type_ids = np.array(bert_tokenize_result[1])
111attention_mask = np.array(bert_tokenize_result[2])
112
113expect(node, inputs=[inputs],
114 outputs=[input_ids, token_type_ids, attention_mask], name='test_bert_tokenizer')
115```
116</details>
117
118### BertTokenizerDecoder
119
120<details>
121<summary>BertTokenizerDecoder details</summary>
122
123BertTokenizerDecoder replicates `decode` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
124
125#### Inputs
126
127***token_ids: tensor(int64)***
128
129List of tokenized input ids.
130
131***indices: tensor(int64)***
132
133List of `[start_position, end_position]` to indicate what segments of input ids should be decoded. This input only enabled when attribute `use_indices`=1.
134
135Usually, it is used to decode the slot in the text.
136
137#### Attributes
138
139***vocab_file: string***
140
141The content of vocab which has same with huggingface.
142
143***unk_token: string***
144
145The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
146token instead.
147
148***sep_token: string***
149
150The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
151sequence classification or for a text and a question for question answering. It is also used as the last
152token of a sequence built with special tokens.
153
154***pad_token: string***
155
156The token used for padding, for example when batching sequences of different lengths.
157
158***cls_token: string***
159
160The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
161
162***mask_token: string***
163
164The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
165
166***suffix_indicator: string***
167
168The suffix indicator.
169
170***use_indices: int64_t***
171
172Whether use second input.
173
174***skip_special_tokens: int64_t***
175
176Whether or not to remove special tokens in the decoding.
177
178***clean_up_tokenization_spaces: int64_t***
179
180Whether or not to clean up the tokenization spaces.
181
182#### Outputs
183
184***sentences: tensor(int64_t)***
185
186The decoded sentences.
187
188#### Examples
189
190
191```python
192import transformers
193
194def get_file_content(path):
195 with open(path, "rb") as file:
196 return file.read()
197
198bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')
199bert_cased_tokenizer.save('.', 'bert')
200
201
202node = onnx.helper.make_node(
203 'BertTokenizerDecoder',
204 inputs=['token_ids'],
205 outputs=['sentences'],
206 vocab_file=get_file_content("bert-vocab.txt")
207)
208
209text = "Hello world louder"
210token_ids = np.array([bert_cased_tokenizer.tokenize(text)], dtype=object),
211sentences = np.array(text)
212
213
214expect(node, inputs=[token_ids],
215 outputs=[sentences], name='test_bert_tokenizer')
216```
217</details>
218
219
220
221### GPT2Tokenizer
222
223<details>
224<summary>GPT2Tokenizer details</summary>
225
226GPT2Tokenizer that performs byte-level bpe tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html).
227
228#### Attributes
229
230***vocab***
231
232The **content** of the vocabulary file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).
233
234***merges***
235
236The **content** of the merges file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/merges.txt).
237
238***padding_length(optional)***
239
240When the input is a set of query, the tokenized result is ragged tensor, so we need to pad the tensor to tidy tensor and the `padding_length` indicates the strategy of the padding. When the padding_length equals -1, we will pad the tensor to length of longest row. When the padding_length is more than 0, we will pad the tensor to the number of padding_length.
241
242The default value of `padding_length` is -1.
243
244#### Inputs
245
246***data: tensor(string)***
247
248The string tensor for tokenization
249
250#### Outputs
251
252***input_ids: tensor(int64)***
253
254The tokenized ids of input
255
256***attention_mask: tensor(int64)***
257
258A tensor indicates which part of input_ids is padded.
259
260#### Examples
261
262
263```python
264def get_file_content(path):
265 with open(path, "rb") as file:
266 return file.read()
267
268node = onnx.helper.make_node(
269 'GPT2Tokenizer',
270 inputs=['x'],
271 outputs=['y'],
272 vocab=get_file_content(vocabulary_file),
273 merges=get_file_content(merges_file)
274)
275
276x = ["hey cortana"]
277y = np.array([20342, 12794, 2271], dtype=np.int64)
278
279expect(node, inputs=[x], outputs=[y],
280 name='test_gpt2_tokenizer')
281```
282</details>
283
284### WordpieceTokenizer
285
286<details>
287<summary>WordpieceTokenizer details</summary>
288
289
290WordpieceTokenizer that performs WordPiece tokenization to the input tensor,
291based on the [hugging face version](https://huggingface.co/transformers/model_doc/bert.html#WordpieceTokenizer).
292[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)
293from *tensorflow_text* can be implemented by a pair of nodes
294*RegexSplitWithOffets* followed by *WordpieceTokenizer*.
295it
296
297#### Attributes
298
299***vocab***
300
301The **content** of the vocabulary file, its format is same with
302[hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).
303
304***suffix_indicator***
305
306Suffix added to token not in the first position before looking into the vocabulary.
307
308***unk_token***
309
310Unknown tokens. Every token not found in the vocabulary is replaced by this one.
311
312***max_input_chars_per_word***
313
314Maximum number of characters per token (optional, defaults to 200).
315
316#### Inputs
317
318***data: tensor(string)***
319
320The string tensor for tokenization
321
322***row_indices: tensor(int64)*** Empty or the fndices of every first token of input sentences.
323`indices[i+1] - indices[i]` is the number of tokens in input `i`.
324
325[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)
326includes two steps. The first one splits sentences into words and then splits
327every work into tokens. This operator only implements the second step.
328The first one can be done with operator *StringRegexSplit*.
329This parameter can either be empty or it can be the third output
330of operator *StringRegexSplit*.
331
332#### Outputs
333
334***tokens: tensor(string)*** Every token.
335
336***token_indices: tensor(int32)*** Indices of each token. -1 means a token outside the vocabulary.
337
338***row_indices: tensor(int64)*** Indices of every first token of input sentences.
339`indices[i+1] - indices[i]` is the number of tokens in input `i`.
340These are updates row indices given as inputs or new ones if the second input is empty.
341
342#### Examples
343
344
345```python
346words = ["want", "##want",
347 "##ed", "wa", "un", "runn", "##ing"]
348vocab = {w: i + 10 for i, w in enumerate(words)}
349st = json.dumps(vocab)
350nodes = []
351mkv = helper.make_tensor_value_info
352reg = helper.make_tensor(
353 "pattern", onnx_proto.TensorProto.STRING, [1, ], ["(\\s)".encode('ascii')])
354reg_empty = helper.make_tensor(
355 "keep_pattern", onnx_proto.TensorProto.STRING, [0, ], [])
356
357nodes = [
358 helper.make_node(
359 'StringRegexSplitWithOffsets,
360 inputs=['text', 'pattern', 'keep_pattern'],
361 outputs=['words', 'begin_end', 'indices'],
362 name='StringRegexPlsitOpName',
363 domain='ai.onnx.contrib'),
364 helper.make_node(
365 'WordpieceTokenizer',
366 inputs=['words', 'indices'],
367 outputs=['out0', 'out1', 'out2'],
368 name='WordpieceTokenizerOpName',
369 domain='ai.onnx.contrib',
370 vocab=st.encode('utf-8'),
371 suffix_indicator="##",
372 unk_token="[UNK]")
373]
374inputs = [mkv('text', onnx_proto.TensorProto.STRING, [None])]
375graph = helper.make_graph(
376 nodes, 'test0', inputs, [
377 mkv('out0', onnx_proto.TensorProto.STRING, [None]),
378 mkv('out1', onnx_proto.TensorProto.INT32, [None]),
379 mkv('out2', onnx_proto.TensorProto.INT64, [None]),
380 mkv('words', onnx_proto.TensorProto.STRING, [None]),
381 mkv('indices', onnx_proto.TensorProto.INT64, [None])],
382 [reg, reg_empty])
383model = helper.make_model(
384 graph, opset_imports=[helper.make_operatorsetid(domain, 1)])
385
386text = np.array(["unwanted running", "unwantedX running"], dtype=object)
387tokens = np.array(['un', '##want', '##ed', 'runn', '##ing', 'un', '##want', '##ed',
388 '[UNK]', 'runn', '##ing'], dtype=object),
389indices = np.array([14, 11, 12, 15, 16, 14, 11, 12, -1, 15, 16], dtype=int32)
390row_indices = np.array([ 0, 5, 11], dtype=int64)
391
392expect(model, inputs=[text], outputs=[tokens, indices, row_indices],
393 name='test_bert_tokenizer')
394```
395
396</details>
397
398### SentencepieceTokenizer
399
400<details>
401<summary>SentencepieceTokenizer details</summary>
402
403SentencepieceTokenizer replicates [SentencepieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/SentencepieceTokenizer.md).
404
405#### Inputs
406
407***data: tensor(string)*** The string tensor for tokenization
408
409***nbest_size: tensor(int64)*** A scalar for sampling. nbest_size = {0,1}: No sampling is performed.
410(default) nbest_size > 1: samples from the nbest_size results. nbest_size < 0: assuming that
411nbest_size is infinite and samples from the all hypothesis (lattice) using
412forward-filtering-and-backward-sampling algorithm.
413
414***alpha: tensor(float)*** A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
415
416***reverse: tensor(bool)*** Reverses the tokenized sequence (Default = false)
417
418***add_bos: tensor(bool)*** Add beginning of sentence token to the result (Default = false)
419
420***add_eos: tensor(bool)*** Add end of sentence token to the result (Default = false).
421When reverse=True beginning/end of sentence tokens are added after reversing.
422
423#### Attributes
424
425***model: string*** The sentencepiece model serialized proto as stored as a string.
426
427#### Outputs
428
429***tokens: tensor(int32)*** Indices of each token.
430
431***indices: tensor(int64)*** Indices of every first token of input sentences.
432`indices[i+1] - indices[i]` is the number of tokens in input `i`.
433
434Tokenized result of the input
435
436#### Examples
437
438
439```python
440
441url = "https://github.com/microsoft/ort-customops/raw/main/test/data/test_sentencepiece_ops_model__6.txt"
442with urllib.request.urlopen(url) as f:
443 content = f.read()
444model = np.array(list(base64.decodebytes(content.encode())), dtype=np.uint8)
445
446node = onnx.helper.make_node(
447 'SentencepieceTokenizer',
448 inputs=['inputs', 'nbest_size', 'alpha', 'add_bos', 'add_eos', 'reverse'],
449 outputs=['indices', 'output'],
450 mapping_file_name='vocabulary.txt',
451 unmapping_value="unknown_word",
452 model=model
453)
454
455inputs = np.array(["Hello world", "Hello world louder"], dtype=object),
456nbest_size = np.array([0], dtype=np.float32),
457alpha = np.array([0], dtype=np.float32),
458add_bos = np.array([0], dtype=np.bool_),
459add_eos = np.array([0], dtype=np.bool_),
460reverse = np.array([0], dtype=np.bool_)
461
462tokens = array([17486, 1017, 17486, 1017, 155, 21869], dtype=int32)
463indices = array([0, 2, 6], dtype=int64)
464
465expect(node, inputs=[inputs, nbest_size, alpha, add_bos, add_eos, reverse],
466 outputs=[tokens, indices], name='sp')
467```
468</details>
469
470
471### BasicTokenizer
472
473<details>
474<summary>BasicTokenizer details</summary>
475
476TODO: is this still supported?
477
478BasicTokenizer performs basic tokenization to input string tensor, based on [basic tokenizer in BertTokenizer(hugging face version)](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
479
480#### Inputs
481
482***text: tensor(string)*** The string tensor for tokenization
483
484#### Attributes
485
486***do_lower_case: int64_t*** (default is 1, 1 represents True, 0 represents False)
487
488Whether or not to lowercase the input when tokenizing.
489
490***tokenize_chinese_chars: int64_t*** (default is 1, 1 represents True, 0 represents False)
491
492Whether or not to tokenize Chinese characters.
493
494***strip_accents: int64_t*** (default is 1, 1 represents True, 0 represents False)
495
496Whether or not to strip all accents. If this option is not specified, then it will be determined by the
497value for :obj:`lowercase` (as in the original BERT).
498
499***tokenize_punctuation: int64_t*** (default is 0, 1 represents True, 0 represents False)
500
501Splits punctuation on a piece of text.
502
503***remove_control_chars: int64_t*** (default is 0, 1 represents True, 0 represents False)
504
505Remove control chars(such as NUL, BEL) in the text.
506
507#### Outputs
508
509***tokens: tensor(string)*** Tokenized tokens.
510
511#### Examples
512
513```python
514import transformers
515
516tokenizer = transformers.BasicTokenizer()
517
518node = onnx.helper.make_node(
519 'BasicTokenizer',
520 inputs=['text'],
521 outputs=['tokens'],
522)
523
524inputs = np.array([ "Hello world louder"], dtype=object),
525tokens = np.array(tokenizer(inputs), dtype=int32)
526
527expect(node, inputs=[inputs],
528 outputs=[tokens], name='test_basic_tokenizer')
529```
530</details>
531
532
533### BlingFireSentenceBreaker
534
535TODO
536
537### BpeTokenizer
538
539TODO
540
541
542## String operators
543
544### StringEqual
545
546<details>
547<summary>StringEqual details</summary>
548
549Compares two strings and returns true if they are equal and false if not.
550
551#### Inputs
552
553***x: tensor(string)***
554
555The first string input
556
557***x: tensor(string)***
558
559The second string input
560
561#### Outputs
562
563***z: tensor(boolean)***
564
565String with replacements.
566
567</details>
568
569
570### StringHash
571
572<details>
573<summary>StringHash details</summary>
574
575
576Hashes the input string based on the number of buckets
577
578#### Inputs
579
580***input: tensor(string)***
581
582The string to hash
583
584***num_buckets: tensor(int64)***
585
586The number of buckets (must be equal to 1?)
587
588#### Outputs
589
590***name: tensor(int64)***
591
592The hash value of the string
593
594</details>
595
596
597### StringHashFast
598
599<details>
600<summary>StringHashFast details</summary>
601
602
603A faster implementation of StringHash.
604
605</details>
606
607
608### StringJoin
609
610<details>
611<summary>StringJoin details</summary>
612
613
614Join an array of strings
615
616#### Inputs
617
618***input_X: tensor(string)***
619
620The input array of strings
621
622***input_sep: tensor(string)***
623
624The string separator for the resulting joing
625
626***input_axis: tensor(int64)***
627
628The axis along which to joing
629
630#### Outputs
631
632***out: tensor(string)***
633
634The resulting joined string
635
636#### Examples
637
638
639```bash
640
641input_X = [["a", "b", "c"], ["aa", "bb", ""]]
642input_sep=";"
643input_axis = 1
644
645out = ["a;b;c", "aa;bb;"]
646
647input_axis = 0
648
649out = ['a;aa', 'b;bb', 'c;']
650
651
652</details>
653
654
655### StringRegexReplace
656
657<details>
658<summary>StringRegexReplace details</summary>
659
660
661String replacement based on [Re2-format](https://github.com/google/re2/wiki/Syntax) regular expressions.
662
663#### Inputs
664
665***text: tensor(string)***
666
667String tensor to extract slices from.
668
669***pattern: tensor(string)***
670
671Pattern of the regular expression.
672
673***rewrite: tensor(string)***
674
675Replacement.
676
677#### Attributes
678
679***global_replace: int64*** (default is 1)
680
681Replace all strings matching the pattern or the first one.
682
683#### Outputs
684
685***output: tensor(string)***
686
687String with replacements.
688
689#### Examples
690
691```python
692
693node = onnx.helper.make_node(
694 'StringRegexReplace',
695 inputs=['text', 'pattern', 'rewrite'],
696 outputs=['y'],
697)
698
699text = np.array([['def myfunc():'], ['def dummy():']])
700pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
701rewrite = np.array([r'static PyObject* py_\1(void) {'])
702y = [['static PyObject* py_myfunc(void) {'],
703 ['static PyObject* py_dummy(void) {']]
704
705expect(node, inputs=[text, pattern, rewrite], outputs=[y],
706 name='test_string_regex_replace')
707```
708
709</details>
710
711### StringECMARegexReplace
712
713<details>
714<summary>StringECMARegexReplace details</summary>
715
716String replacement based on [ECMA-format](https://en.cppreference.com/w/cpp/regex/ecmascript) regular expressions.
717
718#### Inputs
719
720***text: tensor(string)***
721
722String tensor to extract slices from.
723
724***pattern: tensor(string)***
725
726Pattern of the regular expression.
727
728***rewrite: tensor(string)***
729
730Replacement.
731
732#### Attributes
733
734***global_replace: int64*** (default is 1)
735
736Replace all strings matching the pattern or the first one.
737
738
739***ignore_case: int64*** (default is 0)
740
741Replace
742
743#### Outputs
744
745***output: tensor(string)***
746
747String with replacements.
748
749#### Examples
750
751
752```python
753
754node = onnx.helper.make_node(
755 'StringRegexReplace',
756 inputs=['text', 'pattern', 'rewrite'],
757 outputs=['y'],
758)
759
760text = np.array([['def myfunc():'], ['def dummy():']])
761pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
762rewrite = np.array([r'static PyObject* py_$1(void) {'])
763y = [['static PyObject* py_myfunc(void) {'],
764 ['static PyObject* py_dummy(void) {']]
765
766expect(node, inputs=[text, pattern, rewrite], outputs=[y],
767 name='test_string_regex_replace')
768```
769
770</details>
771
772
773
774### StringSplit
775
776TODO
777
778### StringUpper
779
780TODO
781
782### StringLower
783
784TODO
785
786### StringLength
787
788<details>
789<summary>StringECMARegexReplace details</summary>
790
791Get the length of each string element in input tensor. Similar to the function `len("abcde"")` in python.
792
793#### Inputs
794
795***data: tensor(string)***
796
797String tensor to get length of its each string element.
798
799#### Outputs
800
801***output: tensor(int64)***
802
803Data length tensor.
804
805#### Examples
806
807
808```python
809
810node = onnx.helper.make_node(
811 'StringLength',
812 inputs=['x'],
813 outputs=['y']
814)
815
816x = ["abcdef", "hijkl"]
817y = np.array([len(x[0]), len(x[1])], dtype=np.int64)
818
819
820expect(node, inputs=[x], outputs=[y],
821 name='test_string_length')
822```
823</details>
824
825### StringConcat
826
827<details>
828<summary>StringConcat details</summary>
829
830Concat the corresponding string in the two string tensor. Two input tensors should have the same dimension.
831
832```python
833 output = []
834 shape = input1.shape
835 input1 = input1.flatten()
836 input2 = input2.flatten()
837 for i in range(len(input1)):
838 output.append(input1[i] + input2[i])
839 output = np.array(output).reshape(shape)
840```
841
842#### Inputs
843
844***input_1: tensor(string)***
845
846The first string tensor.
847
848***input_2: tensor(string)***
849
850The second string tensor.
851
852
853#### Outputs
854
855***output: tensor(string)***
856
857The result.
858
859#### Examples
860
861
862```python
863
864node = onnx.helper.make_node(
865 'StringConcat',
866 inputs=['x', 'y'],
867 outputs=['result'],
868)
869
870x = np.array(["abcd", "efgh"])
871y = np.array(["wxyz", "stuv"])
872result = np.array([x[0] + y[0], x[1] + y[1]])
873
874expect(node, inputs=[x, y], outputs=[result],
875 name='test_string_concat')
876```
877
878</details>
879
880### StringRegexSplitWithOffsets
881
882<details>
883<summary>StringRegexSplitWithOffsets details</summary>
884
885Splits string based on regular expressions.
886
887#### Inputs
888
889***text: tensor(string)***
890
891String tensor to extract slices from.
892
893***delim_regex_pattern: tensor(string)***
894
895Splitting attern of the regular expression.
896
897***keep_delim_regex_pattern: tensor(string)***
898
899By default, delimiters are not included in the split string results. Delimiters may be included by specifying a regex pattern keep_delim_regex_pattern.
900
901#### Outputs
902
903***words: tensor(string)*** Tensor of words.
904
905***offsets: tensor(int64)*** 2D tensor with 3 columns:
906sentence index, position of the first character, position of the last one (excluded)
907
908***row_indices: tensor(int64)*** Indices of every first token of input sentences.
909`row_indices[i+1] - row_indices[i]` is the number of tokens in input `i`.
910These are updates row indices given as inputs or new ones if the second input is empty.
911
912
913#### Examples
914
915
916```python
917
918node = onnx.helper.make_node(
919 'StringRegexSplit',
920 inputs=['text', 'pattern', 'rewrite'],
921 outputs=['y', 'begin_end', 'indices'],
922)
923
924text = np.array(["hello there"])
925pattern = np.array([r'\s'])
926rewrite = np.array([r'\s'])
927y = np.array(["hello", " ", "there"])
928z1 = np.array([[0, 0, 5],
929 [0, 5, 6],
930 [0, 6, 11]], dtype=np.int64)
931z2 = np.array([0, 2], dtype=np.int64)
932
933expect(node, inputs=[text, pattern, rewrite], outputs=[y, z1, z2],
934 name='test_string_regex_replace')
935```
936
937</details>
938
939
940### StringECMARegexSplitWithOffsets
941
942TODO
943
944### VectorToString
945
946<details>
947<summary>VectorToString details</summary>
948
949VectorToString is the contrary operation to the `StringToVector` , they share same format of mapping table:
950
951 <string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>
952
953Unmapped vector will output the value of the attribute `unk`.
954
955Example:
956
957*Attributes:*
958
959- `map`:
960 ```
961 a 0 0 1 2
962 b 0 1 2 3
963 d 0 1 3 4
964 ```
965
966- `unk`: "unknown_word"
967
968*Inputs:*
969- data: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]
970
971*Ouputs:*
972- output: ["a", "d", "unknown_word" ]
973
974#### Attributes
975
976***mapping_file_name***
977
978the formative mapping table
979
980***unmapping_value***
981
982the result returned when a vector aren't found in the map
983
984#### Inputs
985
986***data: tensor(T)***
987
988Input tensor
989
990#### Outputs
991
992***output: tensor(string)***
993
994The mapping result of the input
995
996#### Type Constraints
997***T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)***
998
999Constrain input and output types to numerical tensors.
1000
1001
1002#### Examples
1003
1004
1005```python
1006mapping_table = \
1007 """
1008 a 0 0 1 2
1009 b 0 1 2 3
1010 d 0 1 3 4
1011 """
1012
1013node = onnx.helper.make_node(
1014 'VectorToString',
1015 inputs=['x'],
1016 outputs=['y'],
1017 map=mapping_table,
1018 unk="unknown_word"
1019)
1020
1021
1022x = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)
1023y = ["a", "d", "unknown_word"]
1024
1025
1026expect(node, inputs=[x], outputs=[y],
1027 name='test_vector_to_string')
1028```
1029</details>
1030
1031
1032### StringToVector
1033
1034<details>
1035<summary>StringToVector details</summary>
1036
1037StringToVector will map each string element in the input to the corresponding vector according to the mapping file. The mapping file is a utf-8 encoding text file in tsv format:
1038
1039 <string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>
1040
1041Unmapped string will output the value of the attribute `unmapping_value`.
1042
1043Example:
1044
1045*Attributes:*
1046
1047- `mapping_file_name`: vocabulary.txt
1048 ```
1049 a 0 0 1 2
1050 b 0 1 2 3
1051 d 0 1 3 4
1052 ```
1053
1054- `unmapping_value`: [0 0 0 0]
1055
1056*Inputs:*
1057- data: ["a", "d", "e"]
1058
1059*Ouputs:*
1060- output: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]
1061
1062#### Attributes
1063
1064***mapping_file_name:string***
1065
1066The name of your string to vector mapping file.
1067
1068***unmapping_value:list(int)***
1069
1070Mapping result for unmapped string
1071
1072#### Inputs
1073
1074***data: tensor(string)***
1075
1076Input tensor
1077
1078#### Outputs
1079
1080***output: tensor(T)***
1081
1082The mapping result of the input
1083
1084#### Type Constraints
1085***T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)***
1086
1087Constrain input and output types to numerical tensors.
1088
1089#### Examples
1090
1091
1092```python
1093# what's in vocabulary.txt
1094
1095mapping_table = \
1096"""
1097a 0 0 1 2
1098b 0 1 2 3
1099d 0 1 3 4
1100"""
1101
1102node = onnx.helper.make_node(
1103 'StringToVector',
1104 inputs=['x'],
1105 outputs=['y'],
1106 mapping_table=mapping_table,
1107 unmapping_value=[0,0,0,0]
1108)
1109
1110
1111x = ["a", "d", "e"]
1112y = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)
1113
1114
1115expect(node, inputs=[x], outputs=[y],
1116 name='test_string_to_vector')
1117```
1118
1119</details>
1120
1121
1122
1123### StringSlice
1124
1125<details>
1126<summary>StringSlice details</summary>
1127
1128Do the slice operation to each string element in input tensor. Similar to string slice in python
1129
1130```python
1131a = "abcdef"
1132b = a[1:2]
1133c = a[3:1:-1]
1134```
1135
1136#### Inputs
1137
1138***data: tensor(string)***
1139
1140String tensor to extract slices from.
1141
1142***starts: tensor(int64/int32)***
1143
1144The tensor of starting indices of corresponding string in data, which has same dimension of data.
1145
1146***ends: tensor(int64/int32)***
1147
1148The tensor of ending indices of corresponding string in data, which has same dimension of data.
1149
1150***steps(optional): tensor(int64/int32)***
1151
1152The tensor of slice step of corresponding string in data, which has same dimension of data.If steps is empty tensor, we will use default value 1 for each string
1153
1154#### Outputs
1155
1156***output: tensor(string)***
1157
1158Sliced data tensor.
1159
1160#### Examples
1161
1162
1163```python
1164
1165node = onnx.helper.make_node(
1166 'StringSlice',
1167 inputs=['x', 'starts', 'ends', 'steps'],
1168 outputs=['y'],
1169)
1170
1171x = np.array(["abcdef", "hijkl"])
1172y = np.array([x[0][1:3:1], x[1][3:1:-1]])
1173starts = np.array([1, 3], dtype=np.int64)
1174ends = np.array([3, 1], dtype=np.int64)
1175axes = np.array([0, 1], dtype=np.int64)
1176steps = np.array([1, 1], dtype=np.int64)
1177
1178expect(node, inputs=[x, starts, ends, axes, steps], outputs=[y],
1179 name='test_string_slice')
1180```
1181
1182</details>
1183
1184
1185### MaskedFill
1186
1187<details>
1188<summary>MaskedFill details</summary>
1189
1190
1191Fills elements of self tensor with value where mask is True. The operator is similar with [`Tensor.masked_fill_`](https://pytorch.org/docs/stable/generated/torch.Tensor.masked_fill_.html#torch.Tensor.masked_fill_) in pytorch.
1192
1193
1194#### Inputs
1195
1196***value: tensor(string)***
1197
1198The value to fill in with, currently we only support string type and vector&scalar dimension.
1199
1200***mask: tensor(bool)***
1201
1202The boolean mask, the dimension of mask tensor should be same with value.
1203
1204#### Outputs
1205
1206***output: tensor(string)***
1207
1208The filled output of input tensor.
1209
1210
1211#### Examples
1212
1213
1214```python
1215
1216node = onnx.helper.make_node(
1217 'MaskedFill',
1218 inputs=['value', 'mask'],
1219 outputs=['output']
1220)
1221
1222
1223value = np.array(["a", "b", "c", "d"])
1224mask = np.array([True, False, True, False], dtype=bool)
1225output = np.array(["a", "c"])
1226
1227
1228expect(node, inputs=[value, mask], outputs=[output],
1229 name='test_masked_fill')
1230```
1231</details>
1232
1233
1234### StringRaggedTensorToDense
1235
1236TODO
1237
1238### StringMapping
1239
1240TODO
1241
1242## Math operators
1243
1244
1245### Inverse
1246
1247TODO
1248
1249### NegPos
1250
1251TODO
1252
1253### SegmentExtraction
1254
1255TODO
1256
1257### SegmentSum
1258
1259TODO
1260
1261## Tensor operators
1262
1263### RaggedTensorToSparse
1264
1265TODO
1266
1267### RaggedTensorToDense
1268
1269TODO
1270
1271### Template
1272
1273<details>
1274<summary>Template details</summary>
1275
1276Description
1277
1278#### Inputs
1279
1280***name: tensor(type)***
1281
1282Description
1283
1284#### Outputs
1285
1286***name: tensor(type)***
1287
1288Description
1289
1290#### Examples
1291
1292
1293```python
1294
1295node = onnx.helper.make_node(
1296 'StringRegexReplace',
1297 inputs=['text', 'pattern', 'rewrite'],
1298 outputs=['y'],
1299)
1300
1301text = np.array([['def myfunc():'], ['def dummy():']])
1302pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
1303rewrite = np.array([r'static PyObject* py_\1(void) {'])
1304y = [['static PyObject* py_myfunc(void) {'],
1305 ['static PyObject* py_dummy(void) {']]
1306
1307expect(node, inputs=[text, pattern, rewrite], outputs=[y],
1308 name='test_string_regex_replace')
1309```
1310
1311</details>
1312
1313
1314## Azure operators
1315
1316### OpenAIAudioToText
1317
1318<details>
1319<summary>OpenAIAudioToText details</summary>
1320
1321
1322OpenAIAudioToText operator talks to [openAI audio](https://platform.openai.com/docs/api-reference/audio) endpoints.
1323
1324
1325#### Attributes
1326
1327***model_uri:string***
1328
1329Endpoint uri, like "https://api.openai.com/v1/audio/transcriptions".
1330
1331***audio_format:string***
1332
1333The format of the audio, by default "wav".
1334
1335#### Inputs
1336
1337***auth_token: tensor(string)***
1338
1339An access token comes with openAI subscription.
1340
1341***model_name: tensor(string)***
1342
1343Model name to send to the endpoint, such as "whisper-1".
1344
1345***response_format: tensor(string)***
1346
1347Expected format of the response, either be "text" or "json".
1348
1349***audio_blob: tensor(uint8)***
1350
1351A byte array containing raw data from the audio file.
1352
1353#### Outputs
1354
1355***transcriptions: tensor(string)***
1356
1357
1358#### Examples
1359
1360Note - OpenAIAudioToText operator composes a request based on last part of the input and output names split by "/",
1361
1362Meaning for input names, they must be of format:
1363- auth_token: "whatever-name-you-want-to-use"
1364- model_name: ".../.../.../model_name"
1365- response_format: ".../.../.../response_format"
1366- audio_blob: ".../.../.../file"
1367
1368for output name, it must be of format:
1369- transcriptions: ".../.../.../transcriptions"
1370
1371Hence there could be multiple OpenAIAudioToText operators accepting different inputs inside a model, and give varied outputs.
1372
1373Pls find sample code below for a better illustration.
1374
1375
1376```python
1377
1378import os
1379import numpy as np
1380
1381from onnx import *
1382from onnxruntime_extensions import PyOrtFunction, util, get_library_path
1383from onnxruntime import *
1384
1385
1386openai_model_uri = os.getenv('URI', '') # read uri from env
1387openai_auth_token = os.getenv('AUTH', '') # read auto token from env
1388
1389
1390def create_openai_audio_model():
1391 auth_token = helper.make_tensor_value_info('auth_token', TensorProto.STRING, [1])
1392 model = helper.make_tensor_value_info('node_1/model_name', TensorProto.STRING, [1])
1393 response_format = helper.make_tensor_value_info('node_1/response_format', TensorProto.STRING, [-1])
1394 file = helper.make_tensor_value_info('node_1/file', TensorProto.UINT8, [-1])
1395 transcriptions = helper.make_tensor_value_info('node_1/transcriptions', TensorProto.STRING, [-1])
1396
1397 invoker = helper.make_node('OpenAIAudioToText',
1398 ['auth_token', 'node_1/model_name', 'node_1/response_format', 'node_1/file'], # names must follow the format
1399 ['node_1/transcriptions'], # names must follow the format
1400 domain='com.microsoft.extensions',
1401 name='audio_invoker',
1402 model_uri=openai_model_uri,
1403 audio_format='wav')
1404
1405 graph = helper.make_graph([invoker], 'graph', [auth_token, model, response_format, file], [transcriptions])
1406 model = helper.make_model(graph,
1407 opset_imports=[helper.make_operatorsetid('com.microsoft.extensions', 1)])
1408
1409 onnx.save(model, 'openai_audio.onnx')
1410
1411
1412create_openai_audio_model()
1413opt = SessionOptions()
1414opt.register_custom_ops_library(get_library_path())
1415sess = InferenceSession(os.path.join(test_data_dir, "openai_audio.onnx"),
1416 opt, providers=["CPUExecutionProvider", "AzureExecutionProvider"])
1417auth_token = np.array([openai_auth_token])
1418model = np.array(['whisper-1'])
1419response_format = np.array(['text'])
1420
1421with open(os.path.join(test_data_dir, "test16.wav"), "rb") as _f:
1422 audio_blob = np.asarray(list(_f.read()), dtype=np.uint8)
1423 ort_inputs = {
1424 "auth_token": auth_token,
1425 "node_1/model_name": model,
1426 "node_1/response_format": response_format,
1427 "node_1/file": audio_blob,
1428 }
1429 out = sess.run(None, ort_inputs)[0]
1430```
1431</details>
1432
1433
1434### AzureTextToText
1435
1436<details>
1437<summary>AzureTextToText details</summary>
1438
1439
1440AzureTextToText talks to a GPT model hosted by [Azure openAI service](https://learn.microsoft.com/en-us/azure/ai-services/openai/).
1441
1442
1443#### Attributes
1444
1445***model_uri:string***
1446
1447Endpoint uri, like "https://myname-aoai-test.openai.azure.com/openai/deployments/mydeploy/chat/completions?api-version=2023-05-15'".
1448
1449#### Inputs
1450
1451***auth_token: tensor(string)***
1452
1453An access token comes with Azure openAI subscription.
1454
1455***chat: tensor(string)***
1456
1457A json string in requested [format](https://learn.microsoft.com/en-us/azure/ai-services/openai/chatgpt-quickstart?tabs=command-line&pivots=rest-api).
1458
1459#### Outputs
1460
1461***response_format: tensor(string)***
1462
1463A json string as response.
1464
1465
1466#### Examples
1467
1468
1469```python
1470
1471import os
1472import numpy as np
1473
1474from onnx import *
1475from onnxruntime_extensions import PyOrtFunction, util, get_library_path
1476from onnxruntime import *
1477
1478
1479azure_model_uri = os.getenv('URI', '') # read uri from env
1480azure_auth_token = os.getenv('AUTH', '') # read auto token from env
1481
1482
1483def create_azure_chat_model():
1484 auth_token = helper.make_tensor_value_info('auth_token', TensorProto.STRING, [-1])
1485 chat = helper.make_tensor_value_info('chat', TensorProto.STRING, [-1])
1486 response = helper.make_tensor_value_info('response', TensorProto.STRING, [-1])
1487
1488 invoker = helper.make_node('AzureTextToText', ['auth_token', 'chat'], ['response'],
1489 domain='com.microsoft.extensions',
1490 name='chat_invoker',
1491 model_uri=azure_model_uri)
1492
1493 graph = helper.make_graph([invoker], 'graph', [auth_token, chat], [response])
1494 model = helper.make_model(graph,
1495 opset_imports=[helper.make_operatorsetid('com.microsoft.extensions', 1)])
1496
1497 onnx.save(model, 'azure_chat.onnx')
1498
1499
1500create_azure_chat_model()
1501opt = SessionOptions()
1502opt.register_custom_ops_library(get_library_path())
1503sess = InferenceSession(os.path.join(test_data_dir, "azure_chat.onnx"), opt, providers=["CPUExecutionProvider", "AzureExecutionProvider"])
1504auth_token = np.array([azure_auth_token])
1505chat = np.array([r'{"messages":[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Does Azure OpenAI support customer managed keys?"},{"role": "assistant", "content": "Yes, customer managed keys are supported by Azure OpenAI."},{"role": "user", "content": "Do other Azure AI services support this too?"}]}'])
1506ort_inputs = {
1507 "auth_token": auth_token,
1508 "chat": chat,
1509}
1510out = sess.run(None, ort_inputs)[0]
1511```
1512</details>
1513
1514
1515### AzureTritonInvoker
1516
1517<details>
1518<summary>AzureTritonInvoker details</summary>
1519
1520
1521AzureTritonInvoker talks to [Azure Machine Learning triton services](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?view=azureml-api-2&tabs=azure-cli%2Cendpoint).
1522
1523
1524#### Attributes
1525
1526***model_uri:string***
1527
1528Endpoint uri, like "'https://endpoint-12345678.westus.inference.ml.azure.com".
1529
1530***model_name:string***
1531
1532***model_version:string***
1533
1534A version string, like "1", or "2".
1535
1536#### Inputs
1537
1538***auth_token: tensor(string)***
1539
1540An access token comes with Azure Machine Learning model deployment.
1541
1542***inputs: tensor(variadic)***
1543
1544Tensors of any supported onnx data type.
1545
1546#### Outputs
1547
1548***outputs: tensor(variadic)***
1549
1550Tensors of any supported onnx data type.
1551
1552
1553#### Examples
1554
1555
1556```python
1557
1558import os
1559import numpy as np
1560
1561from onnx import *
1562from onnxruntime_extensions import PyOrtFunction, util, get_library_path
1563from onnxruntime import *
1564
1565
1566triton_uri = os.getenv('URI', '') # read uri from env
1567triton_auth_token = os.getenv('AUTH', '') # read auto token from env
1568
1569
1570def createAddf():
1571 auth_token = helper.make_tensor_value_info('auth_token', TensorProto.STRING, [-1])
1572 X = helper.make_tensor_value_info('X', TensorProto.FLOAT, [-1])
1573 Y = helper.make_tensor_value_info('Y', TensorProto.FLOAT, [-1])
1574 Z = helper.make_tensor_value_info('Z', TensorProto.FLOAT, [-1])
1575 invoker = helper.make_node('AzureTritonInvoker', ['auth_token', 'X', 'Y'], ['Z'],
1576 domain='com.microsoft.extensions', name='triton_invoker',
1577 model_uri=triton_uri,
1578 model_name='addf', model_version='1')
1579 graph = helper.make_graph([invoker], 'graph', [auth_token, X, Y], [Z])
1580 model = helper.make_model(graph,
1581 opset_imports=[helper.make_operatorsetid('com.microsoft.extensions', 1)])
1582 save(model, 'triton_addf.onnx')
1583
1584
1585def run_add_f():
1586 opt = SessionOptions()
1587 opt.register_custom_ops_library(get_library_path())
1588 sess = InferenceSession(os.path.join(test_data_dir, "triton_addf.onnx"),
1589 opt, providers=["CPUExecutionProvider", "AzureExecutionProvider"])
1590 auth_token = np.array([triton_auth_token])
1591 x = np.array([1,2,3,4]).astype(np.float32)
1592 y = np.array([4,3,2,1]).astype(np.float32)
1593 ort_inputs = {
1594 "auth_token": auth_token,
1595 "X": x,
1596 "Y": y
1597 }
1598 out = sess.run(None, ort_inputs)[0]
1599```
1600</details>