microsoft/onnxruntime-extensions

Public

mirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
5e44a7c3c90cb9dc29f78c3322a60aa869dcf837

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

docs/custom_ops.md

1310lines · modecode

1# Operators
2
3
4## Natural language operators
5
6### BertTokenizer
7
8<details>
9<summary>BertTokenizer details</summary>
10
11BertTokenizer replicates `encode_plus` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
12
13#### Inputs
14
15***text: tensor(string)*** The string tensor for tokenization
16
17#### Attributes
18
19***vocab_file: string***
20
21The content of vocab which has same with huggingface.
22
23***do_lower_case: int64_t*** (default is 1, 1 represents True, 0 represents False)
24
25Whether or not to lowercase the input when tokenizing.
26
27***do_basic_tokenize: int64_t*** (default is 1, 1 represents True, 0 represents False)
28
29Whether or not to do basic tokenization before WordPiece.
30
31***unk_token: string***
32
33The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
34token instead.
35
36***sep_token: string***
37
38The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
39sequence classification or for a text and a question for question answering. It is also used as the last
40token of a sequence built with special tokens.
41
42***pad_token: string***
43
44The token used for padding, for example when batching sequences of different lengths.
45
46***cls_token: string***
47
48The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
49
50***mask_token: string***
51
52The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
53
54***tokenize_chinese_chars: int64_t*** (default is 1, 1 represents True, 0 represents False)
55
56Whether or not to tokenize Chinese characters.
57
58***strip_accents: int64_t*** (default is 1, 1 represents True, 0 represents False)
59
60Whether or not to strip all accents. If this option is not specified, then it will be determined by the
61value for :obj:`lowercase` (as in the original BERT).
62
63***tokenize_punctuation: int64_t*** (default is 0, 1 represents True, 0 represents False)
64
65Splits punctuation on a piece of text.
66
67***remove_control_chars: int64_t*** (default is 0, 1 represents True, 0 represents False)
68
69Remove control chars(such as NUL, BEL) in the text.
70
71***truncation_strategy_name: string***
72
73The name of truncation strategy, it could be `longest_first`, `only_first`, `only_second`, `longest_from_back`.
74
75#### Outputs
76
77***input_ids: tensor(int64_t)***
78
79List of token ids.
80
81***token_type_ids: tensor(64_t)***
82
83List of token type ids
84
85***attention_mask: tensor(64_t)***
86
87List of indices specifying which tokens should b
88e attended to by the model
89
90
91#### Examples
92
93```python
94import transformers
95
96bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')
97
98node = onnx.helper.make_node(
99 'BertTokenizer',
100 inputs=['text'],
101 outputs=['tokens'],
102)
103
104text = "Hello world louder"
105inputs = np.array([text], dtype=object),
106
107bert_tokenize_result = bert_cased_tokenizer.tokenize(text)
108
109input_ids = np.array(bert_tokenize_result[0])
110token_type_ids = np.array(bert_tokenize_result[1])
111attention_mask = np.array(bert_tokenize_result[2])
112
113expect(node, inputs=[inputs],
114 outputs=[input_ids, token_type_ids, attention_mask], name='test_bert_tokenizer')
115```
116</details>
117
118### BertTokenizerDecoder
119
120<details>
121<summary>BertTokenizerDecoder details</summary>
122
123BertTokenizerDecoder replicates `decode` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
124
125#### Inputs
126
127***token_ids: tensor(int64)***
128
129List of tokenized input ids.
130
131***indices: tensor(int64)***
132
133List of `[start_position, end_position]` to indicate what segments of input ids should be decoded. This input only enabled when attribute `use_indices`=1.
134
135Usually, it is used to decode the slot in the text.
136
137#### Attributes
138
139***vocab_file: string***
140
141The content of vocab which has same with huggingface.
142
143***unk_token: string***
144
145The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
146token instead.
147
148***sep_token: string***
149
150The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
151sequence classification or for a text and a question for question answering. It is also used as the last
152token of a sequence built with special tokens.
153
154***pad_token: string***
155
156The token used for padding, for example when batching sequences of different lengths.
157
158***cls_token: string***
159
160The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
161
162***mask_token: string***
163
164The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
165
166***suffix_indicator: string***
167
168The suffix indicator.
169
170***use_indices: int64_t***
171
172Whether use second input.
173
174***skip_special_tokens: int64_t***
175
176Whether or not to remove special tokens in the decoding.
177
178***clean_up_tokenization_spaces: int64_t***
179
180Whether or not to clean up the tokenization spaces.
181
182#### Outputs
183
184***sentences: tensor(int64_t)***
185
186The decoded sentences.
187
188#### Examples
189
190
191```python
192import transformers
193
194def get_file_content(path):
195 with open(path, "rb") as file:
196 return file.read()
197
198bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')
199bert_cased_tokenizer.save('.', 'bert')
200
201
202node = onnx.helper.make_node(
203 'BertTokenizerDecoder',
204 inputs=['token_ids'],
205 outputs=['sentences'],
206 vocab_file=get_file_content("bert-vocab.txt")
207)
208
209text = "Hello world louder"
210token_ids = np.array([bert_cased_tokenizer.tokenize(text)], dtype=object),
211sentences = np.array(text)
212
213
214expect(node, inputs=[token_ids],
215 outputs=[sentences], name='test_bert_tokenizer')
216```
217</details>
218
219
220
221### GPT2Tokenizer
222
223<details>
224<summary>GPT2Tokenizer details</summary>
225
226GPT2Tokenizer that performs byte-level bpe tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html).
227
228#### Attributes
229
230***vocab***
231
232The **content** of the vocabulary file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).
233
234***merges***
235
236The **content** of the merges file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/merges.txt).
237
238***padding_length(optional)***
239
240When the input is a set of query, the tokenized result is ragged tensor, so we need to pad the tensor to tidy tensor and the `padding_length` indicates the strategy of the padding. When the padding_length equals -1, we will pad the tensor to length of longest row. When the padding_length is more than 0, we will pad the tensor to the number of padding_length.
241
242The default value of `padding_length` is -1.
243
244#### Inputs
245
246***data: tensor(string)***
247
248The string tensor for tokenization
249
250#### Outputs
251
252***input_ids: tensor(int64)***
253
254The tokenized ids of input
255
256***attention_mask: tensor(int64)***
257
258A tensor indicates which part of input_ids is padded.
259
260#### Examples
261
262
263```python
264def get_file_content(path):
265 with open(path, "rb") as file:
266 return file.read()
267
268node = onnx.helper.make_node(
269 'GPT2Tokenizer',
270 inputs=['x'],
271 outputs=['y'],
272 vocab=get_file_content(vocabulary_file),
273 merges=get_file_content(merges_file)
274)
275
276x = ["hey cortana"]
277y = np.array([20342, 12794, 2271], dtype=np.int64)
278
279expect(node, inputs=[x], outputs=[y],
280 name='test_gpt2_tokenizer')
281```
282</details>
283
284### WordpieceTokenizer
285
286<details>
287<summary>WordpieceTokenizer details</summary>
288
289
290WordpieceTokenizer that performs WordPiece tokenization to the input tensor,
291based on the [hugging face version](https://huggingface.co/transformers/model_doc/bert.html#WordpieceTokenizer).
292[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)
293from *tensorflow_text* can be implemented by a pair of nodes
294*RegexSplitWithOffets* followed by *WordpieceTokenizer*.
295it
296
297#### Attributes
298
299***vocab***
300
301The **content** of the vocabulary file, its format is same with
302[hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).
303
304***suffix_indicator***
305
306Suffix added to token not in the first position before looking into the vocabulary.
307
308***unk_token***
309
310Unknown tokens. Every token not found in the vocabulary is replaced by this one.
311
312***max_input_chars_per_word***
313
314Maximum number of characters per token (optional, defaults to 200).
315
316#### Inputs
317
318***data: tensor(string)***
319
320The string tensor for tokenization
321
322***row_indices: tensor(int64)*** Empty or the fndices of every first token of input sentences.
323`indices[i+1] - indices[i]` is the number of tokens in input `i`.
324
325[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)
326includes two steps. The first one splits sentences into words and then splits
327every work into tokens. This operator only implements the second step.
328The first one can be done with operator *StringRegexSplit*.
329This parameter can either be empty or it can be the third output
330of operator *StringRegexSplit*.
331
332#### Outputs
333
334***tokens: tensor(string)*** Every token.
335
336***token_indices: tensor(int32)*** Indices of each token. -1 means a token outside the vocabulary.
337
338***row_indices: tensor(int64)*** Indices of every first token of input sentences.
339`indices[i+1] - indices[i]` is the number of tokens in input `i`.
340These are updates row indices given as inputs or new ones if the second input is empty.
341
342#### Examples
343
344
345```python
346words = ["want", "##want",
347 "##ed", "wa", "un", "runn", "##ing"]
348vocab = {w: i + 10 for i, w in enumerate(words)}
349st = json.dumps(vocab)
350nodes = []
351mkv = helper.make_tensor_value_info
352reg = helper.make_tensor(
353 "pattern", onnx_proto.TensorProto.STRING, [1, ], ["(\\s)".encode('ascii')])
354reg_empty = helper.make_tensor(
355 "keep_pattern", onnx_proto.TensorProto.STRING, [0, ], [])
356
357nodes = [
358 helper.make_node(
359 'StringRegexSplitWithOffsets,
360 inputs=['text', 'pattern', 'keep_pattern'],
361 outputs=['words', 'begin_end', 'indices'],
362 name='StringRegexPlsitOpName',
363 domain='ai.onnx.contrib'),
364 helper.make_node(
365 'WordpieceTokenizer',
366 inputs=['words', 'indices'],
367 outputs=['out0', 'out1', 'out2'],
368 name='WordpieceTokenizerOpName',
369 domain='ai.onnx.contrib',
370 vocab=st.encode('utf-8'),
371 suffix_indicator="##",
372 unk_token="[UNK]")
373]
374inputs = [mkv('text', onnx_proto.TensorProto.STRING, [None])]
375graph = helper.make_graph(
376 nodes, 'test0', inputs, [
377 mkv('out0', onnx_proto.TensorProto.STRING, [None]),
378 mkv('out1', onnx_proto.TensorProto.INT32, [None]),
379 mkv('out2', onnx_proto.TensorProto.INT64, [None]),
380 mkv('words', onnx_proto.TensorProto.STRING, [None]),
381 mkv('indices', onnx_proto.TensorProto.INT64, [None])],
382 [reg, reg_empty])
383model = helper.make_model(
384 graph, opset_imports=[helper.make_operatorsetid(domain, 1)])
385
386text = np.array(["unwanted running", "unwantedX running"], dtype=object)
387tokens = np.array(['un', '##want', '##ed', 'runn', '##ing', 'un', '##want', '##ed',
388 '[UNK]', 'runn', '##ing'], dtype=object),
389indices = np.array([14, 11, 12, 15, 16, 14, 11, 12, -1, 15, 16], dtype=int32)
390row_indices = np.array([ 0, 5, 11], dtype=int64)
391
392expect(model, inputs=[text], outputs=[tokens, indices, row_indices],
393 name='test_bert_tokenizer')
394```
395
396</details>
397
398### SentencepieceTokenizer
399
400<details>
401<summary>SentencepieceTokenizer details</summary>
402
403SentencepieceTokenizer replicates [SentencepieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/SentencepieceTokenizer.md).
404
405#### Inputs
406
407***data: tensor(string)*** The string tensor for tokenization
408
409***nbest_size: tensor(int64)*** A scalar for sampling. nbest_size = {0,1}: No sampling is performed.
410(default) nbest_size > 1: samples from the nbest_size results. nbest_size < 0: assuming that
411nbest_size is infinite and samples from the all hypothesis (lattice) using
412forward-filtering-and-backward-sampling algorithm.
413
414***alpha: tensor(float)*** A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
415
416***reverse: tensor(bool)*** Reverses the tokenized sequence (Default = false)
417
418***add_bos: tensor(bool)*** Add beginning of sentence token to the result (Default = false)
419
420***add_eos: tensor(bool)*** Add end of sentence token to the result (Default = false).
421When reverse=True beginning/end of sentence tokens are added after reversing.
422
423#### Attributes
424
425***model: string*** The sentencepiece model serialized proto as stored as a string.
426
427#### Outputs
428
429***tokens: tensor(int32)*** Indices of each token.
430
431***indices: tensor(int64)*** Indices of every first token of input sentences.
432`indices[i+1] - indices[i]` is the number of tokens in input `i`.
433
434Tokenized result of the input
435
436#### Examples
437
438
439```python
440
441url = "https://github.com/microsoft/ort-customops/raw/main/test/data/test_sentencepiece_ops_model__6.txt"
442with urllib.request.urlopen(url) as f:
443 content = f.read()
444model = np.array(list(base64.decodebytes(content.encode())), dtype=np.uint8)
445
446node = onnx.helper.make_node(
447 'SentencepieceTokenizer',
448 inputs=['inputs', 'nbest_size', 'alpha', 'add_bos', 'add_eos', 'reverse'],
449 outputs=['indices', 'output'],
450 mapping_file_name='vocabulary.txt',
451 unmapping_value="unknown_word",
452 model=model
453)
454
455inputs = np.array(["Hello world", "Hello world louder"], dtype=object),
456nbest_size = np.array([0], dtype=np.float32),
457alpha = np.array([0], dtype=np.float32),
458add_bos = np.array([0], dtype=np.bool_),
459add_eos = np.array([0], dtype=np.bool_),
460reverse = np.array([0], dtype=np.bool_)
461
462tokens = array([17486, 1017, 17486, 1017, 155, 21869], dtype=int32)
463indices = array([0, 2, 6], dtype=int64)
464
465expect(node, inputs=[inputs, nbest_size, alpha, add_bos, add_eos, reverse],
466 outputs=[tokens, indices], name='sp')
467```
468</details>
469
470
471### BasicTokenizer
472
473<details>
474<summary>BasicTokenizer details</summary>
475
476TODO: is this still supported?
477
478BasicTokenizer performs basic tokenization to input string tensor, based on [basic tokenizer in BertTokenizer(hugging face version)](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
479
480#### Inputs
481
482***text: tensor(string)*** The string tensor for tokenization
483
484#### Attributes
485
486***do_lower_case: int64_t*** (default is 1, 1 represents True, 0 represents False)
487
488Whether or not to lowercase the input when tokenizing.
489
490***tokenize_chinese_chars: int64_t*** (default is 1, 1 represents True, 0 represents False)
491
492Whether or not to tokenize Chinese characters.
493
494***strip_accents: int64_t*** (default is 1, 1 represents True, 0 represents False)
495
496Whether or not to strip all accents. If this option is not specified, then it will be determined by the
497value for :obj:`lowercase` (as in the original BERT).
498
499***tokenize_punctuation: int64_t*** (default is 0, 1 represents True, 0 represents False)
500
501Splits punctuation on a piece of text.
502
503***remove_control_chars: int64_t*** (default is 0, 1 represents True, 0 represents False)
504
505Remove control chars(such as NUL, BEL) in the text.
506
507#### Outputs
508
509***tokens: tensor(string)*** Tokenized tokens.
510
511#### Examples
512
513```python
514import transformers
515
516tokenizer = transformers.BasicTokenizer()
517
518node = onnx.helper.make_node(
519 'BasicTokenizer',
520 inputs=['text'],
521 outputs=['tokens'],
522)
523
524inputs = np.array([ "Hello world louder"], dtype=object),
525tokens = np.array(tokenizer(inputs), dtype=int32)
526
527expect(node, inputs=[inputs],
528 outputs=[tokens], name='test_basic_tokenizer')
529```
530</details>
531
532
533### BlingFireSentenceBreaker
534
535TODO
536
537### BpeTokenizer
538
539TODO
540
541
542## String operators
543
544### StringEqual
545
546<details>
547<summary>StringEqual details</summary>
548
549Compares two strings and returns true if they are equal and false if not.
550
551#### Inputs
552
553***x: tensor(string)***
554
555The first string input
556
557***x: tensor(string)***
558
559The second string input
560
561#### Outputs
562
563***z: tensor(boolean)***
564
565String with replacements.
566
567</details>
568
569
570### StringHash
571
572<details>
573<summary>StringHash details</summary>
574
575
576Hashes the input string based on the number of buckets
577
578#### Inputs
579
580***input: tensor(string)***
581
582The string to hash
583
584***num_buckets: tensor(int64)***
585
586The number of buckets (must be equal to 1?)
587
588#### Outputs
589
590***name: tensor(int64)***
591
592The hash value of the string
593
594</details>
595
596
597### StringHashFast
598
599<details>
600<summary>StringHashFast details</summary>
601
602
603A faster implementation of StringHash.
604
605</details>
606
607
608### StringJoin
609
610<details>
611<summary>StringJoin details</summary>
612
613
614Join an array of strings
615
616#### Inputs
617
618***input_X: tensor(string)***
619
620The input array of strings
621
622***input_sep: tensor(string)***
623
624The string separator for the resulting joing
625
626***input_axis: tensor(int64)***
627
628The axis along which to joing
629
630#### Outputs
631
632***out: tensor(string)***
633
634The resulting joined string
635
636#### Examples
637
638
639```bash
640
641input_X = [["a", "b", "c"], ["aa", "bb", ""]]
642input_sep=";"
643input_axis = 1
644
645out = ["a;b;c", "aa;bb;"]
646
647input_axis = 0
648
649out = ['a;aa', 'b;bb', 'c;']
650
651
652</details>
653
654
655### StringRegexReplace
656
657<details>
658<summary>StringRegexReplace details</summary>
659
660
661String replacement based on [Re2-format](https://github.com/google/re2/wiki/Syntax) regular expressions.
662
663#### Inputs
664
665***text: tensor(string)***
666
667String tensor to extract slices from.
668
669***pattern: tensor(string)***
670
671Pattern of the regular expression.
672
673***rewrite: tensor(string)***
674
675Replacement.
676
677#### Attributes
678
679***global_replace: int64*** (default is 1)
680
681Replace all strings matching the pattern or the first one.
682
683#### Outputs
684
685***output: tensor(string)***
686
687String with replacements.
688
689#### Examples
690
691```python
692
693node = onnx.helper.make_node(
694 'StringRegexReplace',
695 inputs=['text', 'pattern', 'rewrite'],
696 outputs=['y'],
697)
698
699text = np.array([['def myfunc():'], ['def dummy():']])
700pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
701rewrite = np.array([r'static PyObject* py_\1(void) {'])
702y = [['static PyObject* py_myfunc(void) {'],
703 ['static PyObject* py_dummy(void) {']]
704
705expect(node, inputs=[text, pattern, rewrite], outputs=[y],
706 name='test_string_regex_replace')
707```
708
709</details>
710
711### StringECMARegexReplace
712
713<details>
714<summary>StringECMARegexReplace details</summary>
715
716String replacement based on [ECMA-format](https://en.cppreference.com/w/cpp/regex/ecmascript) regular expressions.
717
718#### Inputs
719
720***text: tensor(string)***
721
722String tensor to extract slices from.
723
724***pattern: tensor(string)***
725
726Pattern of the regular expression.
727
728***rewrite: tensor(string)***
729
730Replacement.
731
732#### Attributes
733
734***global_replace: int64*** (default is 1)
735
736Replace all strings matching the pattern or the first one.
737
738
739***ignore_case: int64*** (default is 0)
740
741Replace
742
743#### Outputs
744
745***output: tensor(string)***
746
747String with replacements.
748
749#### Examples
750
751
752```python
753
754node = onnx.helper.make_node(
755 'StringRegexReplace',
756 inputs=['text', 'pattern', 'rewrite'],
757 outputs=['y'],
758)
759
760text = np.array([['def myfunc():'], ['def dummy():']])
761pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
762rewrite = np.array([r'static PyObject* py_$1(void) {'])
763y = [['static PyObject* py_myfunc(void) {'],
764 ['static PyObject* py_dummy(void) {']]
765
766expect(node, inputs=[text, pattern, rewrite], outputs=[y],
767 name='test_string_regex_replace')
768```
769
770</details>
771
772
773
774### StringSplit
775
776TODO
777
778### StringUpper
779
780TODO
781
782### StringLower
783
784TODO
785
786### StringLength
787
788<details>
789<summary>StringECMARegexReplace details</summary>
790
791Get the length of each string element in input tensor. Similar to the function `len("abcde"")` in python.
792
793#### Inputs
794
795***data: tensor(string)***
796
797String tensor to get length of its each string element.
798
799#### Outputs
800
801***output: tensor(int64)***
802
803Data length tensor.
804
805#### Examples
806
807
808```python
809
810node = onnx.helper.make_node(
811 'StringLength',
812 inputs=['x'],
813 outputs=['y']
814)
815
816x = ["abcdef", "hijkl"]
817y = np.array([len(x[0]), len(x[1])], dtype=np.int64)
818
819
820expect(node, inputs=[x], outputs=[y],
821 name='test_string_length')
822```
823</details>
824
825### StringConcat
826
827<details>
828<summary>StringConcat details</summary>
829
830Concat the corresponding string in the two string tensor. Two input tensors should have the same dimension.
831
832```python
833 output = []
834 shape = input1.shape
835 input1 = input1.flatten()
836 input2 = input2.flatten()
837 for i in range(len(input1)):
838 output.append(input1[i] + input2[i])
839 output = np.array(output).reshape(shape)
840```
841
842#### Inputs
843
844***input_1: tensor(string)***
845
846The first string tensor.
847
848***input_2: tensor(string)***
849
850The second string tensor.
851
852
853#### Outputs
854
855***output: tensor(string)***
856
857The result.
858
859#### Examples
860
861
862```python
863
864node = onnx.helper.make_node(
865 'StringConcat',
866 inputs=['x', 'y'],
867 outputs=['result'],
868)
869
870x = np.array(["abcd", "efgh"])
871y = np.array(["wxyz", "stuv"])
872result = np.array([x[0] + y[0], x[1] + y[1]])
873
874expect(node, inputs=[x, y], outputs=[result],
875 name='test_string_concat')
876```
877
878</details>
879
880### StringRegexSplitWithOffsets
881
882<details>
883<summary>StringRegexSplitWithOffsets details</summary>
884
885Splits string based on regular expressions.
886
887#### Inputs
888
889***text: tensor(string)***
890
891String tensor to extract slices from.
892
893***delim_regex_pattern: tensor(string)***
894
895Splitting attern of the regular expression.
896
897***keep_delim_regex_pattern: tensor(string)***
898
899By default, delimiters are not included in the split string results. Delimiters may be included by specifying a regex pattern keep_delim_regex_pattern.
900
901#### Outputs
902
903***words: tensor(string)*** Tensor of words.
904
905***offsets: tensor(int64)*** 2D tensor with 3 columns:
906sentence index, position of the first character, position of the last one (excluded)
907
908***row_indices: tensor(int64)*** Indices of every first token of input sentences.
909`row_indices[i+1] - row_indices[i]` is the number of tokens in input `i`.
910These are updates row indices given as inputs or new ones if the second input is empty.
911
912
913#### Examples
914
915
916```python
917
918node = onnx.helper.make_node(
919 'StringRegexSplit',
920 inputs=['text', 'pattern', 'rewrite'],
921 outputs=['y', 'begin_end', 'indices'],
922)
923
924text = np.array(["hello there"])
925pattern = np.array([r'\s'])
926rewrite = np.array([r'\s'])
927y = np.array(["hello", " ", "there"])
928z1 = np.array([[0, 0, 5],
929 [0, 5, 6],
930 [0, 6, 11]], dtype=np.int64)
931z2 = np.array([0, 2], dtype=np.int64)
932
933expect(node, inputs=[text, pattern, rewrite], outputs=[y, z1, z2],
934 name='test_string_regex_replace')
935```
936
937</details>
938
939
940### StringECMARegexSplitWithOffsets
941
942TODO
943
944### VectorToString
945
946<details>
947<summary>VectorToString details</summary>
948
949VectorToString is the contrary operation to the `StringToVector` , they share same format of mapping table:
950
951 <string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>
952
953Unmapped vector will output the value of the attribute `unk`.
954
955Example:
956
957*Attributes:*
958
959- `map`:
960 ```
961 a 0 0 1 2
962 b 0 1 2 3
963 d 0 1 3 4
964 ```
965
966- `unk`: "unknown_word"
967
968*Inputs:*
969- data: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]
970
971*Ouputs:*
972- output: ["a", "d", "unknown_word" ]
973
974#### Attributes
975
976***mapping_file_name***
977
978the formative mapping table
979
980***unmapping_value***
981
982the result returned when a vector aren't found in the map
983
984#### Inputs
985
986***data: tensor(T)***
987
988Input tensor
989
990#### Outputs
991
992***output: tensor(string)***
993
994The mapping result of the input
995
996#### Type Constraints
997***T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)***
998
999Constrain input and output types to numerical tensors.
1000
1001
1002#### Examples
1003
1004
1005```python
1006mapping_table = \
1007 """
1008 a 0 0 1 2
1009 b 0 1 2 3
1010 d 0 1 3 4
1011 """
1012
1013node = onnx.helper.make_node(
1014 'VectorToString',
1015 inputs=['x'],
1016 outputs=['y'],
1017 map=mapping_table,
1018 unk="unknown_word"
1019)
1020
1021
1022x = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)
1023y = ["a", "d", "unknown_word"]
1024
1025
1026expect(node, inputs=[x], outputs=[y],
1027 name='test_vector_to_string')
1028```
1029</details>
1030
1031
1032### StringToVector
1033
1034<details>
1035<summary>StringToVector details</summary>
1036
1037StringToVector will map each string element in the input to the corresponding vector according to the mapping file. The mapping file is a utf-8 encoding text file in tsv format:
1038
1039 <string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>
1040
1041Unmapped string will output the value of the attribute `unmapping_value`.
1042
1043Example:
1044
1045*Attributes:*
1046
1047- `mapping_file_name`: vocabulary.txt
1048 ```
1049 a 0 0 1 2
1050 b 0 1 2 3
1051 d 0 1 3 4
1052 ```
1053
1054- `unmapping_value`: [0 0 0 0]
1055
1056*Inputs:*
1057- data: ["a", "d", "e"]
1058
1059*Ouputs:*
1060- output: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]
1061
1062#### Attributes
1063
1064***mapping_file_name:string***
1065
1066The name of your string to vector mapping file.
1067
1068***unmapping_value:list(int)***
1069
1070Mapping result for unmapped string
1071
1072#### Inputs
1073
1074***data: tensor(string)***
1075
1076Input tensor
1077
1078#### Outputs
1079
1080***output: tensor(T)***
1081
1082The mapping result of the input
1083
1084#### Type Constraints
1085***T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)***
1086
1087Constrain input and output types to numerical tensors.
1088
1089#### Examples
1090
1091
1092```python
1093# what's in vocabulary.txt
1094
1095mapping_table = \
1096"""
1097a 0 0 1 2
1098b 0 1 2 3
1099d 0 1 3 4
1100"""
1101
1102node = onnx.helper.make_node(
1103 'StringToVector',
1104 inputs=['x'],
1105 outputs=['y'],
1106 mapping_table=mapping_table,
1107 unmapping_value=[0,0,0,0]
1108)
1109
1110
1111x = ["a", "d", "e"]
1112y = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)
1113
1114
1115expect(node, inputs=[x], outputs=[y],
1116 name='test_string_to_vector')
1117```
1118
1119</details>
1120
1121
1122
1123### StringSlice
1124
1125<details>
1126<summary>StringSlice details</summary>
1127
1128Do the slice operation to each string element in input tensor. Similar to string slice in python
1129
1130```python
1131a = "abcdef"
1132b = a[1:2]
1133c = a[3:1:-1]
1134```
1135
1136#### Inputs
1137
1138***data: tensor(string)***
1139
1140String tensor to extract slices from.
1141
1142***starts: tensor(int64/int32)***
1143
1144The tensor of starting indices of corresponding string in data, which has same dimension of data.
1145
1146***ends: tensor(int64/int32)***
1147
1148The tensor of ending indices of corresponding string in data, which has same dimension of data.
1149
1150***steps(optional): tensor(int64/int32)***
1151
1152The tensor of slice step of corresponding string in data, which has same dimension of data.If steps is empty tensor, we will use default value 1 for each string
1153
1154#### Outputs
1155
1156***output: tensor(string)***
1157
1158Sliced data tensor.
1159
1160#### Examples
1161
1162
1163```python
1164
1165node = onnx.helper.make_node(
1166 'StringSlice',
1167 inputs=['x', 'starts', 'ends', 'steps'],
1168 outputs=['y'],
1169)
1170
1171x = np.array(["abcdef", "hijkl"])
1172y = np.array([x[0][1:3:1], x[1][3:1:-1]])
1173starts = np.array([1, 3], dtype=np.int64)
1174ends = np.array([3, 1], dtype=np.int64)
1175axes = np.array([0, 1], dtype=np.int64)
1176steps = np.array([1, 1], dtype=np.int64)
1177
1178expect(node, inputs=[x, starts, ends, axes, steps], outputs=[y],
1179 name='test_string_slice')
1180```
1181
1182</details>
1183
1184
1185### MaskedFill
1186
1187<details>
1188<summary>MaskedFill details</summary>
1189
1190
1191Fills elements of self tensor with value where mask is True. The operator is similar with [`Tensor.masked_fill_`](https://pytorch.org/docs/stable/generated/torch.Tensor.masked_fill_.html#torch.Tensor.masked_fill_) in pytorch.
1192
1193
1194#### Inputs
1195
1196***value: tensor(string)***
1197
1198The value to fill in with, currently we only support string type and vector&scalar dimension.
1199
1200***mask: tensor(bool)***
1201
1202The boolean mask, the dimension of mask tensor should be same with value.
1203
1204#### Outputs
1205
1206***output: tensor(string)***
1207
1208The filled output of input tensor.
1209
1210
1211#### Examples
1212
1213
1214```python
1215
1216node = onnx.helper.make_node(
1217 'MaskedFill',
1218 inputs=['value', 'mask'],
1219 outputs=['output']
1220)
1221
1222
1223value = np.array(["a", "b", "c", "d"])
1224mask = np.array([True, False, True, False], dtype=bool)
1225output = np.array(["a", "c"])
1226
1227
1228expect(node, inputs=[value, mask], outputs=[output],
1229 name='test_masked_fill')
1230```
1231</details>
1232
1233### StringRaggedTensorToDense
1234
1235TODO
1236
1237### StringMapping
1238
1239TODO
1240
1241## Math operators
1242
1243
1244### Inverse
1245
1246TODO
1247
1248### NegPos
1249
1250TODO
1251
1252### SegmentExtraction
1253
1254TODO
1255
1256### SegmentSum
1257
1258TODO
1259
1260## Tensor operators
1261
1262### RaggedTensorToSparse
1263
1264TODO
1265
1266### RaggedTensorToDense
1267
1268TODO
1269
1270### Template
1271
1272<details>
1273<summary>Template details</summary>
1274
1275Description
1276
1277#### Inputs
1278
1279***name: tensor(type)***
1280
1281Description
1282
1283#### Outputs
1284
1285***name: tensor(type)***
1286
1287Description
1288
1289#### Examples
1290
1291
1292```python
1293
1294node = onnx.helper.make_node(
1295 'StringRegexReplace',
1296 inputs=['text', 'pattern', 'rewrite'],
1297 outputs=['y'],
1298)
1299
1300text = np.array([['def myfunc():'], ['def dummy():']])
1301pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
1302rewrite = np.array([r'static PyObject* py_\1(void) {'])
1303y = [['static PyObject* py_myfunc(void) {'],
1304 ['static PyObject* py_dummy(void) {']]
1305
1306expect(node, inputs=[text, pattern, rewrite], outputs=[y],
1307 name='test_string_regex_replace')
1308```
1309
1310</details>
1311