microsoft/onnxruntime-extensions

Public

mirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable

Watch0 Fork0 Star0

Code Commits Issues Pull requests Actions Insights Security

5e44a7c3c90cb9dc29f78c3322a60aa869dcf837

Find a branch or tag

Branches

5e44a7c3c90cb9dc29f78c3322a60aa869dcf837

Clone

HTTPS

Download ZIP

onnxruntime-extensions/docs

docs/custom_ops.md

1310lines · modecode

Raw Download

Latest commit unavailable.

unknown

1	`# Operators`
2
3
4	`## Natural language operators`
5
6	`### BertTokenizer`
7
8	`<details>`
9	`<summary>BertTokenizer details</summary>`
10
11	BertTokenizer replicates `encode_plus` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
12
13	`#### Inputs`
14
15	`*text: tensor(string)* The string tensor for tokenization`
16
17	`#### Attributes`
18
19	`*vocab_file: string*`
20
21	`The content of vocab which has same with huggingface.`
22
23	`*do_lower_case: int64_t* (default is 1, 1 represents True, 0 represents False)`
24
25	`Whether or not to lowercase the input when tokenizing.`
26
27	`*do_basic_tokenize: int64_t* (default is 1, 1 represents True, 0 represents False)`
28
29	`Whether or not to do basic tokenization before WordPiece.`
30
31	`*unk_token: string*`
32
33	`The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this`
34	`token instead.`
35
36	`*sep_token: string*`
37
38	`The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for`
39	`sequence classification or for a text and a question for question answering. It is also used as the last`
40	`token of a sequence built with special tokens.`
41
42	`*pad_token: string*`
43
44	`The token used for padding, for example when batching sequences of different lengths.`
45
46	`*cls_token: string*`
47
48	`The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.`
49
50	`*mask_token: string*`
51
52	`The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.`
53
54	`*tokenize_chinese_chars: int64_t* (default is 1, 1 represents True, 0 represents False)`
55
56	`Whether or not to tokenize Chinese characters.`
57
58	`*strip_accents: int64_t* (default is 1, 1 represents True, 0 represents False)`
59
60	`Whether or not to strip all accents. If this option is not specified, then it will be determined by the`
61	value for :obj:`lowercase` (as in the original BERT).
62
63	`*tokenize_punctuation: int64_t* (default is 0, 1 represents True, 0 represents False)`
64
65	`Splits punctuation on a piece of text.`
66
67	`*remove_control_chars: int64_t* (default is 0, 1 represents True, 0 represents False)`
68
69	`Remove control chars(such as NUL, BEL) in the text.`
70
71	`*truncation_strategy_name: string*`
72
73	The name of truncation strategy, it could be `longest_first`, `only_first`, `only_second`, `longest_from_back`.
74
75	`#### Outputs`
76
77	`*input_ids: tensor(int64_t)*`
78
79	`List of token ids.`
80
81	`*token_type_ids: tensor(64_t)*`
82
83	`List of token type ids`
84
85	`*attention_mask: tensor(64_t)*`
86
87	`List of indices specifying which tokens should b`
88	`e attended to by the model`
89
90
91	`#### Examples`
92
93	```python
94	`import transformers`
95
96	`bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')`
97
98	`node = onnx.helper.make_node(`
99	`'BertTokenizer',`
100	`inputs=['text'],`
101	`outputs=['tokens'],`
102	`)`
103
104	`text = "Hello world louder"`
105	`inputs = np.array([text], dtype=object),`
106
107	`bert_tokenize_result = bert_cased_tokenizer.tokenize(text)`
108
109	`input_ids = np.array(bert_tokenize_result[0])`
110	`token_type_ids = np.array(bert_tokenize_result[1])`
111	`attention_mask = np.array(bert_tokenize_result[2])`
112
113	`expect(node, inputs=[inputs],`
114	`outputs=[input_ids, token_type_ids, attention_mask], name='test_bert_tokenizer')`
115	```
116	`</details>`
117
118	`### BertTokenizerDecoder`
119
120	`<details>`
121	`<summary>BertTokenizerDecoder details</summary>`
122
123	BertTokenizerDecoder replicates `decode` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
124
125	`#### Inputs`
126
127	`*token_ids: tensor(int64)*`
128
129	`List of tokenized input ids.`
130
131	`*indices: tensor(int64)*`
132
133	List of `[start_position, end_position]` to indicate what segments of input ids should be decoded. This input only enabled when attribute `use_indices`=1.
134
135	`Usually, it is used to decode the slot in the text.`
136
137	`#### Attributes`
138
139	`*vocab_file: string*`
140
141	`The content of vocab which has same with huggingface.`
142
143	`*unk_token: string*`
144
145	`The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this`
146	`token instead.`
147
148	`*sep_token: string*`
149
150	`The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for`
151	`sequence classification or for a text and a question for question answering. It is also used as the last`
152	`token of a sequence built with special tokens.`
153
154	`*pad_token: string*`
155
156	`The token used for padding, for example when batching sequences of different lengths.`
157
158	`*cls_token: string*`
159
160	`The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.`
161
162	`*mask_token: string*`
163
164	`The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.`
165
166	`*suffix_indicator: string*`
167
168	`The suffix indicator.`
169
170	`*use_indices: int64_t*`
171
172	`Whether use second input.`
173
174	`*skip_special_tokens: int64_t*`
175
176	`Whether or not to remove special tokens in the decoding.`
177
178	`*clean_up_tokenization_spaces: int64_t*`
179
180	`Whether or not to clean up the tokenization spaces.`
181
182	`#### Outputs`
183
184	`*sentences: tensor(int64_t)*`
185
186	`The decoded sentences.`
187
188	`#### Examples`
189
190
191	```python
192	`import transformers`
193
194	`def get_file_content(path):`
195	`with open(path, "rb") as file:`
196	`return file.read()`
197
198	`bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')`
199	`bert_cased_tokenizer.save('.', 'bert')`
200
201
202	`node = onnx.helper.make_node(`
203	`'BertTokenizerDecoder',`
204	`inputs=['token_ids'],`
205	`outputs=['sentences'],`
206	`vocab_file=get_file_content("bert-vocab.txt")`
207	`)`
208
209	`text = "Hello world louder"`
210	`token_ids = np.array([bert_cased_tokenizer.tokenize(text)], dtype=object),`
211	`sentences = np.array(text)`
212
213
214	`expect(node, inputs=[token_ids],`
215	`outputs=[sentences], name='test_bert_tokenizer')`
216	```
217	`</details>`
218
219
220
221	`### GPT2Tokenizer`
222
223	`<details>`
224	`<summary>GPT2Tokenizer details</summary>`
225
226	`GPT2Tokenizer that performs byte-level bpe tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html).`
227
228	`#### Attributes`
229
230	`*vocab*`
231
232	`The content of the vocabulary file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).`
233
234	`*merges*`
235
236	`The content of the merges file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/merges.txt).`
237
238	`*padding_length(optional)*`
239
240	When the input is a set of query, the tokenized result is ragged tensor, so we need to pad the tensor to tidy tensor and the `padding_length` indicates the strategy of the padding. When the padding_length equals -1, we will pad the tensor to length of longest row. When the padding_length is more than 0, we will pad the tensor to the number of padding_length.
241
242	The default value of `padding_length` is -1.
243
244	`#### Inputs`
245
246	`*data: tensor(string)*`
247
248	`The string tensor for tokenization`
249
250	`#### Outputs`
251
252	`*input_ids: tensor(int64)*`
253
254	`The tokenized ids of input`
255
256	`*attention_mask: tensor(int64)*`
257
258	`A tensor indicates which part of input_ids is padded.`
259
260	`#### Examples`
261
262
263	```python
264	`def get_file_content(path):`
265	`with open(path, "rb") as file:`
266	`return file.read()`
267
268	`node = onnx.helper.make_node(`
269	`'GPT2Tokenizer',`
270	`inputs=['x'],`
271	`outputs=['y'],`
272	`vocab=get_file_content(vocabulary_file),`
273	`merges=get_file_content(merges_file)`
274	`)`
275
276	`x = ["hey cortana"]`
277	`y = np.array([20342, 12794, 2271], dtype=np.int64)`
278
279	`expect(node, inputs=[x], outputs=[y],`
280	`name='test_gpt2_tokenizer')`
281	```
282	`</details>`
283
284	`### WordpieceTokenizer`
285
286	`<details>`
287	`<summary>WordpieceTokenizer details</summary>`
288
289
290	`WordpieceTokenizer that performs WordPiece tokenization to the input tensor,`
291	`based on the [hugging face version](https://huggingface.co/transformers/model_doc/bert.html#WordpieceTokenizer).`
292	`[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)`
293	`from tensorflow_text can be implemented by a pair of nodes`
294	`RegexSplitWithOffets followed by WordpieceTokenizer.`
295	`it`
296
297	`#### Attributes`
298
299	`*vocab*`
300
301	`The content of the vocabulary file, its format is same with`
302	`[hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).`
303
304	`*suffix_indicator*`
305
306	`Suffix added to token not in the first position before looking into the vocabulary.`
307
308	`*unk_token*`
309
310	`Unknown tokens. Every token not found in the vocabulary is replaced by this one.`
311
312	`*max_input_chars_per_word*`
313
314	`Maximum number of characters per token (optional, defaults to 200).`
315
316	`#### Inputs`
317
318	`*data: tensor(string)*`
319
320	`The string tensor for tokenization`
321
322	`*row_indices: tensor(int64)* Empty or the fndices of every first token of input sentences.`
323	`indices[i+1] - indices[i]` is the number of tokens in input `i`.
324
325	`[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)`
326	`includes two steps. The first one splits sentences into words and then splits`
327	`every work into tokens. This operator only implements the second step.`
328	`The first one can be done with operator StringRegexSplit.`
329	`This parameter can either be empty or it can be the third output`
330	`of operator StringRegexSplit.`
331
332	`#### Outputs`
333
334	`*tokens: tensor(string)* Every token.`
335
336	`*token_indices: tensor(int32)* Indices of each token. -1 means a token outside the vocabulary.`
337
338	`*row_indices: tensor(int64)* Indices of every first token of input sentences.`
339	`indices[i+1] - indices[i]` is the number of tokens in input `i`.
340	`These are updates row indices given as inputs or new ones if the second input is empty.`
341
342	`#### Examples`
343
344
345	```python
346	`words = ["want", "##want",`
347	`"##ed", "wa", "un", "runn", "##ing"]`
348	`vocab = {w: i + 10 for i, w in enumerate(words)}`
349	`st = json.dumps(vocab)`
350	`nodes = []`
351	`mkv = helper.make_tensor_value_info`
352	`reg = helper.make_tensor(`
353	`"pattern", onnx_proto.TensorProto.STRING, [1, ], ["(\\s)".encode('ascii')])`
354	`reg_empty = helper.make_tensor(`
355	`"keep_pattern", onnx_proto.TensorProto.STRING, [0, ], [])`
356
357	`nodes = [`
358	`helper.make_node(`
359	`'StringRegexSplitWithOffsets,`
360	`inputs=['text', 'pattern', 'keep_pattern'],`
361	`outputs=['words', 'begin_end', 'indices'],`
362	`name='StringRegexPlsitOpName',`
363	`domain='ai.onnx.contrib'),`
364	`helper.make_node(`
365	`'WordpieceTokenizer',`
366	`inputs=['words', 'indices'],`
367	`outputs=['out0', 'out1', 'out2'],`
368	`name='WordpieceTokenizerOpName',`
369	`domain='ai.onnx.contrib',`
370	`vocab=st.encode('utf-8'),`
371	`suffix_indicator="##",`
372	`unk_token="[UNK]")`
373	`]`
374	`inputs = [mkv('text', onnx_proto.TensorProto.STRING, [None])]`
375	`graph = helper.make_graph(`
376	`nodes, 'test0', inputs, [`
377	`mkv('out0', onnx_proto.TensorProto.STRING, [None]),`
378	`mkv('out1', onnx_proto.TensorProto.INT32, [None]),`
379	`mkv('out2', onnx_proto.TensorProto.INT64, [None]),`
380	`mkv('words', onnx_proto.TensorProto.STRING, [None]),`
381	`mkv('indices', onnx_proto.TensorProto.INT64, [None])],`
382	`[reg, reg_empty])`
383	`model = helper.make_model(`
384	`graph, opset_imports=[helper.make_operatorsetid(domain, 1)])`
385
386	`text = np.array(["unwanted running", "unwantedX running"], dtype=object)`
387	`tokens = np.array(['un', '##want', '##ed', 'runn', '##ing', 'un', '##want', '##ed',`
388	`'[UNK]', 'runn', '##ing'], dtype=object),`
389	`indices = np.array([14, 11, 12, 15, 16, 14, 11, 12, -1, 15, 16], dtype=int32)`
390	`row_indices = np.array([ 0, 5, 11], dtype=int64)`
391
392	`expect(model, inputs=[text], outputs=[tokens, indices, row_indices],`
393	`name='test_bert_tokenizer')`
394	```
395
396	`</details>`
397
398	`### SentencepieceTokenizer`
399
400	`<details>`
401	`<summary>SentencepieceTokenizer details</summary>`
402
403	`SentencepieceTokenizer replicates [SentencepieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/SentencepieceTokenizer.md).`
404
405	`#### Inputs`
406
407	`*data: tensor(string)* The string tensor for tokenization`
408
409	`*nbest_size: tensor(int64)* A scalar for sampling. nbest_size = {0,1}: No sampling is performed.`
410	`(default) nbest_size > 1: samples from the nbest_size results. nbest_size < 0: assuming that`
411	`nbest_size is infinite and samples from the all hypothesis (lattice) using`
412	`forward-filtering-and-backward-sampling algorithm.`
413
414	`*alpha: tensor(float)* A scalar for a smoothing parameter. Inverse temperature for probability rescaling.`
415
416	`*reverse: tensor(bool)* Reverses the tokenized sequence (Default = false)`
417
418	`*add_bos: tensor(bool)* Add beginning of sentence token to the result (Default = false)`
419
420	`*add_eos: tensor(bool)* Add end of sentence token to the result (Default = false).`
421	`When reverse=True beginning/end of sentence tokens are added after reversing.`
422
423	`#### Attributes`
424
425	`*model: string* The sentencepiece model serialized proto as stored as a string.`
426
427	`#### Outputs`
428
429	`*tokens: tensor(int32)* Indices of each token.`
430
431	`*indices: tensor(int64)* Indices of every first token of input sentences.`
432	`indices[i+1] - indices[i]` is the number of tokens in input `i`.
433
434	`Tokenized result of the input`
435
436	`#### Examples`
437
438
439	```python
440
441	`url = "https://github.com/microsoft/ort-customops/raw/main/test/data/test_sentencepiece_ops_model__6.txt"`
442	`with urllib.request.urlopen(url) as f:`
443	`content = f.read()`
444	`model = np.array(list(base64.decodebytes(content.encode())), dtype=np.uint8)`
445
446	`node = onnx.helper.make_node(`
447	`'SentencepieceTokenizer',`
448	`inputs=['inputs', 'nbest_size', 'alpha', 'add_bos', 'add_eos', 'reverse'],`
449	`outputs=['indices', 'output'],`
450	`mapping_file_name='vocabulary.txt',`
451	`unmapping_value="unknown_word",`
452	`model=model`
453	`)`
454
455	`inputs = np.array(["Hello world", "Hello world louder"], dtype=object),`
456	`nbest_size = np.array([0], dtype=np.float32),`
457	`alpha = np.array([0], dtype=np.float32),`
458	`add_bos = np.array([0], dtype=np.bool_),`
459	`add_eos = np.array([0], dtype=np.bool_),`
460	`reverse = np.array([0], dtype=np.bool_)`
461
462	`tokens = array([17486, 1017, 17486, 1017, 155, 21869], dtype=int32)`
463	`indices = array([0, 2, 6], dtype=int64)`
464
465	`expect(node, inputs=[inputs, nbest_size, alpha, add_bos, add_eos, reverse],`
466	`outputs=[tokens, indices], name='sp')`
467	```
468	`</details>`
469
470
471	`### BasicTokenizer`
472
473	`<details>`
474	`<summary>BasicTokenizer details</summary>`
475
476	`TODO: is this still supported?`
477
478	`BasicTokenizer performs basic tokenization to input string tensor, based on [basic tokenizer in BertTokenizer(hugging face version)](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).`
479
480	`#### Inputs`
481
482	`*text: tensor(string)* The string tensor for tokenization`
483
484	`#### Attributes`
485
486	`*do_lower_case: int64_t* (default is 1, 1 represents True, 0 represents False)`
487
488	`Whether or not to lowercase the input when tokenizing.`
489
490	`*tokenize_chinese_chars: int64_t* (default is 1, 1 represents True, 0 represents False)`
491
492	`Whether or not to tokenize Chinese characters.`
493
494	`*strip_accents: int64_t* (default is 1, 1 represents True, 0 represents False)`
495
496	`Whether or not to strip all accents. If this option is not specified, then it will be determined by the`
497	value for :obj:`lowercase` (as in the original BERT).
498
499	`*tokenize_punctuation: int64_t* (default is 0, 1 represents True, 0 represents False)`
500
501	`Splits punctuation on a piece of text.`
502
503	`*remove_control_chars: int64_t* (default is 0, 1 represents True, 0 represents False)`
504
505	`Remove control chars(such as NUL, BEL) in the text.`
506
507	`#### Outputs`
508
509	`*tokens: tensor(string)* Tokenized tokens.`
510
511	`#### Examples`
512
513	```python
514	`import transformers`
515
516	`tokenizer = transformers.BasicTokenizer()`
517
518	`node = onnx.helper.make_node(`
519	`'BasicTokenizer',`
520	`inputs=['text'],`
521	`outputs=['tokens'],`
522	`)`
523
524	`inputs = np.array([ "Hello world louder"], dtype=object),`
525	`tokens = np.array(tokenizer(inputs), dtype=int32)`
526
527	`expect(node, inputs=[inputs],`
528	`outputs=[tokens], name='test_basic_tokenizer')`
529	```
530	`</details>`
531
532
533	`### BlingFireSentenceBreaker`
534
535	`TODO`
536
537	`### BpeTokenizer`
538
539	`TODO`
540
541
542	`## String operators`
543
544	`### StringEqual`
545
546	`<details>`
547	`<summary>StringEqual details</summary>`
548
549	`Compares two strings and returns true if they are equal and false if not.`
550
551	`#### Inputs`
552
553	`*x: tensor(string)*`
554
555	`The first string input`
556
557	`*x: tensor(string)*`
558
559	`The second string input`
560
561	`#### Outputs`
562
563	`*z: tensor(boolean)*`
564
565	`String with replacements.`
566
567	`</details>`
568
569
570	`### StringHash`
571
572	`<details>`
573	`<summary>StringHash details</summary>`
574
575
576	`Hashes the input string based on the number of buckets`
577
578	`#### Inputs`
579
580	`*input: tensor(string)*`
581
582	`The string to hash`
583
584	`*num_buckets: tensor(int64)*`
585
586	`The number of buckets (must be equal to 1?)`
587
588	`#### Outputs`
589
590	`*name: tensor(int64)*`
591
592	`The hash value of the string`
593
594	`</details>`
595
596
597	`### StringHashFast`
598
599	`<details>`
600	`<summary>StringHashFast details</summary>`
601
602
603	`A faster implementation of StringHash.`
604
605	`</details>`
606
607
608	`### StringJoin`
609
610	`<details>`
611	`<summary>StringJoin details</summary>`
612
613
614	`Join an array of strings`
615
616	`#### Inputs`
617
618	`*input_X: tensor(string)*`
619
620	`The input array of strings`
621
622	`*input_sep: tensor(string)*`
623
624	`The string separator for the resulting joing`
625
626	`*input_axis: tensor(int64)*`
627
628	`The axis along which to joing`
629
630	`#### Outputs`
631
632	`*out: tensor(string)*`
633
634	`The resulting joined string`
635
636	`#### Examples`
637
638
639	```bash
640
641	`input_X = [["a", "b", "c"], ["aa", "bb", ""]]`
642	`input_sep=";"`
643	`input_axis = 1`
644
645	`out = ["a;b;c", "aa;bb;"]`
646
647	`input_axis = 0`
648
649	`out = ['a;aa', 'b;bb', 'c;']`
650
651
652	`</details>`
653
654
655	`### StringRegexReplace`
656
657	`<details>`
658	`<summary>StringRegexReplace details</summary>`
659
660
661	`String replacement based on [Re2-format](https://github.com/google/re2/wiki/Syntax) regular expressions.`
662
663	`#### Inputs`
664
665	`*text: tensor(string)*`
666
667	`String tensor to extract slices from.`
668
669	`*pattern: tensor(string)*`
670
671	`Pattern of the regular expression.`
672
673	`*rewrite: tensor(string)*`
674
675	`Replacement.`
676
677	`#### Attributes`
678
679	`*global_replace: int64* (default is 1)`
680
681	`Replace all strings matching the pattern or the first one.`
682
683	`#### Outputs`
684
685	`*output: tensor(string)*`
686
687	`String with replacements.`
688
689	`#### Examples`
690
691	```python
692
693	`node = onnx.helper.make_node(`
694	`'StringRegexReplace',`
695	`inputs=['text', 'pattern', 'rewrite'],`
696	`outputs=['y'],`
697	`)`
698
699	`text = np.array([['def myfunc():'], ['def dummy():']])`
700	`pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:'])`
701	`rewrite = np.array([r'static PyObject* py_\1(void) {'])`
702	`y = [['static PyObject* py_myfunc(void) {'],`
703	`['static PyObject* py_dummy(void) {']]`
704
705	`expect(node, inputs=[text, pattern, rewrite], outputs=[y],`
706	`name='test_string_regex_replace')`
707	```
708
709	`</details>`
710
711	`### StringECMARegexReplace`
712
713	`<details>`
714	`<summary>StringECMARegexReplace details</summary>`
715
716	`String replacement based on [ECMA-format](https://en.cppreference.com/w/cpp/regex/ecmascript) regular expressions.`
717
718	`#### Inputs`
719
720	`*text: tensor(string)*`
721
722	`String tensor to extract slices from.`
723
724	`*pattern: tensor(string)*`
725
726	`Pattern of the regular expression.`
727
728	`*rewrite: tensor(string)*`
729
730	`Replacement.`
731
732	`#### Attributes`
733
734	`*global_replace: int64* (default is 1)`
735
736	`Replace all strings matching the pattern or the first one.`
737
738
739	`*ignore_case: int64* (default is 0)`
740
741	`Replace`
742
743	`#### Outputs`
744
745	`*output: tensor(string)*`
746
747	`String with replacements.`
748
749	`#### Examples`
750
751
752	```python
753
754	`node = onnx.helper.make_node(`
755	`'StringRegexReplace',`
756	`inputs=['text', 'pattern', 'rewrite'],`
757	`outputs=['y'],`
758	`)`
759
760	`text = np.array([['def myfunc():'], ['def dummy():']])`
761	`pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:'])`
762	`rewrite = np.array([r'static PyObject* py_$1(void) {'])`
763	`y = [['static PyObject* py_myfunc(void) {'],`
764	`['static PyObject* py_dummy(void) {']]`
765
766	`expect(node, inputs=[text, pattern, rewrite], outputs=[y],`
767	`name='test_string_regex_replace')`
768	```
769
770	`</details>`
771
772
773
774	`### StringSplit`
775
776	`TODO`
777
778	`### StringUpper`
779
780	`TODO`
781
782	`### StringLower`
783
784	`TODO`
785
786	`### StringLength`
787
788	`<details>`
789	`<summary>StringECMARegexReplace details</summary>`
790
791	Get the length of each string element in input tensor. Similar to the function `len("abcde"")` in python.
792
793	`#### Inputs`
794
795	`*data: tensor(string)*`
796
797	`String tensor to get length of its each string element.`
798
799	`#### Outputs`
800
801	`*output: tensor(int64)*`
802
803	`Data length tensor.`
804
805	`#### Examples`
806
807
808	```python
809
810	`node = onnx.helper.make_node(`
811	`'StringLength',`
812	`inputs=['x'],`
813	`outputs=['y']`
814	`)`
815
816	`x = ["abcdef", "hijkl"]`
817	`y = np.array([len(x[0]), len(x[1])], dtype=np.int64)`
818
819
820	`expect(node, inputs=[x], outputs=[y],`
821	`name='test_string_length')`
822	```
823	`</details>`
824
825	`### StringConcat`
826
827	`<details>`
828	`<summary>StringConcat details</summary>`
829
830	`Concat the corresponding string in the two string tensor. Two input tensors should have the same dimension.`
831
832	```python
833	`output = []`
834	`shape = input1.shape`
835	`input1 = input1.flatten()`
836	`input2 = input2.flatten()`
837	`for i in range(len(input1)):`
838	`output.append(input1[i] + input2[i])`
839	`output = np.array(output).reshape(shape)`
840	```
841
842	`#### Inputs`
843
844	`*input_1: tensor(string)*`
845
846	`The first string tensor.`
847
848	`*input_2: tensor(string)*`
849
850	`The second string tensor.`
851
852
853	`#### Outputs`
854
855	`*output: tensor(string)*`
856
857	`The result.`
858
859	`#### Examples`
860
861
862	```python
863
864	`node = onnx.helper.make_node(`
865	`'StringConcat',`
866	`inputs=['x', 'y'],`
867	`outputs=['result'],`
868	`)`
869
870	`x = np.array(["abcd", "efgh"])`
871	`y = np.array(["wxyz", "stuv"])`
872	`result = np.array([x[0] + y[0], x[1] + y[1]])`
873
874	`expect(node, inputs=[x, y], outputs=[result],`
875	`name='test_string_concat')`
876	```
877
878	`</details>`
879
880	`### StringRegexSplitWithOffsets`
881
882	`<details>`
883	`<summary>StringRegexSplitWithOffsets details</summary>`
884
885	`Splits string based on regular expressions.`
886
887	`#### Inputs`
888
889	`*text: tensor(string)*`
890
891	`String tensor to extract slices from.`
892
893	`*delim_regex_pattern: tensor(string)*`
894
895	`Splitting attern of the regular expression.`
896
897	`*keep_delim_regex_pattern: tensor(string)*`
898
899	`By default, delimiters are not included in the split string results. Delimiters may be included by specifying a regex pattern keep_delim_regex_pattern.`
900
901	`#### Outputs`
902
903	`*words: tensor(string)* Tensor of words.`
904
905	`*offsets: tensor(int64)* 2D tensor with 3 columns:`
906	`sentence index, position of the first character, position of the last one (excluded)`
907
908	`*row_indices: tensor(int64)* Indices of every first token of input sentences.`
909	`row_indices[i+1] - row_indices[i]` is the number of tokens in input `i`.
910	`These are updates row indices given as inputs or new ones if the second input is empty.`
911
912
913	`#### Examples`
914
915
916	```python
917
918	`node = onnx.helper.make_node(`
919	`'StringRegexSplit',`
920	`inputs=['text', 'pattern', 'rewrite'],`
921	`outputs=['y', 'begin_end', 'indices'],`
922	`)`
923
924	`text = np.array(["hello there"])`
925	`pattern = np.array([r'\s'])`
926	`rewrite = np.array([r'\s'])`
927	`y = np.array(["hello", " ", "there"])`
928	`z1 = np.array([[0, 0, 5],`
929	`[0, 5, 6],`
930	`[0, 6, 11]], dtype=np.int64)`
931	`z2 = np.array([0, 2], dtype=np.int64)`
932
933	`expect(node, inputs=[text, pattern, rewrite], outputs=[y, z1, z2],`
934	`name='test_string_regex_replace')`
935	```
936
937	`</details>`
938
939
940	`### StringECMARegexSplitWithOffsets`
941
942	`TODO`
943
944	`### VectorToString`
945
946	`<details>`
947	`<summary>VectorToString details</summary>`
948
949	VectorToString is the contrary operation to the `StringToVector` , they share same format of mapping table:
950
951	`<string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>`
952
953	Unmapped vector will output the value of the attribute `unk`.
954
955	`Example:`
956
957	`Attributes:`
958
959	- `map`:
960	```
961	`a 0 0 1 2`
962	`b 0 1 2 3`
963	`d 0 1 3 4`
964	```
965
966	- `unk`: "unknown_word"
967
968	`Inputs:`
969	`- data: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]`
970
971	`Ouputs:`
972	`- output: ["a", "d", "unknown_word" ]`
973
974	`#### Attributes`
975
976	`*mapping_file_name*`
977
978	`the formative mapping table`
979
980	`*unmapping_value*`
981
982	`the result returned when a vector aren't found in the map`
983
984	`#### Inputs`
985
986	`*data: tensor(T)*`
987
988	`Input tensor`
989
990	`#### Outputs`
991
992	`*output: tensor(string)*`
993
994	`The mapping result of the input`
995
996	`#### Type Constraints`
997	`*T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)*`
998
999	`Constrain input and output types to numerical tensors.`
1000
1001
1002	`#### Examples`
1003
1004
1005	```python
1006	`mapping_table = \`
1007	`"""`
1008	`a 0 0 1 2`
1009	`b 0 1 2 3`
1010	`d 0 1 3 4`
1011	`"""`
1012
1013	`node = onnx.helper.make_node(`
1014	`'VectorToString',`
1015	`inputs=['x'],`
1016	`outputs=['y'],`
1017	`map=mapping_table,`
1018	`unk="unknown_word"`
1019	`)`
1020
1021
1022	`x = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)`
1023	`y = ["a", "d", "unknown_word"]`
1024
1025
1026	`expect(node, inputs=[x], outputs=[y],`
1027	`name='test_vector_to_string')`
1028	```
1029	`</details>`
1030
1031
1032	`### StringToVector`
1033
1034	`<details>`
1035	`<summary>StringToVector details</summary>`
1036
1037	`StringToVector will map each string element in the input to the corresponding vector according to the mapping file. The mapping file is a utf-8 encoding text file in tsv format:`
1038
1039	`<string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>`
1040
1041	Unmapped string will output the value of the attribute `unmapping_value`.
1042
1043	`Example:`
1044
1045	`Attributes:`
1046
1047	- `mapping_file_name`: vocabulary.txt
1048	```
1049	`a 0 0 1 2`
1050	`b 0 1 2 3`
1051	`d 0 1 3 4`
1052	```
1053
1054	- `unmapping_value`: [0 0 0 0]
1055
1056	`Inputs:`
1057	`- data: ["a", "d", "e"]`
1058
1059	`Ouputs:`
1060	`- output: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]`
1061
1062	`#### Attributes`
1063
1064	`*mapping_file_name:string*`
1065
1066	`The name of your string to vector mapping file.`
1067
1068	`*unmapping_value:list(int)*`
1069
1070	`Mapping result for unmapped string`
1071
1072	`#### Inputs`
1073
1074	`*data: tensor(string)*`
1075
1076	`Input tensor`
1077
1078	`#### Outputs`
1079
1080	`*output: tensor(T)*`
1081
1082	`The mapping result of the input`
1083
1084	`#### Type Constraints`
1085	`*T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)*`
1086
1087	`Constrain input and output types to numerical tensors.`
1088
1089	`#### Examples`
1090
1091
1092	```python
1093	`# what's in vocabulary.txt`
1094
1095	`mapping_table = \`
1096	`"""`
1097	`a 0 0 1 2`
1098	`b 0 1 2 3`
1099	`d 0 1 3 4`
1100	`"""`
1101
1102	`node = onnx.helper.make_node(`
1103	`'StringToVector',`
1104	`inputs=['x'],`
1105	`outputs=['y'],`
1106	`mapping_table=mapping_table,`
1107	`unmapping_value=[0,0,0,0]`
1108	`)`
1109
1110
1111	`x = ["a", "d", "e"]`
1112	`y = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)`
1113
1114
1115	`expect(node, inputs=[x], outputs=[y],`
1116	`name='test_string_to_vector')`
1117	```
1118
1119	`</details>`
1120
1121
1122
1123	`### StringSlice`
1124
1125	`<details>`
1126	`<summary>StringSlice details</summary>`
1127
1128	`Do the slice operation to each string element in input tensor. Similar to string slice in python`
1129
1130	```python
1131	`a = "abcdef"`
1132	`b = a[1:2]`
1133	`c = a[3:1:-1]`
1134	```
1135
1136	`#### Inputs`
1137
1138	`*data: tensor(string)*`
1139
1140	`String tensor to extract slices from.`
1141
1142	`*starts: tensor(int64/int32)*`
1143
1144	`The tensor of starting indices of corresponding string in data, which has same dimension of data.`
1145
1146	`*ends: tensor(int64/int32)*`
1147
1148	`The tensor of ending indices of corresponding string in data, which has same dimension of data.`
1149
1150	`*steps(optional): tensor(int64/int32)*`
1151
1152	`The tensor of slice step of corresponding string in data, which has same dimension of data.If steps is empty tensor, we will use default value 1 for each string`
1153
1154	`#### Outputs`
1155
1156	`*output: tensor(string)*`
1157
1158	`Sliced data tensor.`
1159
1160	`#### Examples`
1161
1162
1163	```python
1164
1165	`node = onnx.helper.make_node(`
1166	`'StringSlice',`
1167	`inputs=['x', 'starts', 'ends', 'steps'],`
1168	`outputs=['y'],`
1169	`)`
1170
1171	`x = np.array(["abcdef", "hijkl"])`
1172	`y = np.array([x[0][1:3:1], x[1][3:1:-1]])`
1173	`starts = np.array([1, 3], dtype=np.int64)`
1174	`ends = np.array([3, 1], dtype=np.int64)`
1175	`axes = np.array([0, 1], dtype=np.int64)`
1176	`steps = np.array([1, 1], dtype=np.int64)`
1177
1178	`expect(node, inputs=[x, starts, ends, axes, steps], outputs=[y],`
1179	`name='test_string_slice')`
1180	```
1181
1182	`</details>`
1183
1184
1185	`### MaskedFill`
1186
1187	`<details>`
1188	`<summary>MaskedFill details</summary>`
1189
1190
1191	Fills elements of self tensor with value where mask is True. The operator is similar with [`Tensor.masked_fill_`](https://pytorch.org/docs/stable/generated/torch.Tensor.masked_fill_.html#torch.Tensor.masked_fill_) in pytorch.
1192
1193
1194	`#### Inputs`
1195
1196	`*value: tensor(string)*`
1197
1198	`The value to fill in with, currently we only support string type and vector&scalar dimension.`
1199
1200	`*mask: tensor(bool)*`
1201
1202	`The boolean mask, the dimension of mask tensor should be same with value.`
1203
1204	`#### Outputs`
1205
1206	`*output: tensor(string)*`
1207
1208	`The filled output of input tensor.`
1209
1210
1211	`#### Examples`
1212
1213
1214	```python
1215
1216	`node = onnx.helper.make_node(`
1217	`'MaskedFill',`
1218	`inputs=['value', 'mask'],`
1219	`outputs=['output']`
1220	`)`
1221
1222
1223	`value = np.array(["a", "b", "c", "d"])`
1224	`mask = np.array([True, False, True, False], dtype=bool)`
1225	`output = np.array(["a", "c"])`
1226
1227
1228	`expect(node, inputs=[value, mask], outputs=[output],`
1229	`name='test_masked_fill')`
1230	```
1231	`</details>`
1232
1233	`### StringRaggedTensorToDense`
1234
1235	`TODO`
1236
1237	`### StringMapping`
1238
1239	`TODO`
1240
1241	`## Math operators`
1242
1243
1244	`### Inverse`
1245
1246	`TODO`
1247
1248	`### NegPos`
1249
1250	`TODO`
1251
1252	`### SegmentExtraction`
1253
1254	`TODO`
1255
1256	`### SegmentSum`
1257
1258	`TODO`
1259
1260	`## Tensor operators`
1261
1262	`### RaggedTensorToSparse`
1263
1264	`TODO`
1265
1266	`### RaggedTensorToDense`
1267
1268	`TODO`
1269
1270	`### Template`
1271
1272	`<details>`
1273	`<summary>Template details</summary>`
1274
1275	`Description`
1276
1277	`#### Inputs`
1278
1279	`*name: tensor(type)*`
1280
1281	`Description`
1282
1283	`#### Outputs`
1284
1285	`*name: tensor(type)*`
1286
1287	`Description`
1288
1289	`#### Examples`
1290
1291
1292	```python
1293
1294	`node = onnx.helper.make_node(`
1295	`'StringRegexReplace',`
1296	`inputs=['text', 'pattern', 'rewrite'],`
1297	`outputs=['y'],`
1298	`)`
1299
1300	`text = np.array([['def myfunc():'], ['def dummy():']])`
1301	`pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:'])`
1302	`rewrite = np.array([r'static PyObject* py_\1(void) {'])`
1303	`y = [['static PyObject* py_myfunc(void) {'],`
1304	`['static PyObject* py_dummy(void) {']]`
1305
1306	`expect(node, inputs=[text, pattern, rewrite], outputs=[y],`
1307	`name='test_string_regex_replace')`
1308	```
1309
1310	`</details>`
1311

microsoft/onnxruntime-extensions

Branches

Tags

Clone