microsoft/onnxruntime-extensions

Public

mirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
v0.4.2

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

docs/custom_text_ops.md

745lines · modecode

1## Operator Schemas
2
3### Auxiliary String Operator
4
5|**Operator**|**Support State**|
6|------------|-----------------|
7|StringEqual | Supported |
8|StringHash | Supported |
9|StringToHashBucketFast|Supported|
10|StringJoin | Supported |
11|StringRegexReplace| Supported |
12|StringRegexSplit| Supported |
13|StringSplit | Supported |
14|StringUpper | Supported |
15|StringLength | Supported |
16|StringConcat | Supported |
17|StringRegexSplitWithOffsets| Supported |
18|VectorToString| Supported |
19|StringToVector| Supported|
20|StringSlice | Under development|
21### Tokenizer
22
23|**Operator**|**Support State**|
24|------------|-----------------|
25|GPT2Tokenizer| Supported |
26|WordpieceTokenizer| Supported |
27|XLNetTokenizer| Under development |
28|SentencepieceTokenizer| Supported |
29
30## Auxiliary String Operator
31
32[TODO: Add existing operators]
33
34### <a name="StringRegexReplace"></a><a name="StringRegexReplace">**StringRegexReplace**</a>
35
36String replacement based on regular expressions.
37
38#### Inputs
39
40***text: tensor(string)***
41
42String tensor to extract slices from.
43
44***pattern: tensor(string)***
45
46Pattern of the regular expression.
47
48***rewrite: tensor(string)***
49
50Replacement.
51
52#### Attributes
53
54***global_replace: int64*** (default is 1)
55
56Replace all strings matching the pattern or the first one.
57
58#### Outputs
59
60***output: tensor(string)***
61
62String with replacements.
63
64#### Examples
65
66<details>
67<summary>StringRegexReplace</summary>
68
69```python
70
71node = onnx.helper.make_node(
72 'StringRegexReplace',
73 inputs=['text', 'pattern', 'rewrite'],
74 outputs=['y'],
75)
76
77text = np.array([['def myfunc():'], ['def dummy():']])
78pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
79rewrite = np.array([r'static PyObject* py_\1(void) {'])
80y = [['static PyObject* py_myfunc(void) {'],
81 ['static PyObject* py_dummy(void) {']]
82
83expect(node, inputs=[text, pattern, rewrite], outputs=[y],
84 name='test_string_regex_replace')
85```
86
87</details>
88
89### <a name="StringRegexSplit"></a><a name="StringRegexSplit">**StringRegexSplit**</a>
90
91Splits string based on regular expressions.
92
93#### Inputs
94
95***text: tensor(string)***
96
97String tensor to extract slices from.
98
99***delim_regex_pattern: tensor(string)***
100
101Splitting attern of the regular expression.
102
103***keep_delim_regex_pattern: tensor(string)***
104
105By default, delimiters are not included in the split string results. Delimiters may be included by specifying a regex pattern keep_delim_regex_pattern.
106
107#### Outputs
108
109***words: tensor(string)*** Tensor of words.
110
111***offsets: tensor(int64)*** 2D tensor with 3 columns:
112sentence index, position of the first character, position of the last one (excluded)
113
114***row_indices: tensor(int64)*** Indices of every first token of input sentences.
115`row_indices[i+1] - row_indices[i]` is the number of tokens in input `i`.
116These are updates row indices given as inputs or new ones if the second input is empty.
117
118
119#### Examples
120
121<details>
122<summary>StringRegexSplit</summary>
123
124```python
125
126node = onnx.helper.make_node(
127 'StringRegexSplit',
128 inputs=['text', 'pattern', 'rewrite'],
129 outputs=['y', 'begin_end', 'indices'],
130)
131
132text = np.array(["hello there"])
133pattern = np.array([r'\s'])
134rewrite = np.array([r'\s'])
135y = np.array(["hello", " ", "there"])
136z1 = np.array([[0, 0, 5],
137 [0, 5, 6],
138 [0, 6, 11]], dtype=np.int64)
139z2 = np.array([0, 2], dtype=np.int64)
140
141expect(node, inputs=[text, pattern, rewrite], outputs=[y, z1, z2],
142 name='test_string_regex_replace')
143```
144
145</details>
146
147### <a name="StringConcat"></a><a name="StringConcat">**StringConcat**</a>
148
149Concat the corresponding string in the two string tensor. Two input tensors should have the same dimension.
150
151```python
152 output = []
153 shape = input1.shape
154 input1 = input1.flatten()
155 input2 = input2.flatten()
156 for i in range(len(input1)):
157 output.append(input1[i] + input2[i])
158 output = np.array(output).reshape(shape)
159```
160
161#### Inputs
162
163***input_1: tensor(string)***
164
165The first string tensor.
166
167***input_2: tensor(string)***
168
169The second string tensor.
170
171
172#### Outputs
173
174***output: tensor(string)***
175
176The result.
177
178#### Examples
179
180<details>
181<summary>StringConcat</summary>
182
183```python
184
185node = onnx.helper.make_node(
186 'StringConcat',
187 inputs=['x', 'y'],
188 outputs=['result'],
189)
190
191x = np.array(["abcd", "efgh"])
192y = np.array(["wxyz", "stuv"])
193result = np.array([x[0] + y[0], x[1] + y[1]])
194
195expect(node, inputs=[x, y], outputs=[result],
196 name='test_string_concat')
197```
198
199</details>
200
201### <a name="StringSlice"></a><a name="StringSlice">**StringSlice**</a>
202
203Do the slice operation to each string element in input tensor. Similar to string slice in python
204
205```python
206a = "abcdef"
207b = a[1:2]
208c = a[3:1:-1]
209```
210
211#### Inputs
212
213***data: tensor(string)***
214
215String tensor to extract slices from.
216
217***starts: tensor(int64/int32)***
218
219The tensor of starting indices of corresponding string in data, which has same dimension of data.
220
221***ends: tensor(int64/int32)***
222
223The tensor of ending indices of corresponding string in data, which has same dimension of data.
224
225***steps(optional): tensor(int64/int32)***
226
227The tensor of slice step of corresponding string in data, which has same dimension of data.If steps is empty tensor, we will use default value 1 for each string
228
229#### Outputs
230
231***output: tensor(string)***
232
233Sliced data tensor.
234
235#### Examples
236
237<details>
238<summary>string_slice</summary>
239
240```python
241
242node = onnx.helper.make_node(
243 'StringSlice',
244 inputs=['x', 'starts', 'ends', 'steps'],
245 outputs=['y'],
246)
247
248x = np.array(["abcdef", "hijkl"])
249y = np.array([x[0][1:3:1], x[1][3:1:-1]])
250starts = np.array([1, 3], dtype=np.int64)
251ends = np.array([3, 1], dtype=np.int64)
252axes = np.array([0, 1], dtype=np.int64)
253steps = np.array([1, 1], dtype=np.int64)
254
255expect(node, inputs=[x, starts, ends, axes, steps], outputs=[y],
256 name='test_string_slice')
257```
258
259</details>
260
261### <a name="StringLength"></a><a name="StringLength">**StringLength**</a>
262
263Get the length of each string element in input tensor. Similar to the function `len("abcde"")` in python.
264
265#### Inputs
266
267***data: tensor(string)***
268
269String tensor to get length of its each string element.
270
271#### Outputs
272
273***output: tensor(int64)***
274
275Data length tensor.
276
277#### Examples
278
279<details>
280<summary>string_length</summary>
281
282```python
283
284node = onnx.helper.make_node(
285 'StringLength',
286 inputs=['x'],
287 outputs=['y']
288)
289
290x = ["abcdef", "hijkl"]
291y = np.array([len(x[0]), len(x[1])], dtype=np.int64)
292
293
294expect(node, inputs=[x], outputs=[y],
295 name='test_string_length')
296```
297</details>
298
299
300### <a name="StringToVector"></a><a name="StringToVector">**StringToVector**</a>
301
302StringToVector will map each string element in the input to the corresponding vector according to the mapping file. The mapping file is a utf-8 encoding text file in tsv format:
303
304 <string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>
305
306Unmapped string will output the value of the attribute `unmapping_value`.
307
308Example:
309
310*Attributes:*
311
312- `mapping_file_name`: vocabulary.txt
313 ```
314 a 0 0 1 2
315 b 0 1 2 3
316 d 0 1 3 4
317 ```
318
319- `unmapping_value`: [0 0 0 0]
320
321*Inputs:*
322- data: ["a", "d", "e"]
323
324*Ouputs:*
325- output: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]
326
327#### Attributes
328
329***mapping_file_name:string***
330
331The name of your string to vector mapping file.
332
333***unmapping_value:list(int)***
334
335Mapping result for unmapped string
336
337#### Inputs
338
339***data: tensor(string)***
340
341Input tensor
342
343#### Outputs
344
345***output: tensor(T)***
346
347The mapping result of the input
348
349#### Type Constraints
350***T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)***
351
352Constrain input and output types to numerical tensors.
353
354#### Examples
355
356<details>
357<summary>string_to_vector</summary>
358
359```python
360# what's in vocabulary.txt
361
362mapping_table = \
363"""
364a 0 0 1 2
365b 0 1 2 3
366d 0 1 3 4
367"""
368
369node = onnx.helper.make_node(
370 'StringToVector',
371 inputs=['x'],
372 outputs=['y'],
373 mapping_table=mapping_table,
374 unmapping_value=[0,0,0,0]
375)
376
377
378x = ["a", "d", "e"]
379y = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)
380
381
382expect(node, inputs=[x], outputs=[y],
383 name='test_string_to_vector')
384```
385
386</details>
387
388### <a name="VectorToString"></a><a name="VectorToString">**VectorToString**</a>
389
390VectorToString is the contrary operation to the `StringToVector` , they share same format of mapping table:
391
392 <string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>
393
394Unmapped vector will output the value of the attribute `unk`.
395
396Example:
397
398*Attributes:*
399
400- `map`:
401 ```
402 a 0 0 1 2
403 b 0 1 2 3
404 d 0 1 3 4
405 ```
406
407- `unk`: "unknown_word"
408
409*Inputs:*
410- data: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]
411
412*Ouputs:*
413- output: ["a", "d", "unknown_word" ]
414
415#### Attributes
416
417***mapping_file_name***
418
419the formative mapping table
420
421***unmapping_value***
422
423the result returned when a vector aren't found in the map
424
425#### Inputs
426
427***data: tensor(T)***
428
429Input tensor
430
431#### Outputs
432
433***output: tensor(string)***
434
435The mapping result of the input
436
437#### Type Constraints
438***T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)***
439
440Constrain input and output types to numerical tensors.
441
442
443#### Examples
444
445<details>
446<summary>vector_to_string</summary>
447
448```python
449mapping_table = \
450 """
451 a 0 0 1 2
452 b 0 1 2 3
453 d 0 1 3 4
454 """
455
456node = onnx.helper.make_node(
457 'VectorToString',
458 inputs=['x'],
459 outputs=['y'],
460 map=mapping_table,
461 unk="unknown_word"
462)
463
464
465x = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)
466y = ["a", "d", "unknown_word"]
467
468
469expect(node, inputs=[x], outputs=[y],
470 name='test_vector_to_string')
471```
472</details>
473
474## Tokenizer
475
476### <a name="GPT2Tokenizer"></a><a name="GPT2Tokenizer">**GPT2Tokenizer**</a>
477
478GPT2Tokenizer that performs byte-level bpe tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html).
479
480#### Attributes
481
482***vocab***
483
484The **content** of the vocabulary file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).
485
486***merges***
487
488The **content** of the merges file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/merges.txt).
489
490***padding_length(optional)***
491
492When the input is a set of query, the tokenized result is ragged tensor, so we need to pad the tensor to tidy tensor and the `padding_length` indicates the strategy of the padding. When the padding_length equals -1, we will pad the tensor to length of longest row. When the padding_length is more than 0, we will pad the tensor to the number of padding_length.
493
494The default value of `padding_length` is -1.
495
496#### Inputs
497
498***data: tensor(string)***
499
500The string tensor for tokenization
501
502#### Outputs
503
504***input_ids: tensor(int64)***
505
506The tokenized ids of input
507
508***attention_mask: tensor(int64)***
509
510A tensor indicates which part of input_ids is padded.
511
512#### Examples
513
514<details>
515<summary>gpt2tokenizer</summary>
516
517```python
518def get_file_content(path):
519 with open(path, "rb") as file:
520 return file.read()
521
522node = onnx.helper.make_node(
523 'GPT2Tokenizer',
524 inputs=['x'],
525 outputs=['y'],
526 vocab=get_file_content(vocabulary_file),
527 merges=get_file_content(merges_file)
528)
529
530x = ["hey cortana"]
531y = np.array([20342, 12794, 2271], dtype=np.int64)
532
533expect(node, inputs=[x], outputs=[y],
534 name='test_gpt2_tokenizer')
535```
536</details>
537
538
539### <a name="WordpieceTokenizer"></a><a name="WordpieceTokenizer">**WordpieceTokenizer**</a>
540
541WordpieceTokenizer that performs WordPiece tokenization to the input tensor,
542based on the [hugging face version](https://huggingface.co/transformers/model_doc/bert.html#WordpieceTokenizer).
543[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)
544from *tensorflow_text* can be implemented by a pair of nodes
545*RegexSplitWithOffets* followed by *WordpieceTokenizer*.
546it
547
548#### Attributes
549
550***vocab***
551
552The **content** of the vocabulary file, its format is same with
553[hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).
554
555***suffix_indicator***
556
557Suffix added to token not in the first position before looking into the vocabulary.
558
559***unk_token***
560
561Unknown tokens. Every token not found in the vocabulary is replaced by this one.
562
563***max_input_chars_per_word***
564
565Maximum number of characters per token (optional, defaults to 200).
566
567#### Inputs
568
569***data: tensor(string)***
570
571The string tensor for tokenization
572
573***row_indices: tensor(int64)*** Empty or the fndices of every first token of input sentences.
574`indices[i+1] - indices[i]` is the number of tokens in input `i`.
575
576[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)
577includes two steps. The first one splits sentences into words and then splits
578every work into tokens. This operator only implements the second step.
579The first one can be done with operator *StringRegexSplit*.
580This parameter can either be empty or it can be the third output
581of operator *StringRegexSplit*.
582
583#### Outputs
584
585***tokens: tensor(string)*** Every token.
586
587***token_indices: tensor(int32)*** Indices of each token. -1 means a token outside the vocabulary.
588
589***row_indices: tensor(int64)*** Indices of every first token of input sentences.
590`indices[i+1] - indices[i]` is the number of tokens in input `i`.
591These are updates row indices given as inputs or new ones if the second input is empty.
592
593#### Examples
594
595<details>
596<summary>word_piece_tokenizer</summary>
597
598```python
599words = ["want", "##want",
600 "##ed", "wa", "un", "runn", "##ing"]
601vocab = {w: i + 10 for i, w in enumerate(words)}
602st = json.dumps(vocab)
603nodes = []
604mkv = helper.make_tensor_value_info
605reg = helper.make_tensor(
606 "pattern", onnx_proto.TensorProto.STRING, [1, ], ["(\\s)".encode('ascii')])
607reg_empty = helper.make_tensor(
608 "keep_pattern", onnx_proto.TensorProto.STRING, [0, ], [])
609
610nodes = [
611 helper.make_node(
612 'StringRegexSplitWithOffsets,
613 inputs=['text', 'pattern', 'keep_pattern'],
614 outputs=['words', 'begin_end', 'indices'],
615 name='StringRegexPlsitOpName',
616 domain='ai.onnx.contrib'),
617 helper.make_node(
618 'WordpieceTokenizer',
619 inputs=['words', 'indices'],
620 outputs=['out0', 'out1', 'out2'],
621 name='WordpieceTokenizerOpName',
622 domain='ai.onnx.contrib',
623 vocab=st.encode('utf-8'),
624 suffix_indicator="##",
625 unk_token="[UNK]")
626]
627inputs = [mkv('text', onnx_proto.TensorProto.STRING, [None])]
628graph = helper.make_graph(
629 nodes, 'test0', inputs, [
630 mkv('out0', onnx_proto.TensorProto.STRING, [None]),
631 mkv('out1', onnx_proto.TensorProto.INT32, [None]),
632 mkv('out2', onnx_proto.TensorProto.INT64, [None]),
633 mkv('words', onnx_proto.TensorProto.STRING, [None]),
634 mkv('indices', onnx_proto.TensorProto.INT64, [None])],
635 [reg, reg_empty])
636model = helper.make_model(
637 graph, opset_imports=[helper.make_operatorsetid(domain, 1)])
638
639text = np.array(["unwanted running", "unwantedX running"], dtype=np.object)
640tokens = np.array(['un', '##want', '##ed', 'runn', '##ing', 'un', '##want', '##ed',
641 '[UNK]', 'runn', '##ing'], dtype=object),
642indices = np.array([14, 11, 12, 15, 16, 14, 11, 12, -1, 15, 16], dtype=int32)
643row_indices = np.array([ 0, 5, 11], dtype=int64)
644
645expect(model, inputs=[text], outputs=[tokens, indices, row_indices],
646 name='test_bert_tokenizer')
647```
648
649</details>
650
651### <a name="SentencepieceTokenizer"></a><a name="SentencepieceTokenizer">**SentencepieceTokenizer**</a>
652
653SentencepieceTokenizer replicates [SentencepieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/SentencepieceTokenizer.md).
654
655#### Inputs
656
657***data: tensor(string)*** The string tensor for tokenization
658
659***nbest_size: tensor(int64)*** A scalar for sampling. nbest_size = {0,1}: No sampling is performed.
660(default) nbest_size > 1: samples from the nbest_size results. nbest_size < 0: assuming that
661nbest_size is infinite and samples from the all hypothesis (lattice) using
662forward-filtering-and-backward-sampling algorithm.
663
664***alpha: tensor(float)*** A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
665
666***reverse: tensor(bool)*** Reverses the tokenized sequence (Default = false)
667
668***add_bos: tensor(bool)*** Add beginning of sentence token to the result (Default = false)
669
670***add_eos: tensor(bool)*** Add end of sentence token to the result (Default = false).
671When reverse=True beginning/end of sentence tokens are added after reversing.
672
673#### Attributes
674
675***model: string*** The sentencepiece model serialized proto as stored as a string.
676
677#### Outputs
678
679***tokens: tensor(int32)*** Indices of each token.
680
681***indices: tensor(int64)*** Indices of every first token of input sentences.
682`indices[i+1] - indices[i]` is the number of tokens in input `i`.
683
684Tokenized result of the input
685
686#### Examples
687
688<details>
689<summary>example 1</summary>
690
691```python
692
693url = "https://github.com/microsoft/ort-customops/raw/main/test/data/test_sentencepiece_ops_model__6.txt"
694with urllib.request.urlopen(url) as f:
695 content = f.read()
696model = np.array(list(base64.decodebytes(content.encode())), dtype=np.uint8)
697
698node = onnx.helper.make_node(
699 'SentencepieceTokenizer',
700 inputs=['inputs', 'nbest_size', 'alpha', 'add_bos', 'add_eos', 'reverse'],
701 outputs=['indices', 'output'],
702 mapping_file_name='vocabulary.txt',
703 unmapping_value="unknown_word",
704 model=model
705)
706
707inputs = np.array(["Hello world", "Hello world louder"], dtype=np.object),
708nbest_size = np.array([0], dtype=np.float32),
709alpha = np.array([0], dtype=np.float32),
710add_bos = np.array([0], dtype=np.bool_),
711add_eos = np.array([0], dtype=np.bool_),
712reverse = np.array([0], dtype=np.bool_)
713
714tokens = array([17486, 1017, 17486, 1017, 155, 21869], dtype=int32)
715indices = array([0, 2, 6], dtype=int64)
716
717expect(node, inputs=[inputs, nbest_size, alpha, add_bos, add_eos, reverse],
718 outputs=[tokens, indices], name='sp')
719```
720</details>
721
722### <a name="XLNetTokenizer"></a><a name="XLNetTokenizer">**XLNetTokenizer**</a>
723
724GPT2Tokenizer that performs SentencePiece tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/model_doc/xlnet.html#xlnettokenizer).
725
726#### Inputs
727
728***data: tensor(string)***
729The string tensor for tokenization
730
731#### Outputs
732
733***output: tensor(int64)***
734
735Tokenized result of the input
736
737#### Examples
738
739<details>
740<summary>word_piece_tokenizer</summary>
741
742```python
743
744```
745</details>
746