microsoft/onnxruntime-extensions

Public

mirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
bfbfa5a3044ec8d1312f3782c78ea3b9246bf667

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

docs/custom_text_ops.md

1103lines · modecode

1## Operator Schemas
2
3### Auxiliary String Operator
4
5|**Operator**|**Support State**|
6|------------|-----------------|
7|StringEqual | Supported |
8|StringHash | Supported |
9|StringToHashBucketFast|Supported|
10|StringJoin | Supported |
11|StringRegexReplace| Supported |
12|StringECMARegexReplace| Supported|
13|StringSplit | Supported |
14|StringUpper | Supported |
15|StringLength | Supported |
16|StringConcat | Supported |
17|StringRegexSplitWithOffsets| Supported |
18|StringECMARegexSplitWithOffsets| Supported|
19|VectorToString| Supported |
20|StringToVector| Supported|
21|StringSlice | Under development|
22|MaskedFill | Supported|
23
24### Tokenizer
25
26|**Operator**|**Support State**|
27|------------|-----------------|
28|GPT2Tokenizer| Supported |
29|WordpieceTokenizer| Supported |
30|SentencepieceTokenizer| Supported |
31|BasicTokenizer| Supported |
32|BertTokenizer| Supported |
33|BertTokenizerDecoder| Supported |
34
35
36## Auxiliary String Operator
37
38[TODO: Add existing operators]
39
40### <a name="StringRegexReplace"></a><a name="StringRegexReplace">**StringRegexReplace**</a>
41
42String replacement based on [Re2-format](https://github.com/google/re2/wiki/Syntax) regular expressions.
43
44#### Inputs
45
46***text: tensor(string)***
47
48String tensor to extract slices from.
49
50***pattern: tensor(string)***
51
52Pattern of the regular expression.
53
54***rewrite: tensor(string)***
55
56Replacement.
57
58#### Attributes
59
60***global_replace: int64*** (default is 1)
61
62Replace all strings matching the pattern or the first one.
63
64#### Outputs
65
66***output: tensor(string)***
67
68String with replacements.
69
70#### Examples
71
72<details>
73<summary>StringRegexReplace</summary>
74
75```python
76
77node = onnx.helper.make_node(
78 'StringRegexReplace',
79 inputs=['text', 'pattern', 'rewrite'],
80 outputs=['y'],
81)
82
83text = np.array([['def myfunc():'], ['def dummy():']])
84pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
85rewrite = np.array([r'static PyObject* py_\1(void) {'])
86y = [['static PyObject* py_myfunc(void) {'],
87 ['static PyObject* py_dummy(void) {']]
88
89expect(node, inputs=[text, pattern, rewrite], outputs=[y],
90 name='test_string_regex_replace')
91```
92
93</details>
94
95### <a name="StringECMARegexReplace"></a><a name="StringECMARegexReplace">**StringECMARegexReplace**</a>
96
97String replacement based on [ECMA-format](https://en.cppreference.com/w/cpp/regex/ecmascript) regular expressions.
98
99#### Inputs
100
101***text: tensor(string)***
102
103String tensor to extract slices from.
104
105***pattern: tensor(string)***
106
107Pattern of the regular expression.
108
109***rewrite: tensor(string)***
110
111Replacement.
112
113#### Attributes
114
115***global_replace: int64*** (default is 1)
116
117Replace all strings matching the pattern or the first one.
118
119
120***ignore_case: int64*** (default is 0)
121
122Replace
123
124#### Outputs
125
126***output: tensor(string)***
127
128String with replacements.
129
130#### Examples
131
132<details>
133<summary>StringRegexReplace</summary>
134
135```python
136
137node = onnx.helper.make_node(
138 'StringRegexReplace',
139 inputs=['text', 'pattern', 'rewrite'],
140 outputs=['y'],
141)
142
143text = np.array([['def myfunc():'], ['def dummy():']])
144pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
145rewrite = np.array([r'static PyObject* py_$1(void) {'])
146y = [['static PyObject* py_myfunc(void) {'],
147 ['static PyObject* py_dummy(void) {']]
148
149expect(node, inputs=[text, pattern, rewrite], outputs=[y],
150 name='test_string_regex_replace')
151```
152
153</details>
154
155
156### <a name="StringRegexSplitWithOffsets"></a><a name="StringRegexSplitWithOffsets">**StringRegexSplitWithOffsets**</a>
157
158Splits string based on regular expressions.
159
160#### Inputs
161
162***text: tensor(string)***
163
164String tensor to extract slices from.
165
166***delim_regex_pattern: tensor(string)***
167
168Splitting attern of the regular expression.
169
170***keep_delim_regex_pattern: tensor(string)***
171
172By default, delimiters are not included in the split string results. Delimiters may be included by specifying a regex pattern keep_delim_regex_pattern.
173
174#### Outputs
175
176***words: tensor(string)*** Tensor of words.
177
178***offsets: tensor(int64)*** 2D tensor with 3 columns:
179sentence index, position of the first character, position of the last one (excluded)
180
181***row_indices: tensor(int64)*** Indices of every first token of input sentences.
182`row_indices[i+1] - row_indices[i]` is the number of tokens in input `i`.
183These are updates row indices given as inputs or new ones if the second input is empty.
184
185
186#### Examples
187
188<details>
189<summary>StringRegexSplit</summary>
190
191```python
192
193node = onnx.helper.make_node(
194 'StringRegexSplit',
195 inputs=['text', 'pattern', 'rewrite'],
196 outputs=['y', 'begin_end', 'indices'],
197)
198
199text = np.array(["hello there"])
200pattern = np.array([r'\s'])
201rewrite = np.array([r'\s'])
202y = np.array(["hello", " ", "there"])
203z1 = np.array([[0, 0, 5],
204 [0, 5, 6],
205 [0, 6, 11]], dtype=np.int64)
206z2 = np.array([0, 2], dtype=np.int64)
207
208expect(node, inputs=[text, pattern, rewrite], outputs=[y, z1, z2],
209 name='test_string_regex_replace')
210```
211
212</details>
213
214### <a name="StringConcat"></a><a name="StringConcat">**StringConcat**</a>
215
216Concat the corresponding string in the two string tensor. Two input tensors should have the same dimension.
217
218```python
219 output = []
220 shape = input1.shape
221 input1 = input1.flatten()
222 input2 = input2.flatten()
223 for i in range(len(input1)):
224 output.append(input1[i] + input2[i])
225 output = np.array(output).reshape(shape)
226```
227
228#### Inputs
229
230***input_1: tensor(string)***
231
232The first string tensor.
233
234***input_2: tensor(string)***
235
236The second string tensor.
237
238
239#### Outputs
240
241***output: tensor(string)***
242
243The result.
244
245#### Examples
246
247<details>
248<summary>StringConcat</summary>
249
250```python
251
252node = onnx.helper.make_node(
253 'StringConcat',
254 inputs=['x', 'y'],
255 outputs=['result'],
256)
257
258x = np.array(["abcd", "efgh"])
259y = np.array(["wxyz", "stuv"])
260result = np.array([x[0] + y[0], x[1] + y[1]])
261
262expect(node, inputs=[x, y], outputs=[result],
263 name='test_string_concat')
264```
265
266</details>
267
268### <a name="StringSlice"></a><a name="StringSlice">**StringSlice**</a>
269
270Do the slice operation to each string element in input tensor. Similar to string slice in python
271
272```python
273a = "abcdef"
274b = a[1:2]
275c = a[3:1:-1]
276```
277
278#### Inputs
279
280***data: tensor(string)***
281
282String tensor to extract slices from.
283
284***starts: tensor(int64/int32)***
285
286The tensor of starting indices of corresponding string in data, which has same dimension of data.
287
288***ends: tensor(int64/int32)***
289
290The tensor of ending indices of corresponding string in data, which has same dimension of data.
291
292***steps(optional): tensor(int64/int32)***
293
294The tensor of slice step of corresponding string in data, which has same dimension of data.If steps is empty tensor, we will use default value 1 for each string
295
296#### Outputs
297
298***output: tensor(string)***
299
300Sliced data tensor.
301
302#### Examples
303
304<details>
305<summary>string_slice</summary>
306
307```python
308
309node = onnx.helper.make_node(
310 'StringSlice',
311 inputs=['x', 'starts', 'ends', 'steps'],
312 outputs=['y'],
313)
314
315x = np.array(["abcdef", "hijkl"])
316y = np.array([x[0][1:3:1], x[1][3:1:-1]])
317starts = np.array([1, 3], dtype=np.int64)
318ends = np.array([3, 1], dtype=np.int64)
319axes = np.array([0, 1], dtype=np.int64)
320steps = np.array([1, 1], dtype=np.int64)
321
322expect(node, inputs=[x, starts, ends, axes, steps], outputs=[y],
323 name='test_string_slice')
324```
325
326</details>
327
328### <a name="StringLength"></a><a name="StringLength">**StringLength**</a>
329
330Get the length of each string element in input tensor. Similar to the function `len("abcde"")` in python.
331
332#### Inputs
333
334***data: tensor(string)***
335
336String tensor to get length of its each string element.
337
338#### Outputs
339
340***output: tensor(int64)***
341
342Data length tensor.
343
344#### Examples
345
346<details>
347<summary>string_length</summary>
348
349```python
350
351node = onnx.helper.make_node(
352 'StringLength',
353 inputs=['x'],
354 outputs=['y']
355)
356
357x = ["abcdef", "hijkl"]
358y = np.array([len(x[0]), len(x[1])], dtype=np.int64)
359
360
361expect(node, inputs=[x], outputs=[y],
362 name='test_string_length')
363```
364</details>
365
366
367### <a name="StringToVector"></a><a name="StringToVector">**StringToVector**</a>
368
369StringToVector will map each string element in the input to the corresponding vector according to the mapping file. The mapping file is a utf-8 encoding text file in tsv format:
370
371 <string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>
372
373Unmapped string will output the value of the attribute `unmapping_value`.
374
375Example:
376
377*Attributes:*
378
379- `mapping_file_name`: vocabulary.txt
380 ```
381 a 0 0 1 2
382 b 0 1 2 3
383 d 0 1 3 4
384 ```
385
386- `unmapping_value`: [0 0 0 0]
387
388*Inputs:*
389- data: ["a", "d", "e"]
390
391*Ouputs:*
392- output: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]
393
394#### Attributes
395
396***mapping_file_name:string***
397
398The name of your string to vector mapping file.
399
400***unmapping_value:list(int)***
401
402Mapping result for unmapped string
403
404#### Inputs
405
406***data: tensor(string)***
407
408Input tensor
409
410#### Outputs
411
412***output: tensor(T)***
413
414The mapping result of the input
415
416#### Type Constraints
417***T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)***
418
419Constrain input and output types to numerical tensors.
420
421#### Examples
422
423<details>
424<summary>string_to_vector</summary>
425
426```python
427# what's in vocabulary.txt
428
429mapping_table = \
430"""
431a 0 0 1 2
432b 0 1 2 3
433d 0 1 3 4
434"""
435
436node = onnx.helper.make_node(
437 'StringToVector',
438 inputs=['x'],
439 outputs=['y'],
440 mapping_table=mapping_table,
441 unmapping_value=[0,0,0,0]
442)
443
444
445x = ["a", "d", "e"]
446y = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)
447
448
449expect(node, inputs=[x], outputs=[y],
450 name='test_string_to_vector')
451```
452
453</details>
454
455### <a name="VectorToString"></a><a name="VectorToString">**VectorToString**</a>
456
457VectorToString is the contrary operation to the `StringToVector` , they share same format of mapping table:
458
459 <string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>
460
461Unmapped vector will output the value of the attribute `unk`.
462
463Example:
464
465*Attributes:*
466
467- `map`:
468 ```
469 a 0 0 1 2
470 b 0 1 2 3
471 d 0 1 3 4
472 ```
473
474- `unk`: "unknown_word"
475
476*Inputs:*
477- data: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]
478
479*Ouputs:*
480- output: ["a", "d", "unknown_word" ]
481
482#### Attributes
483
484***mapping_file_name***
485
486the formative mapping table
487
488***unmapping_value***
489
490the result returned when a vector aren't found in the map
491
492#### Inputs
493
494***data: tensor(T)***
495
496Input tensor
497
498#### Outputs
499
500***output: tensor(string)***
501
502The mapping result of the input
503
504#### Type Constraints
505***T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)***
506
507Constrain input and output types to numerical tensors.
508
509
510#### Examples
511
512<details>
513<summary>vector_to_string</summary>
514
515```python
516mapping_table = \
517 """
518 a 0 0 1 2
519 b 0 1 2 3
520 d 0 1 3 4
521 """
522
523node = onnx.helper.make_node(
524 'VectorToString',
525 inputs=['x'],
526 outputs=['y'],
527 map=mapping_table,
528 unk="unknown_word"
529)
530
531
532x = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)
533y = ["a", "d", "unknown_word"]
534
535
536expect(node, inputs=[x], outputs=[y],
537 name='test_vector_to_string')
538```
539</details>
540
541### <a name="MaskedFill"></a><a name="MaskedFill">**MaskedFill**</a>
542
543Fills elements of self tensor with value where mask is True. The operator is similar with [`Tensor.masked_fill_`](https://pytorch.org/docs/stable/generated/torch.Tensor.masked_fill_.html#torch.Tensor.masked_fill_) in pytorch.
544
545
546#### Inputs
547
548***value: tensor(string)***
549
550The value to fill in with, currently we only support string type and vector&scalar dimension.
551
552***mask: tensor(bool)***
553
554The boolean mask, the dimension of mask tensor should be same with value.
555
556#### Outputs
557
558***output: tensor(string)***
559
560The filled output of input tensor.
561
562
563#### Examples
564
565<details>
566<summary>vector_to_string</summary>
567
568```python
569
570node = onnx.helper.make_node(
571 'MaskedFill',
572 inputs=['value', 'mask'],
573 outputs=['output']
574)
575
576
577value = np.array(["a", "b", "c", "d"])
578mask = np.array([True, False, True, False], dtype=bool)
579output = np.array(["a", "c"])
580
581
582expect(node, inputs=[value, mask], outputs=[output],
583 name='test_masked_fill')
584```
585</details>
586
587## Tokenizer
588
589### <a name="GPT2Tokenizer"></a><a name="GPT2Tokenizer">**GPT2Tokenizer**</a>
590
591GPT2Tokenizer that performs byte-level bpe tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html).
592
593#### Attributes
594
595***vocab***
596
597The **content** of the vocabulary file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).
598
599***merges***
600
601The **content** of the merges file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/merges.txt).
602
603***padding_length(optional)***
604
605When the input is a set of query, the tokenized result is ragged tensor, so we need to pad the tensor to tidy tensor and the `padding_length` indicates the strategy of the padding. When the padding_length equals -1, we will pad the tensor to length of longest row. When the padding_length is more than 0, we will pad the tensor to the number of padding_length.
606
607The default value of `padding_length` is -1.
608
609#### Inputs
610
611***data: tensor(string)***
612
613The string tensor for tokenization
614
615#### Outputs
616
617***input_ids: tensor(int64)***
618
619The tokenized ids of input
620
621***attention_mask: tensor(int64)***
622
623A tensor indicates which part of input_ids is padded.
624
625#### Examples
626
627<details>
628<summary>gpt2tokenizer</summary>
629
630```python
631def get_file_content(path):
632 with open(path, "rb") as file:
633 return file.read()
634
635node = onnx.helper.make_node(
636 'GPT2Tokenizer',
637 inputs=['x'],
638 outputs=['y'],
639 vocab=get_file_content(vocabulary_file),
640 merges=get_file_content(merges_file)
641)
642
643x = ["hey cortana"]
644y = np.array([20342, 12794, 2271], dtype=np.int64)
645
646expect(node, inputs=[x], outputs=[y],
647 name='test_gpt2_tokenizer')
648```
649</details>
650
651
652### <a name="WordpieceTokenizer"></a><a name="WordpieceTokenizer">**WordpieceTokenizer**</a>
653
654WordpieceTokenizer that performs WordPiece tokenization to the input tensor,
655based on the [hugging face version](https://huggingface.co/transformers/model_doc/bert.html#WordpieceTokenizer).
656[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)
657from *tensorflow_text* can be implemented by a pair of nodes
658*RegexSplitWithOffets* followed by *WordpieceTokenizer*.
659it
660
661#### Attributes
662
663***vocab***
664
665The **content** of the vocabulary file, its format is same with
666[hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).
667
668***suffix_indicator***
669
670Suffix added to token not in the first position before looking into the vocabulary.
671
672***unk_token***
673
674Unknown tokens. Every token not found in the vocabulary is replaced by this one.
675
676***max_input_chars_per_word***
677
678Maximum number of characters per token (optional, defaults to 200).
679
680#### Inputs
681
682***data: tensor(string)***
683
684The string tensor for tokenization
685
686***row_indices: tensor(int64)*** Empty or the fndices of every first token of input sentences.
687`indices[i+1] - indices[i]` is the number of tokens in input `i`.
688
689[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)
690includes two steps. The first one splits sentences into words and then splits
691every work into tokens. This operator only implements the second step.
692The first one can be done with operator *StringRegexSplit*.
693This parameter can either be empty or it can be the third output
694of operator *StringRegexSplit*.
695
696#### Outputs
697
698***tokens: tensor(string)*** Every token.
699
700***token_indices: tensor(int32)*** Indices of each token. -1 means a token outside the vocabulary.
701
702***row_indices: tensor(int64)*** Indices of every first token of input sentences.
703`indices[i+1] - indices[i]` is the number of tokens in input `i`.
704These are updates row indices given as inputs or new ones if the second input is empty.
705
706#### Examples
707
708<details>
709<summary>word_piece_tokenizer</summary>
710
711```python
712words = ["want", "##want",
713 "##ed", "wa", "un", "runn", "##ing"]
714vocab = {w: i + 10 for i, w in enumerate(words)}
715st = json.dumps(vocab)
716nodes = []
717mkv = helper.make_tensor_value_info
718reg = helper.make_tensor(
719 "pattern", onnx_proto.TensorProto.STRING, [1, ], ["(\\s)".encode('ascii')])
720reg_empty = helper.make_tensor(
721 "keep_pattern", onnx_proto.TensorProto.STRING, [0, ], [])
722
723nodes = [
724 helper.make_node(
725 'StringRegexSplitWithOffsets,
726 inputs=['text', 'pattern', 'keep_pattern'],
727 outputs=['words', 'begin_end', 'indices'],
728 name='StringRegexPlsitOpName',
729 domain='ai.onnx.contrib'),
730 helper.make_node(
731 'WordpieceTokenizer',
732 inputs=['words', 'indices'],
733 outputs=['out0', 'out1', 'out2'],
734 name='WordpieceTokenizerOpName',
735 domain='ai.onnx.contrib',
736 vocab=st.encode('utf-8'),
737 suffix_indicator="##",
738 unk_token="[UNK]")
739]
740inputs = [mkv('text', onnx_proto.TensorProto.STRING, [None])]
741graph = helper.make_graph(
742 nodes, 'test0', inputs, [
743 mkv('out0', onnx_proto.TensorProto.STRING, [None]),
744 mkv('out1', onnx_proto.TensorProto.INT32, [None]),
745 mkv('out2', onnx_proto.TensorProto.INT64, [None]),
746 mkv('words', onnx_proto.TensorProto.STRING, [None]),
747 mkv('indices', onnx_proto.TensorProto.INT64, [None])],
748 [reg, reg_empty])
749model = helper.make_model(
750 graph, opset_imports=[helper.make_operatorsetid(domain, 1)])
751
752text = np.array(["unwanted running", "unwantedX running"], dtype=np.object)
753tokens = np.array(['un', '##want', '##ed', 'runn', '##ing', 'un', '##want', '##ed',
754 '[UNK]', 'runn', '##ing'], dtype=object),
755indices = np.array([14, 11, 12, 15, 16, 14, 11, 12, -1, 15, 16], dtype=int32)
756row_indices = np.array([ 0, 5, 11], dtype=int64)
757
758expect(model, inputs=[text], outputs=[tokens, indices, row_indices],
759 name='test_bert_tokenizer')
760```
761
762</details>
763
764### <a name="SentencepieceTokenizer"></a><a name="SentencepieceTokenizer">**SentencepieceTokenizer**</a>
765
766SentencepieceTokenizer replicates [SentencepieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/SentencepieceTokenizer.md).
767
768#### Inputs
769
770***data: tensor(string)*** The string tensor for tokenization
771
772***nbest_size: tensor(int64)*** A scalar for sampling. nbest_size = {0,1}: No sampling is performed.
773(default) nbest_size > 1: samples from the nbest_size results. nbest_size < 0: assuming that
774nbest_size is infinite and samples from the all hypothesis (lattice) using
775forward-filtering-and-backward-sampling algorithm.
776
777***alpha: tensor(float)*** A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
778
779***reverse: tensor(bool)*** Reverses the tokenized sequence (Default = false)
780
781***add_bos: tensor(bool)*** Add beginning of sentence token to the result (Default = false)
782
783***add_eos: tensor(bool)*** Add end of sentence token to the result (Default = false).
784When reverse=True beginning/end of sentence tokens are added after reversing.
785
786#### Attributes
787
788***model: string*** The sentencepiece model serialized proto as stored as a string.
789
790#### Outputs
791
792***tokens: tensor(int32)*** Indices of each token.
793
794***indices: tensor(int64)*** Indices of every first token of input sentences.
795`indices[i+1] - indices[i]` is the number of tokens in input `i`.
796
797Tokenized result of the input
798
799#### Examples
800
801<details>
802<summary>example 1</summary>
803
804```python
805
806url = "https://github.com/microsoft/ort-customops/raw/main/test/data/test_sentencepiece_ops_model__6.txt"
807with urllib.request.urlopen(url) as f:
808 content = f.read()
809model = np.array(list(base64.decodebytes(content.encode())), dtype=np.uint8)
810
811node = onnx.helper.make_node(
812 'SentencepieceTokenizer',
813 inputs=['inputs', 'nbest_size', 'alpha', 'add_bos', 'add_eos', 'reverse'],
814 outputs=['indices', 'output'],
815 mapping_file_name='vocabulary.txt',
816 unmapping_value="unknown_word",
817 model=model
818)
819
820inputs = np.array(["Hello world", "Hello world louder"], dtype=np.object),
821nbest_size = np.array([0], dtype=np.float32),
822alpha = np.array([0], dtype=np.float32),
823add_bos = np.array([0], dtype=np.bool_),
824add_eos = np.array([0], dtype=np.bool_),
825reverse = np.array([0], dtype=np.bool_)
826
827tokens = array([17486, 1017, 17486, 1017, 155, 21869], dtype=int32)
828indices = array([0, 2, 6], dtype=int64)
829
830expect(node, inputs=[inputs, nbest_size, alpha, add_bos, add_eos, reverse],
831 outputs=[tokens, indices], name='sp')
832```
833</details>
834
835### <a name="BasicTokenizer"></a><a name="BasicTokenizer">**BasicTokenizer**</a>
836
837BasicTokenizer performs basic tokenization to input string tensor, based on [basic tokenizer in BertTokenizer(hugging face version)](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
838
839#### Inputs
840
841***text: tensor(string)*** The string tensor for tokenization
842
843#### Attributes
844
845***do_lower_case: int64_t*** (default is 1, 1 represents True, 0 represents False)
846
847Whether or not to lowercase the input when tokenizing.
848
849***tokenize_chinese_chars: int64_t*** (default is 1, 1 represents True, 0 represents False)
850
851Whether or not to tokenize Chinese characters.
852
853***strip_accents: int64_t*** (default is 1, 1 represents True, 0 represents False)
854
855Whether or not to strip all accents. If this option is not specified, then it will be determined by the
856value for :obj:`lowercase` (as in the original BERT).
857
858***tokenize_punctuation: int64_t*** (default is 0, 1 represents True, 0 represents False)
859
860Splits punctuation on a piece of text.
861
862***remove_control_chars: int64_t*** (default is 0, 1 represents True, 0 represents False)
863
864Remove control chars(such as NUL, BEL) in the text.
865
866#### Outputs
867
868***tokens: tensor(string)*** Tokenized tokens.
869
870#### Examples
871
872<details>
873<summary>example 1</summary>
874
875```python
876import transformers
877
878tokenizer = transformers.BasicTokenizer()
879
880node = onnx.helper.make_node(
881 'BasicTokenizer',
882 inputs=['text'],
883 outputs=['tokens'],
884)
885
886inputs = np.array([ "Hello world louder"], dtype=np.object),
887tokens = np.array(tokenizer(inputs), dtype=int32)
888
889expect(node, inputs=[inputs],
890 outputs=[tokens], name='test_basic_tokenizer')
891```
892</details>
893
894### <a name="BertTokenizer"></a><a name="BertTokenizer">**BertTokenizer**</a>
895
896BertTokenizer replicates `encode_plus` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
897#### Inputs
898
899***text: tensor(string)*** The string tensor for tokenization
900
901#### Attributes
902
903***vocab_file: string***
904
905The content of vocab which has same with huggingface.
906
907***do_lower_case: int64_t*** (default is 1, 1 represents True, 0 represents False)
908
909Whether or not to lowercase the input when tokenizing.
910
911***do_basic_tokenize: int64_t*** (default is 1, 1 represents True, 0 represents False)
912
913Whether or not to do basic tokenization before WordPiece.
914
915***unk_token: string***
916
917The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
918token instead.
919
920***sep_token: string***
921
922The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
923sequence classification or for a text and a question for question answering. It is also used as the last
924token of a sequence built with special tokens.
925
926***pad_token: string***
927
928The token used for padding, for example when batching sequences of different lengths.
929
930***cls_token: string***
931
932The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
933
934***mask_token: string***
935
936The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
937
938***tokenize_chinese_chars: int64_t*** (default is 1, 1 represents True, 0 represents False)
939
940Whether or not to tokenize Chinese characters.
941
942***strip_accents: int64_t*** (default is 1, 1 represents True, 0 represents False)
943
944Whether or not to strip all accents. If this option is not specified, then it will be determined by the
945value for :obj:`lowercase` (as in the original BERT).
946
947***tokenize_punctuation: int64_t*** (default is 0, 1 represents True, 0 represents False)
948
949Splits punctuation on a piece of text.
950
951***remove_control_chars: int64_t*** (default is 0, 1 represents True, 0 represents False)
952
953Remove control chars(such as NUL, BEL) in the text.
954
955***truncation_strategy_name: string***
956
957The name of truncation strategy, it could be `longest_first`, `only_first`, `only_second`, `longest_from_back`.
958
959#### Outputs
960
961***input_ids: tensor(int64_t)***
962
963List of token ids.
964
965***token_type_ids: tensor(64_t)***
966
967List of token type ids
968
969***attention_mask: tensor(64_t)***
970
971List of indices specifying which tokens should b
972e attended to by the model
973
974
975#### Examples
976
977<details>
978<summary>example 1</summary>
979
980```python
981import transformers
982
983bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')
984
985node = onnx.helper.make_node(
986 'BertTokenizer',
987 inputs=['text'],
988 outputs=['tokens'],
989)
990
991text = "Hello world louder"
992inputs = np.array([text], dtype=np.object),
993
994bert_tokenize_result = bert_cased_tokenizer.tokenize(text)
995
996input_ids = np.array(bert_tokenize_result[0])
997token_type_ids = np.array(bert_tokenize_result[1])
998attention_mask = np.array(bert_tokenize_result[2])
999
1000expect(node, inputs=[inputs],
1001 outputs=[input_ids, token_type_ids, attention_mask], name='test_bert_tokenizer')
1002```
1003</details>
1004
1005
1006### <a name="BertTokenizerDecoder"></a><a name="BertTokenizerDecoder">**BertTokenizerDecoder**</a>
1007
1008BertTokenizer replicates `decode` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
1009#### Inputs
1010
1011***token_ids: tensor(int64)***
1012
1013List of tokenized input ids.
1014
1015***indices: tensor(int64)***
1016
1017List of `[start_position, end_position]` to indicate what segments of input ids should be decoded. This input only enabled when attribute `use_indices`=1.
1018
1019Usually, it is used to decode the slot in the text.
1020
1021#### Attributes
1022
1023***vocab_file: string***
1024
1025The content of vocab which has same with huggingface.
1026
1027***unk_token: string***
1028
1029The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
1030token instead.
1031
1032***sep_token: string***
1033
1034The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
1035sequence classification or for a text and a question for question answering. It is also used as the last
1036token of a sequence built with special tokens.
1037
1038***pad_token: string***
1039
1040The token used for padding, for example when batching sequences of different lengths.
1041
1042***cls_token: string***
1043
1044The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
1045
1046***mask_token: string***
1047
1048The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
1049
1050***suffix_indicator: string***
1051
1052The suffix indicator.
1053
1054***use_indices: int64_t***
1055
1056Whether use second input.
1057
1058***skip_special_tokens: int64_t***
1059
1060Whether or not to remove special tokens in the decoding.
1061
1062***clean_up_tokenization_spaces: int64_t***
1063
1064Whether or not to clean up the tokenization spaces.
1065
1066#### Outputs
1067
1068***sentences: tensor(int64_t)***
1069
1070The decoded sentences.
1071
1072#### Examples
1073
1074<details>
1075<summary>example 1</summary>
1076
1077```python
1078import transformers
1079
1080def get_file_content(path):
1081 with open(path, "rb") as file:
1082 return file.read()
1083
1084bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')
1085bert_cased_tokenizer.save('.', 'bert')
1086
1087
1088node = onnx.helper.make_node(
1089 'BertTokenizerDecoder',
1090 inputs=['token_ids'],
1091 outputs=['sentences'],
1092 vocab_file=get_file_content("bert-vocab.txt")
1093)
1094
1095text = "Hello world louder"
1096token_ids = np.array([bert_cased_tokenizer.tokenize(text)], dtype=np.object),
1097sentences = np.array(text)
1098
1099
1100expect(node, inputs=[token_ids],
1101 outputs=[sentences], name='test_bert_tokenizer')
1102```
1103</details>
1104