microsoft/onnxruntime-extensions

Public

mirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable

Watch0 Fork0 Star0

Code Commits Issues Pull requests Actions Insights Security

bfbfa5a3044ec8d1312f3782c78ea3b9246bf667

Find a branch or tag

Branches

bfbfa5a3044ec8d1312f3782c78ea3b9246bf667

Clone

HTTPS

Download ZIP

onnxruntime-extensions/docs

docs/custom_text_ops.md

1103lines · modecode

Raw Download

Latest commit unavailable.

unknown

1	`## Operator Schemas`
2
3	`### Auxiliary String Operator`
4
5	`\|Operator\|Support State\|`
6	`\|------------\|-----------------\|`
7	`\|StringEqual \| Supported \|`
8	`\|StringHash \| Supported \|`
9	`\|StringToHashBucketFast\|Supported\|`
10	`\|StringJoin \| Supported \|`
11	`\|StringRegexReplace\| Supported \|`
12	`\|StringECMARegexReplace\| Supported\|`
13	`\|StringSplit \| Supported \|`
14	`\|StringUpper \| Supported \|`
15	`\|StringLength \| Supported \|`
16	`\|StringConcat \| Supported \|`
17	`\|StringRegexSplitWithOffsets\| Supported \|`
18	`\|StringECMARegexSplitWithOffsets\| Supported\|`
19	`\|VectorToString\| Supported \|`
20	`\|StringToVector\| Supported\|`
21	`\|StringSlice \| Under development\|`
22	`\|MaskedFill \| Supported\|`
23
24	`### Tokenizer`
25
26	`\|Operator\|Support State\|`
27	`\|------------\|-----------------\|`
28	`\|GPT2Tokenizer\| Supported \|`
29	`\|WordpieceTokenizer\| Supported \|`
30	`\|SentencepieceTokenizer\| Supported \|`
31	`\|BasicTokenizer\| Supported \|`
32	`\|BertTokenizer\| Supported \|`
33	`\|BertTokenizerDecoder\| Supported \|`
34
35
36	`## Auxiliary String Operator`
37
38	`[TODO: Add existing operators]`
39
40	`### <a name="StringRegexReplace"></a><a name="StringRegexReplace">StringRegexReplace</a>`
41
42	`String replacement based on [Re2-format](https://github.com/google/re2/wiki/Syntax) regular expressions.`
43
44	`#### Inputs`
45
46	`*text: tensor(string)*`
47
48	`String tensor to extract slices from.`
49
50	`*pattern: tensor(string)*`
51
52	`Pattern of the regular expression.`
53
54	`*rewrite: tensor(string)*`
55
56	`Replacement.`
57
58	`#### Attributes`
59
60	`*global_replace: int64* (default is 1)`
61
62	`Replace all strings matching the pattern or the first one.`
63
64	`#### Outputs`
65
66	`*output: tensor(string)*`
67
68	`String with replacements.`
69
70	`#### Examples`
71
72	`<details>`
73	`<summary>StringRegexReplace</summary>`
74
75	```python
76
77	`node = onnx.helper.make_node(`
78	`'StringRegexReplace',`
79	`inputs=['text', 'pattern', 'rewrite'],`
80	`outputs=['y'],`
81	`)`
82
83	`text = np.array([['def myfunc():'], ['def dummy():']])`
84	`pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:'])`
85	`rewrite = np.array([r'static PyObject* py_\1(void) {'])`
86	`y = [['static PyObject* py_myfunc(void) {'],`
87	`['static PyObject* py_dummy(void) {']]`
88
89	`expect(node, inputs=[text, pattern, rewrite], outputs=[y],`
90	`name='test_string_regex_replace')`
91	```
92
93	`</details>`
94
95	`### <a name="StringECMARegexReplace"></a><a name="StringECMARegexReplace">StringECMARegexReplace</a>`
96
97	`String replacement based on [ECMA-format](https://en.cppreference.com/w/cpp/regex/ecmascript) regular expressions.`
98
99	`#### Inputs`
100
101	`*text: tensor(string)*`
102
103	`String tensor to extract slices from.`
104
105	`*pattern: tensor(string)*`
106
107	`Pattern of the regular expression.`
108
109	`*rewrite: tensor(string)*`
110
111	`Replacement.`
112
113	`#### Attributes`
114
115	`*global_replace: int64* (default is 1)`
116
117	`Replace all strings matching the pattern or the first one.`
118
119
120	`*ignore_case: int64* (default is 0)`
121
122	`Replace`
123
124	`#### Outputs`
125
126	`*output: tensor(string)*`
127
128	`String with replacements.`
129
130	`#### Examples`
131
132	`<details>`
133	`<summary>StringRegexReplace</summary>`
134
135	```python
136
137	`node = onnx.helper.make_node(`
138	`'StringRegexReplace',`
139	`inputs=['text', 'pattern', 'rewrite'],`
140	`outputs=['y'],`
141	`)`
142
143	`text = np.array([['def myfunc():'], ['def dummy():']])`
144	`pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:'])`
145	`rewrite = np.array([r'static PyObject* py_$1(void) {'])`
146	`y = [['static PyObject* py_myfunc(void) {'],`
147	`['static PyObject* py_dummy(void) {']]`
148
149	`expect(node, inputs=[text, pattern, rewrite], outputs=[y],`
150	`name='test_string_regex_replace')`
151	```
152
153	`</details>`
154
155
156	`### <a name="StringRegexSplitWithOffsets"></a><a name="StringRegexSplitWithOffsets">StringRegexSplitWithOffsets</a>`
157
158	`Splits string based on regular expressions.`
159
160	`#### Inputs`
161
162	`*text: tensor(string)*`
163
164	`String tensor to extract slices from.`
165
166	`*delim_regex_pattern: tensor(string)*`
167
168	`Splitting attern of the regular expression.`
169
170	`*keep_delim_regex_pattern: tensor(string)*`
171
172	`By default, delimiters are not included in the split string results. Delimiters may be included by specifying a regex pattern keep_delim_regex_pattern.`
173
174	`#### Outputs`
175
176	`*words: tensor(string)* Tensor of words.`
177
178	`*offsets: tensor(int64)* 2D tensor with 3 columns:`
179	`sentence index, position of the first character, position of the last one (excluded)`
180
181	`*row_indices: tensor(int64)* Indices of every first token of input sentences.`
182	`row_indices[i+1] - row_indices[i]` is the number of tokens in input `i`.
183	`These are updates row indices given as inputs or new ones if the second input is empty.`
184
185
186	`#### Examples`
187
188	`<details>`
189	`<summary>StringRegexSplit</summary>`
190
191	```python
192
193	`node = onnx.helper.make_node(`
194	`'StringRegexSplit',`
195	`inputs=['text', 'pattern', 'rewrite'],`
196	`outputs=['y', 'begin_end', 'indices'],`
197	`)`
198
199	`text = np.array(["hello there"])`
200	`pattern = np.array([r'\s'])`
201	`rewrite = np.array([r'\s'])`
202	`y = np.array(["hello", " ", "there"])`
203	`z1 = np.array([[0, 0, 5],`
204	`[0, 5, 6],`
205	`[0, 6, 11]], dtype=np.int64)`
206	`z2 = np.array([0, 2], dtype=np.int64)`
207
208	`expect(node, inputs=[text, pattern, rewrite], outputs=[y, z1, z2],`
209	`name='test_string_regex_replace')`
210	```
211
212	`</details>`
213
214	`### <a name="StringConcat"></a><a name="StringConcat">StringConcat</a>`
215
216	`Concat the corresponding string in the two string tensor. Two input tensors should have the same dimension.`
217
218	```python
219	`output = []`
220	`shape = input1.shape`
221	`input1 = input1.flatten()`
222	`input2 = input2.flatten()`
223	`for i in range(len(input1)):`
224	`output.append(input1[i] + input2[i])`
225	`output = np.array(output).reshape(shape)`
226	```
227
228	`#### Inputs`
229
230	`*input_1: tensor(string)*`
231
232	`The first string tensor.`
233
234	`*input_2: tensor(string)*`
235
236	`The second string tensor.`
237
238
239	`#### Outputs`
240
241	`*output: tensor(string)*`
242
243	`The result.`
244
245	`#### Examples`
246
247	`<details>`
248	`<summary>StringConcat</summary>`
249
250	```python
251
252	`node = onnx.helper.make_node(`
253	`'StringConcat',`
254	`inputs=['x', 'y'],`
255	`outputs=['result'],`
256	`)`
257
258	`x = np.array(["abcd", "efgh"])`
259	`y = np.array(["wxyz", "stuv"])`
260	`result = np.array([x[0] + y[0], x[1] + y[1]])`
261
262	`expect(node, inputs=[x, y], outputs=[result],`
263	`name='test_string_concat')`
264	```
265
266	`</details>`
267
268	`### <a name="StringSlice"></a><a name="StringSlice">StringSlice</a>`
269
270	`Do the slice operation to each string element in input tensor. Similar to string slice in python`
271
272	```python
273	`a = "abcdef"`
274	`b = a[1:2]`
275	`c = a[3:1:-1]`
276	```
277
278	`#### Inputs`
279
280	`*data: tensor(string)*`
281
282	`String tensor to extract slices from.`
283
284	`*starts: tensor(int64/int32)*`
285
286	`The tensor of starting indices of corresponding string in data, which has same dimension of data.`
287
288	`*ends: tensor(int64/int32)*`
289
290	`The tensor of ending indices of corresponding string in data, which has same dimension of data.`
291
292	`*steps(optional): tensor(int64/int32)*`
293
294	`The tensor of slice step of corresponding string in data, which has same dimension of data.If steps is empty tensor, we will use default value 1 for each string`
295
296	`#### Outputs`
297
298	`*output: tensor(string)*`
299
300	`Sliced data tensor.`
301
302	`#### Examples`
303
304	`<details>`
305	`<summary>string_slice</summary>`
306
307	```python
308
309	`node = onnx.helper.make_node(`
310	`'StringSlice',`
311	`inputs=['x', 'starts', 'ends', 'steps'],`
312	`outputs=['y'],`
313	`)`
314
315	`x = np.array(["abcdef", "hijkl"])`
316	`y = np.array([x[0][1:3:1], x[1][3:1:-1]])`
317	`starts = np.array([1, 3], dtype=np.int64)`
318	`ends = np.array([3, 1], dtype=np.int64)`
319	`axes = np.array([0, 1], dtype=np.int64)`
320	`steps = np.array([1, 1], dtype=np.int64)`
321
322	`expect(node, inputs=[x, starts, ends, axes, steps], outputs=[y],`
323	`name='test_string_slice')`
324	```
325
326	`</details>`
327
328	`### <a name="StringLength"></a><a name="StringLength">StringLength</a>`
329
330	Get the length of each string element in input tensor. Similar to the function `len("abcde"")` in python.
331
332	`#### Inputs`
333
334	`*data: tensor(string)*`
335
336	`String tensor to get length of its each string element.`
337
338	`#### Outputs`
339
340	`*output: tensor(int64)*`
341
342	`Data length tensor.`
343
344	`#### Examples`
345
346	`<details>`
347	`<summary>string_length</summary>`
348
349	```python
350
351	`node = onnx.helper.make_node(`
352	`'StringLength',`
353	`inputs=['x'],`
354	`outputs=['y']`
355	`)`
356
357	`x = ["abcdef", "hijkl"]`
358	`y = np.array([len(x[0]), len(x[1])], dtype=np.int64)`
359
360
361	`expect(node, inputs=[x], outputs=[y],`
362	`name='test_string_length')`
363	```
364	`</details>`
365
366
367	`### <a name="StringToVector"></a><a name="StringToVector">StringToVector</a>`
368
369	`StringToVector will map each string element in the input to the corresponding vector according to the mapping file. The mapping file is a utf-8 encoding text file in tsv format:`
370
371	`<string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>`
372
373	Unmapped string will output the value of the attribute `unmapping_value`.
374
375	`Example:`
376
377	`Attributes:`
378
379	- `mapping_file_name`: vocabulary.txt
380	```
381	`a 0 0 1 2`
382	`b 0 1 2 3`
383	`d 0 1 3 4`
384	```
385
386	- `unmapping_value`: [0 0 0 0]
387
388	`Inputs:`
389	`- data: ["a", "d", "e"]`
390
391	`Ouputs:`
392	`- output: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]`
393
394	`#### Attributes`
395
396	`*mapping_file_name:string*`
397
398	`The name of your string to vector mapping file.`
399
400	`*unmapping_value:list(int)*`
401
402	`Mapping result for unmapped string`
403
404	`#### Inputs`
405
406	`*data: tensor(string)*`
407
408	`Input tensor`
409
410	`#### Outputs`
411
412	`*output: tensor(T)*`
413
414	`The mapping result of the input`
415
416	`#### Type Constraints`
417	`*T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)*`
418
419	`Constrain input and output types to numerical tensors.`
420
421	`#### Examples`
422
423	`<details>`
424	`<summary>string_to_vector</summary>`
425
426	```python
427	`# what's in vocabulary.txt`
428
429	`mapping_table = \`
430	`"""`
431	`a 0 0 1 2`
432	`b 0 1 2 3`
433	`d 0 1 3 4`
434	`"""`
435
436	`node = onnx.helper.make_node(`
437	`'StringToVector',`
438	`inputs=['x'],`
439	`outputs=['y'],`
440	`mapping_table=mapping_table,`
441	`unmapping_value=[0,0,0,0]`
442	`)`
443
444
445	`x = ["a", "d", "e"]`
446	`y = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)`
447
448
449	`expect(node, inputs=[x], outputs=[y],`
450	`name='test_string_to_vector')`
451	```
452
453	`</details>`
454
455	`### <a name="VectorToString"></a><a name="VectorToString">VectorToString</a>`
456
457	VectorToString is the contrary operation to the `StringToVector` , they share same format of mapping table:
458
459	`<string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>`
460
461	Unmapped vector will output the value of the attribute `unk`.
462
463	`Example:`
464
465	`Attributes:`
466
467	- `map`:
468	```
469	`a 0 0 1 2`
470	`b 0 1 2 3`
471	`d 0 1 3 4`
472	```
473
474	- `unk`: "unknown_word"
475
476	`Inputs:`
477	`- data: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]`
478
479	`Ouputs:`
480	`- output: ["a", "d", "unknown_word" ]`
481
482	`#### Attributes`
483
484	`*mapping_file_name*`
485
486	`the formative mapping table`
487
488	`*unmapping_value*`
489
490	`the result returned when a vector aren't found in the map`
491
492	`#### Inputs`
493
494	`*data: tensor(T)*`
495
496	`Input tensor`
497
498	`#### Outputs`
499
500	`*output: tensor(string)*`
501
502	`The mapping result of the input`
503
504	`#### Type Constraints`
505	`*T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)*`
506
507	`Constrain input and output types to numerical tensors.`
508
509
510	`#### Examples`
511
512	`<details>`
513	`<summary>vector_to_string</summary>`
514
515	```python
516	`mapping_table = \`
517	`"""`
518	`a 0 0 1 2`
519	`b 0 1 2 3`
520	`d 0 1 3 4`
521	`"""`
522
523	`node = onnx.helper.make_node(`
524	`'VectorToString',`
525	`inputs=['x'],`
526	`outputs=['y'],`
527	`map=mapping_table,`
528	`unk="unknown_word"`
529	`)`
530
531
532	`x = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)`
533	`y = ["a", "d", "unknown_word"]`
534
535
536	`expect(node, inputs=[x], outputs=[y],`
537	`name='test_vector_to_string')`
538	```
539	`</details>`
540
541	`### <a name="MaskedFill"></a><a name="MaskedFill">MaskedFill</a>`
542
543	Fills elements of self tensor with value where mask is True. The operator is similar with [`Tensor.masked_fill_`](https://pytorch.org/docs/stable/generated/torch.Tensor.masked_fill_.html#torch.Tensor.masked_fill_) in pytorch.
544
545
546	`#### Inputs`
547
548	`*value: tensor(string)*`
549
550	`The value to fill in with, currently we only support string type and vector&scalar dimension.`
551
552	`*mask: tensor(bool)*`
553
554	`The boolean mask, the dimension of mask tensor should be same with value.`
555
556	`#### Outputs`
557
558	`*output: tensor(string)*`
559
560	`The filled output of input tensor.`
561
562
563	`#### Examples`
564
565	`<details>`
566	`<summary>vector_to_string</summary>`
567
568	```python
569
570	`node = onnx.helper.make_node(`
571	`'MaskedFill',`
572	`inputs=['value', 'mask'],`
573	`outputs=['output']`
574	`)`
575
576
577	`value = np.array(["a", "b", "c", "d"])`
578	`mask = np.array([True, False, True, False], dtype=bool)`
579	`output = np.array(["a", "c"])`
580
581
582	`expect(node, inputs=[value, mask], outputs=[output],`
583	`name='test_masked_fill')`
584	```
585	`</details>`
586
587	`## Tokenizer`
588
589	`### <a name="GPT2Tokenizer"></a><a name="GPT2Tokenizer">GPT2Tokenizer</a>`
590
591	`GPT2Tokenizer that performs byte-level bpe tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html).`
592
593	`#### Attributes`
594
595	`*vocab*`
596
597	`The content of the vocabulary file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).`
598
599	`*merges*`
600
601	`The content of the merges file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/merges.txt).`
602
603	`*padding_length(optional)*`
604
605	When the input is a set of query, the tokenized result is ragged tensor, so we need to pad the tensor to tidy tensor and the `padding_length` indicates the strategy of the padding. When the padding_length equals -1, we will pad the tensor to length of longest row. When the padding_length is more than 0, we will pad the tensor to the number of padding_length.
606
607	The default value of `padding_length` is -1.
608
609	`#### Inputs`
610
611	`*data: tensor(string)*`
612
613	`The string tensor for tokenization`
614
615	`#### Outputs`
616
617	`*input_ids: tensor(int64)*`
618
619	`The tokenized ids of input`
620
621	`*attention_mask: tensor(int64)*`
622
623	`A tensor indicates which part of input_ids is padded.`
624
625	`#### Examples`
626
627	`<details>`
628	`<summary>gpt2tokenizer</summary>`
629
630	```python
631	`def get_file_content(path):`
632	`with open(path, "rb") as file:`
633	`return file.read()`
634
635	`node = onnx.helper.make_node(`
636	`'GPT2Tokenizer',`
637	`inputs=['x'],`
638	`outputs=['y'],`
639	`vocab=get_file_content(vocabulary_file),`
640	`merges=get_file_content(merges_file)`
641	`)`
642
643	`x = ["hey cortana"]`
644	`y = np.array([20342, 12794, 2271], dtype=np.int64)`
645
646	`expect(node, inputs=[x], outputs=[y],`
647	`name='test_gpt2_tokenizer')`
648	```
649	`</details>`
650
651
652	`### <a name="WordpieceTokenizer"></a><a name="WordpieceTokenizer">WordpieceTokenizer</a>`
653
654	`WordpieceTokenizer that performs WordPiece tokenization to the input tensor,`
655	`based on the [hugging face version](https://huggingface.co/transformers/model_doc/bert.html#WordpieceTokenizer).`
656	`[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)`
657	`from tensorflow_text can be implemented by a pair of nodes`
658	`RegexSplitWithOffets followed by WordpieceTokenizer.`
659	`it`
660
661	`#### Attributes`
662
663	`*vocab*`
664
665	`The content of the vocabulary file, its format is same with`
666	`[hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).`
667
668	`*suffix_indicator*`
669
670	`Suffix added to token not in the first position before looking into the vocabulary.`
671
672	`*unk_token*`
673
674	`Unknown tokens. Every token not found in the vocabulary is replaced by this one.`
675
676	`*max_input_chars_per_word*`
677
678	`Maximum number of characters per token (optional, defaults to 200).`
679
680	`#### Inputs`
681
682	`*data: tensor(string)*`
683
684	`The string tensor for tokenization`
685
686	`*row_indices: tensor(int64)* Empty or the fndices of every first token of input sentences.`
687	`indices[i+1] - indices[i]` is the number of tokens in input `i`.
688
689	`[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)`
690	`includes two steps. The first one splits sentences into words and then splits`
691	`every work into tokens. This operator only implements the second step.`
692	`The first one can be done with operator StringRegexSplit.`
693	`This parameter can either be empty or it can be the third output`
694	`of operator StringRegexSplit.`
695
696	`#### Outputs`
697
698	`*tokens: tensor(string)* Every token.`
699
700	`*token_indices: tensor(int32)* Indices of each token. -1 means a token outside the vocabulary.`
701
702	`*row_indices: tensor(int64)* Indices of every first token of input sentences.`
703	`indices[i+1] - indices[i]` is the number of tokens in input `i`.
704	`These are updates row indices given as inputs or new ones if the second input is empty.`
705
706	`#### Examples`
707
708	`<details>`
709	`<summary>word_piece_tokenizer</summary>`
710
711	```python
712	`words = ["want", "##want",`
713	`"##ed", "wa", "un", "runn", "##ing"]`
714	`vocab = {w: i + 10 for i, w in enumerate(words)}`
715	`st = json.dumps(vocab)`
716	`nodes = []`
717	`mkv = helper.make_tensor_value_info`
718	`reg = helper.make_tensor(`
719	`"pattern", onnx_proto.TensorProto.STRING, [1, ], ["(\\s)".encode('ascii')])`
720	`reg_empty = helper.make_tensor(`
721	`"keep_pattern", onnx_proto.TensorProto.STRING, [0, ], [])`
722
723	`nodes = [`
724	`helper.make_node(`
725	`'StringRegexSplitWithOffsets,`
726	`inputs=['text', 'pattern', 'keep_pattern'],`
727	`outputs=['words', 'begin_end', 'indices'],`
728	`name='StringRegexPlsitOpName',`
729	`domain='ai.onnx.contrib'),`
730	`helper.make_node(`
731	`'WordpieceTokenizer',`
732	`inputs=['words', 'indices'],`
733	`outputs=['out0', 'out1', 'out2'],`
734	`name='WordpieceTokenizerOpName',`
735	`domain='ai.onnx.contrib',`
736	`vocab=st.encode('utf-8'),`
737	`suffix_indicator="##",`
738	`unk_token="[UNK]")`
739	`]`
740	`inputs = [mkv('text', onnx_proto.TensorProto.STRING, [None])]`
741	`graph = helper.make_graph(`
742	`nodes, 'test0', inputs, [`
743	`mkv('out0', onnx_proto.TensorProto.STRING, [None]),`
744	`mkv('out1', onnx_proto.TensorProto.INT32, [None]),`
745	`mkv('out2', onnx_proto.TensorProto.INT64, [None]),`
746	`mkv('words', onnx_proto.TensorProto.STRING, [None]),`
747	`mkv('indices', onnx_proto.TensorProto.INT64, [None])],`
748	`[reg, reg_empty])`
749	`model = helper.make_model(`
750	`graph, opset_imports=[helper.make_operatorsetid(domain, 1)])`
751
752	`text = np.array(["unwanted running", "unwantedX running"], dtype=np.object)`
753	`tokens = np.array(['un', '##want', '##ed', 'runn', '##ing', 'un', '##want', '##ed',`
754	`'[UNK]', 'runn', '##ing'], dtype=object),`
755	`indices = np.array([14, 11, 12, 15, 16, 14, 11, 12, -1, 15, 16], dtype=int32)`
756	`row_indices = np.array([ 0, 5, 11], dtype=int64)`
757
758	`expect(model, inputs=[text], outputs=[tokens, indices, row_indices],`
759	`name='test_bert_tokenizer')`
760	```
761
762	`</details>`
763
764	`### <a name="SentencepieceTokenizer"></a><a name="SentencepieceTokenizer">SentencepieceTokenizer</a>`
765
766	`SentencepieceTokenizer replicates [SentencepieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/SentencepieceTokenizer.md).`
767
768	`#### Inputs`
769
770	`*data: tensor(string)* The string tensor for tokenization`
771
772	`*nbest_size: tensor(int64)* A scalar for sampling. nbest_size = {0,1}: No sampling is performed.`
773	`(default) nbest_size > 1: samples from the nbest_size results. nbest_size < 0: assuming that`
774	`nbest_size is infinite and samples from the all hypothesis (lattice) using`
775	`forward-filtering-and-backward-sampling algorithm.`
776
777	`*alpha: tensor(float)* A scalar for a smoothing parameter. Inverse temperature for probability rescaling.`
778
779	`*reverse: tensor(bool)* Reverses the tokenized sequence (Default = false)`
780
781	`*add_bos: tensor(bool)* Add beginning of sentence token to the result (Default = false)`
782
783	`*add_eos: tensor(bool)* Add end of sentence token to the result (Default = false).`
784	`When reverse=True beginning/end of sentence tokens are added after reversing.`
785
786	`#### Attributes`
787
788	`*model: string* The sentencepiece model serialized proto as stored as a string.`
789
790	`#### Outputs`
791
792	`*tokens: tensor(int32)* Indices of each token.`
793
794	`*indices: tensor(int64)* Indices of every first token of input sentences.`
795	`indices[i+1] - indices[i]` is the number of tokens in input `i`.
796
797	`Tokenized result of the input`
798
799	`#### Examples`
800
801	`<details>`
802	`<summary>example 1</summary>`
803
804	```python
805
806	`url = "https://github.com/microsoft/ort-customops/raw/main/test/data/test_sentencepiece_ops_model__6.txt"`
807	`with urllib.request.urlopen(url) as f:`
808	`content = f.read()`
809	`model = np.array(list(base64.decodebytes(content.encode())), dtype=np.uint8)`
810
811	`node = onnx.helper.make_node(`
812	`'SentencepieceTokenizer',`
813	`inputs=['inputs', 'nbest_size', 'alpha', 'add_bos', 'add_eos', 'reverse'],`
814	`outputs=['indices', 'output'],`
815	`mapping_file_name='vocabulary.txt',`
816	`unmapping_value="unknown_word",`
817	`model=model`
818	`)`
819
820	`inputs = np.array(["Hello world", "Hello world louder"], dtype=np.object),`
821	`nbest_size = np.array([0], dtype=np.float32),`
822	`alpha = np.array([0], dtype=np.float32),`
823	`add_bos = np.array([0], dtype=np.bool_),`
824	`add_eos = np.array([0], dtype=np.bool_),`
825	`reverse = np.array([0], dtype=np.bool_)`
826
827	`tokens = array([17486, 1017, 17486, 1017, 155, 21869], dtype=int32)`
828	`indices = array([0, 2, 6], dtype=int64)`
829
830	`expect(node, inputs=[inputs, nbest_size, alpha, add_bos, add_eos, reverse],`
831	`outputs=[tokens, indices], name='sp')`
832	```
833	`</details>`
834
835	`### <a name="BasicTokenizer"></a><a name="BasicTokenizer">BasicTokenizer</a>`
836
837	`BasicTokenizer performs basic tokenization to input string tensor, based on [basic tokenizer in BertTokenizer(hugging face version)](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).`
838
839	`#### Inputs`
840
841	`*text: tensor(string)* The string tensor for tokenization`
842
843	`#### Attributes`
844
845	`*do_lower_case: int64_t* (default is 1, 1 represents True, 0 represents False)`
846
847	`Whether or not to lowercase the input when tokenizing.`
848
849	`*tokenize_chinese_chars: int64_t* (default is 1, 1 represents True, 0 represents False)`
850
851	`Whether or not to tokenize Chinese characters.`
852
853	`*strip_accents: int64_t* (default is 1, 1 represents True, 0 represents False)`
854
855	`Whether or not to strip all accents. If this option is not specified, then it will be determined by the`
856	value for :obj:`lowercase` (as in the original BERT).
857
858	`*tokenize_punctuation: int64_t* (default is 0, 1 represents True, 0 represents False)`
859
860	`Splits punctuation on a piece of text.`
861
862	`*remove_control_chars: int64_t* (default is 0, 1 represents True, 0 represents False)`
863
864	`Remove control chars(such as NUL, BEL) in the text.`
865
866	`#### Outputs`
867
868	`*tokens: tensor(string)* Tokenized tokens.`
869
870	`#### Examples`
871
872	`<details>`
873	`<summary>example 1</summary>`
874
875	```python
876	`import transformers`
877
878	`tokenizer = transformers.BasicTokenizer()`
879
880	`node = onnx.helper.make_node(`
881	`'BasicTokenizer',`
882	`inputs=['text'],`
883	`outputs=['tokens'],`
884	`)`
885
886	`inputs = np.array([ "Hello world louder"], dtype=np.object),`
887	`tokens = np.array(tokenizer(inputs), dtype=int32)`
888
889	`expect(node, inputs=[inputs],`
890	`outputs=[tokens], name='test_basic_tokenizer')`
891	```
892	`</details>`
893
894	`### <a name="BertTokenizer"></a><a name="BertTokenizer">BertTokenizer</a>`
895
896	BertTokenizer replicates `encode_plus` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
897	`#### Inputs`
898
899	`*text: tensor(string)* The string tensor for tokenization`
900
901	`#### Attributes`
902
903	`*vocab_file: string*`
904
905	`The content of vocab which has same with huggingface.`
906
907	`*do_lower_case: int64_t* (default is 1, 1 represents True, 0 represents False)`
908
909	`Whether or not to lowercase the input when tokenizing.`
910
911	`*do_basic_tokenize: int64_t* (default is 1, 1 represents True, 0 represents False)`
912
913	`Whether or not to do basic tokenization before WordPiece.`
914
915	`*unk_token: string*`
916
917	`The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this`
918	`token instead.`
919
920	`*sep_token: string*`
921
922	`The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for`
923	`sequence classification or for a text and a question for question answering. It is also used as the last`
924	`token of a sequence built with special tokens.`
925
926	`*pad_token: string*`
927
928	`The token used for padding, for example when batching sequences of different lengths.`
929
930	`*cls_token: string*`
931
932	`The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.`
933
934	`*mask_token: string*`
935
936	`The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.`
937
938	`*tokenize_chinese_chars: int64_t* (default is 1, 1 represents True, 0 represents False)`
939
940	`Whether or not to tokenize Chinese characters.`
941
942	`*strip_accents: int64_t* (default is 1, 1 represents True, 0 represents False)`
943
944	`Whether or not to strip all accents. If this option is not specified, then it will be determined by the`
945	value for :obj:`lowercase` (as in the original BERT).
946
947	`*tokenize_punctuation: int64_t* (default is 0, 1 represents True, 0 represents False)`
948
949	`Splits punctuation on a piece of text.`
950
951	`*remove_control_chars: int64_t* (default is 0, 1 represents True, 0 represents False)`
952
953	`Remove control chars(such as NUL, BEL) in the text.`
954
955	`*truncation_strategy_name: string*`
956
957	The name of truncation strategy, it could be `longest_first`, `only_first`, `only_second`, `longest_from_back`.
958
959	`#### Outputs`
960
961	`*input_ids: tensor(int64_t)*`
962
963	`List of token ids.`
964
965	`*token_type_ids: tensor(64_t)*`
966
967	`List of token type ids`
968
969	`*attention_mask: tensor(64_t)*`
970
971	`List of indices specifying which tokens should b`
972	`e attended to by the model`
973
974
975	`#### Examples`
976
977	`<details>`
978	`<summary>example 1</summary>`
979
980	```python
981	`import transformers`
982
983	`bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')`
984
985	`node = onnx.helper.make_node(`
986	`'BertTokenizer',`
987	`inputs=['text'],`
988	`outputs=['tokens'],`
989	`)`
990
991	`text = "Hello world louder"`
992	`inputs = np.array([text], dtype=np.object),`
993
994	`bert_tokenize_result = bert_cased_tokenizer.tokenize(text)`
995
996	`input_ids = np.array(bert_tokenize_result[0])`
997	`token_type_ids = np.array(bert_tokenize_result[1])`
998	`attention_mask = np.array(bert_tokenize_result[2])`
999
1000	`expect(node, inputs=[inputs],`
1001	`outputs=[input_ids, token_type_ids, attention_mask], name='test_bert_tokenizer')`
1002	```
1003	`</details>`
1004
1005
1006	`### <a name="BertTokenizerDecoder"></a><a name="BertTokenizerDecoder">BertTokenizerDecoder</a>`
1007
1008	BertTokenizer replicates `decode` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
1009	`#### Inputs`
1010
1011	`*token_ids: tensor(int64)*`
1012
1013	`List of tokenized input ids.`
1014
1015	`*indices: tensor(int64)*`
1016
1017	List of `[start_position, end_position]` to indicate what segments of input ids should be decoded. This input only enabled when attribute `use_indices`=1.
1018
1019	`Usually, it is used to decode the slot in the text.`
1020
1021	`#### Attributes`
1022
1023	`*vocab_file: string*`
1024
1025	`The content of vocab which has same with huggingface.`
1026
1027	`*unk_token: string*`
1028
1029	`The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this`
1030	`token instead.`
1031
1032	`*sep_token: string*`
1033
1034	`The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for`
1035	`sequence classification or for a text and a question for question answering. It is also used as the last`
1036	`token of a sequence built with special tokens.`
1037
1038	`*pad_token: string*`
1039
1040	`The token used for padding, for example when batching sequences of different lengths.`
1041
1042	`*cls_token: string*`
1043
1044	`The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.`
1045
1046	`*mask_token: string*`
1047
1048	`The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.`
1049
1050	`*suffix_indicator: string*`
1051
1052	`The suffix indicator.`
1053
1054	`*use_indices: int64_t*`
1055
1056	`Whether use second input.`
1057
1058	`*skip_special_tokens: int64_t*`
1059
1060	`Whether or not to remove special tokens in the decoding.`
1061
1062	`*clean_up_tokenization_spaces: int64_t*`
1063
1064	`Whether or not to clean up the tokenization spaces.`
1065
1066	`#### Outputs`
1067
1068	`*sentences: tensor(int64_t)*`
1069
1070	`The decoded sentences.`
1071
1072	`#### Examples`
1073
1074	`<details>`
1075	`<summary>example 1</summary>`
1076
1077	```python
1078	`import transformers`
1079
1080	`def get_file_content(path):`
1081	`with open(path, "rb") as file:`
1082	`return file.read()`
1083
1084	`bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')`
1085	`bert_cased_tokenizer.save('.', 'bert')`
1086
1087
1088	`node = onnx.helper.make_node(`
1089	`'BertTokenizerDecoder',`
1090	`inputs=['token_ids'],`
1091	`outputs=['sentences'],`
1092	`vocab_file=get_file_content("bert-vocab.txt")`
1093	`)`
1094
1095	`text = "Hello world louder"`
1096	`token_ids = np.array([bert_cased_tokenizer.tokenize(text)], dtype=np.object),`
1097	`sentences = np.array(text)`
1098
1099
1100	`expect(node, inputs=[token_ids],`
1101	`outputs=[sentences], name='test_bert_tokenizer')`
1102	```
1103	`</details>`
1104

microsoft/onnxruntime-extensions

Branches

Tags

Clone