microsoft/onnxruntime-extensions

Public

mirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable

Watch0 Fork0 Star0

Code Commits Issues Pull requests Actions Insights Security

v0.4.2

Find a branch or tag

Branches

v0.4.2

Clone

HTTPS

Download ZIP

onnxruntime-extensions/docs

docs/custom_text_ops.md

745lines · modecode

Raw Download

Latest commit unavailable.

unknown

1	`## Operator Schemas`
2
3	`### Auxiliary String Operator`
4
5	`\|Operator\|Support State\|`
6	`\|------------\|-----------------\|`
7	`\|StringEqual \| Supported \|`
8	`\|StringHash \| Supported \|`
9	`\|StringToHashBucketFast\|Supported\|`
10	`\|StringJoin \| Supported \|`
11	`\|StringRegexReplace\| Supported \|`
12	`\|StringRegexSplit\| Supported \|`
13	`\|StringSplit \| Supported \|`
14	`\|StringUpper \| Supported \|`
15	`\|StringLength \| Supported \|`
16	`\|StringConcat \| Supported \|`
17	`\|StringRegexSplitWithOffsets\| Supported \|`
18	`\|VectorToString\| Supported \|`
19	`\|StringToVector\| Supported\|`
20	`\|StringSlice \| Under development\|`
21	`### Tokenizer`
22
23	`\|Operator\|Support State\|`
24	`\|------------\|-----------------\|`
25	`\|GPT2Tokenizer\| Supported \|`
26	`\|WordpieceTokenizer\| Supported \|`
27	`\|XLNetTokenizer\| Under development \|`
28	`\|SentencepieceTokenizer\| Supported \|`
29
30	`## Auxiliary String Operator`
31
32	`[TODO: Add existing operators]`
33
34	`### <a name="StringRegexReplace"></a><a name="StringRegexReplace">StringRegexReplace</a>`
35
36	`String replacement based on regular expressions.`
37
38	`#### Inputs`
39
40	`*text: tensor(string)*`
41
42	`String tensor to extract slices from.`
43
44	`*pattern: tensor(string)*`
45
46	`Pattern of the regular expression.`
47
48	`*rewrite: tensor(string)*`
49
50	`Replacement.`
51
52	`#### Attributes`
53
54	`*global_replace: int64* (default is 1)`
55
56	`Replace all strings matching the pattern or the first one.`
57
58	`#### Outputs`
59
60	`*output: tensor(string)*`
61
62	`String with replacements.`
63
64	`#### Examples`
65
66	`<details>`
67	`<summary>StringRegexReplace</summary>`
68
69	```python
70
71	`node = onnx.helper.make_node(`
72	`'StringRegexReplace',`
73	`inputs=['text', 'pattern', 'rewrite'],`
74	`outputs=['y'],`
75	`)`
76
77	`text = np.array([['def myfunc():'], ['def dummy():']])`
78	`pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s\(\s*\):'])`
79	`rewrite = np.array([r'static PyObject* py_\1(void) {'])`
80	`y = [['static PyObject* py_myfunc(void) {'],`
81	`['static PyObject* py_dummy(void) {']]`
82
83	`expect(node, inputs=[text, pattern, rewrite], outputs=[y],`
84	`name='test_string_regex_replace')`
85	```
86
87	`</details>`
88
89	`### <a name="StringRegexSplit"></a><a name="StringRegexSplit">StringRegexSplit</a>`
90
91	`Splits string based on regular expressions.`
92
93	`#### Inputs`
94
95	`*text: tensor(string)*`
96
97	`String tensor to extract slices from.`
98
99	`*delim_regex_pattern: tensor(string)*`
100
101	`Splitting attern of the regular expression.`
102
103	`*keep_delim_regex_pattern: tensor(string)*`
104
105	`By default, delimiters are not included in the split string results. Delimiters may be included by specifying a regex pattern keep_delim_regex_pattern.`
106
107	`#### Outputs`
108
109	`*words: tensor(string)* Tensor of words.`
110
111	`*offsets: tensor(int64)* 2D tensor with 3 columns:`
112	`sentence index, position of the first character, position of the last one (excluded)`
113
114	`*row_indices: tensor(int64)* Indices of every first token of input sentences.`
115	`row_indices[i+1] - row_indices[i]` is the number of tokens in input `i`.
116	`These are updates row indices given as inputs or new ones if the second input is empty.`
117
118
119	`#### Examples`
120
121	`<details>`
122	`<summary>StringRegexSplit</summary>`
123
124	```python
125
126	`node = onnx.helper.make_node(`
127	`'StringRegexSplit',`
128	`inputs=['text', 'pattern', 'rewrite'],`
129	`outputs=['y', 'begin_end', 'indices'],`
130	`)`
131
132	`text = np.array(["hello there"])`
133	`pattern = np.array([r'\s'])`
134	`rewrite = np.array([r'\s'])`
135	`y = np.array(["hello", " ", "there"])`
136	`z1 = np.array([[0, 0, 5],`
137	`[0, 5, 6],`
138	`[0, 6, 11]], dtype=np.int64)`
139	`z2 = np.array([0, 2], dtype=np.int64)`
140
141	`expect(node, inputs=[text, pattern, rewrite], outputs=[y, z1, z2],`
142	`name='test_string_regex_replace')`
143	```
144
145	`</details>`
146
147	`### <a name="StringConcat"></a><a name="StringConcat">StringConcat</a>`
148
149	`Concat the corresponding string in the two string tensor. Two input tensors should have the same dimension.`
150
151	```python
152	`output = []`
153	`shape = input1.shape`
154	`input1 = input1.flatten()`
155	`input2 = input2.flatten()`
156	`for i in range(len(input1)):`
157	`output.append(input1[i] + input2[i])`
158	`output = np.array(output).reshape(shape)`
159	```
160
161	`#### Inputs`
162
163	`*input_1: tensor(string)*`
164
165	`The first string tensor.`
166
167	`*input_2: tensor(string)*`
168
169	`The second string tensor.`
170
171
172	`#### Outputs`
173
174	`*output: tensor(string)*`
175
176	`The result.`
177
178	`#### Examples`
179
180	`<details>`
181	`<summary>StringConcat</summary>`
182
183	```python
184
185	`node = onnx.helper.make_node(`
186	`'StringConcat',`
187	`inputs=['x', 'y'],`
188	`outputs=['result'],`
189	`)`
190
191	`x = np.array(["abcd", "efgh"])`
192	`y = np.array(["wxyz", "stuv"])`
193	`result = np.array([x[0] + y[0], x[1] + y[1]])`
194
195	`expect(node, inputs=[x, y], outputs=[result],`
196	`name='test_string_concat')`
197	```
198
199	`</details>`
200
201	`### <a name="StringSlice"></a><a name="StringSlice">StringSlice</a>`
202
203	`Do the slice operation to each string element in input tensor. Similar to string slice in python`
204
205	```python
206	`a = "abcdef"`
207	`b = a[1:2]`
208	`c = a[3:1:-1]`
209	```
210
211	`#### Inputs`
212
213	`*data: tensor(string)*`
214
215	`String tensor to extract slices from.`
216
217	`*starts: tensor(int64/int32)*`
218
219	`The tensor of starting indices of corresponding string in data, which has same dimension of data.`
220
221	`*ends: tensor(int64/int32)*`
222
223	`The tensor of ending indices of corresponding string in data, which has same dimension of data.`
224
225	`*steps(optional): tensor(int64/int32)*`
226
227	`The tensor of slice step of corresponding string in data, which has same dimension of data.If steps is empty tensor, we will use default value 1 for each string`
228
229	`#### Outputs`
230
231	`*output: tensor(string)*`
232
233	`Sliced data tensor.`
234
235	`#### Examples`
236
237	`<details>`
238	`<summary>string_slice</summary>`
239
240	```python
241
242	`node = onnx.helper.make_node(`
243	`'StringSlice',`
244	`inputs=['x', 'starts', 'ends', 'steps'],`
245	`outputs=['y'],`
246	`)`
247
248	`x = np.array(["abcdef", "hijkl"])`
249	`y = np.array([x[0][1:3:1], x[1][3:1:-1]])`
250	`starts = np.array([1, 3], dtype=np.int64)`
251	`ends = np.array([3, 1], dtype=np.int64)`
252	`axes = np.array([0, 1], dtype=np.int64)`
253	`steps = np.array([1, 1], dtype=np.int64)`
254
255	`expect(node, inputs=[x, starts, ends, axes, steps], outputs=[y],`
256	`name='test_string_slice')`
257	```
258
259	`</details>`
260
261	`### <a name="StringLength"></a><a name="StringLength">StringLength</a>`
262
263	Get the length of each string element in input tensor. Similar to the function `len("abcde"")` in python.
264
265	`#### Inputs`
266
267	`*data: tensor(string)*`
268
269	`String tensor to get length of its each string element.`
270
271	`#### Outputs`
272
273	`*output: tensor(int64)*`
274
275	`Data length tensor.`
276
277	`#### Examples`
278
279	`<details>`
280	`<summary>string_length</summary>`
281
282	```python
283
284	`node = onnx.helper.make_node(`
285	`'StringLength',`
286	`inputs=['x'],`
287	`outputs=['y']`
288	`)`
289
290	`x = ["abcdef", "hijkl"]`
291	`y = np.array([len(x[0]), len(x[1])], dtype=np.int64)`
292
293
294	`expect(node, inputs=[x], outputs=[y],`
295	`name='test_string_length')`
296	```
297	`</details>`
298
299
300	`### <a name="StringToVector"></a><a name="StringToVector">StringToVector</a>`
301
302	`StringToVector will map each string element in the input to the corresponding vector according to the mapping file. The mapping file is a utf-8 encoding text file in tsv format:`
303
304	`<string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>`
305
306	Unmapped string will output the value of the attribute `unmapping_value`.
307
308	`Example:`
309
310	`Attributes:`
311
312	- `mapping_file_name`: vocabulary.txt
313	```
314	`a 0 0 1 2`
315	`b 0 1 2 3`
316	`d 0 1 3 4`
317	```
318
319	- `unmapping_value`: [0 0 0 0]
320
321	`Inputs:`
322	`- data: ["a", "d", "e"]`
323
324	`Ouputs:`
325	`- output: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]`
326
327	`#### Attributes`
328
329	`*mapping_file_name:string*`
330
331	`The name of your string to vector mapping file.`
332
333	`*unmapping_value:list(int)*`
334
335	`Mapping result for unmapped string`
336
337	`#### Inputs`
338
339	`*data: tensor(string)*`
340
341	`Input tensor`
342
343	`#### Outputs`
344
345	`*output: tensor(T)*`
346
347	`The mapping result of the input`
348
349	`#### Type Constraints`
350	`*T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)*`
351
352	`Constrain input and output types to numerical tensors.`
353
354	`#### Examples`
355
356	`<details>`
357	`<summary>string_to_vector</summary>`
358
359	```python
360	`# what's in vocabulary.txt`
361
362	`mapping_table = \`
363	`"""`
364	`a 0 0 1 2`
365	`b 0 1 2 3`
366	`d 0 1 3 4`
367	`"""`
368
369	`node = onnx.helper.make_node(`
370	`'StringToVector',`
371	`inputs=['x'],`
372	`outputs=['y'],`
373	`mapping_table=mapping_table,`
374	`unmapping_value=[0,0,0,0]`
375	`)`
376
377
378	`x = ["a", "d", "e"]`
379	`y = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)`
380
381
382	`expect(node, inputs=[x], outputs=[y],`
383	`name='test_string_to_vector')`
384	```
385
386	`</details>`
387
388	`### <a name="VectorToString"></a><a name="VectorToString">VectorToString</a>`
389
390	VectorToString is the contrary operation to the `StringToVector` , they share same format of mapping table:
391
392	`<string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>`
393
394	Unmapped vector will output the value of the attribute `unk`.
395
396	`Example:`
397
398	`Attributes:`
399
400	- `map`:
401	```
402	`a 0 0 1 2`
403	`b 0 1 2 3`
404	`d 0 1 3 4`
405	```
406
407	- `unk`: "unknown_word"
408
409	`Inputs:`
410	`- data: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]`
411
412	`Ouputs:`
413	`- output: ["a", "d", "unknown_word" ]`
414
415	`#### Attributes`
416
417	`*mapping_file_name*`
418
419	`the formative mapping table`
420
421	`*unmapping_value*`
422
423	`the result returned when a vector aren't found in the map`
424
425	`#### Inputs`
426
427	`*data: tensor(T)*`
428
429	`Input tensor`
430
431	`#### Outputs`
432
433	`*output: tensor(string)*`
434
435	`The mapping result of the input`
436
437	`#### Type Constraints`
438	`*T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)*`
439
440	`Constrain input and output types to numerical tensors.`
441
442
443	`#### Examples`
444
445	`<details>`
446	`<summary>vector_to_string</summary>`
447
448	```python
449	`mapping_table = \`
450	`"""`
451	`a 0 0 1 2`
452	`b 0 1 2 3`
453	`d 0 1 3 4`
454	`"""`
455
456	`node = onnx.helper.make_node(`
457	`'VectorToString',`
458	`inputs=['x'],`
459	`outputs=['y'],`
460	`map=mapping_table,`
461	`unk="unknown_word"`
462	`)`
463
464
465	`x = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)`
466	`y = ["a", "d", "unknown_word"]`
467
468
469	`expect(node, inputs=[x], outputs=[y],`
470	`name='test_vector_to_string')`
471	```
472	`</details>`
473
474	`## Tokenizer`
475
476	`### <a name="GPT2Tokenizer"></a><a name="GPT2Tokenizer">GPT2Tokenizer</a>`
477
478	`GPT2Tokenizer that performs byte-level bpe tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html).`
479
480	`#### Attributes`
481
482	`*vocab*`
483
484	`The content of the vocabulary file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).`
485
486	`*merges*`
487
488	`The content of the merges file, its format is same with [hugging face](https://huggingface.co/gpt2/resolve/main/merges.txt).`
489
490	`*padding_length(optional)*`
491
492	When the input is a set of query, the tokenized result is ragged tensor, so we need to pad the tensor to tidy tensor and the `padding_length` indicates the strategy of the padding. When the padding_length equals -1, we will pad the tensor to length of longest row. When the padding_length is more than 0, we will pad the tensor to the number of padding_length.
493
494	The default value of `padding_length` is -1.
495
496	`#### Inputs`
497
498	`*data: tensor(string)*`
499
500	`The string tensor for tokenization`
501
502	`#### Outputs`
503
504	`*input_ids: tensor(int64)*`
505
506	`The tokenized ids of input`
507
508	`*attention_mask: tensor(int64)*`
509
510	`A tensor indicates which part of input_ids is padded.`
511
512	`#### Examples`
513
514	`<details>`
515	`<summary>gpt2tokenizer</summary>`
516
517	```python
518	`def get_file_content(path):`
519	`with open(path, "rb") as file:`
520	`return file.read()`
521
522	`node = onnx.helper.make_node(`
523	`'GPT2Tokenizer',`
524	`inputs=['x'],`
525	`outputs=['y'],`
526	`vocab=get_file_content(vocabulary_file),`
527	`merges=get_file_content(merges_file)`
528	`)`
529
530	`x = ["hey cortana"]`
531	`y = np.array([20342, 12794, 2271], dtype=np.int64)`
532
533	`expect(node, inputs=[x], outputs=[y],`
534	`name='test_gpt2_tokenizer')`
535	```
536	`</details>`
537
538
539	`### <a name="WordpieceTokenizer"></a><a name="WordpieceTokenizer">WordpieceTokenizer</a>`
540
541	`WordpieceTokenizer that performs WordPiece tokenization to the input tensor,`
542	`based on the [hugging face version](https://huggingface.co/transformers/model_doc/bert.html#WordpieceTokenizer).`
543	`[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)`
544	`from tensorflow_text can be implemented by a pair of nodes`
545	`RegexSplitWithOffets followed by WordpieceTokenizer.`
546	`it`
547
548	`#### Attributes`
549
550	`*vocab*`
551
552	`The content of the vocabulary file, its format is same with`
553	`[hugging face](https://huggingface.co/gpt2/resolve/main/vocab.json).`
554
555	`*suffix_indicator*`
556
557	`Suffix added to token not in the first position before looking into the vocabulary.`
558
559	`*unk_token*`
560
561	`Unknown tokens. Every token not found in the vocabulary is replaced by this one.`
562
563	`*max_input_chars_per_word*`
564
565	`Maximum number of characters per token (optional, defaults to 200).`
566
567	`#### Inputs`
568
569	`*data: tensor(string)*`
570
571	`The string tensor for tokenization`
572
573	`*row_indices: tensor(int64)* Empty or the fndices of every first token of input sentences.`
574	`indices[i+1] - indices[i]` is the number of tokens in input `i`.
575
576	`[WordpieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/WordpieceTokenizer.md)`
577	`includes two steps. The first one splits sentences into words and then splits`
578	`every work into tokens. This operator only implements the second step.`
579	`The first one can be done with operator StringRegexSplit.`
580	`This parameter can either be empty or it can be the third output`
581	`of operator StringRegexSplit.`
582
583	`#### Outputs`
584
585	`*tokens: tensor(string)* Every token.`
586
587	`*token_indices: tensor(int32)* Indices of each token. -1 means a token outside the vocabulary.`
588
589	`*row_indices: tensor(int64)* Indices of every first token of input sentences.`
590	`indices[i+1] - indices[i]` is the number of tokens in input `i`.
591	`These are updates row indices given as inputs or new ones if the second input is empty.`
592
593	`#### Examples`
594
595	`<details>`
596	`<summary>word_piece_tokenizer</summary>`
597
598	```python
599	`words = ["want", "##want",`
600	`"##ed", "wa", "un", "runn", "##ing"]`
601	`vocab = {w: i + 10 for i, w in enumerate(words)}`
602	`st = json.dumps(vocab)`
603	`nodes = []`
604	`mkv = helper.make_tensor_value_info`
605	`reg = helper.make_tensor(`
606	`"pattern", onnx_proto.TensorProto.STRING, [1, ], ["(\\s)".encode('ascii')])`
607	`reg_empty = helper.make_tensor(`
608	`"keep_pattern", onnx_proto.TensorProto.STRING, [0, ], [])`
609
610	`nodes = [`
611	`helper.make_node(`
612	`'StringRegexSplitWithOffsets,`
613	`inputs=['text', 'pattern', 'keep_pattern'],`
614	`outputs=['words', 'begin_end', 'indices'],`
615	`name='StringRegexPlsitOpName',`
616	`domain='ai.onnx.contrib'),`
617	`helper.make_node(`
618	`'WordpieceTokenizer',`
619	`inputs=['words', 'indices'],`
620	`outputs=['out0', 'out1', 'out2'],`
621	`name='WordpieceTokenizerOpName',`
622	`domain='ai.onnx.contrib',`
623	`vocab=st.encode('utf-8'),`
624	`suffix_indicator="##",`
625	`unk_token="[UNK]")`
626	`]`
627	`inputs = [mkv('text', onnx_proto.TensorProto.STRING, [None])]`
628	`graph = helper.make_graph(`
629	`nodes, 'test0', inputs, [`
630	`mkv('out0', onnx_proto.TensorProto.STRING, [None]),`
631	`mkv('out1', onnx_proto.TensorProto.INT32, [None]),`
632	`mkv('out2', onnx_proto.TensorProto.INT64, [None]),`
633	`mkv('words', onnx_proto.TensorProto.STRING, [None]),`
634	`mkv('indices', onnx_proto.TensorProto.INT64, [None])],`
635	`[reg, reg_empty])`
636	`model = helper.make_model(`
637	`graph, opset_imports=[helper.make_operatorsetid(domain, 1)])`
638
639	`text = np.array(["unwanted running", "unwantedX running"], dtype=np.object)`
640	`tokens = np.array(['un', '##want', '##ed', 'runn', '##ing', 'un', '##want', '##ed',`
641	`'[UNK]', 'runn', '##ing'], dtype=object),`
642	`indices = np.array([14, 11, 12, 15, 16, 14, 11, 12, -1, 15, 16], dtype=int32)`
643	`row_indices = np.array([ 0, 5, 11], dtype=int64)`
644
645	`expect(model, inputs=[text], outputs=[tokens, indices, row_indices],`
646	`name='test_bert_tokenizer')`
647	```
648
649	`</details>`
650
651	`### <a name="SentencepieceTokenizer"></a><a name="SentencepieceTokenizer">SentencepieceTokenizer</a>`
652
653	`SentencepieceTokenizer replicates [SentencepieceTokenizer](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/SentencepieceTokenizer.md).`
654
655	`#### Inputs`
656
657	`*data: tensor(string)* The string tensor for tokenization`
658
659	`*nbest_size: tensor(int64)* A scalar for sampling. nbest_size = {0,1}: No sampling is performed.`
660	`(default) nbest_size > 1: samples from the nbest_size results. nbest_size < 0: assuming that`
661	`nbest_size is infinite and samples from the all hypothesis (lattice) using`
662	`forward-filtering-and-backward-sampling algorithm.`
663
664	`*alpha: tensor(float)* A scalar for a smoothing parameter. Inverse temperature for probability rescaling.`
665
666	`*reverse: tensor(bool)* Reverses the tokenized sequence (Default = false)`
667
668	`*add_bos: tensor(bool)* Add beginning of sentence token to the result (Default = false)`
669
670	`*add_eos: tensor(bool)* Add end of sentence token to the result (Default = false).`
671	`When reverse=True beginning/end of sentence tokens are added after reversing.`
672
673	`#### Attributes`
674
675	`*model: string* The sentencepiece model serialized proto as stored as a string.`
676
677	`#### Outputs`
678
679	`*tokens: tensor(int32)* Indices of each token.`
680
681	`*indices: tensor(int64)* Indices of every first token of input sentences.`
682	`indices[i+1] - indices[i]` is the number of tokens in input `i`.
683
684	`Tokenized result of the input`
685
686	`#### Examples`
687
688	`<details>`
689	`<summary>example 1</summary>`
690
691	```python
692
693	`url = "https://github.com/microsoft/ort-customops/raw/main/test/data/test_sentencepiece_ops_model__6.txt"`
694	`with urllib.request.urlopen(url) as f:`
695	`content = f.read()`
696	`model = np.array(list(base64.decodebytes(content.encode())), dtype=np.uint8)`
697
698	`node = onnx.helper.make_node(`
699	`'SentencepieceTokenizer',`
700	`inputs=['inputs', 'nbest_size', 'alpha', 'add_bos', 'add_eos', 'reverse'],`
701	`outputs=['indices', 'output'],`
702	`mapping_file_name='vocabulary.txt',`
703	`unmapping_value="unknown_word",`
704	`model=model`
705	`)`
706
707	`inputs = np.array(["Hello world", "Hello world louder"], dtype=np.object),`
708	`nbest_size = np.array([0], dtype=np.float32),`
709	`alpha = np.array([0], dtype=np.float32),`
710	`add_bos = np.array([0], dtype=np.bool_),`
711	`add_eos = np.array([0], dtype=np.bool_),`
712	`reverse = np.array([0], dtype=np.bool_)`
713
714	`tokens = array([17486, 1017, 17486, 1017, 155, 21869], dtype=int32)`
715	`indices = array([0, 2, 6], dtype=int64)`
716
717	`expect(node, inputs=[inputs, nbest_size, alpha, add_bos, add_eos, reverse],`
718	`outputs=[tokens, indices], name='sp')`
719	```
720	`</details>`
721
722	`### <a name="XLNetTokenizer"></a><a name="XLNetTokenizer">XLNetTokenizer</a>`
723
724	`GPT2Tokenizer that performs SentencePiece tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/model_doc/xlnet.html#xlnettokenizer).`
725
726	`#### Inputs`
727
728	`*data: tensor(string)*`
729	`The string tensor for tokenization`
730
731	`#### Outputs`
732
733	`*output: tensor(int64)*`
734
735	`Tokenized result of the input`
736
737	`#### Examples`
738
739	`<details>`
740	`<summary>word_piece_tokenizer</summary>`
741
742	```python
743
744	```
745	`</details>`
746

microsoft/onnxruntime-extensions

Branches

Tags

Clone