microsoft/onnxruntime-extensions

Public

mirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
4a0f8929494fa301baa6c59f617cce7872a7c4c8

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

docs/custom_text_ops.md

365lines · modecode

1## Operator Schemas
2
3### Auxiliary String Operator
4
5|**Operator**|**Support State**|
6|------------|-----------------|
7|StringEqual | Supported |
8|StringHash | Supported |
9|StringToHashBucketFast|Supported|
10|StringJoin | Supported |
11|StringRegexReplace| Supported |
12|StringSplit | Supported |
13|StringUpper | Supported |
14|StringSlice | Under development|
15|StringLength | Under development |
16|StringToVector| Under development|
17|VectorToString| Under development |
18
19
20
21### Tokenizer
22
23|**Operator**|**Support State**|
24|------------|-----------------|
25|GPT2Tokenizer| Supported |
26|BertTokenizer| Under development |
27|XLNetTokenizer| Under development |
28
29
30## Auxiliary String Operator
31
32[TODO: Add existing operators]
33
34### <a name="StringSlice"></a><a name="StringSlice">**StringSlice**</a>
35Do the slice operation to each string element in input tensor. Similar to string slice in python
36```python
37a = "abcdef"
38b = a[1:2]
39c = a[3:1:-1]
40```
41#### Inputs
42
43***data: tensor(string)***
44<dd>String tensor to extract slices from.</dd>
45
46***starts: tensor(int64/int32)***
47<dd>The tensor of starting indices of corresponding string in data, which has same dimension of data.</dd>
48
49***ends: tensor(int64/int32)***
50<dd>The tensor of ending indices of corresponding string in data, which has same dimension of data.</dd>
51
52***steps(optional): tensor(int64/int32)***
53<dd>The tensor of slice step of corresponding string in data, which has same dimension of data.If steps is empty tensor, we will use default value 1 for each string</dd>
54
55#### Outputs
56
57***output: tensor(string)***
58<dd>Sliced data tensor.</dd>
59
60#### Examples
61
62<details>
63<summary>string_slice</summary>
64
65```python
66
67node = onnx.helper.make_node(
68 'StringSlice',
69 inputs=['x', 'starts', 'ends', 'steps'],
70 outputs=['y'],
71)
72
73x = ["abcdef", "hijkl"]
74y = [x[0][1:3:1], x[1][3:1:-1]]
75starts = np.array([1, 3], dtype=np.int64)
76ends = np.array([3, 1], dtype=np.int64)
77axes = np.array([0, 1], dtype=np.int64)
78steps = np.array([1, 1], dtype=np.int64)
79
80expect(node, inputs=[x, starts, ends, axes, steps], outputs=[y],
81 name='test_string_slice')
82```
83</details>
84
85### <a name="StringLength"></a><a name="StringLength">**StringLength**</a>
86
87Get the length of each string element in input tensor. Similar to the function `len("abcde"")` in python.
88
89#### Inputs
90
91***data: tensor(string)***
92<dd>String tensor to get length of its each string element.</dd>
93
94#### Outputs
95
96***output: tensor(int64)***
97<dd>Data length tensor.</dd>
98
99#### Examples
100
101<details>
102<summary>string_length</summary>
103
104```python
105
106node = onnx.helper.make_node(
107 'StringLength',
108 inputs=['x'],
109 outputs=['y']
110)
111
112x = ["abcdef", "hijkl"]
113y = np.array([len(x[0]), len(x[1])], dtype=np.int64)
114
115
116expect(node, inputs=[x], outputs=[y],
117 name='test_string_length')
118```
119</details>
120
121
122### <a name="StringToVector"></a><a name="StringToVector">**StringToVector**</a>
123
124StringToVector will map each string element in the input to the corresponding vector according to the mapping file. The mapping file is a utf-8 encoding text file in tsv format:
125
126 <string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>
127
128Unmapped string will output the value of the attribute `unmapping_value`.
129
130Example:
131
132*Attributes:*
133
134- `mapping_file_name`: vocabulary.txt
135 ```
136 a 0 0 1 2
137 b 0 1 2 3
138 d 0 1 3 4
139 ```
140
141- `unmapping_value`: [0 0 0 0]
142
143*Inputs:*
144- data: ["a", "d", "e"]
145
146*Ouputs:*
147- output: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]
148
149#### Attributes
150
151***mapping_file_name:string***
152<dd>The name of your string to vector mapping file.</dd>
153
154***unmapping_value:list(int)***
155<dd>Mapping result for unmapped string</dd>
156
157#### Inputs
158
159***data: tensor(string)***
160<dd>Iut tensor</dd>
161
162#### Outputs
163
164***output: tensor(T)***
165<dd>The mapping result of the input</dd>
166
167#### Type Constraints
168***T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)***
169<dd>Constrain input and output types to numerical tensors.</dd>
170
171
172#### Examples
173
174<details>
175<summary>string_to_vector</summary>
176
177```python
178# what's in vocabulary.txt
179
180# a 0 0 1 2
181# b 0 1 2 3
182# d 0 1 3 4
183
184node = onnx.helper.make_node(
185 'StringToVector',
186 inputs=['x'],
187 outputs=['y'],
188 mapping_file_name='vocabulary.txt',
189 unmapping_value=[0,0,0,0]
190)
191
192
193x = ["a", "d", "e"]
194y = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)
195
196
197expect(node, inputs=[x], outputs=[y],
198 name='test_string_to_vector')
199```
200</details>
201
202### <a name="VectorToString"></a><a name="VectorToString">**VectorToString**</a>
203
204VectorToString is the contrary operation to the `StringToVector` , they share same format of mapping file:
205
206 <string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>
207
208Unmapped vector will output the value of the attribute `unmapping_value`.
209
210Example:
211
212*Attributes:*
213
214- `mapping_file_name`: vocabulary.txt
215 ```
216 a 0 0 1 2
217 b 0 1 2 3
218 d 0 1 3 4
219 ```
220
221- `unmapping_value`: "unknown_word"
222
223*Inputs:*
224- data: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]
225
226*Ouputs:*
227- output: ["a", "d", "unknown_word" ]
228
229#### Attributes
230
231***mapping_file_name***
232<dd>The name of your string to vector mapping file.</dd>
233
234***unmapping_value***
235<dd>Mapping result for unmapped string</dd>
236
237#### Inputs
238
239***data: tensor(string)***
240<dd>Input tensor</dd>
241
242#### Outputs
243
244***output: tensor(T)***
245<dd>The mapping result of the input</dd>
246
247#### Type Constraints
248***T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)***
249<dd>Constrain input and output types to numerical tensors.</dd>
250
251
252#### Examples
253
254<details>
255<summary>vector_to_string</summary>
256
257```python
258# what's in vocabulary.txt
259
260# a 0 0 1 2
261# b 0 1 2 3
262# d 0 1 3 4
263
264node = onnx.helper.make_node(
265 'StringToVector',
266 inputs=['x'],
267 outputs=['y'],
268 mapping_file_name='vocabulary.txt',
269 unmapping_value="unknown_word"
270)
271
272
273x = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)
274y = ["a", "d", "unknown_worde"]
275
276
277expect(node, inputs=[x], outputs=[y],
278 name='test_vector_to_string')
279```
280</details>
281
282## Tokenizer
283
284### <a name="GPT2Tokenizer"></a><a name="GPT2Tokenizer">**GPT2Tokenizer**</a>
285
286GPT2Tokenizer that performs byte-level bpe tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html).
287
288#### Inputs
289
290***data: tensor(string)***
291<dd>The string tensor for tokenization</dd>
292
293#### Outputs
294
295***output: tensor(int64)***
296<dd>The tokenized result of input</dd>
297
298#### Examples
299
300<details>
301<summary>gpt2tokenizer</summary>
302
303```python
304
305node = onnx.helper.make_node(
306 'GPT2Tokenizer',
307 inputs=['x'],
308 outputs=['y'],
309)
310
311x = ["hey cortana"]
312y = np.array([20342, 12794, 2271], dtype=np.int64)
313
314expect(node, inputs=[x], outputs=[y],
315 name='test_gpt2_tokenizer')
316```
317</details>
318
319
320### <a name="BertTokenizer"></a><a name="BertTokenizer">**BertTokenizer**</a>
321
322BertTokenizer that performs WordPiece tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer).
323
324#### Inputs
325
326***data: tensor(string)***
327<dd>The string tensor for tokenization</dd>
328
329#### Outputs
330
331***output: tensor(int64)***
332<dd>Tokenized result of the input</dd>
333
334#### Examples
335
336<details>
337<summary>word_piece_tokenizer</summary>
338
339```python
340```
341</details>
342
343### <a name="XLNetTokenizer"></a><a name="XLNetTokenizer">**XLNetTokenizer**</a>
344
345GPT2Tokenizer that performs SentencePiece tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/model_doc/xlnet.html#xlnettokenizer).
346
347#### Inputs
348
349***data: tensor(string)***
350<dd>The string tensor for tokenization</dd>
351
352#### Outputs
353
354***output: tensor(int64)***
355<dd>Tokenized result of the input</dd>
356
357#### Examples
358
359<details>
360<summary>word_piece_tokenizer</summary>
361
362```python
363
364```
365</details>
366