microsoft/onnxruntime-extensions

Public

mirrored fromhttps://github.com/microsoft/onnxruntime-extensionsAvailable

Watch0 Fork0 Star0

Code Commits Issues Pull requests Actions Insights Security

4a0f8929494fa301baa6c59f617cce7872a7c4c8

Find a branch or tag

Branches

4a0f8929494fa301baa6c59f617cce7872a7c4c8

Clone

HTTPS

Download ZIP

onnxruntime-extensions/docs

docs/custom_text_ops.md

365lines · modecode

Raw Download

Latest commit unavailable.

unknown

1	`## Operator Schemas`
2
3	`### Auxiliary String Operator`
4
5	`\|Operator\|Support State\|`
6	`\|------------\|-----------------\|`
7	`\|StringEqual \| Supported \|`
8	`\|StringHash \| Supported \|`
9	`\|StringToHashBucketFast\|Supported\|`
10	`\|StringJoin \| Supported \|`
11	`\|StringRegexReplace\| Supported \|`
12	`\|StringSplit \| Supported \|`
13	`\|StringUpper \| Supported \|`
14	`\|StringSlice \| Under development\|`
15	`\|StringLength \| Under development \|`
16	`\|StringToVector\| Under development\|`
17	`\|VectorToString\| Under development \|`
18
19
20
21	`### Tokenizer`
22
23	`\|Operator\|Support State\|`
24	`\|------------\|-----------------\|`
25	`\|GPT2Tokenizer\| Supported \|`
26	`\|BertTokenizer\| Under development \|`
27	`\|XLNetTokenizer\| Under development \|`
28
29
30	`## Auxiliary String Operator`
31
32	`[TODO: Add existing operators]`
33
34	`### <a name="StringSlice"></a><a name="StringSlice">StringSlice</a>`
35	`Do the slice operation to each string element in input tensor. Similar to string slice in python`
36	```python
37	`a = "abcdef"`
38	`b = a[1:2]`
39	`c = a[3:1:-1]`
40	```
41	`#### Inputs`
42
43	`*data: tensor(string)*`
44	`<dd>String tensor to extract slices from.</dd>`
45
46	`*starts: tensor(int64/int32)*`
47	`<dd>The tensor of starting indices of corresponding string in data, which has same dimension of data.</dd>`
48
49	`*ends: tensor(int64/int32)*`
50	`<dd>The tensor of ending indices of corresponding string in data, which has same dimension of data.</dd>`
51
52	`*steps(optional): tensor(int64/int32)*`
53	`<dd>The tensor of slice step of corresponding string in data, which has same dimension of data.If steps is empty tensor, we will use default value 1 for each string</dd>`
54
55	`#### Outputs`
56
57	`*output: tensor(string)*`
58	`<dd>Sliced data tensor.</dd>`
59
60	`#### Examples`
61
62	`<details>`
63	`<summary>string_slice</summary>`
64
65	```python
66
67	`node = onnx.helper.make_node(`
68	`'StringSlice',`
69	`inputs=['x', 'starts', 'ends', 'steps'],`
70	`outputs=['y'],`
71	`)`
72
73	`x = ["abcdef", "hijkl"]`
74	`y = [x[0][1:3:1], x[1][3:1:-1]]`
75	`starts = np.array([1, 3], dtype=np.int64)`
76	`ends = np.array([3, 1], dtype=np.int64)`
77	`axes = np.array([0, 1], dtype=np.int64)`
78	`steps = np.array([1, 1], dtype=np.int64)`
79
80	`expect(node, inputs=[x, starts, ends, axes, steps], outputs=[y],`
81	`name='test_string_slice')`
82	```
83	`</details>`
84
85	`### <a name="StringLength"></a><a name="StringLength">StringLength</a>`
86
87	Get the length of each string element in input tensor. Similar to the function `len("abcde"")` in python.
88
89	`#### Inputs`
90
91	`*data: tensor(string)*`
92	`<dd>String tensor to get length of its each string element.</dd>`
93
94	`#### Outputs`
95
96	`*output: tensor(int64)*`
97	`<dd>Data length tensor.</dd>`
98
99	`#### Examples`
100
101	`<details>`
102	`<summary>string_length</summary>`
103
104	```python
105
106	`node = onnx.helper.make_node(`
107	`'StringLength',`
108	`inputs=['x'],`
109	`outputs=['y']`
110	`)`
111
112	`x = ["abcdef", "hijkl"]`
113	`y = np.array([len(x[0]), len(x[1])], dtype=np.int64)`
114
115
116	`expect(node, inputs=[x], outputs=[y],`
117	`name='test_string_length')`
118	```
119	`</details>`
120
121
122	`### <a name="StringToVector"></a><a name="StringToVector">StringToVector</a>`
123
124	`StringToVector will map each string element in the input to the corresponding vector according to the mapping file. The mapping file is a utf-8 encoding text file in tsv format:`
125
126	`<string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>`
127
128	Unmapped string will output the value of the attribute `unmapping_value`.
129
130	`Example:`
131
132	`Attributes:`
133
134	- `mapping_file_name`: vocabulary.txt
135	```
136	`a 0 0 1 2`
137	`b 0 1 2 3`
138	`d 0 1 3 4`
139	```
140
141	- `unmapping_value`: [0 0 0 0]
142
143	`Inputs:`
144	`- data: ["a", "d", "e"]`
145
146	`Ouputs:`
147	`- output: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]`
148
149	`#### Attributes`
150
151	`*mapping_file_name:string*`
152	`<dd>The name of your string to vector mapping file.</dd>`
153
154	`*unmapping_value:list(int)*`
155	`<dd>Mapping result for unmapped string</dd>`
156
157	`#### Inputs`
158
159	`*data: tensor(string)*`
160	`<dd>Iut tensor</dd>`
161
162	`#### Outputs`
163
164	`*output: tensor(T)*`
165	`<dd>The mapping result of the input</dd>`
166
167	`#### Type Constraints`
168	`*T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)*`
169	`<dd>Constrain input and output types to numerical tensors.</dd>`
170
171
172	`#### Examples`
173
174	`<details>`
175	`<summary>string_to_vector</summary>`
176
177	```python
178	`# what's in vocabulary.txt`
179
180	`# a 0 0 1 2`
181	`# b 0 1 2 3`
182	`# d 0 1 3 4`
183
184	`node = onnx.helper.make_node(`
185	`'StringToVector',`
186	`inputs=['x'],`
187	`outputs=['y'],`
188	`mapping_file_name='vocabulary.txt',`
189	`unmapping_value=[0,0,0,0]`
190	`)`
191
192
193	`x = ["a", "d", "e"]`
194	`y = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)`
195
196
197	`expect(node, inputs=[x], outputs=[y],`
198	`name='test_string_to_vector')`
199	```
200	`</details>`
201
202	`### <a name="VectorToString"></a><a name="VectorToString">VectorToString</a>`
203
204	VectorToString is the contrary operation to the `StringToVector` , they share same format of mapping file:
205
206	`<string>\t<scalar_1>\s<scalar_2>\s<scalar_3>...<scalar_n>`
207
208	Unmapped vector will output the value of the attribute `unmapping_value`.
209
210	`Example:`
211
212	`Attributes:`
213
214	- `mapping_file_name`: vocabulary.txt
215	```
216	`a 0 0 1 2`
217	`b 0 1 2 3`
218	`d 0 1 3 4`
219	```
220
221	- `unmapping_value`: "unknown_word"
222
223	`Inputs:`
224	`- data: [[0,0,1,2],[0,1,3,4],[0,0,0,0]]`
225
226	`Ouputs:`
227	`- output: ["a", "d", "unknown_word" ]`
228
229	`#### Attributes`
230
231	`*mapping_file_name*`
232	`<dd>The name of your string to vector mapping file.</dd>`
233
234	`*unmapping_value*`
235	`<dd>Mapping result for unmapped string</dd>`
236
237	`#### Inputs`
238
239	`*data: tensor(string)*`
240	`<dd>Input tensor</dd>`
241
242	`#### Outputs`
243
244	`*output: tensor(T)*`
245	`<dd>The mapping result of the input</dd>`
246
247	`#### Type Constraints`
248	`*T:tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(bfloat16), tensor(float16), tensor(float), tensor(double), tensor(bool)*`
249	`<dd>Constrain input and output types to numerical tensors.</dd>`
250
251
252	`#### Examples`
253
254	`<details>`
255	`<summary>vector_to_string</summary>`
256
257	```python
258	`# what's in vocabulary.txt`
259
260	`# a 0 0 1 2`
261	`# b 0 1 2 3`
262	`# d 0 1 3 4`
263
264	`node = onnx.helper.make_node(`
265	`'StringToVector',`
266	`inputs=['x'],`
267	`outputs=['y'],`
268	`mapping_file_name='vocabulary.txt',`
269	`unmapping_value="unknown_word"`
270	`)`
271
272
273	`x = np.array([[0,0,1,2],[0,1,3,4],[0,0,0,0]], type=np.int64)`
274	`y = ["a", "d", "unknown_worde"]`
275
276
277	`expect(node, inputs=[x], outputs=[y],`
278	`name='test_vector_to_string')`
279	```
280	`</details>`
281
282	`## Tokenizer`
283
284	`### <a name="GPT2Tokenizer"></a><a name="GPT2Tokenizer">GPT2Tokenizer</a>`
285
286	`GPT2Tokenizer that performs byte-level bpe tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html).`
287
288	`#### Inputs`
289
290	`*data: tensor(string)*`
291	`<dd>The string tensor for tokenization</dd>`
292
293	`#### Outputs`
294
295	`*output: tensor(int64)*`
296	`<dd>The tokenized result of input</dd>`
297
298	`#### Examples`
299
300	`<details>`
301	`<summary>gpt2tokenizer</summary>`
302
303	```python
304
305	`node = onnx.helper.make_node(`
306	`'GPT2Tokenizer',`
307	`inputs=['x'],`
308	`outputs=['y'],`
309	`)`
310
311	`x = ["hey cortana"]`
312	`y = np.array([20342, 12794, 2271], dtype=np.int64)`
313
314	`expect(node, inputs=[x], outputs=[y],`
315	`name='test_gpt2_tokenizer')`
316	```
317	`</details>`
318
319
320	`### <a name="BertTokenizer"></a><a name="BertTokenizer">BertTokenizer</a>`
321
322	`BertTokenizer that performs WordPiece tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer).`
323
324	`#### Inputs`
325
326	`*data: tensor(string)*`
327	`<dd>The string tensor for tokenization</dd>`
328
329	`#### Outputs`
330
331	`*output: tensor(int64)*`
332	`<dd>Tokenized result of the input</dd>`
333
334	`#### Examples`
335
336	`<details>`
337	`<summary>word_piece_tokenizer</summary>`
338
339	```python
340	```
341	`</details>`
342
343	`### <a name="XLNetTokenizer"></a><a name="XLNetTokenizer">XLNetTokenizer</a>`
344
345	`GPT2Tokenizer that performs SentencePiece tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/model_doc/xlnet.html#xlnettokenizer).`
346
347	`#### Inputs`
348
349	`*data: tensor(string)*`
350	`<dd>The string tensor for tokenization</dd>`
351
352	`#### Outputs`
353
354	`*output: tensor(int64)*`
355	`<dd>Tokenized result of the input</dd>`
356
357	`#### Examples`
358
359	`<details>`
360	`<summary>word_piece_tokenizer</summary>`
361
362	```python
363
364	```
365	`</details>`
366

microsoft/onnxruntime-extensions

Branches

Tags

Clone