openai/openai-python

Public

mirrored fromhttps://github.com/openai/openai-pythonAvailable

CodeCommitsIssuesPull requestsActionsInsightsSecurity
v0.15.0

Branches

Tags

  • No tags available.
0Branches0Tags
Go to file
Add file
Code

Clone

HTTPS

Download ZIP

examples/embeddings/Regression.ipynb

109lines · modecode

1{
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "metadata": {},
6 "source": [
7 "## Regression using the embeddings\n",
8 "\n",
9 "Regression means predicting a number, rather than one of the categories. We will predict the score based on the embedding of the review's text. We split the dataset into a training and a testing set for all of the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).\n",
10 "\n",
11 "We're predicting the score of the review, which is a number between 1 and 5 (1-star being negative and 5-star positive)."
12 ]
13 },
14 {
15 "cell_type": "code",
16 "execution_count": 2,
17 "metadata": {},
18 "outputs": [
19 {
20 "name": "stdout",
21 "output_type": "stream",
22 "text": [
23 "Babbage similarity embedding performance on 1k Amazon reviews: mse=0.38, mae=0.39\n"
24 ]
25 }
26 ],
27 "source": [
28 "import pandas as pd\n",
29 "import numpy as np\n",
30 "\n",
31 "from sklearn.ensemble import RandomForestRegressor\n",
32 "from sklearn.model_selection import train_test_split\n",
33 "from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
34 "\n",
35 "df = pd.read_csv('output/embedded_1k_reviews.csv')\n",
36 "df['babbage_similarity'] = df.babbage_similarity.apply(eval).apply(np.array)\n",
37 "\n",
38 "X_train, X_test, y_train, y_test = train_test_split(list(df.babbage_similarity.values), df.Score, test_size = 0.2, random_state=42)\n",
39 "\n",
40 "rfr = RandomForestRegressor(n_estimators=100)\n",
41 "rfr.fit(X_train, y_train)\n",
42 "preds = rfr.predict(X_test)\n",
43 "\n",
44 "\n",
45 "mse = mean_squared_error(y_test, preds)\n",
46 "mae = mean_absolute_error(y_test, preds)\n",
47 "\n",
48 "print(f\"Babbage similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")"
49 ]
50 },
51 {
52 "cell_type": "code",
53 "execution_count": 26,
54 "metadata": {},
55 "outputs": [
56 {
57 "name": "stdout",
58 "output_type": "stream",
59 "text": [
60 "Dummy mean prediction performance on Amazon reviews: mse=1.77, mae=1.04\n"
61 ]
62 }
63 ],
64 "source": [
65 "bmse = mean_squared_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n",
66 "bmae = mean_absolute_error(y_test, np.repeat(y_test.mean(), len(y_test)))\n",
67 "print(f\"Dummy mean prediction performance on Amazon reviews: mse={bmse:.2f}, mae={bmae:.2f}\")"
68 ]
69 },
70 {
71 "cell_type": "markdown",
72 "metadata": {},
73 "source": [
74 "We can see that the embeddings are able to predict the scores with an average error of 0.39 per score prediction. This is roughly equivalent to predicting 2 out of 3 reviews perfectly, and 1 out of three reviews by a one star error."
75 ]
76 },
77 {
78 "cell_type": "markdown",
79 "metadata": {},
80 "source": [
81 "You could also train a classifier to predict the label, or use the embeddings within an existing ML model to encode free text features."
82 ]
83 }
84 ],
85 "metadata": {
86 "interpreter": {
87 "hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
88 },
89 "kernelspec": {
90 "display_name": "Python 3.7.3 64-bit ('base': conda)",
91 "name": "python3"
92 },
93 "language_info": {
94 "codemirror_mode": {
95 "name": "ipython",
96 "version": 3
97 },
98 "file_extension": ".py",
99 "mimetype": "text/x-python",
100 "name": "python",
101 "nbconvert_exporter": "python",
102 "pygments_lexer": "ipython3",
103 "version": "3.7.3"
104 },
105 "orig_nbformat": 4
106 },
107 "nbformat": 4,
108 "nbformat_minor": 2
109}
110