crawles · September 16, 2016 18:48
diff --git a/text-analytics-service-pcf.ipynb b/text-analytics-service-pcf.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook implements a pre-trained sentiment analysis pipeline including a regex pre-processing step, tokenization, n-gram computation, and logistic regreression model as a RESTful API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-08-24T11:16:17.753340",
     "start_time": "2016-08-24T11:16:17.733692"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import cPickle\n",
    "import json\n",
    "\n",
    "import pandas as pd\n",
    "import sklearn\n",
    "import requests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import the trained model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-08-24T14:20:40.129088",
     "start_time": "2016-08-24T14:20:40.044663"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "resp = requests.get(\"https://raw.githubusercontent.com/crawles/gpdb_sentiment_analysis_twitter_model/master/twitter_sentiment_model.pkl\")\n",
    "resp.raise_for_status()\n",
    "cl = cPickle.loads(resp.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-09-16T11:57:34.885062",
     "start_time": "2016-09-16T11:57:34.874853"
    }
   },
   "source": [
    "### Import data pre-processing function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def regex_preprocess(raw_tweets):\n",
    "    pp_text = pd.Series(raw_tweets)\n",
    "    \n",
    "    user_pat = '(?<=^|(?<=[^a-zA-Z0-9-_\\.]))@([A-Za-z]+[A-Za-z0-9]+)'\n",
    "    http_pat = '(https?:\\/\\/(?:www\\.|(?!www))[^\\s\\.]+\\.[^\\s]{2,}|www\\.[^\\s]+\\.[^\\s]{2,})'\n",
    "    repeat_pat, repeat_repl = \"(.)\\\\1\\\\1+\",'\\\\1\\\\1'\n",
    "\n",
    "    pp_text = pp_text.str.replace(pat = user_pat, repl = 'USERNAME')\n",
    "    pp_text = pp_text.str.replace(pat = http_pat, repl = 'URL')\n",
    "    pp_text.str.replace(pat = repeat_pat, repl = repeat_repl)\n",
    "    return pp_text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup the API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Jupyter Kernel Gateway utilizes a global REQUEST JSON string that will be replaced on each invocation of the API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "REQUEST = json.dumps({\n",
    "    'path' : {},\n",
    "    'args' : {}\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute sentiment using trained model and serve using POST"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the kernel gateway, a cell is created as an HTTP handler using a single line comment. The handler supports common HTTP verbs (GET, POST, DELETE, etc). For more information, view the <a href=\"https://jupyter-kernel-gateway.readthedocs.io/en/latest/http-mode.html\">docs</a>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-08-22T17:52:15.653020",
     "start_time": "2016-08-22T17:52:15.636712"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# POST /polarity_compute\n",
    "req = json.loads(REQUEST)\n",
    "tweets = req['body']['data']\n",
    "print(cl.predict_proba(regex_preprocess(tweets))[:][:,1])"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This notebook implements a pre-trained sentiment analysis pipeline including a regex pre-processing step, tokenization, n-gram computation, and logistic regreression model as a RESTful API."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2016-08-24T11:16:17.753340",
	"start_time": "2016-08-24T11:16:17.733692"
	},
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import cPickle\n",
	"import json\n",
	"\n",
	"import pandas as pd\n",
	"import sklearn\n",
	"import requests"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Import the trained model"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2016-08-24T14:20:40.129088",
	"start_time": "2016-08-24T14:20:40.044663"
	},
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"resp = requests.get(\"https://raw.githubusercontent.com/crawles/gpdb_sentiment_analysis_twitter_model/master/twitter_sentiment_model.pkl\")\n",
	"resp.raise_for_status()\n",
	"cl = cPickle.loads(resp.content)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"ExecuteTime": {
	"end_time": "2016-09-16T11:57:34.885062",
	"start_time": "2016-09-16T11:57:34.874853"
	}
	},
	"source": [
	"### Import data pre-processing function"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"def regex_preprocess(raw_tweets):\n",
	" pp_text = pd.Series(raw_tweets)\n",
	" \n",
	" user_pat = '(?<=^\|(?<=[^a-zA-Z0-9-_\\.]))@([A-Za-z]+[A-Za-z0-9]+)'\n",
	" http_pat = '(https?:\\/\\/(?:www\\.\|(?!www))[^\\s\\.]+\\.[^\\s]{2,}\|www\\.[^\\s]+\\.[^\\s]{2,})'\n",
	" repeat_pat, repeat_repl = \"(.)\\\\1\\\\1+\",'\\\\1\\\\1'\n",
	"\n",
	" pp_text = pp_text.str.replace(pat = user_pat, repl = 'USERNAME')\n",
	" pp_text = pp_text.str.replace(pat = http_pat, repl = 'URL')\n",
	" pp_text.str.replace(pat = repeat_pat, repl = repeat_repl)\n",
	" return pp_text"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Setup the API"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Jupyter Kernel Gateway utilizes a global REQUEST JSON string that will be replaced on each invocation of the API."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"REQUEST = json.dumps({\n",
	" 'path' : {},\n",
	" 'args' : {}\n",
	"})"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Compute sentiment using trained model and serve using POST"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Using the kernel gateway, a cell is created as an HTTP handler using a single line comment. The handler supports common HTTP verbs (GET, POST, DELETE, etc). For more information, view the <a href=\"https://jupyter-kernel-gateway.readthedocs.io/en/latest/http-mode.html\">docs</a>."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2016-08-22T17:52:15.653020",
	"start_time": "2016-08-22T17:52:15.636712"
	},
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# POST /polarity_compute\n",
	"req = json.loads(REQUEST)\n",
	"tweets = req['body']['data']\n",
	"print(cl.predict_proba(regex_preprocess(tweets))[:][:,1])"
	]
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.12"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}