alanzchen · March 31, 2017 13:09
diff --git a/LDA_Profiling_Draft.ipynb b/LDA_Profiling_Draft.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generating Stock Authors (Reviewer) LDA-based Profile\n",
    "\n",
    "Each author has her own research interests. By analyising her previous works (papers in this case), we can estimates her position in a **interest space**.\n",
    "\n",
    "In the LDA model implementation, a *interest space* is a n-dimensional space spanned by n-*topics*. Given a bag of words (BOW), a vector in the space can then be generating. Each coordinate of the vector is the *confidence* that the LDA model believes the BOW belongs to a particular *topic*.\n",
    "\n",
    "To get a vector of a paper, we preproccess the abstract of a paper and convert it to a BOW, then feed it to a trained LDA model. By summing up all the vectors of previous works of an author, we can get the author's position in the *interest space*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Anaconda3\\lib\\site-packages\\gensim\\utils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n",
      "  warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n"
     ]
    }
   ],
   "source": [
    "import logging\n",
    "import pandas as pd\n",
    "import sqlite3\n",
    "import gensim\n",
    "import nltk\n",
    "import json\n",
    "from gensim.corpora import BleiCorpus\n",
    "from gensim import corpora\n",
    "from nltk.corpus import stopwords\n",
    "from textblob import TextBlob\n",
    "from gensim.corpora import Dictionary\n",
    "from gensim.models import LdaModel\n",
    "import numpy as np\n",
    "import pickle\n",
    "con = sqlite3.connect(\"fintime50.sqlite\")\n",
    "db_documents = pd.read_sql_query(\"SELECT * from documents\", con)\n",
    "db_authors = pd.read_sql_query(\"SELECT * from authors\", con)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preproccessing\n",
    "\n",
    "The first thing is to tokenise the raw text, then turn it into a BOW.\n",
    "\n",
    "We already have a trained LDA model and its dictionary. Now we can use the dictionary to generate a BOW given the tokenised text. To filter out (roughly) irrevelent terms, we only leave noun phrases in our BOW."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def tokenise2vec(text):\n",
    "    if text:\n",
    "        return dictionary.doc2bow(TextBlob(text.lower()).noun_phrases)\n",
    "    else:\n",
    "        return []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def save_pkl(target_object, filename):\n",
    "    with open(filename, \"wb\") as file:\n",
    "        pickle.dump(target_object, file, protocol=2, fix_imports=True)\n",
    "        \n",
    "def load_pkl(filename):\n",
    "    return pickle.load(open(filename, \"rb\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "model = LdaModel.load(\"fintime50_300.ldamodel\")\n",
    "dictionary = Dictionary.load(\"fintime50_300.ldamodel.dictionary\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "models = {}\n",
    "dictionaries = {}\n",
    "documents = {}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fitting (Predicting) Topics Distribution From Raw Text\n",
    "\n",
    "`predict` function will predict the topics distributions from a given raw text. The result is a pandas dataframe, with topics ids and confidence thereof."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def predict(sometext):\n",
    "    vec = tokenise2vec(sometext)\n",
    "    dtype = [('topic_id', int), ('confidence', float)]\n",
    "    topics = np.array(model[vec], dtype=dtype)\n",
    "    topics.sort(order=\"confidence\")\n",
    "#     for topic in topics[::-1]:\n",
    "#         print(\"--------\")\n",
    "#         print(topic[1], topic[0])\n",
    "#         print(model.print_topic(topic[0]))\n",
    "    return pd.DataFrame(topics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>topic_id</th>\n",
       "      <th>confidence</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>121</td>\n",
       "      <td>0.143161</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>162</td>\n",
       "      <td>0.143166</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>259</td>\n",
       "      <td>0.571198</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   topic_id  confidence\n",
       "0       121    0.143161\n",
       "1       162    0.143166\n",
       "2       259    0.571198"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict(\"null values are interpreted as unknown value or inapplicable value. This paper proposes a new approach for solving the unknown value problems with Implicit Predicate (IP). The IP serves as a descriptor corresponding to a set of the unknown values, thereby expressing the semantics of them. In this paper, we demonstrate that the IP is capable of (1) enhancing the semantic expressiveness of the unknown values, (2) entering incomplete information into database and (3) exploiting the information and a variety of inference rules in database to reduce the uncertainties of the unknown values.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate a Author's Topic Vector\n",
    "\n",
    "The vector is a topic confidence vector for the author. The length of the vector should be the number of topics in the LDA model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def update_author_vector(vec, doc_vec):\n",
    "    for topic_id, confidence in zip(doc_vec['topic_id'], doc_vec['confidence']):\n",
    "        vec[topic_id] += confidence\n",
    "    return vec"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a author, we first get all his previous papers in our database. For each paper we get, we generate a paper's vector. At last, the sum of all vectors will be the vector (aka the position) in the *interest space*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def profile_author(author_id, model_topics_num=model.num_topics):\n",
    "    author_vec = np.array([1.0 for i in range(model_topics_num)])\n",
    "    # Discuss: Should the vector be initialised with 1s or 0s?\n",
    "    paper_list = pd.read_sql_query(\"SELECT * FROM documents_authors WHERE authors_id=\" + str(author_id), con)['documents_id']\n",
    "    for paper_id in paper_list:\n",
    "        try:\n",
    "            abstract = db_documents['abstract'][paper_id]\n",
    "        except KeyError:\n",
    "            print(\"KeyError occurred on paper id \" + str(paper_id))\n",
    "        vec = predict(abstract)\n",
    "        author_vec = update_author_vector(author_vec, vec)\n",
    "    return author_vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def profile_all_authors():\n",
    "    authors = {}\n",
    "    for author_id in db_authors['id']:\n",
    "        authors[author_id] = profile_author(author_id)\n",
    "        # print(author_id)\n",
    "        # uncomment the above line to track the progress\n",
    "    return authors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "authors_lib = profile_all_authors()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "save_pkl(authors_lib, \"fintime50_300.ldamodel.pkl\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "authors_lib = load_pkl(\"fintime50_300.ldamodel.pkl\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "300"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(authors_lib[1])\n",
    "# the length of a author's vector. It should always be equal to the number of topics of the trained model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Helpers\n",
    "\n",
    "This function will get the top *k* highest confidence topics for a author. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def get_author_top_topics(author_id, top=10):\n",
    "    author = authors_lib[author_id]\n",
    "    top_topics = []\n",
    "    for topic_id, confidence in enumerate(author):\n",
    "        if confidence > 1:\n",
    "            top_topics.append([topic_id, (confidence - 1) * 100])\n",
    "    top_topics.sort(key=lambda tup: tup[1], reverse=True)\n",
    "    return top_topics[:top]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[229, 22.237712805080356],\n",
      " [48, 11.165327246751723],\n",
      " [130, 11.138433115183544],\n",
      " [283, 11.137873330881032],\n",
      " [211, 11.13752201495004],\n",
      " [23, 11.1373962659568],\n",
      " [231, 11.111419280714063]]\n"
     ]
    }
   ],
   "source": [
    "topics = get_author_top_topics(12345)\n",
    "from pprint import pprint\n",
    "pprint(topics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def get_topic_in_list(model, topic_id):\n",
    "    return [term.strip().split('*') for term in model.print_topic(topic_id).split(\"+\")]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def get_topic_in_string(model, topic_id, top=5):\n",
    "    topic_list = get_topic_in_list(model, topic_id)\n",
    "    topic_string = \" / \".join([i[1] for i in topic_list][:top])\n",
    "    return topic_string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\"political system\" / \"complex process\" / \"law enforcement\" / \"portfolio selection\" / \"decision-making situations\"'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_topic_in_string(model, 5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_topics_in_string(model, topics, confidence=False):\n",
    "    if confidence:\n",
    "        topics_list = []\n",
    "        for topic in topics:\n",
    "            topic_map = {\n",
    "                \"topic_id\": topic[0],\n",
    "                \"string\": get_topic_in_string(model, topic[0]),\n",
    "                \"confidence\": topic[1]\n",
    "            }\n",
    "            topics_list.append(topic_map)\n",
    "    else:\n",
    "        topics_list = []\n",
    "        for topic_id in topics:\n",
    "            topic_map = {\n",
    "                \"topic_id\": topic_id,\n",
    "                \"string\": get_topic_in_string(model, topic_id),\n",
    "            }\n",
    "            topics_list.append(topic_map)\n",
    "    return topics_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'confidence': 22.237712805080356,\n",
       "  'string': '\"analytical results\" / \"major questions\" / \"recent development\" / \"important area\" / \"public ownership\"',\n",
       "  'topic_id': 229},\n",
       " {'confidence': 11.165327246751723,\n",
       "  'string': '\"top executives\" / \"negative effect\" / \"financial measures\" / \"cultural differences\" / \"consistent evidence\"',\n",
       "  'topic_id': 48},\n",
       " {'confidence': 11.138433115183544,\n",
       "  'string': '\"field studies\" / \"alternative explanation\" / \"different perspective\" / \"traditional approaches\" / \"experimental data\"',\n",
       "  'topic_id': 130},\n",
       " {'confidence': 11.137873330881032,\n",
       "  'string': '\"present research\" / \"performance outcomes\" / \"potential customers\" / \"strategic capabilities\" / \"preliminary evidence\"',\n",
       "  'topic_id': 283},\n",
       " {'confidence': 11.13752201495004,\n",
       "  'string': '\"major source\" / \"basic assumptions\" / \"numerical experiments\" / \"analyst coverage\" / \"social norms\"',\n",
       "  'topic_id': 211},\n",
       " {'confidence': 11.1373962659568,\n",
       "  'string': '\"different conclusions\" / \"basic elements\" / \"consumer surplus\" / \"opportunistic behavior\" / \"alternative view\"',\n",
       "  'topic_id': 23},\n",
       " {'confidence': 11.111419280714063,\n",
       "  'string': '\"positive effect\" / \"top management\" / \"research suggests\" / \"study demonstrates\" / \"conflict resolution\"',\n",
       "  'topic_id': 231}]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_topics_in_string(model, topics, confidence=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Matching Authors (or Reviewers)\n",
    "\n",
    "With a library of authors' vectors on hand, we can then try to measure the similarity (or distance) between any two authors.\n",
    "\n",
    "To match a author A, a simple way is just to compare her with every authors in the library and get the best match (smallest distance or largest similarity)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining Similarity\n",
    "\n",
    "The simplest measure is just the geometric distance. The squared distance is defined by the sum of squares of the difference of two vectors.\n",
    "\n",
    "$$ \\sum_{i=0}^{n\\_topics} (vec^A_i - vec^B_i)^2 $$\n",
    "\n",
    "Then the similarity would simply be the reciprocal of the sum."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def similarity_distance(vec1, vec2):\n",
    "    diff = vec1 - vec2\n",
    "    sum_of_squares = sum([pow(i, 2) for i in diff])\n",
    "    if sum_of_squares == 0:\n",
    "        return 0\n",
    "    else:\n",
    "        return 1/sum_of_squares"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def search_by_sim(author_vec, authors_lib, sim_algorithm):\n",
    "    result = []\n",
    "    for author_id_ in authors_lib:\n",
    "        result.append([author_id_, sim_algorithm(authors_lib[author_id_], author_vec)])\n",
    "    result.sort(key=lambda tup: tup[1], reverse=True)\n",
    "    return result\n",
    "#     result = np.array(result, dtype=dtype)\n",
    "#     return topics.sort(order=\"similarity\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "result = search_by_sim(authors_lib[999], authors_lib, similarity_distance)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is an example of lda matching algorithm:\n",
    "We match Professor Jane Davison (https://pure.royalholloway.ac.uk/portal/en/persons/jane-davison(7bcbd8ce-a4db-4464-84de-49419cbd77d4).html) (id 999 in our database) with all authors in our database to find the best matching.\n",
    "\n",
    "Her research intersts, as stated on her page, are:\n",
    "\n",
    "- Accounting and the visual\n",
    "- Accounting narratives\n",
    "- Literary, fine art and philosophical approaches to communication issues in accounting\n",
    "- Reporting of intangibles/intellectual capital\n",
    "- Myth and accounting\n",
    "\n",
    "Here is our top 10 results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[8194, 3.3363940617663648],\n",
       " [8196, 3.3363940617663648],\n",
       " [17226, 3.3363940617663648],\n",
       " [17239, 3.3363940617663648],\n",
       " [28280, 3.3363940617663648],\n",
       " [28281, 3.3363940617663648],\n",
       " [42131, 3.3363940617663648],\n",
       " [2249, 2.9744371121018638],\n",
       " [50815, 2.9446532132279937],\n",
       " [43658, 2.930374285586681]]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "id                                                          8195\n",
       "email                                                       None\n",
       "institution                           University of Pennsylvania\n",
       "last_name                                                  Lanen\n",
       "first_name                                            William N.\n",
       "middle_name                                                 None\n",
       "avatar                                                      None\n",
       "address        University of Pennsylvania, Philadelphia, PA 1...\n",
       "vitae                                                       None\n",
       "Name: 8194, dtype: object"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "db_authors.iloc[8194]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The top result is the author with id 8194, [William N. Lanen](http://www.bus.umich.edu/facultybios/cv/lanen.pdf) from the Northeastern University.\n",
    "\n",
    "His research interests found in the PDF file are:\n",
    "- Performance Measurement and Compensation\n",
    "- Cost Management\n",
    "- Performance Measurement and Compensation\n",
    "- Management Control Systems in Firms in Transitional Economies\n",
    "- Environmental Accounting\n",
    "\n",
    "As we can see, at least the result isn't too bad.\n",
    "\n",
    "Note that from the 2nd to the 7th result, the confidence stays the same as the first result. This suggests that our database may have only one papers coauthored by theses authors, resulting a exactly the same vector.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Merging Authors\n",
    "\n",
    "An author may be misidentified as sepearte authors. Here is a tool to help you manually merge several authors as a single author."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def merge_author(author_ids, authors_lib):\n",
    "    # merge_author merges several authors together and returns a merged author vector.\n",
    "    # Extremely useful when the same author are identified differently in the database.\n",
    "    from copy import deepcopy\n",
    "    merged = deepcopy(authors_lib[author_ids[0]])\n",
    "    for author_id in author_ids[1:]:\n",
    "        merged += authors_lib[author_id]\n",
    "    return merged"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, let's find Wei Xiong's matching."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "xiongwei_vec = merge_author([14711, 15255, 17090, 42122], authors_lib) - 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.16699482,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.49951348,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.33348477,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.49956615,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.33360545,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.16698115,  0.66589136,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.16702132,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.49948726,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.33353439,  0.1669846 ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.16700858,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.33348728,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
       "        0.        ,  0.        ,  0.        ,  0.        ,  0.        ])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "xiongwei_vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "xw_result = search_by_sim(xiongwei_vec, authors_lib, similarity_distance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[20, 0.0034116970217480508],\n",
       " [21, 0.0034116970217480508],\n",
       " [24, 0.0034116970217480508],\n",
       " [26, 0.0034116970217480508],\n",
       " [29, 0.0034116970217480508],\n",
       " [34, 0.0034116970217480508],\n",
       " [38, 0.0034116970217480508],\n",
       " [39, 0.0034116970217480508],\n",
       " [40, 0.0034116970217480508],\n",
       " [41, 0.0034116970217480508],\n",
       " [42, 0.0034116970217480508],\n",
       " [44, 0.0034116970217480508],\n",
       " [45, 0.0034116970217480508],\n",
       " [61, 0.0034116970217480508],\n",
       " [66, 0.0034116970217480508],\n",
       " [67, 0.0034116970217480508],\n",
       " [68, 0.0034116970217480508],\n",
       " [77, 0.0034116970217480508],\n",
       " [79, 0.0034116970217480508],\n",
       " [81, 0.0034116970217480508],\n",
       " [83, 0.0034116970217480508],\n",
       " [85, 0.0034116970217480508],\n",
       " [94, 0.0034116970217480508],\n",
       " [95, 0.0034116970217480508],\n",
       " [96, 0.0034116970217480508],\n",
       " [100, 0.0034116970217480508],\n",
       " [102, 0.0034116970217480508],\n",
       " [107, 0.0034116970217480508],\n",
       " [108, 0.0034116970217480508],\n",
       " [161, 0.0034116970217480508],\n",
       " [169, 0.0034116970217480508],\n",
       " [171, 0.0034116970217480508],\n",
       " [172, 0.0034116970217480508],\n",
       " [173, 0.0034116970217480508],\n",
       " [175, 0.0034116970217480508],\n",
       " [178, 0.0034116970217480508],\n",
       " [179, 0.0034116970217480508],\n",
       " [180, 0.0034116970217480508],\n",
       " [181, 0.0034116970217480508],\n",
       " [182, 0.0034116970217480508],\n",
       " [183, 0.0034116970217480508],\n",
       " [198, 0.0034116970217480508],\n",
       " [200, 0.0034116970217480508],\n",
       " [201, 0.0034116970217480508],\n",
       " [202, 0.0034116970217480508],\n",
       " [203, 0.0034116970217480508],\n",
       " [212, 0.0034116970217480508],\n",
       " [213, 0.0034116970217480508],\n",
       " [218, 0.0034116970217480508],\n",
       " [225, 0.0034116970217480508]]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "xw_result[:50]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Such weird result is due to insufficient paper data these authors. Authors with exactly the same similarity score are highly likely to share exactly the same portfolio of papers in our database. Thus, their author_vec(s) are exactly the same too.\n",
    "\n",
    "Now let's look what Wei Xiong is up to:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"interest rates\" / \"market price\" / \"bad news\" / \"good news\" / \"social relationships\"\n",
      "\"paper studies\" / \"paper demonstrates\" / \"contextual factors\" / \"paper considers\" / \"future events\"\n",
      "\"new technologies\" / \"past decade\" / \"primary focus\" / \"public policy\" / \"contingency approaches\"\n",
      "\"commercial banks\" / \"cost function\" / \"relative strength\" / \"various scenarios\" / \"standard setters\"\n",
      "\"present study\" / \"firm size\" / \"human resources\" / \"study attempts\" / \"financial statements\"\n",
      "\"empirical analysis\" / \"important implications\" / \"paper tests\" / \"external environment\" / \"new opportunities\"\n",
      "\"evidence consistent\" / \"research questions\" / \"current state\" / \"sensitivity analyses\" / \"solution procedures\"\n",
      "\"world 's\" / \"market failure\" / \"public goods\" / \"competitive pressures\" / \"group decision\"\n",
      "\"paper investigates\" / \"important role\" / \"utility functions\" / \"major reasons\" / \"network providers\"\n",
      "\"economic performance\" / \"institutional investors\" / \"explanatory power\" / \"direct effect\" / \"opportunity cost\"\n",
      "\"paper suggests\" / \"corporate governance\" / \"corporate performance\" / \"internal process\" / \"article investigates\"\n",
      "\"theoretical model\" / \"relative performance\" / \"cognitive biases\" / \"qualitative research\" / \"unique setting\"\n",
      "\"poor performance\" / \"reward systems\" / \"common use\" / \"cycle time\" / \"major role\"\n"
     ]
    }
   ],
   "source": [
    "top_topics = []\n",
    "for topic_id, confidence in enumerate(xiongwei_vec):\n",
    "    if confidence > 0:\n",
    "        top_topics.append([topic_id, confidence * 100])\n",
    "top_topics.sort(key=lambda tup: tup[1], reverse=True)\n",
    "for i in top_topics:\n",
    "    print(get_topic_in_string(model, topic_id=i[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }