andreasvc · December 30, 2022 00:35
diff --git a/1027.txt.mrg.gz b/1027.txt.mrg.gz
diff --git a/treefragments.ipynb b/treefragments.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A tutorial on using tree fragments for text classification\n",
    "----------------------------------------------------------\n",
    "\n",
    "Tree fragments are arbitrarly sized connected subgraphs of parse trees. For a reference see e.g. http://dare.uva.nl/record/371504\n",
    "\n",
    "As input any set of parse trees can be used, obtained by for example the Charniak & Johnson parser (my recommendation, see http://github.com/BLLIP/bllip-parser ), the Stanford Parser, or the Berkeley Parser.\n",
    "\n",
    "This assumes you have successfully installed the disco-dop parser, which contains the code for fragment extraction. See http://github.com/andreasvc/disco-dop\n",
    "\n",
    "For the machine learning part we rely on scikit-learn, see http://scikit-learn.org/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob\n",
    "from collections import defaultdict\n",
    "from discodop import treebank, treetransforms, fragments\n",
    "from sklearn import linear_model, preprocessing, feature_extraction, model_selection\n",
    "vectorizer = feature_extraction.DictVectorizer(sparse=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the trees. Here we read only the first 1000 parse trees from a single novel from the Gutenberg project.\n",
    "\n",
    "Trees need to be binarized for fragment extraction. There are many parameters for binarization, but the most important are the ones related to Markovization. See Klein & Manning (2003), Accurate unlexicalized parsing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = treebank.BracketCorpusReader('1027.txt.mrg.gz')\n",
    "trees = [treetransforms.binarize(item.tree, horzmarkov=1, vertmarkov=1)\n",
    "         for _, item in text.itertrees(0, 1000)]\n",
    "sents = [item.sent for _, item in text.itertrees(0, 1000)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the fragment extraction. When running on a machine with multiple cores, the numproc parameter can be increased to run multiple processes in parallel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = fragments.recurringfragments(trees, sents, numproc=1, disc=False, maxdepth=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are fragments in string form, along with a dictionary of all the sentence numbers where the given fragment occurs. A summation reduces this to a simple occurrence count."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3135\t(SINV|<''> ('' '') (SINV|<VP> (VP (VBD )) (SINV|<NP> (NP (NNP )) (SINV|<,> ))))\n",
      "553\t(IN near)\n",
      "2331\t(VP (AUX ) (VP|<RB> (RB n't) (NP )))\n",
      "1776\t(JJ likely)\n",
      "16143\t(VP (VBN ) (PP (IN ) (NP )))\n"
     ]
    }
   ],
   "source": [
    "for a, b in list(result.items())[:5]:\n",
    "    print('%3d\\t%s' % (sum(b), a))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use the fragments for a machine learning problem, we want to have a feature mapping for each sentence (or document). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmp = [defaultdict(int) for _ in range(1000)]\n",
    "for a, b in result.items():\n",
    "    for n in b:\n",
    "        tmp[n][a] += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert list of dicts to a sparse matrix\n",
    "vectorizer = feature_extraction.DictVectorizer(sparse=True)\n",
    "X = vectorizer.fit_transform(tmp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Trivial machine learning objective: detect long sentences\n",
    "target = ['long' if len(sent) > 20 else 'short' for sent in sents]\n",
    "y = preprocessing.LabelEncoder().fit_transform(target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.88118812,  0.88118812,  0.89108911,  0.91089109,  0.87      ,\n",
       "        0.9       ,  0.91919192,  0.87878788,  0.84848485,  0.86868687])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Use an SVM-like classifier and 10-fold crossvalidation for evaluation\n",
    "classifier = linear_model.SGDClassifier(loss='hinge', penalty='elasticnet', max_iter=5, tol=None)\n",
    "cv = model_selection.StratifiedKFold(n_splits=10, shuffle=True, random_state=42)\n",
    "model_selection.cross_val_score(classifier, X, y, cv=cv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To further analyze the machine learning results, consult the sci-kit learn documentation: http://scikit-learn.org/stable/documentation.html\n",
    "\n",
    "Also see my notebook on text classification with bag-of-word models, which shows how to list difficult to classify documents, and find the most important features: http://nbviewer.ipython.org/gist/andreasvc/5d9b17fb981ee2a8b728"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"A tutorial on using tree fragments for text classification\n",
	"----------------------------------------------------------\n",
	"\n",
	"Tree fragments are arbitrarly sized connected subgraphs of parse trees. For a reference see e.g. http://dare.uva.nl/record/371504\n",
	"\n",
	"As input any set of parse trees can be used, obtained by for example the Charniak & Johnson parser (my recommendation, see http://github.com/BLLIP/bllip-parser ), the Stanford Parser, or the Berkeley Parser.\n",
	"\n",
	"This assumes you have successfully installed the disco-dop parser, which contains the code for fragment extraction. See http://github.com/andreasvc/disco-dop\n",
	"\n",
	"For the machine learning part we rely on scikit-learn, see http://scikit-learn.org/"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import glob\n",
	"from collections import defaultdict\n",
	"from discodop import treebank, treetransforms, fragments\n",
	"from sklearn import linear_model, preprocessing, feature_extraction, model_selection\n",
	"vectorizer = feature_extraction.DictVectorizer(sparse=True)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Read the trees. Here we read only the first 1000 parse trees from a single novel from the Gutenberg project.\n",
	"\n",
	"Trees need to be binarized for fragment extraction. There are many parameters for binarization, but the most important are the ones related to Markovization. See Klein & Manning (2003), Accurate unlexicalized parsing."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"text = treebank.BracketCorpusReader('1027.txt.mrg.gz')\n",
	"trees = [treetransforms.binarize(item.tree, horzmarkov=1, vertmarkov=1)\n",
	" for _, item in text.itertrees(0, 1000)]\n",
	"sents = [item.sent for _, item in text.itertrees(0, 1000)]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Run the fragment extraction. When running on a machine with multiple cores, the numproc parameter can be increased to run multiple processes in parallel."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"result = fragments.recurringfragments(trees, sents, numproc=1, disc=False, maxdepth=1)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The results are fragments in string form, along with a dictionary of all the sentence numbers where the given fragment occurs. A summation reduces this to a simple occurrence count."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"3135\t(SINV\|<''> ('' '') (SINV\|<VP> (VP (VBD )) (SINV\|<NP> (NP (NNP )) (SINV\|<,> ))))\n",
	"553\t(IN near)\n",
	"2331\t(VP (AUX ) (VP\|<RB> (RB n't) (NP )))\n",
	"1776\t(JJ likely)\n",
	"16143\t(VP (VBN ) (PP (IN ) (NP )))\n"
	]
	}
	],
	"source": [
	"for a, b in list(result.items())[:5]:\n",
	" print('%3d\\t%s' % (sum(b), a))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"To use the fragments for a machine learning problem, we want to have a feature mapping for each sentence (or document). "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"tmp = [defaultdict(int) for _ in range(1000)]\n",
	"for a, b in result.items():\n",
	" for n in b:\n",
	" tmp[n][a] += 1"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Convert list of dicts to a sparse matrix\n",
	"vectorizer = feature_extraction.DictVectorizer(sparse=True)\n",
	"X = vectorizer.fit_transform(tmp)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Trivial machine learning objective: detect long sentences\n",
	"target = ['long' if len(sent) > 20 else 'short' for sent in sents]\n",
	"y = preprocessing.LabelEncoder().fit_transform(target)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([ 0.88118812, 0.88118812, 0.89108911, 0.91089109, 0.87 ,\n",
	" 0.9 , 0.91919192, 0.87878788, 0.84848485, 0.86868687])"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Use an SVM-like classifier and 10-fold crossvalidation for evaluation\n",
	"classifier = linear_model.SGDClassifier(loss='hinge', penalty='elasticnet', max_iter=5, tol=None)\n",
	"cv = model_selection.StratifiedKFold(n_splits=10, shuffle=True, random_state=42)\n",
	"model_selection.cross_val_score(classifier, X, y, cv=cv)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"To further analyze the machine learning results, consult the sci-kit learn documentation: http://scikit-learn.org/stable/documentation.html\n",
	"\n",
	"Also see my notebook on text classification with bag-of-word models, which shows how to list difficult to classify documents, and find the most important features: http://nbviewer.ipython.org/gist/andreasvc/5d9b17fb981ee2a8b728"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}