dmcguire81 · December 3, 2022 19:57
diff --git a/3-deeper-text-analysis.ipynb b/3-deeper-text-analysis.ipynb
 {
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/dmcguire81/db6800b32a833426d212d431b6736a45/3-deeper-text-analysis.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "%%capture\n",
        "%pip install https://github.com/illinois/metapy/releases/download/v0.2.14/metapy-0.2.14-cp38-cp38-manylinux_2_24_x86_64.whl"
      ],
      "metadata": {
        "id": "tZxgJ0O2lUtQ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "4fYb2j4ch7Ty"
      },
      "source": [
        "First, we'll import the `metapy` python bindings."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "deletable": true,
        "editable": true,
        "id": "ajRA5CtMh7T1"
      },
      "outputs": [],
      "source": [
        "import metapy"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "cNQyuzTTh7T3"
      },
      "source": [
        "Now, let's create a document with some content."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "deletable": true,
        "editable": true,
        "id": "E3MhcrnIh7Ud"
      },
      "outputs": [],
      "source": [
        "doc = metapy.index.Document()\n",
        "doc.content(\"I said that I can't believe that it only costs $19.95!\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "9umBdaKah7Uh"
      },
      "source": [
        "MeTA provides a stream-based interface for performing document tokenization. Each stream starts off with a Tokenizer object, and in most cases you should use the [Unicode standard aware](http://site.icu-project.org) `ICUTokenizer`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "deletable": true,
        "editable": true,
        "id": "cg5GFw-Wh7Ui"
      },
      "outputs": [],
      "source": [
        "tok = metapy.analyzers.ICUTokenizer()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "3tWFN1syh7Uk"
      },
      "source": [
        "Tokenizers operate on raw text and provide an Iterable that spits out the individual text tokens. Let's try running just the `ICUTokenizer` to see what it does."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "e2y-ylash7Um",
        "outputId": "4050abe1-b124-4798-8bed-1fd00a2b2805",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['<s>',\n",
              " 'I',\n",
              " 'said',\n",
              " 'that',\n",
              " 'I',\n",
              " \"can't\",\n",
              " 'believe',\n",
              " 'that',\n",
              " 'it',\n",
              " 'only',\n",
              " 'costs',\n",
              " '$',\n",
              " '19.95',\n",
              " '!',\n",
              " '</s>']"
            ]
          },
          "metadata": {},
          "execution_count": 11
        }
      ],
      "source": [
        "tok.set_content(doc.content()) # this could be any string\n",
        "[token for token in tok]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "TCNypbjRh7Uo"
      },
      "source": [
        "One thing that you likely immediately notice is the insertion of these pseudo-XML looking `<s>` and `</s>` tags. These are called \"sentence boundary tags\". As a side-effect, a default-construted `ICUTokenizer` discovers the sentences in a document by delimiting them with the sentence boundary tags. Let's try tokenizing a multi-sentence document to see what that looks like."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "-gAp_uwRh7VP",
        "outputId": "e40e744c-fd0f-4d55-ae09-c04271337053",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['<s>',\n",
              " 'I',\n",
              " 'said',\n",
              " 'that',\n",
              " 'I',\n",
              " \"can't\",\n",
              " 'believe',\n",
              " 'that',\n",
              " 'it',\n",
              " 'only',\n",
              " 'costs',\n",
              " '$',\n",
              " '19.95',\n",
              " '!',\n",
              " '</s>',\n",
              " '<s>',\n",
              " 'I',\n",
              " 'could',\n",
              " 'only',\n",
              " 'find',\n",
              " 'it',\n",
              " 'for',\n",
              " 'more',\n",
              " 'than',\n",
              " '$',\n",
              " '30',\n",
              " 'before',\n",
              " '.',\n",
              " '</s>']"
            ]
          },
          "metadata": {},
          "execution_count": 12
        }
      ],
      "source": [
        "doc.content(\"I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.\")\n",
        "tok.set_content(doc.content())\n",
        "[token for token in tok]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "FJh8fpJhh7VS"
      },
      "source": [
        "Most of the information retrieval techniques you have likely been learning about in this class don't need to concern themselves with finding the boundaries between separate sentences in a document, but later today we'll explore a scenario where this might matter more.\n",
        "\n",
        "Let's pass a flag to the `ICUTokenizer` constructor to disable sentence boundary tags for now."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "dnRIWL5ih7VU",
        "outputId": "1ddd9346-bb06-4c80-ba42-af6d8903440f",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['I',\n",
              " 'said',\n",
              " 'that',\n",
              " 'I',\n",
              " \"can't\",\n",
              " 'believe',\n",
              " 'that',\n",
              " 'it',\n",
              " 'only',\n",
              " 'costs',\n",
              " '$',\n",
              " '19.95',\n",
              " '!',\n",
              " 'I',\n",
              " 'could',\n",
              " 'only',\n",
              " 'find',\n",
              " 'it',\n",
              " 'for',\n",
              " 'more',\n",
              " 'than',\n",
              " '$',\n",
              " '30',\n",
              " 'before',\n",
              " '.']"
            ]
          },
          "metadata": {},
          "execution_count": 13
        }
      ],
      "source": [
        "tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)\n",
        "tok.set_content(doc.content())\n",
        "[token for token in tok]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "I3fEMqrRh7VW"
      },
      "source": [
        "I mentioned earlier that MeTA treats tokenization as a *streaming* process, and that it *starts* with a tokenizer. As you've learned, for optimal search performance it's often beneficial to modify the raw underlying tokens of a document, and thus change its representation, before adding it to an inverted index structure for searching.\n",
        "\n",
        "The \"intermediate\" steps in the tokenization stream are represented with objects called Filters. Each filter consumes the content of a previous filter (or a tokenizer) and modifies the tokens coming out of the stream in some way.\n",
        "\n",
        "Let's start by using a simple filter that can help eliminate a lot of noise that we might encounter when tokenizing web documents: a `LengthFilter`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "RsEHE1O5h7VX",
        "outputId": "bc3f8d64-1054-4b02-aea0-1a2eb90df206",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['said',\n",
              " 'that',\n",
              " \"can't\",\n",
              " 'believe',\n",
              " 'that',\n",
              " 'it',\n",
              " 'only',\n",
              " 'costs',\n",
              " '19.95',\n",
              " 'could',\n",
              " 'only',\n",
              " 'find',\n",
              " 'it',\n",
              " 'for',\n",
              " 'more',\n",
              " 'than',\n",
              " '30',\n",
              " 'before']"
            ]
          },
          "metadata": {},
          "execution_count": 14
        }
      ],
      "source": [
        "tok = metapy.analyzers.LengthFilter(tok, min=2, max=30)\n",
        "tok.set_content(doc.content())\n",
        "[token for token in tok]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "Elr_WlQ3h7VZ"
      },
      "source": [
        "Here, we can see that the `LengthFilter` is consuming our original `ICUTokenizer`. It modifies the token stream by only emitting tokens that are of a minimum length of 2 and a maximum length of 30. This can get rid of a lot of punctuation tokens, but also excessively long tokens such as URLs."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "B3fuuqENh7VZ"
      },
      "source": [
        "Another common trick is to remove stopwords. (Can anyone tell me what a stopword is?) In MeTA, this is done using a `ListFilter`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ElDf9nLph7Va"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!wget -nc https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "lJQ_MTPxh7WE",
        "outputId": "b0f36002-ee98-4163-a079-50581d1f608e",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[\"can't\", 'believe', 'costs', '19.95', 'find', '30']"
            ]
          },
          "metadata": {},
          "execution_count": 16
        }
      ],
      "source": [
        "tok = metapy.analyzers.ListFilter(tok, \"lemur-stopwords.txt\", metapy.analyzers.ListFilter.Type.Reject)\n",
        "tok.set_content(doc.content())\n",
        "[token for token in tok]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "K7gNGl96h7WE"
      },
      "source": [
        "Here we've downloaded a common list of stopwords obtained from the [Lemur project](http://lemurproject.org) and created a `ListFilter` to reject any tokens that occur in that list of words.\n",
        "\n",
        "You can see how much of a difference removing stopwords can make on the size of a document's token stream! This translates to a lot of space savings in the inverted index as well."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "QFaj8LNWh7WG"
      },
      "source": [
        "Another common filter that people use is called a stemmer, or lemmatizer. This kind of filter tries to modify individual tokens in such a way that different inflected forms of a word all reduce to the same representation. This lets you, for example, find documents about a \"run\" when you search \"running\" or \"runs\". A common stemmer is the [Porter2 Stemmer](http://snowball.tartarus.org/algorithms/english/stemmer.html), which MeTA has an implementation of. Let's try it!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "gcMHZcdMh7WH",
        "outputId": "19640070-6dda-4348-e265-6f07425711c2",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[\"can't\", 'believ', 'cost', '19.95', 'find', '30']"
            ]
          },
          "metadata": {},
          "execution_count": 17
        }
      ],
      "source": [
        "tok = metapy.analyzers.Porter2Filter(tok)\n",
        "tok.set_content(doc.content())\n",
        "[token for token in tok]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "02_h-2iPh7WH"
      },
      "source": [
        "Notice how \"believe\" becomes \"believ\" and \"costs\" becomes \"cost\". Stemming can help search by allowing queries to return more matched documents by relaxing what it means for a document to match a query term. Note that it's important to ensure that queries are tokenized in the *exact same way* as your documents were before indexing them. If you ignore this, your query is unlikely to contain the raw token \"believ\" and you'll miss a lot of results."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "J6h4ppqoh7WH"
      },
      "source": [
        "Finally, after you've got the token stream configured the way you'd like, it's time to analyze the document by consuming each token from its token stream and performing some actions based on these tokens. In the simplest case, which often is enough for \"good enough\" search results, our action can simply be counting how many times these tokens occur.\n",
        "\n",
        "For clarity, let's switch back to a simpler token stream first. Write me a token stream that tokenizes using the Unicode standard, and then lowercases each token. (Hint: `help(metapy.analyzers)`.)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "4KiiZbQRh7WI",
        "outputId": "67a2c292-1c7f-45db-c147-d9a0b9855b59",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Help on module metapy.metapy.analyzers in metapy.metapy:\n",
            "\n",
            "NAME\n",
            "    metapy.metapy.analyzers\n",
            "\n",
            "CLASSES\n",
            "    pybind11_builtins.pybind11_object(builtins.object)\n",
            "        Analyzer\n",
            "            MultiAnalyzer\n",
            "            NGramPOSAnalyzer\n",
            "            NGramWordAnalyzer\n",
            "            TreeAnalyzer\n",
            "        TokenStream\n",
            "            AlphaFilter\n",
            "            CharacterTokenizer\n",
            "            EmptySentenceFilter\n",
            "            EnglishNormalizer\n",
            "            ICUFilter\n",
            "            ICUTokenizer\n",
            "            LengthFilter\n",
            "            ListFilter\n",
            "            LowercaseFilter\n",
            "            PennTreebankNormalizer\n",
            "            Porter2Filter\n",
            "            SentenceBoundaryAdder\n",
            "        TreeFeaturizer\n",
            "            BranchFeaturizer\n",
            "            DepthFeaturizer\n",
            "            SemiSkeletonFeaturizer\n",
            "            SkeletonFeaturizer\n",
            "            SubtreeFeaturizer\n",
            "            TagFeaturizer\n",
            "    \n",
            "    class AlphaFilter(TokenStream)\n",
            "     |  Method resolution order:\n",
            "     |      AlphaFilter\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.AlphaFilter, arg0: metapy.metapy.analyzers.TokenStream) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TokenStream:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes inherited from TokenStream:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class Analyzer(pybind11_builtins.pybind11_object)\n",
            "     |  Method resolution order:\n",
            "     |      Analyzer\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.Analyzer) -> None\n",
            "     |  \n",
            "     |  analyze(...)\n",
            "     |      analyze(self: metapy.metapy.analyzers.Analyzer, arg0: metapy.metapy.index.Document) -> dict<str, int>\n",
            "     |  \n",
            "     |  featurize(...)\n",
            "     |      featurize(self: metapy.metapy.analyzers.Analyzer, arg0: metapy.metapy.index.Document) -> dict<str, float>\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class BranchFeaturizer(TreeFeaturizer)\n",
            "     |  Method resolution order:\n",
            "     |      BranchFeaturizer\n",
            "     |      TreeFeaturizer\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.BranchFeaturizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TreeFeaturizer:\n",
            "     |  \n",
            "     |  tree_tokenize(...)\n",
            "     |      tree_tokenize(self: metapy.metapy.analyzers.TreeFeaturizer, arg0: meta::parser::parse_tree, arg1: meta::analyzers::featurizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class CharacterTokenizer(TokenStream)\n",
            "     |  Method resolution order:\n",
            "     |      CharacterTokenizer\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.CharacterTokenizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TokenStream:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes inherited from TokenStream:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class DepthFeaturizer(TreeFeaturizer)\n",
            "     |  Method resolution order:\n",
            "     |      DepthFeaturizer\n",
            "     |      TreeFeaturizer\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.DepthFeaturizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TreeFeaturizer:\n",
            "     |  \n",
            "     |  tree_tokenize(...)\n",
            "     |      tree_tokenize(self: metapy.metapy.analyzers.TreeFeaturizer, arg0: meta::parser::parse_tree, arg1: meta::analyzers::featurizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class EmptySentenceFilter(TokenStream)\n",
            "     |  Method resolution order:\n",
            "     |      EmptySentenceFilter\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.EmptySentenceFilter, arg0: metapy.metapy.analyzers.TokenStream) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TokenStream:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes inherited from TokenStream:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class EnglishNormalizer(TokenStream)\n",
            "     |  Method resolution order:\n",
            "     |      EnglishNormalizer\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.EnglishNormalizer, arg0: metapy.metapy.analyzers.TokenStream) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TokenStream:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes inherited from TokenStream:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class ICUFilter(TokenStream)\n",
            "     |  Method resolution order:\n",
            "     |      ICUFilter\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.ICUFilter, arg0: metapy.metapy.analyzers.TokenStream, arg1: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TokenStream:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes inherited from TokenStream:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class ICUTokenizer(TokenStream)\n",
            "     |  Method resolution order:\n",
            "     |      ICUTokenizer\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.ICUTokenizer, suppress_tags: bool = False) -> None\n",
            "     |      \n",
            "     |      Creates a tokenizer using the UTF text segmentation standard\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TokenStream:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes inherited from TokenStream:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class LengthFilter(TokenStream)\n",
            "     |  Method resolution order:\n",
            "     |      LengthFilter\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.LengthFilter, source: metapy.metapy.analyzers.TokenStream, min: int, max: int) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TokenStream:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes inherited from TokenStream:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class ListFilter(TokenStream)\n",
            "     |  Method resolution order:\n",
            "     |      ListFilter\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.ListFilter, arg0: metapy.metapy.analyzers.TokenStream, arg1: str, arg2: metapy.metapy.analyzers.ListFilter.Type) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes defined here:\n",
            "     |  \n",
            "     |  Type = <class 'metapy.metapy.analyzers.ListFilter.Type'>\n",
            "     |      Members:\n",
            "     |      \n",
            "     |      Accept\n",
            "     |      \n",
            "     |      Reject\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TokenStream:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes inherited from TokenStream:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class LowercaseFilter(TokenStream)\n",
            "     |  Method resolution order:\n",
            "     |      LowercaseFilter\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.LowercaseFilter, arg0: metapy.metapy.analyzers.TokenStream) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TokenStream:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes inherited from TokenStream:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class MultiAnalyzer(Analyzer)\n",
            "     |  Method resolution order:\n",
            "     |      MultiAnalyzer\n",
            "     |      Analyzer\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(self, /, *args, **kwargs)\n",
            "     |      Initialize self.  See help(type(self)) for accurate signature.\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from Analyzer:\n",
            "     |  \n",
            "     |  analyze(...)\n",
            "     |      analyze(self: metapy.metapy.analyzers.Analyzer, arg0: metapy.metapy.index.Document) -> dict<str, int>\n",
            "     |  \n",
            "     |  featurize(...)\n",
            "     |      featurize(self: metapy.metapy.analyzers.Analyzer, arg0: metapy.metapy.index.Document) -> dict<str, float>\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class NGramPOSAnalyzer(Analyzer)\n",
            "     |  Method resolution order:\n",
            "     |      NGramPOSAnalyzer\n",
            "     |      Analyzer\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.NGramPOSAnalyzer, arg0: int, arg1: metapy.metapy.analyzers.TokenStream, arg2: str) -> None\n",
            "     |  \n",
            "     |  analyze(...)\n",
            "     |      analyze(self: metapy.metapy.analyzers.NGramPOSAnalyzer, arg0: metapy.metapy.index.Document) -> object\n",
            "     |  \n",
            "     |  featurize(...)\n",
            "     |      featurize(self: metapy.metapy.analyzers.NGramPOSAnalyzer, arg0: metapy.metapy.index.Document) -> object\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class NGramWordAnalyzer(Analyzer)\n",
            "     |  Method resolution order:\n",
            "     |      NGramWordAnalyzer\n",
            "     |      Analyzer\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.NGramWordAnalyzer, arg0: int, arg1: metapy.metapy.analyzers.TokenStream) -> None\n",
            "     |  \n",
            "     |  analyze(...)\n",
            "     |      analyze(self: metapy.metapy.analyzers.NGramWordAnalyzer, arg0: metapy.metapy.index.Document) -> object\n",
            "     |  \n",
            "     |  featurize(...)\n",
            "     |      featurize(self: metapy.metapy.analyzers.NGramWordAnalyzer, arg0: metapy.metapy.index.Document) -> object\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class PennTreebankNormalizer(TokenStream)\n",
            "     |  Method resolution order:\n",
            "     |      PennTreebankNormalizer\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.PennTreebankNormalizer, arg0: metapy.metapy.analyzers.TokenStream) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TokenStream:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes inherited from TokenStream:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class Porter2Filter(TokenStream)\n",
            "     |  Method resolution order:\n",
            "     |      Porter2Filter\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.Porter2Filter, arg0: metapy.metapy.analyzers.TokenStream) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TokenStream:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes inherited from TokenStream:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class SemiSkeletonFeaturizer(TreeFeaturizer)\n",
            "     |  Method resolution order:\n",
            "     |      SemiSkeletonFeaturizer\n",
            "     |      TreeFeaturizer\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.SemiSkeletonFeaturizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TreeFeaturizer:\n",
            "     |  \n",
            "     |  tree_tokenize(...)\n",
            "     |      tree_tokenize(self: metapy.metapy.analyzers.TreeFeaturizer, arg0: meta::parser::parse_tree, arg1: meta::analyzers::featurizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class SentenceBoundaryAdder(TokenStream)\n",
            "     |  Method resolution order:\n",
            "     |      SentenceBoundaryAdder\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.SentenceBoundaryAdder, arg0: metapy.metapy.analyzers.TokenStream) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TokenStream:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes inherited from TokenStream:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class SkeletonFeaturizer(TreeFeaturizer)\n",
            "     |  Method resolution order:\n",
            "     |      SkeletonFeaturizer\n",
            "     |      TreeFeaturizer\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.SkeletonFeaturizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TreeFeaturizer:\n",
            "     |  \n",
            "     |  tree_tokenize(...)\n",
            "     |      tree_tokenize(self: metapy.metapy.analyzers.TreeFeaturizer, arg0: meta::parser::parse_tree, arg1: meta::analyzers::featurizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class SubtreeFeaturizer(TreeFeaturizer)\n",
            "     |  Method resolution order:\n",
            "     |      SubtreeFeaturizer\n",
            "     |      TreeFeaturizer\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.SubtreeFeaturizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TreeFeaturizer:\n",
            "     |  \n",
            "     |  tree_tokenize(...)\n",
            "     |      tree_tokenize(self: metapy.metapy.analyzers.TreeFeaturizer, arg0: meta::parser::parse_tree, arg1: meta::analyzers::featurizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class TagFeaturizer(TreeFeaturizer)\n",
            "     |  Method resolution order:\n",
            "     |      TagFeaturizer\n",
            "     |      TreeFeaturizer\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.TagFeaturizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from TreeFeaturizer:\n",
            "     |  \n",
            "     |  tree_tokenize(...)\n",
            "     |      tree_tokenize(self: metapy.metapy.analyzers.TreeFeaturizer, arg0: meta::parser::parse_tree, arg1: meta::analyzers::featurizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class TokenStream(pybind11_builtins.pybind11_object)\n",
            "     |  Method resolution order:\n",
            "     |      TokenStream\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __bool__(...)\n",
            "     |      __bool__(self: metapy.metapy.analyzers.TokenStream) -> bool\n",
            "     |  \n",
            "     |  __deepcopy__(...)\n",
            "     |      __deepcopy__(self: metapy.metapy.analyzers.TokenStream, arg0: dict) -> metapy.metapy.analyzers.TokenStream\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.TokenStream) -> None\n",
            "     |  \n",
            "     |  __iter__(...)\n",
            "     |      __iter__(self: object) -> py_token_stream_iterator\n",
            "     |  \n",
            "     |  next(...)\n",
            "     |      next(self: metapy.metapy.analyzers.TokenStream) -> str\n",
            "     |  \n",
            "     |  set_content(...)\n",
            "     |      set_content(self: metapy.metapy.analyzers.TokenStream, arg0: str) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Data and other attributes defined here:\n",
            "     |  \n",
            "     |  Iterator = <class 'metapy.metapy.analyzers.TokenStream.Iterator'>\n",
            "     |  \n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class TreeAnalyzer(Analyzer)\n",
            "     |  Method resolution order:\n",
            "     |      TreeAnalyzer\n",
            "     |      Analyzer\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(...)\n",
            "     |      __init__(self: metapy.metapy.analyzers.TreeAnalyzer, arg0: metapy.metapy.analyzers.TokenStream, arg1: str, arg2: str) -> None\n",
            "     |  \n",
            "     |  add(...)\n",
            "     |      add(self: metapy.metapy.analyzers.TreeAnalyzer, arg0: metapy.metapy.analyzers.TreeFeaturizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Methods inherited from Analyzer:\n",
            "     |  \n",
            "     |  analyze(...)\n",
            "     |      analyze(self: metapy.metapy.analyzers.Analyzer, arg0: metapy.metapy.index.Document) -> dict<str, int>\n",
            "     |  \n",
            "     |  featurize(...)\n",
            "     |      featurize(self: metapy.metapy.analyzers.Analyzer, arg0: metapy.metapy.index.Document) -> dict<str, float>\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "    \n",
            "    class TreeFeaturizer(pybind11_builtins.pybind11_object)\n",
            "     |  Method resolution order:\n",
            "     |      TreeFeaturizer\n",
            "     |      pybind11_builtins.pybind11_object\n",
            "     |      builtins.object\n",
            "     |  \n",
            "     |  Methods defined here:\n",
            "     |  \n",
            "     |  __init__(self, /, *args, **kwargs)\n",
            "     |      Initialize self.  See help(type(self)) for accurate signature.\n",
            "     |  \n",
            "     |  tree_tokenize(...)\n",
            "     |      tree_tokenize(self: metapy.metapy.analyzers.TreeFeaturizer, arg0: meta::parser::parse_tree, arg1: meta::analyzers::featurizer) -> None\n",
            "     |  \n",
            "     |  ----------------------------------------------------------------------\n",
            "     |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            "     |  \n",
            "     |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            "     |      Create and return a new object.  See help(type) for accurate signature.\n",
            "\n",
            "FUNCTIONS\n",
            "    load(...) method of builtins.PyCapsule instance\n",
            "        load(arg0: str) -> metapy.metapy.analyzers.Analyzer\n",
            "    \n",
            "    register_filter(...) method of builtins.PyCapsule instance\n",
            "        register_filter(arg0: object) -> None\n",
            "\n",
            "FILE\n",
            "    (built-in)\n",
            "\n",
            "\n"
          ]
        }
      ],
      "source": [
        "help(metapy.analyzers)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "_EklwX8Nh7WK",
        "outputId": "b1aeef38-13cf-41fd-85e8-a7a7a84b87c5",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['i',\n",
              " 'said',\n",
              " 'that',\n",
              " 'i',\n",
              " \"can't\",\n",
              " 'believe',\n",
              " 'that',\n",
              " 'it',\n",
              " 'only',\n",
              " 'costs',\n",
              " '$',\n",
              " '19.95',\n",
              " '!',\n",
              " 'i',\n",
              " 'could',\n",
              " 'only',\n",
              " 'find',\n",
              " 'it',\n",
              " 'for',\n",
              " 'more',\n",
              " 'than',\n",
              " '$',\n",
              " '30',\n",
              " 'before',\n",
              " '.']"
            ]
          },
          "metadata": {},
          "execution_count": 19
        }
      ],
      "source": [
        "tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)\n",
        "tok = metapy.analyzers.LowercaseFilter(tok)\n",
        "tok.set_content(doc.content())\n",
        "[token for token in tok]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "xsbIsQUWh7WL"
      },
      "source": [
        "Now, let's count how often each individual token appears in the stream. You might have called this representation the \"bag of words\" representation, but it is also often called \"unigram word counts\". In MeTA, classes that consume a token stream and emit a document representation are called Analyzers."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "scrolled": false,
        "id": "9v_FoSwkh7WL",
        "outputId": "482fcd10-3758-4933-8134-3cf5b3b878b8",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'for': 1,\n",
              " 'said': 1,\n",
              " '!': 1,\n",
              " 'only': 2,\n",
              " 'before': 1,\n",
              " 'i': 3,\n",
              " 'find': 1,\n",
              " \"can't\": 1,\n",
              " '19.95': 1,\n",
              " '.': 1,\n",
              " 'it': 2,\n",
              " 'that': 2,\n",
              " '30': 1,\n",
              " 'than': 1,\n",
              " 'costs': 1,\n",
              " 'more': 1,\n",
              " '$': 2,\n",
              " 'believe': 1,\n",
              " 'could': 1}"
            ]
          },
          "metadata": {},
          "execution_count": 20
        }
      ],
      "source": [
        "ana = metapy.analyzers.NGramWordAnalyzer(1, tok)\n",
        "print(doc.content())\n",
        "ana.analyze(doc)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "DKZ5xP4Zh7WM"
      },
      "source": [
        "If you noticed the name of the analyzer, you might have realized that you can count not just individual tokens, but groups of them. \"Unigram\" means \"1-gram\", and we count individual tokens. \"Bigram\" means \"2-gram\", and we count adjacent tokens together as a group. Let's try that now."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "BHrGxx2Kh7WM",
        "outputId": "172b43f3-65dc-4644-e00e-fdf6dbdadce3",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{('30', 'before'): 1,\n",
              " ('costs', '$'): 1,\n",
              " ('$', '19.95'): 1,\n",
              " ('before', '.'): 1,\n",
              " ('$', '30'): 1,\n",
              " ('than', '$'): 1,\n",
              " ('only', 'find'): 1,\n",
              " ('that', 'it'): 1,\n",
              " ('find', 'it'): 1,\n",
              " ('it', 'for'): 1,\n",
              " ('for', 'more'): 1,\n",
              " ('could', 'only'): 1,\n",
              " ('only', 'costs'): 1,\n",
              " ('said', 'that'): 1,\n",
              " ('19.95', '!'): 1,\n",
              " (\"can't\", 'believe'): 1,\n",
              " ('more', 'than'): 1,\n",
              " ('i', 'could'): 1,\n",
              " ('that', 'i'): 1,\n",
              " ('i', 'said'): 1,\n",
              " ('i', \"can't\"): 1,\n",
              " ('!', 'i'): 1,\n",
              " ('it', 'only'): 1,\n",
              " ('believe', 'that'): 1}"
            ]
          },
          "metadata": {},
          "execution_count": 21
        }
      ],
      "source": [
        "ana = metapy.analyzers.NGramWordAnalyzer(2, tok)\n",
        "ana.analyze(doc)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "diGWb54Zh7WN"
      },
      "source": [
        "Now the individual \"tokens\" we're counting are pairs of tokens. You can analyze any n-gram of tokens you would like to in this way (and this is a simple way to attempt to support phrase search). Note, however, that as you increase the size of the n-grams you are counting, you are also increasing (exponentially!) the number of possible n-grams you could observe, so there's no free lunch here.\n",
        "\n",
        "This analysis pipeline feeds both the creation of the `InvertedIndex`, which is used for search applications, and the `ForwardIndex`, which is used for topic modeling and classification applications. For classification, sometimes looking at n-grams of characters is useful."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "KgQOtYONh7WN",
        "outputId": "752ee285-9b6b-4a18-adaa-827e5879814c",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{('d', ' ', 'i', 't'): 1,\n",
              " ('v', 'e', ' ', 't'): 1,\n",
              " ('i', 't', ' ', 'o'): 1,\n",
              " ('f', 'o', 'r', 'e'): 1,\n",
              " ('$', '1', '9', '.'): 1,\n",
              " ('d', ' ', 'o', 'n'): 1,\n",
              " (' ', 'o', 'n', 'l'): 2,\n",
              " ('h', 'a', 'n', ' '): 1,\n",
              " ('o', 's', 't', 's'): 1,\n",
              " (' ', 'i', 't', ' '): 2,\n",
              " (' ', 'b', 'e', 'f'): 1,\n",
              " ('o', 'n', 'l', 'y'): 2,\n",
              " ('3', '0', ' ', 'b'): 1,\n",
              " (' ', 'I', ' ', 'c'): 2,\n",
              " ('a', 'n', ' ', '$'): 1,\n",
              " ('f', 'o', 'r', ' '): 1,\n",
              " ('n', 'l', 'y', ' '): 2,\n",
              " ('o', 'r', 'e', '.'): 1,\n",
              " ('t', ' ', 'i', 't'): 1,\n",
              " ('c', 'a', 'n', \"'\"): 1,\n",
              " ('s', 'a', 'i', 'd'): 1,\n",
              " ('t', 'h', 'a', 'n'): 1,\n",
              " ('i', 'd', ' ', 't'): 1,\n",
              " ('a', 't', ' ', 'i'): 1,\n",
              " ('e', ' ', 't', 'h'): 2,\n",
              " ('c', 'o', 's', 't'): 1,\n",
              " ('9', '5', '!', ' '): 1,\n",
              " ('l', 'y', ' ', 'c'): 1,\n",
              " ('$', '3', '0', ' '): 1,\n",
              " ('0', ' ', 'b', 'e'): 1,\n",
              " ('s', ' ', '$', '1'): 1,\n",
              " ('m', 'o', 'r', 'e'): 1,\n",
              " (' ', 'c', 'o', 'u'): 1,\n",
              " ('l', 'i', 'e', 'v'): 1,\n",
              " ('n', 'd', ' ', 'i'): 1,\n",
              " (\"'\", 't', ' ', 'b'): 1,\n",
              " ('i', 'e', 'v', 'e'): 1,\n",
              " ('n', \"'\", 't', ' '): 1,\n",
              " ('t', ' ', 'o', 'n'): 1,\n",
              " ('e', 'v', 'e', ' '): 1,\n",
              " ('e', 'l', 'i', 'e'): 1,\n",
              " ('y', ' ', 'c', 'o'): 1,\n",
              " ('t', ' ', 'I', ' '): 1,\n",
              " ('9', '.', '9', '5'): 1,\n",
              " (' ', '$', '1', '9'): 1,\n",
              " (' ', 'f', 'i', 'n'): 1,\n",
              " ('s', 't', 's', ' '): 1,\n",
              " (' ', 'f', 'o', 'r'): 1,\n",
              " ('I', ' ', 's', 'a'): 1,\n",
              " ('.', '9', '5', '!'): 1,\n",
              " ('a', 'i', 'd', ' '): 1,\n",
              " (' ', 'c', 'a', 'n'): 1,\n",
              " (' ', 's', 'a', 'i'): 1,\n",
              " ('o', 'r', 'e', ' '): 1,\n",
              " (' ', 'm', 'o', 'r'): 1,\n",
              " ('r', ' ', 'm', 'o'): 1,\n",
              " (' ', 't', 'h', 'a'): 3,\n",
              " ('a', 'n', \"'\", 't'): 1,\n",
              " ('t', 's', ' ', '$'): 1,\n",
              " ('a', 't', ' ', 'I'): 1,\n",
              " ('I', ' ', 'c', 'a'): 1,\n",
              " ('5', '!', ' ', 'I'): 1,\n",
              " ('l', 'y', ' ', 'f'): 1,\n",
              " ('!', ' ', 'I', ' '): 1,\n",
              " (' ', '$', '3', '0'): 1,\n",
              " (' ', 'c', 'o', 's'): 1,\n",
              " ('c', 'o', 'u', 'l'): 1,\n",
              " ('b', 'e', 'f', 'o'): 1,\n",
              " ('t', 'h', 'a', 't'): 2,\n",
              " ('t', ' ', 'f', 'o'): 1,\n",
              " ('l', 'd', ' ', 'o'): 1,\n",
              " ('f', 'i', 'n', 'd'): 1,\n",
              " ('t', ' ', 'b', 'e'): 1,\n",
              " ('u', 'l', 'd', ' '): 1,\n",
              " ('n', ' ', '$', '3'): 1,\n",
              " (' ', 'b', 'e', 'l'): 1,\n",
              " ('r', 'e', ' ', 't'): 1,\n",
              " ('o', 'r', ' ', 'm'): 1,\n",
              " ('i', 't', ' ', 'f'): 1,\n",
              " ('d', ' ', 't', 'h'): 1,\n",
              " ('o', 'u', 'l', 'd'): 1,\n",
              " ('e', 'f', 'o', 'r'): 1,\n",
              " ('I', ' ', 'c', 'o'): 1,\n",
              " ('b', 'e', 'l', 'i'): 1,\n",
              " ('1', '9', '.', '9'): 1,\n",
              " ('i', 'n', 'd', ' '): 1,\n",
              " ('h', 'a', 't', ' '): 2,\n",
              " ('y', ' ', 'f', 'i'): 1}"
            ]
          },
          "metadata": {},
          "execution_count": 22
        }
      ],
      "source": [
        "tok = metapy.analyzers.CharacterTokenizer()\n",
        "ana = metapy.analyzers.NGramWordAnalyzer(4, tok)\n",
        "ana.analyze(doc)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "DtrZbXbYh7W2"
      },
      "source": [
        "Different analyzers can be combined together to create document representations that have many unique perspectives. Once things start to get more complicated, we recommend using a configuration file to specify each of the analyzers you wish to combine for your document representation."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "lxpvrDFch7W4"
      },
      "source": [
        "Now, let's explore something a little bit different. MeTA also has a natural language processing (NLP) component, which currently supports two major NLP tasks: part-of-speech tagging and syntactic parsing.\n",
        "\n",
        "(Does anyone know what part-of-speech tagging is?) POS tagging is a task in NLP that involves identifying a type for each word in a sentence. For example, POS tagging can be used to identify all of the nouns in a sentence, or all of the verbs, or adjectives, or... This is useful as first step towards developing an understanding of the meaning of a particular sentence."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "2g9G80tCh7W6"
      },
      "source": [
        "MeTA places its POS tagging component in its \"sequences\" library. Let's play with some sequences first to get an idea of how they work. We'll start of by creating a sequence."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "deletable": true,
        "editable": true,
        "id": "waIzgMIqh7W7"
      },
      "outputs": [],
      "source": [
        "seq = metapy.sequence.Sequence()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "uX_TZGxwh7W8"
      },
      "source": [
        "Now, we can add individual words to this sequence. Sequences consist of a list of `Observation`s, which are essentially (word, tag) pairs. If we don't yet know the tags for a `Sequence`, we can just add individual words and leave the tags unset. Words are called \"symbols\" in the library terminology."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "R4-hSfekh7W8",
        "outputId": "141c472a-4419-4b44-c932-1d29c86ceebc",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(The, ???), (dog, ???), (ran, ???), (across, ???), (the, ???), (park, ???), (., ???)\n"
          ]
        }
      ],
      "source": [
        "for word in [\"The\", \"dog\", \"ran\", \"across\", \"the\", \"park\", \".\"]:\n",
        "    seq.add_symbol(word)\n",
        "print(seq)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "vHCjkiJgh7W8"
      },
      "source": [
        "The printed form of the sequence shows that we do not yet know the tags for each word. Let's fill them in by using a pre-trained POS-tagger model that's distributed with MeTA."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "c8iUNLQnh7W9"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.1/greedy-perceptron-tagger.tar.gz\n",
        "!tar xvf greedy-perceptron-tagger.tar.gz"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "deletable": true,
        "editable": true,
        "id": "YsAqsTrHh7W-"
      },
      "outputs": [],
      "source": [
        "tagger = metapy.sequence.PerceptronTagger(\"perceptron-tagger/\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "eZOEf18Ih7W-"
      },
      "source": [
        "Now let's fill in the missing tags in our sentence based on the best guess this model has."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "7J8VIBlLh7W_",
        "outputId": "d2b21b40-a061-43bf-ac6c-0adeda56bd2f",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(The, DT), (dog, NN), (ran, VBD), (across, IN), (the, DT), (park, NN), (., .)\n"
          ]
        }
      ],
      "source": [
        "tagger.tag(seq)\n",
        "print(seq)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "RAj-NSgfh7W_"
      },
      "source": [
        "Each tag indicates the type of a word, and this particular tagger was trained to output the tags present in the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).\n",
        "\n",
        "But what if we want to POS-tag a document?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "FQVefwXah7XA",
        "outputId": "b3fecbcf-a31e-4ea8-e2ac-6b376fe8bd31",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.\n"
          ]
        }
      ],
      "source": [
        "print(doc.content())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "67F0POfJh7XA"
      },
      "source": [
        "We need a way of going from a document to a list of `Sequence`s, each representing an individual sentence. I'll get you started."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "taDSmQqlh7XA",
        "outputId": "d243a6e4-8a5f-4ec3-af24-d0020f4cb8e4",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['<s>',\n",
              " 'I',\n",
              " 'said',\n",
              " 'that',\n",
              " 'I',\n",
              " 'ca',\n",
              " \"n't\",\n",
              " 'believe',\n",
              " 'that',\n",
              " 'it',\n",
              " 'only',\n",
              " 'costs',\n",
              " '$',\n",
              " '19.95',\n",
              " '!',\n",
              " '</s>',\n",
              " '<s>',\n",
              " 'I',\n",
              " 'could',\n",
              " 'only',\n",
              " 'find',\n",
              " 'it',\n",
              " 'for',\n",
              " 'more',\n",
              " 'than',\n",
              " '$',\n",
              " '30',\n",
              " 'before',\n",
              " '.',\n",
              " '</s>']"
            ]
          },
          "metadata": {},
          "execution_count": 29
        }
      ],
      "source": [
        "tok = metapy.analyzers.ICUTokenizer() # keep sentence boundaries!\n",
        "tok = metapy.analyzers.PennTreebankNormalizer(tok)\n",
        "tok.set_content(doc.content())\n",
        "[token for token in tok]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "Wa6Kq18Ah7XB"
      },
      "source": [
        "(Notice that the `PennTreebankNormalizer` modifies some tokens to better match the conventions of the Penn Treebank training data. This should help improve performance a little.)\n",
        "\n",
        "Now, write me a function that can take a token stream that contains sentence boundary tags and returns a list of `Sequence` objects. Don't include the sentence boundary tags in the actual `Sequence` objects."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "deletable": true,
        "editable": true,
        "id": "QN3jO2h4h7Xm"
      },
      "outputs": [],
      "source": [
        "def extract_sequences(tok):\n",
        "    sequences = []\n",
        "    for token in tok:\n",
        "        if token == '<s>':\n",
        "            sequences.append(metapy.sequence.Sequence())\n",
        "        elif token != '</s>':\n",
        "            sequences[-1].add_symbol(token)            \n",
        "    return sequences"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "nRSh3WPhh7Xr",
        "outputId": "38b01b58-2144-4b0e-aa22-0a38dc4c71c3",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(I, PRP), (said, VBD), (that, IN), (I, PRP), (ca, MD), (n't, RB), (believe, VB), (that, IN), (it, PRP), (only, RB), (costs, VBZ), ($, $), (19.95, CD), (!, .)\n",
            "(I, PRP), (could, MD), (only, RB), (find, VB), (it, PRP), (for, IN), (more, JJR), (than, IN), ($, $), (30, CD), (before, IN), (., .)\n"
          ]
        }
      ],
      "source": [
        "tok.set_content(doc.content())\n",
        "for seq in extract_sequences(tok):\n",
        "    tagger.tag(seq)\n",
        "    print(seq)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "XaeY6oS3h7Xs"
      },
      "source": [
        "This is still a rather shallow understanding of these sentences. The next major leap is to parse these sequences of POS-tagged words to obtain a tree for each sentence. These trees, in our case, will represent the hierarchical phrase structure of a single sentence by grouping together tokens that belong to one phrase together, and showing how small phrases combine into larger phrases, and eventually a sentence.\n",
        "\n",
        "Let's try parsing the sentences in our document using a pre-tranned constituency parser that's distributed with MeTA."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "GOGvWfS1h7Xu"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!wget -nc https://github.com/meta-toolkit/meta/releases/download/v3.0.1/greedy-constituency-parser.tar.gz\n",
        "!tar xvf greedy-constituency-parser.tar.gz"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "GNX3_bxFh7Xv"
      },
      "outputs": [],
      "source": [
        "parser = metapy.parser.Parser(\"parser/\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "WMKW0q6zh7Xv",
        "outputId": "b2b33a8e-cd07-420c-e44a-6e1799205a34",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "I could only find it for more than $ 30 before .\n",
            "(I, PRP), (could, MD), (only, RB), (find, VB), (it, PRP), (for, IN), (more, JJR), (than, IN), ($, $), (30, CD), (before, IN), (., .)\n",
            "(ROOT\n",
            "  (S\n",
            "    (NP (PRP I))\n",
            "    (VP\n",
            "      (MD could)\n",
            "      (ADVP (RB only))\n",
            "      (VP\n",
            "        (VB find)\n",
            "        (NP (PRP it))\n",
            "        (PP\n",
            "          (IN for)\n",
            "          (NP\n",
            "            (QP\n",
            "              (JJR more)\n",
            "              (IN than)\n",
            "              ($ $)\n",
            "              (CD 30))))\n",
            "        (ADVP (IN before))))\n",
            "    (. .)))\n",
            "\n"
          ]
        }
      ],
      "source": [
        "print(' '.join([obs.symbol for obs in seq]))\n",
        "print(seq)\n",
        "tree = parser.parse(seq)\n",
        "print(tree.pretty_str())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "25V4jV7mh7Xx"
      },
      "source": [
        "(You can also play with this with a [prettier online demo](https://meta-toolkit.org/nlp-demo.html).)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "fW9hlnfch7Xx"
      },
      "source": [
        "We can now parse all of the sentences in our document."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "vqvXaKB5h7Xx",
        "outputId": "f7fd5338-abe6-44eb-d633-55e981a63d65",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(ROOT\n",
            "  (S\n",
            "    (NP (PRP I))\n",
            "    (VP\n",
            "      (VBD said)\n",
            "      (SBAR\n",
            "        (IN that)\n",
            "        (S\n",
            "          (NP (PRP I))\n",
            "          (VP\n",
            "            (MD ca)\n",
            "            (RB n't)\n",
            "            (VP\n",
            "              (VB believe)\n",
            "              (SBAR\n",
            "                (IN that)\n",
            "                (S\n",
            "                  (NP (PRP it))\n",
            "                  (ADVP (RB only))\n",
            "                  (VP\n",
            "                    (VBZ costs)\n",
            "                    (NP\n",
            "                      ($ $)\n",
            "                      (CD 19.95))))))))))\n",
            "    (. !)))\n",
            "\n",
            "(ROOT\n",
            "  (S\n",
            "    (NP (PRP I))\n",
            "    (VP\n",
            "      (MD could)\n",
            "      (ADVP (RB only))\n",
            "      (VP\n",
            "        (VB find)\n",
            "        (NP (PRP it))\n",
            "        (PP\n",
            "          (IN for)\n",
            "          (NP\n",
            "            (QP\n",
            "              (JJR more)\n",
            "              (IN than)\n",
            "              ($ $)\n",
            "              (CD 30))))\n",
            "        (ADVP (IN before))))\n",
            "    (. .)))\n",
            "\n"
          ]
        }
      ],
      "source": [
        "tok.set_content(doc.content())\n",
        "for seq in extract_sequences(tok):\n",
        "    tagger.tag(seq)\n",
        "    print(parser.parse(seq).pretty_str())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "VaG3Keylh7YY"
      },
      "source": [
        "Now that we know how to build these phrase structure trees from POS-tagged sentences extracted from raw text, let's explore a simple way we might be able to exploit this knowledge to help a downstream task.\n",
        "\n",
        "Our goal is going to be to extract the Subject-Verb-Object triples from some simple sentences. This will allow us to understand who is doing what to whom, which is knowledge that might be useful for lots of downstream tasks as diverse as question answering to stock market prediction. We should be able to extract these from our constituency parses. (This, of course, isn't the only way, and this method is quite naive. However, the implementation is simple enough that I think you should be able to grasp it in a single lecture.)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "vAxphjfFh7YY"
      },
      "source": [
        "First, let's grab our sample data. This is a collection of BBC news headlines that will serve as our \"simple\" sentences."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rn5_OKPCh7YZ"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!wget -nc https://meta-toolkit.org/data/2017-03-27/headlines.tar.gz # please be nice!\n",
        "!tar xvf headlines.tar.gz"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "lDIEB2Lqh7Yb",
        "outputId": "570b5feb-222c-4a6d-f734-5ddc3c63b4d0",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\n",
            "README:\n",
            "http://mlg.ucd.ie/datasets/bbc.html\n",
            "\n",
            "Exactracted first sentence of each doc from this dataset.\n"
          ]
        }
      ],
      "source": [
        "!echo \"\" && echo \"README:\"\n",
        "!cat headlines/README.md"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "qjMKao6Rh7Yb"
      },
      "source": [
        "Let's look at the first headline of the business category."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "dEGL_gBhh7Yb",
        "outputId": "a3c7abf3-1670-4024-d010-62c09f568299",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'Brazil approves bankruptcy reform'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 38
        }
      ],
      "source": [
        "with open(\"headlines/business.txt\") as f:\n",
        "    business = f.readlines()\n",
        "business[0].strip()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "YIuGQb6rh7Yc"
      },
      "source": [
        "This looks simple enough. Let's see how it gets tagged."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "t1RbCECph7Yc",
        "outputId": "4830666b-5e25-4036-a6e9-1e6245356d18",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(Brazil, NNP), (approves, VBZ), (bankruptcy, NN), (reform, NN)\n"
          ]
        }
      ],
      "source": [
        "tok.set_content(business[0].strip())\n",
        "sequence = extract_sequences(tok)[0]\n",
        "tagger.tag(sequence)\n",
        "print(sequence)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "-9BRWMeuh7Ye"
      },
      "source": [
        "Let's also parse it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "ckVNpNhzh7Yf",
        "outputId": "40664218-c6ca-43f0-dec6-d9ba8f51a76e",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(ROOT\n",
            "  (S\n",
            "    (NP (NNP Brazil))\n",
            "    (VP\n",
            "      (VBZ approves)\n",
            "      (NP\n",
            "        (NN bankruptcy)\n",
            "        (NN reform)))))\n",
            "\n"
          ]
        }
      ],
      "source": [
        "tree = parser.parse(sequence)\n",
        "print(tree.pretty_str())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "Lq9yoFxAh7Yf"
      },
      "source": [
        "Great. We can now start to develop our technique. We can see that the subject here is the first noun phrase (NP), the verb is the first verb-like token in the VP, and the object is the NP within that VP.\n",
        "\n",
        "We're going to need to traverse this tree to extract what we want. MeTA supports this by exploiting the [Visitor pattern](https://en.wikipedia.org/wiki/Visitor_pattern), so the easiest way for us to get at what we're looking for is to write some classes that encapsulate the traversal we want to perform and keep track of things within this tree that we are interested in.\n",
        "\n",
        "Let's write our first simple visitor that traverses the tree to find the first NP node, at which point it will stop and store the root of that subtree."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "Qx65zLpKh7Yg",
        "outputId": "ff0a86aa-66ed-4465-8592-75bd55089d2b",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Help on class Visitor in module metapy.metapy.parser:\n",
            "\n",
            "class Visitor(pybind11_builtins.pybind11_object)\n",
            " |  Method resolution order:\n",
            " |      Visitor\n",
            " |      pybind11_builtins.pybind11_object\n",
            " |      builtins.object\n",
            " |  \n",
            " |  Methods defined here:\n",
            " |  \n",
            " |  __init__(...)\n",
            " |      __init__(self: metapy.metapy.parser.Visitor) -> None\n",
            " |  \n",
            " |  visit_internal(...)\n",
            " |      visit_internal(self: metapy.metapy.parser.Visitor, arg0: metapy.metapy.parser.InternalNode) -> object\n",
            " |  \n",
            " |  visit_leaf(...)\n",
            " |      visit_leaf(self: metapy.metapy.parser.Visitor, arg0: metapy.metapy.parser.LeafNode) -> object\n",
            " |  \n",
            " |  ----------------------------------------------------------------------\n",
            " |  Static methods inherited from pybind11_builtins.pybind11_object:\n",
            " |  \n",
            " |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type\n",
            " |      Create and return a new object.  See help(type) for accurate signature.\n",
            "\n"
          ]
        }
      ],
      "source": [
        "help(metapy.parser.Visitor)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "deletable": true,
        "editable": true,
        "id": "fbvaxV34h7Yh"
      },
      "outputs": [],
      "source": [
        "class NounPhraseFinder(metapy.parser.Visitor):\n",
        "    def __init__(self):\n",
        "        self.node = None\n",
        "        super(NounPhraseFinder, self).__init__() # required; invoke base class __init__\n",
        "        \n",
        "    def visit_leaf(self, node):\n",
        "        pass # we don't care about leaf nodes\n",
        "    \n",
        "    def visit_internal(self, node):\n",
        "        if self.node:\n",
        "            return\n",
        "\n",
        "        # we do care about internal nodes; check if it is an NP\n",
        "        if node.category() == 'NP':\n",
        "            # store this node and stop the traversal\n",
        "            self.node = node\n",
        "        else:\n",
        "            # continue traversing by visiting all of the child nodes\n",
        "            node.each_child(lambda child: child.accept(self))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "qUA6nfnth7Yi",
        "outputId": "6e09e3a4-1661-42cb-a976-ccd4b26c2042",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "NP with 1 child(ren)\n"
          ]
        }
      ],
      "source": [
        "npf = NounPhraseFinder()\n",
        "tree.visit(npf)\n",
        "print(\"{} with {} child(ren)\".format(npf.node.category(), npf.node.num_children()))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "poJmag_xh7Yj"
      },
      "source": [
        "Now that we have that working, we should be able to make a more generic PhraseFinder that finds the first internal node that matches a specific node category. We'll need one for finding the first NP and one for finding the first VP anyway, so this will be helpful."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "deletable": true,
        "editable": true,
        "id": "tgK2UT7dh7Yj"
      },
      "outputs": [],
      "source": [
        "class PhraseFinder(metapy.parser.Visitor):\n",
        "    def __init__(self, category):\n",
        "        super(PhraseFinder, self).__init__()\n",
        "        self.node = None\n",
        "        self.category = category\n",
        "        \n",
        "    def visit_leaf(self, node):\n",
        "        pass # we don't care about leaf nodes\n",
        "    \n",
        "    def visit_internal(self, node):\n",
        "        if self.node:\n",
        "            return\n",
        "        \n",
        "        if node.category() == self.category:\n",
        "            self.node = node\n",
        "        else:\n",
        "            node.each_child(lambda child: child.accept(self))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "Js67ckbhh7Yk",
        "outputId": "49a7c32c-454f-44ac-8c51-c342bbd784ee",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "NP with 1 child(ren)\n",
            "VP with 2 child(ren)\n"
          ]
        }
      ],
      "source": [
        "npf = PhraseFinder('NP')\n",
        "vpf = PhraseFinder('VP')\n",
        "tree.visit(npf)\n",
        "tree.visit(vpf)\n",
        "for node in [npf.node, vpf.node]:\n",
        "    print(\"{} with {} child(ren)\".format(node.category(), node.num_children()))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "93h_wQjfh7ZK"
      },
      "source": [
        "Now that we can find the first internal node matching a category label, we need to set about extracting the actual leaf nodes we care about. Fortunately there is already a visitor that can extract all leaf nodes from a subtree, so we can use that to get started.\n",
        "\n",
        "From the first noun phrase, we want to extract all leaf nodes that are noun-like tags and join them together to make up our subject."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "5-QRK0Zmh7ZL",
        "outputId": "35e05515-32be-4293-a1f8-f7257ba4f79c",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Brazil\n"
          ]
        }
      ],
      "source": [
        "noun_tags = set(['NN', 'NNS', 'NNP', 'NNPS'])\n",
        "lnf = metapy.parser.LeafNodeFinder()\n",
        "npf.node.accept(lnf)\n",
        "subject = ' '.join([leaf.word() for leaf in lnf.leaves() if leaf.category() in noun_tags])\n",
        "print(subject)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "r8QbrGDNh7ZM"
      },
      "source": [
        "And from the first verb phrase, we want to extract (1) the first verb-like leaf node to be the verb and (2) the noun-like tags in the first NP that occurs within that VP. We should be able to re-use some existing code we've already written."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "iWW_uPdlh7ZO",
        "outputId": "3ac6a399-fb9c-4b58-838c-a1e400c55dee",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "approves\n"
          ]
        }
      ],
      "source": [
        "verb_tags = set(['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])\n",
        "lnf = metapy.parser.LeafNodeFinder()\n",
        "vpf.node.accept(lnf)\n",
        "verb = next(leaf.word() for leaf in lnf.leaves() if leaf.category() in verb_tags)\n",
        "print(verb)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "6rYCJIEEh7ZP",
        "outputId": "c83b284d-cf66-4925-818c-99f38ae74362",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "bankruptcy reform\n"
          ]
        }
      ],
      "source": [
        "np_finder = PhraseFinder('NP')\n",
        "vpf.node.accept(np_finder)\n",
        "lnf = metapy.parser.LeafNodeFinder()\n",
        "np_finder.node.accept(lnf)\n",
        "obj = ' '.join([leaf.word() for leaf in lnf.leaves() if leaf.category() in noun_tags])\n",
        "print(obj)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "iDPBzbbyh7ZP",
        "outputId": "f3e3f5b7-ba32-47ad-e0d0-a2e341956ec4",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "SUBJ: Brazil VERB: approves OBJ: bankruptcy reform\n"
          ]
        }
      ],
      "source": [
        "print(\"SUBJ: {} VERB: {} OBJ: {}\".format(subject, verb, obj))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "ac2Ek_h1h7ZQ"
      },
      "source": [
        "Putting this all together, we can write a visitor to extract (SUBJ, VERB, OBJ) triples."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "deletable": true,
        "editable": true,
        "id": "D2XC0K8qh7ZQ"
      },
      "outputs": [],
      "source": [
        "class SVOExtractor(metapy.parser.Visitor):\n",
        "    noun_tags = set(['NN', 'NNS', 'NNP', 'NNPS'])\n",
        "    verb_tags = set(['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])    \n",
        "    \n",
        "    def __init__(self):\n",
        "        super(SVOExtractor, self).__init__()\n",
        "        self.subject = self.verb = self.object = None\n",
        "        \n",
        "    def extract_noun_tagged_words(self, node):\n",
        "        lnf = metapy.parser.LeafNodeFinder()\n",
        "        node.accept(lnf)\n",
        "        return ' '.join([leaf.word() for leaf in lnf.leaves() if leaf.category() in noun_tags])\n",
        "        \n",
        "    def visit_leaf(self, node):\n",
        "        pass # don't care about leaf nodes\n",
        "    \n",
        "    def visit_internal(self, node):\n",
        "        # find and handle the first NP\n",
        "        first_np = PhraseFinder('NP')        \n",
        "        node.accept(first_np)\n",
        "        if first_np.node:\n",
        "            self.subject = self.extract_noun_tagged_words(first_np.node)\n",
        "        \n",
        "        # find and handle the first VP\n",
        "        first_vp = PhraseFinder('VP')\n",
        "        node.accept(first_vp)\n",
        "        \n",
        "        if first_vp.node:\n",
        "            # find the first NP within the first VP\n",
        "            vp_first_np = PhraseFinder('NP')\n",
        "            first_vp.node.accept(vp_first_np)\n",
        "            \n",
        "            if vp_first_np.node:\n",
        "                self.object = self.extract_noun_tagged_words(vp_first_np.node)\n",
        "            \n",
        "            lnf = metapy.parser.LeafNodeFinder()\n",
        "            first_vp.node.accept(lnf)\n",
        "            for leaf in lnf.leaves():\n",
        "                if leaf.category() in verb_tags:\n",
        "                    self.verb = leaf.word()\n",
        "                    break\n",
        "        "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "deletable": true,
        "editable": true,
        "id": "dN7mO_0Nh7ZR",
        "outputId": "321c493a-6ca3-497d-debe-32d611a1912f",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Brazil approves bankruptcy reform\n",
            "SUBJ: Brazil VERB: approves OBJ: bankruptcy reform\n",
            "German business confidence slides\n",
            "SUBJ: business confidence slides VERB: None OBJ: None\n",
            "Dollar slides ahead of New Year\n",
            "SUBJ: Dollar slides New Year VERB: None OBJ: None\n",
            "Aviation firms eye booming India\n",
            "SUBJ: Aviation firms VERB: eye OBJ: India\n",
            "Metlife buys up Citigroup insurer\n",
            "SUBJ: Metlife VERB: buys OBJ: Citigroup insurer\n",
            "US economy still growing says Fed\n",
            "SUBJ:  VERB: says OBJ: None\n",
            "Russia WTO talks 'make progress'\n",
            "SUBJ: Russia WTO talks make progress VERB: None OBJ: None\n",
            "Deadline nears for Fiat-GM deal\n",
            "SUBJ: Deadline VERB: nears OBJ: Fiat GM deal\n",
            "Five million Germans out of work\n",
            "SUBJ: Germans work VERB: None OBJ: None\n",
            "Jobs go at Oracle after takeover\n",
            "SUBJ: Jobs VERB: go OBJ: Oracle\n",
            "Asian banks halt dollar's slide\n",
            "SUBJ: banks VERB: halt OBJ: dollar slide\n",
            "Markets signal Brazilian recovery\n",
            "SUBJ: Markets VERB: signal OBJ: recovery\n",
            "GE sees 'excellent' world economy\n",
            "SUBJ: GE VERB: sees OBJ: world economy\n",
            "Q&A: Malcolm Glazer and Man Utd\n",
            "SUBJ: Q A Malcolm Glazer Man Utd VERB: None OBJ: None\n",
            "China continues rapid growth\n",
            "SUBJ: China VERB: continues OBJ: growth\n",
            "M&S cuts prices by average of 24%\n",
            "SUBJ: M S cuts prices average % VERB: None OBJ: None\n",
            "Trial begins of Spain's top banker\n",
            "SUBJ: Trial VERB: begins OBJ: Spain banker\n",
            "Malaysia lifts Islamic bank limit\n",
            "SUBJ: Malaysia VERB: lifts OBJ: Islamic bank limit\n",
            "Giant waves damage S Asia economy\n",
            "SUBJ: Giant VERB: waves OBJ: damage S Asia economy\n",
            "Europe asks Asia for euro help\n",
            "SUBJ: Europe VERB: asks OBJ: Asia\n",
            "Troubled Marsh under SEC scrutiny\n",
            "SUBJ: Marsh SEC scrutiny VERB: None OBJ: None\n",
            "US to probe airline travel chaos\n",
            "SUBJ: airline travel chaos VERB: probe OBJ: airline travel chaos\n",
            "China's Shanda buys stake in Sina\n",
            "SUBJ: China Shanda VERB: buys OBJ: stake Sina\n",
            "Strong demand triggers oil rally\n",
            "SUBJ: demand VERB: triggers OBJ: oil rally\n",
            "Karachi stocks hit historic high\n",
            "SUBJ: Karachi stocks VERB: hit OBJ: None\n",
            "Senior Fannie Mae bosses resign\n",
            "SUBJ: Fannie Mae bosses VERB: resign OBJ: None\n",
            "Bank opts to leave rates on hold\n",
            "SUBJ: Bank VERB: opts OBJ: rates\n",
            "Standard Life cuts policy bonuses\n",
            "SUBJ: Life cuts policy bonuses VERB: None OBJ: None\n",
            "Ukraine revisits state sell-offs\n",
            "SUBJ: Ukraine VERB: revisits OBJ: state\n",
            "Venezuela reviews foreign deals\n",
            "SUBJ: Venezuela VERB: reviews OBJ: deals\n",
            "Fed chief warning on US deficit\n",
            "SUBJ: Fed chief VERB: deficit OBJ: None\n",
            "US industrial output growth eases\n",
            "SUBJ: output growth eases VERB: None OBJ: None\n",
            "IMF 'cuts' German growth estimate\n",
            "SUBJ: IMF cuts growth estimate VERB: None OBJ: None\n",
            "Electronics firms eye plasma deal\n",
            "SUBJ: Electronics firms VERB: eye OBJ: plasma deal\n",
            "Ethiopia's crop production up 24%\n",
            "SUBJ: Ethiopia crop production % VERB: None OBJ: None\n",
            "India widens access to telecoms\n",
            "SUBJ: India VERB: widens OBJ: access telecoms\n",
            "'Standoff' on Deutsche's LSE bid\n",
            "SUBJ: Standoff Deutsche LSE bid VERB: None OBJ: None\n",
            "JP Morgan admits US slavery links\n",
            "SUBJ: JP Morgan VERB: admits OBJ: slavery links\n",
            "Home loan approvals rising again\n",
            "SUBJ: Home loan approvals VERB: rising OBJ: None\n",
            "Bank payout to Pinochet victims\n",
            "SUBJ: Bank payout Pinochet victims VERB: None OBJ: None\n",
            "India's Reliance family feud heats up\n",
            "SUBJ: India Reliance family feud VERB: heats OBJ: None\n",
            "Tsunami cost hits Jakarta shares\n",
            "SUBJ: Tsunami cost VERB: hits OBJ: Jakarta shares\n",
            "ECB holds rates amid growth fears\n",
            "SUBJ: ECB VERB: holds OBJ: rates\n",
            "Ad sales boost Time Warner profit\n",
            "SUBJ: Ad sales boost Time Warner profit VERB: None OBJ: None\n",
            "Deutsche attacks Yukos case\n",
            "SUBJ: Deutsche attacks Yukos case VERB: None OBJ: None\n",
            "Latin America sees strong growth\n",
            "SUBJ: Latin America VERB: sees OBJ: growth\n",
            "Chinese exports rise 25% in 2004\n",
            "SUBJ: exports VERB: rise OBJ: %\n",
            "Mixed signals from French economy\n",
            "SUBJ: signals economy VERB: None OBJ: None\n",
            "Parmalat to return to stockmarket\n",
            "SUBJ: Parmalat VERB: return OBJ: None\n",
            "Krispy Kreme shares hit\n",
            "SUBJ: Krispy Kreme shares VERB: hit OBJ: None\n",
            "Watchdog probes Vivendi bond sale\n",
            "SUBJ: Watchdog VERB: probes OBJ: Vivendi bond sale\n",
            "Khodorkovsky ally denies charges\n",
            "SUBJ: Khodorkovsky VERB: denies OBJ: charges\n",
            "Profits jump at China's top bank\n",
            "SUBJ: Profits VERB: jump OBJ: China bank\n",
            "Parmalat bank barred from suing\n",
            "SUBJ: Parmalat bank VERB: barred OBJ: \n",
            "Germany nears 1990 jobless level\n",
            "SUBJ: Germany VERB: nears OBJ: level\n",
            "Peugeot deal boosts Mitsubishi\n",
            "SUBJ: Peugeot deal boosts Mitsubishi VERB: None OBJ: None\n",
            "WorldCom director admits lying\n",
            "SUBJ: WorldCom director VERB: admits OBJ: None\n",
            "Ask Jeeves tips online ad revival\n",
            "SUBJ: Jeeves tips VERB: Ask OBJ: Jeeves tips\n",
            "Euro firms miss out on optimism\n",
            "SUBJ: Euro firms VERB: miss OBJ: optimism\n",
            "Business fears over sluggish EU economy\n",
            "SUBJ: Business fears EU economy VERB: None OBJ: None\n",
            "Bush to get 'tough' on deficit\n",
            "SUBJ: Bush VERB: get OBJ: deficit\n",
            "SEC to rethink post-Enron rules\n",
            "SUBJ: SEC VERB: rethink OBJ: post\n",
            "EU 'too slow' on economic reforms\n",
            "SUBJ: EU reforms VERB: None OBJ: None\n",
            "Amex shares up on spin-off news\n",
            "SUBJ: Amex shares spin news VERB: None OBJ: None\n",
            "US bank boss hails 'genius' Smith\n",
            "SUBJ: bank boss genius Smith VERB: hails OBJ: genius Smith\n",
            "Mystery surrounds new Yukos owner\n",
            "SUBJ: Mystery VERB: surrounds OBJ: Yukos owner\n",
            "Mexican in US send $16bn home\n",
            "SUBJ:  VERB: send OBJ: home\n",
            "Euronext 'poised to make LSE bid'\n",
            "SUBJ: LSE bid VERB: None OBJ: None\n",
            "Parmalat founder offers apology\n",
            "SUBJ: Parmalat founder VERB: offers OBJ: apology\n",
            "Turkey-Iran mobile deal 'at risk'\n",
            "SUBJ: Turkey Iran deal risk VERB: None OBJ: None\n",
            "Brazil plays down Varig rescue\n",
            "SUBJ: Brazil VERB: plays OBJ: Varig rescue\n",
            "High fuel prices hit BA's profits\n",
            "SUBJ: fuel prices VERB: hit OBJ: BA profits\n",
            "Bombardier chief to leave company\n",
            "SUBJ: Bombardier chief company VERB: leave OBJ: company\n",
            "Christmas shoppers flock to tills\n",
            "SUBJ: Christmas shoppers VERB: flock OBJ: tills\n",
            "Jobs growth still slow in the US\n",
            "SUBJ: Jobs growth VERB: None OBJ: \n",
            "Winter freeze keeps oil above $50\n",
            "SUBJ: Winter freeze VERB: keeps OBJ: oil\n",
            "US prepares for hybrid onslaught\n",
            "SUBJ:  VERB: prepares OBJ: hybrid onslaught\n",
            "Indonesians face fuel price rise\n",
            "SUBJ: Indonesians VERB: face OBJ: fuel price rise\n",
            "Car giant hit by Mercedes slump\n",
            "SUBJ: Car giant Mercedes slump VERB: hit OBJ: Mercedes slump\n",
            "Bank voted 8-1 for no rate change\n",
            "SUBJ: Bank VERB: voted OBJ: rate change\n",
            "Brazil buy boosts Belgium's Inbev\n",
            "SUBJ: Brazil VERB: buy OBJ: boosts Belgium Inbev\n",
            "Tsunami slows Sri Lanka's growth\n",
            "SUBJ: Tsunami VERB: slows OBJ: Sri Lanka growth\n",
            "Japanese growth grinds to a halt\n",
            "SUBJ: growth VERB: grinds OBJ: halt\n",
            "Shares rise on new Man Utd offer\n",
            "SUBJ: Shares VERB: rise OBJ: Man Utd offer\n",
            "Takeover rumour lifts Exel shares\n",
            "SUBJ: Takeover rumour VERB: lifts OBJ: Exel shares\n",
            "Executive trio leave Aer Lingus\n",
            "SUBJ: Executive trio VERB: leave OBJ: Aer Lingus\n",
            "Businesses fail to plan for HIV\n",
            "SUBJ: Businesses VERB: fail OBJ: HIV\n",
            "Iranian MPs threaten mobile deal\n",
            "SUBJ: MPs VERB: threaten OBJ: deal\n",
            "Air passengers win new EU rights\n",
            "SUBJ: Air passengers VERB: win OBJ: EU rights\n",
            "Yukos drops banks from court bid\n",
            "SUBJ: Yukos drops banks court bid VERB: None OBJ: None\n",
            "Yukos seeks court action on sale\n",
            "SUBJ: Yukos VERB: seeks OBJ: court action sale\n",
            "Lesotho textile workers lose jobs\n",
            "SUBJ: Lesotho textile workers VERB: lose OBJ: jobs\n",
            "Oil rebounds from weather effect\n",
            "SUBJ: Oil rebounds weather effect VERB: None OBJ: None\n",
            "Cairn shares up on new oil find\n",
            "SUBJ: Cairn shares oil find VERB: None OBJ: None\n",
            "Huge rush for Jet Airways shares\n",
            "SUBJ: rush Jet Airways shares VERB: None OBJ: None\n",
            "Brewers' profits lose their fizz\n",
            "SUBJ: Brewers profits VERB: lose OBJ: fizz\n",
            "Winemaker rejects Foster's offer\n",
            "SUBJ: Winemaker VERB: rejects OBJ: Foster offer\n",
            "No seasonal lift for house market\n",
            "SUBJ: lift house market VERB: None OBJ: None\n",
            "Unilever shake up as profit slips\n",
            "SUBJ: Unilever VERB: shake OBJ: profit slips\n",
            "Japan's ageing workforce: built to last\n",
            "SUBJ: Japan workforce VERB: built OBJ: None\n",
            "US interest rates increased to 2%\n",
            "SUBJ: interest rates % VERB: increased OBJ: %\n",
            "Train strike grips Buenos Aires\n",
            "SUBJ: Train strike grips Buenos Aires VERB: None OBJ: None\n",
            "Axa Sun Life cuts bonus payments\n",
            "SUBJ: Axa Sun Life cuts bonus payments VERB: None OBJ: None\n",
            "Russian oil merger excludes Yukos\n",
            "SUBJ: oil merger VERB: excludes OBJ: Yukos\n",
            "Wembley firm won't make a profit\n",
            "SUBJ: Wembley firm VERB: make OBJ: profit\n",
            "Millions 'to lose textile jobs'\n",
            "SUBJ: Millions textile jobs VERB: lose OBJ: textile jobs\n",
            "BMW reveals new models pipeline\n",
            "SUBJ: BMW VERB: reveals OBJ: models pipeline\n",
            "French boss to leave EADS\n",
            "SUBJ: boss EADS VERB: leave OBJ: EADS\n",
            "Irish markets reach all-time high\n",
            "SUBJ: markets VERB: reach OBJ: time\n",
            "Argentina, Venezuela in oil deal\n",
            "SUBJ: Argentina Venezuela oil deal VERB: None OBJ: None\n",
            "Japan economy slides to recession\n",
            "SUBJ: Japan economy VERB: slides OBJ: recession\n",
            "Green reports shun supply chain\n",
            "SUBJ: reports VERB: shun OBJ: supply chain\n",
            "Georgia plans hidden asset pardon\n",
            "SUBJ: Georgia VERB: plans OBJ: asset pardon\n",
            "India's Deccan gets more planes\n",
            "SUBJ: India Deccan VERB: gets OBJ: planes\n",
            "Yukos unit fetches $9bn at auction\n",
            "SUBJ: Yukos unit VERB: fetches OBJ: \n",
            "Bargain calls widen Softbank loss\n",
            "SUBJ: Bargain VERB: calls OBJ: Softbank loss\n",
            "Lufthansa may sue over Bush visit\n",
            "SUBJ: Lufthansa VERB: sue OBJ: Bush visit\n",
            "Stormy year for property insurers\n",
            "SUBJ: Stormy year property insurers VERB: None OBJ: None\n",
            "G7 backs Africa debt relief plan\n",
            "SUBJ: G7 VERB: backs OBJ: Africa debt relief plan\n",
            "Boeing unveils new 777 aircraft\n",
            "SUBJ: Boeing VERB: unveils OBJ: aircraft\n",
            "Steady job growth continues in US\n",
            "SUBJ: job growth VERB: continues OBJ: \n",
            "China suspends 26 power projects\n",
            "SUBJ: China VERB: suspends OBJ: power projects\n",
            "UK firm faces Venezuelan land row\n",
            "SUBJ: UK firm VERB: faces OBJ: land row\n",
            "J&J agrees $25bn Guidant deal\n",
            "SUBJ: J J VERB: agrees OBJ: Guidant deal\n",
            "S Korean consumers spending again\n",
            "SUBJ:  VERB: None OBJ: None\n",
            "Industrial revival hope for Japan\n",
            "SUBJ: revival hope Japan VERB: None OBJ: None\n",
            "UK young top Euro earnings league\n",
            "SUBJ: UK top Euro earnings league VERB: None OBJ: None\n",
            "Dollar hits new low versus euro\n",
            "SUBJ: Dollar VERB: hits OBJ: versus euro\n",
            "Turkey knocks six zeros off lira\n",
            "SUBJ: Turkey VERB: knocks OBJ: zeros lira\n",
            "House prices drop as sales slow\n",
            "SUBJ: House prices VERB: drop OBJ: sales\n",
            "Pension hitch for long-living men\n",
            "SUBJ: Pension hitch living men VERB: None OBJ: None\n",
            "Ban on forced retirement under 65\n",
            "SUBJ: Ban retirement VERB: None OBJ: None\n",
            "UK Coal plunges into deeper loss\n",
            "SUBJ: UK Coal VERB: plunges OBJ: loss\n",
            "US Airways staff agree to pay cut\n",
            "SUBJ: Airways staff cut VERB: agree OBJ: cut\n",
            "US firm pulls out of Iraq\n",
            "SUBJ:  VERB: firm OBJ: Iraq\n",
            "Card fraudsters 'targeting web'\n",
            "SUBJ: Card fraudsters targeting web VERB: None OBJ: None\n",
            "Putin backs state grab for Yukos\n",
            "SUBJ: Putin VERB: backs OBJ: state grab Yukos\n",
            "Iraq to invite phone licence bids\n",
            "SUBJ: Iraq phone licence bids VERB: invite OBJ: phone licence bids\n",
            "Go-ahead for Balkan oil pipeline\n",
            "SUBJ: oil pipeline VERB: Go OBJ: oil pipeline\n",
            "Mixed Christmas for US retailers\n",
            "SUBJ: Christmas US retailers VERB: None OBJ: None\n",
            "Irish duo could block Man Utd bid\n",
            "SUBJ: duo VERB: block OBJ: Man Utd bid\n",
            "Nortel in $300m profit revision\n",
            "SUBJ: Nortel profit revision VERB: None OBJ: None\n",
            "China now top trader with Japan\n",
            "SUBJ: China trader Japan VERB: None OBJ: None\n",
            "Hyundai to build new India plant\n",
            "SUBJ: Hyundai India plant VERB: build OBJ: India plant\n",
            "House prices suffer festive fall\n",
            "SUBJ: House prices VERB: suffer OBJ: fall\n",
            "Yukos bankruptcy 'not US matter'\n",
            "SUBJ: Yukos bankruptcy VERB: None OBJ: None\n",
            "Millions go missing at China bank\n",
            "SUBJ: Millions VERB: go OBJ: China bank\n",
            "Markets fall on weak dollar fears\n",
            "SUBJ: Markets VERB: fall OBJ: dollar fears\n",
            "German bidder in talks with LSE\n",
            "SUBJ: bidder talks LSE VERB: None OBJ: None\n",
            "Khodorkovsky quits Yukos shares\n",
            "SUBJ: Khodorkovsky VERB: quits OBJ: Yukos shares\n",
            "AstraZeneca hit by drug failure\n",
            "SUBJ: AstraZeneca drug failure VERB: hit OBJ: drug failure\n",
            "BT offers equal access to rivals\n",
            "SUBJ: BT VERB: offers OBJ: access rivals\n",
            "US consumer confidence up\n",
            "SUBJ:  VERB: consumer OBJ: confidence\n",
            "BA to suspend two Saudi services\n",
            "SUBJ: BA Saudi services VERB: suspend OBJ: Saudi services\n",
            "Honda wins China copyright ruling\n",
            "SUBJ: Honda VERB: wins OBJ: China copyright ruling\n",
            "Ebbers denies WorldCom fraud\n",
            "SUBJ: Ebbers VERB: denies OBJ: WorldCom fraud\n",
            "Parmalat sues 45 banks over crash\n",
            "SUBJ: Parmalat VERB: sues OBJ: banks\n",
            "Yukos accused of lying to court\n",
            "SUBJ: Yukos VERB: accused OBJ: court\n",
            "Making your office work for you\n",
            "SUBJ: office work VERB: Making OBJ: office work\n",
            "Economy 'strong' in election year\n",
            "SUBJ: Economy election year VERB: None OBJ: None\n",
            "Survey confirms property slowdown\n",
            "SUBJ: Survey VERB: confirms OBJ: property slowdown\n",
            "The 'ticking budget' facing the US\n",
            "SUBJ: ticking budget VERB: facing OBJ: \n",
            "Ukraine strikes Turkmen gas deal\n",
            "SUBJ: Ukraine VERB: strikes OBJ: Turkmen gas deal\n",
            "Mitsubishi in Peugeot link talks\n",
            "SUBJ: Mitsubishi Peugeot link talks VERB: None OBJ: None\n",
            "Golden rule 'intact' says ex-aide\n",
            "SUBJ: Golden rule aide VERB: says OBJ: aide\n",
            "Irish company hit by Iraqi report\n",
            "SUBJ: company Iraqi report VERB: hit OBJ: Iraqi report\n",
            "Fiat chief takes steering wheel\n",
            "SUBJ: Fiat chief VERB: takes OBJ: steering wheel\n",
            "Bat spit drug firm goes to market\n",
            "SUBJ: Bat spit drug firm VERB: goes OBJ: market\n",
            "Sales 'fail to boost High Street'\n",
            "SUBJ: Sales fail High Street VERB: boost OBJ: High Street\n",
            "Indy buys into India paper\n",
            "SUBJ: buys India paper VERB: None OBJ: None\n",
            "Industrial output falls in Japan\n",
            "SUBJ: output VERB: falls OBJ: Japan\n",
            "VW considers opening Indian plant\n",
            "SUBJ: VW VERB: considers OBJ: plant\n",
            "US retail sales surge in December\n",
            "SUBJ: sales surge December VERB: None OBJ: None\n",
            "LSE 'sets date for takeover deal'\n",
            "SUBJ: LSE sets date takeover deal VERB: None OBJ: None\n",
            "Glazer makes new Man Utd approach\n",
            "SUBJ: Glazer VERB: makes OBJ: Man Utd approach\n",
            "French suitor holds LSE meeting\n",
            "SUBJ: suitor VERB: holds OBJ: LSE meeting\n",
            "Booming markets shed few tears\n",
            "SUBJ: markets tears VERB: Booming OBJ: markets tears\n",
            "McDonald's boss Bell dies aged 44\n",
            "SUBJ: McDonald boss Bell VERB: dies OBJ: \n",
            "Wall Street cheers Bush victory\n",
            "SUBJ: Wall Street cheers Bush victory VERB: None OBJ: None\n",
            "US Ahold suppliers face charges\n",
            "SUBJ: Ahold suppliers charges VERB: face OBJ: charges\n",
            "Italy to get economic action plan\n",
            "SUBJ: Italy VERB: get OBJ: action plan\n",
            "India calls for fair trade rules\n",
            "SUBJ: India VERB: calls OBJ: trade rules\n",
            "US trade gap hits record in 2004\n",
            "SUBJ:  VERB: trade OBJ: gap hits record\n",
            "Britannia members' £42m windfall\n",
            "SUBJ: Britannia members £ windfall VERB: None OBJ: None\n",
            "Manufacturing recovery 'slowing'\n",
            "SUBJ: recovery VERB: Manufacturing OBJ: recovery\n",
            "Asia shares defy post-quake gloom\n",
            "SUBJ: Asia shares VERB: defy OBJ: quake gloom\n",
            "Weak dollar hits Reuters\n",
            "SUBJ: dollar hits Reuters VERB: None OBJ: None\n",
            "Cannabis hopes for drug firm\n",
            "SUBJ: Cannabis VERB: hopes OBJ: drug firm\n",
            "India's rupee hits five-year high\n",
            "SUBJ: India rupee VERB: hits OBJ: year\n",
            "Yangtze Electric's profits double\n",
            "SUBJ: Yangtze Electric profits VERB: None OBJ: None\n",
            "Dollar drops on reserves concerns\n",
            "SUBJ: Dollar drops reserves concerns VERB: None OBJ: None\n",
            "Worldcom ex-boss launches defence\n",
            "SUBJ: Worldcom boss launches defence VERB: None OBJ: None\n",
            "Google shares fall as staff sell\n",
            "SUBJ: Google shares VERB: fall OBJ: staff sell\n",
            "Absa and Barclays talks continue\n",
            "SUBJ: Absa Barclays talks VERB: continue OBJ: None\n",
            "US regulator to rule on pain drug\n",
            "SUBJ:  VERB: regulator OBJ: pain drug\n",
            "Hariri killing hits Beirut shares\n",
            "SUBJ: Hariri hits Beirut shares VERB: killing OBJ: hits Beirut shares\n",
            "Jobs growth still slow in the US\n",
            "SUBJ: Jobs growth VERB: None OBJ: \n",
            "S Korea spending boost to economy\n",
            "SUBJ: S Korea spending boost economy VERB: None OBJ: None\n",
            "Marsh executive in guilty plea\n",
            "SUBJ: Marsh executive plea VERB: None OBJ: None\n",
            "India seeks to boost construction\n",
            "SUBJ: India VERB: seeks OBJ: construction\n",
            "Tokyo says deflation 'controlled'\n",
            "SUBJ: Tokyo VERB: says OBJ: deflation\n",
            "Stock market eyes Japan recovery\n",
            "SUBJ: Stock market eyes Japan recovery VERB: None OBJ: None\n",
            "MCI shares climb on takeover bid\n",
            "SUBJ: MCI shares VERB: climb OBJ: takeover bid\n",
            "UK homes hit £3.3 trillion total\n",
            "SUBJ: UK homes VERB: hit OBJ: £\n",
            "US adds more jobs than expected\n",
            "SUBJ:  VERB: adds OBJ: jobs\n",
            "Buyers snap up Jet Airways' shares\n",
            "SUBJ: Buyers VERB: snap OBJ: Jet Airways shares\n",
            "Electrolux to export Europe jobs\n",
            "SUBJ: Electrolux Europe jobs VERB: export OBJ: Europe jobs\n",
            "Crude oil prices back above $50\n",
            "SUBJ: oil prices VERB: None OBJ: None\n",
            "Madagascar completes currency switch\n",
            "SUBJ: Madagascar VERB: completes OBJ: currency switch\n",
            "Tsunami 'to hit Sri Lanka banks'\n",
            "SUBJ: Tsunami Sri Lanka banks VERB: hit OBJ: Sri Lanka banks\n",
            "India and Iran in gas export deal\n",
            "SUBJ: India Iran gas export deal VERB: None OBJ: None\n",
            "Rank 'set to sell off film unit'\n",
            "SUBJ: Rank set film unit VERB: sell OBJ: film unit\n",
            "Iraqi voters turn to economic issues\n",
            "SUBJ: voters VERB: turn OBJ: issues\n",
            "Fed warns of more US rate rises\n",
            "SUBJ: Fed VERB: warns OBJ: rate rises\n",
            "Standard Life concern at LSE bid\n",
            "SUBJ: Standard Life concern LSE bid VERB: None OBJ: None\n",
            "Ukraine trims privatisation check\n",
            "SUBJ: Ukraine VERB: trims OBJ: privatisation check\n",
            "US data sparks inflation worries\n",
            "SUBJ:  VERB: sparks OBJ: inflation worries\n",
            "Optimism remains over UK housing\n",
            "SUBJ: Optimism VERB: remains OBJ: UK housing\n",
            "UK 'risks breaking golden rule'\n",
            "SUBJ: UK risks rule VERB: breaking OBJ: rule\n",
            "Call centre users 'lose patience'\n",
            "SUBJ: Call centre users lose patience VERB: None OBJ: None\n",
            "'Strong dollar' call halts slide\n",
            "SUBJ: dollar call halts VERB: slide OBJ: None\n",
            "Criminal probe on Citigroup deals\n",
            "SUBJ: probe Citigroup deals VERB: None OBJ: None\n",
            "EU aiming to fuel development aid\n",
            "SUBJ: EU VERB: aiming OBJ: development aid\n",
            "Why few targets are better than many\n",
            "SUBJ: targets VERB: are OBJ: \n",
            "Chinese dam firm 'defies Beijing'\n",
            "SUBJ: dam firm defies Beijing VERB: None OBJ: None\n",
            "Beijingers fume over parking fees\n",
            "SUBJ: Beijingers VERB: fume OBJ: parking fees\n",
            "EU-US seeking deal on air dispute\n",
            "SUBJ: EU VERB: seeking OBJ: deal air dispute\n",
            "WMC profits up amid bid criticism\n",
            "SUBJ: WMC VERB: profits OBJ: bid criticism\n",
            "DaimlerChrysler's 2004 sales rise\n",
            "SUBJ: DaimlerChrysler sales rise VERB: None OBJ: None\n",
            "Weak data buffets French economy\n",
            "SUBJ: Weak data buffets economy VERB: None OBJ: None\n",
            "Rover deal 'may cost 2,000 jobs'\n",
            "SUBJ: Rover deal jobs VERB: cost OBJ: jobs\n",
            "Split-caps pay £194m compensation\n",
            "SUBJ: Split caps compensation VERB: pay OBJ: compensation\n",
            "Cairn Energy in Indian gas find\n",
            "SUBJ: Cairn Energy Indian gas find VERB: None OBJ: None\n",
            "Kraft cuts snack ads for children\n",
            "SUBJ: Kraft VERB: cuts OBJ: snack ads\n",
            "EU to probe Alitalia 'state aid'\n",
            "SUBJ: EU Alitalia state aid VERB: probe OBJ: Alitalia state aid\n",
            "Oil prices reach three-month low\n",
            "SUBJ: Oil prices VERB: reach OBJ: month\n",
            "Minister hits out at Yukos sale\n",
            "SUBJ: Minister VERB: hits OBJ: Yukos sale\n",
            "Continental 'may run out of cash'\n",
            "SUBJ: Continental VERB: run OBJ: cash\n",
            "BMW drives record sales in Asia\n",
            "SUBJ: BMW VERB: drives OBJ: record sales Asia\n",
            "ID theft surge hits US consumers\n",
            "SUBJ: theft surge VERB: hits OBJ: consumers\n",
            "Pernod takeover talk lifts Domecq\n",
            "SUBJ: Pernod takeover talk VERB: lifts OBJ: Domecq\n",
            "Wal-Mart fights back at accusers\n",
            "SUBJ: Wal Mart accusers VERB: fights OBJ: accusers\n",
            "Saab to build Cadillacs in Sweden\n",
            "SUBJ: Saab VERB: build OBJ: Cadillacs\n",
            "Police detain Chinese milk bosses\n",
            "SUBJ: Police detain milk bosses VERB: None OBJ: None\n",
            "Libya takes $1bn in unfrozen funds\n",
            "SUBJ: Libya VERB: takes OBJ: \n",
            "Singapore growth at 8.1% in 2004\n",
            "SUBJ: Singapore growth % VERB: None OBJ: None\n",
            "Yukos unit buyer faces loan claim\n",
            "SUBJ: Yukos unit buyer VERB: faces OBJ: loan claim\n",
            "India opens skies to competition\n",
            "SUBJ: India VERB: opens OBJ: skies competition\n",
            "LSE doubts boost bidders' shares\n",
            "SUBJ: LSE doubts VERB: boost OBJ: bidders shares\n",
            "Profits stall at China's Lenovo\n",
            "SUBJ: Profits VERB: stall OBJ: China Lenovo\n",
            "Profits slide at India's Dr Reddy\n",
            "SUBJ: Profits VERB: slide OBJ: India Dr Reddy\n",
            "Newest EU members underpin growth\n",
            "SUBJ: Newest EU members VERB: underpin OBJ: growth\n",
            "S Korean lender faces liquidation\n",
            "SUBJ: lender liquidation VERB: faces OBJ: liquidation\n",
            "GM, Ford cut output as sales fall\n",
            "SUBJ: GM Ford VERB: cut OBJ: output\n",
            "Giving financial gifts to children\n",
            "SUBJ: gifts VERB: Giving OBJ: gifts\n",
            "US bank 'loses' customer details\n",
            "SUBJ: bank customer details VERB: None OBJ: None\n",
            "Fiat mulls Ferrari market listing\n",
            "SUBJ: Fiat VERB: mulls OBJ: Ferrari market\n",
            "Lloyd's of London head chides FSA\n",
            "SUBJ: Lloyd London head chides FSA VERB: None OBJ: None\n",
            "Qwest may spark MCI bidding war\n",
            "SUBJ: Qwest VERB: spark OBJ: MCI bidding war\n",
            "EC calls truce in deficit battle\n",
            "SUBJ: EC VERB: calls OBJ: truce deficit battle\n",
            "Umbro profits lifted by Euro 2004\n",
            "SUBJ: Umbro profits Euro VERB: lifted OBJ: Euro\n",
            "US crude prices surge above $53\n",
            "SUBJ:  VERB: crude OBJ: prices\n",
            "China keeps tight rein on credit\n",
            "SUBJ: China VERB: keeps OBJ: rein\n",
            "Mixed reaction to Man Utd offer\n",
            "SUBJ: reaction Man Utd offer VERB: None OBJ: None\n",
            "Soaring oil 'hits world economy'\n",
            "SUBJ: oil VERB: Soaring OBJ: oil\n",
            "India's Deccan seals $1.8bn deal\n",
            "SUBJ: India Deccan VERB: seals OBJ: deal\n",
            "Japan bank shares up on link talk\n",
            "SUBJ: Japan bank shares link talk VERB: None OBJ: None\n",
            "Laura Ashley chief stepping down\n",
            "SUBJ: Laura Ashley chief VERB: stepping OBJ: None\n",
            "Chinese wine tempts Italy's Illva\n",
            "SUBJ: wine VERB: tempts OBJ: Italy Illva\n",
            "Macy's owner buys rival for $11bn\n",
            "SUBJ: Macy owner VERB: buys OBJ: \n",
            "Japanese banking battle at an end\n",
            "SUBJ: banking battle end VERB: None OBJ: None\n",
            "GM issues 2005 profits warning\n",
            "SUBJ: GM issues profits VERB: None OBJ: None\n",
            "Warning over US pensions deficit\n",
            "SUBJ: pensions VERB: Warning OBJ: pensions\n",
            "Russia gets investment blessing\n",
            "SUBJ: Russia VERB: gets OBJ: investment blessing\n",
            "Brazil jobless rate hits new low\n",
            "SUBJ: Brazil rate VERB: hits OBJ: None\n",
            "Small firms 'hit by rising costs'\n",
            "SUBJ: firms costs VERB: hit OBJ: costs\n",
            "Alfa Romeos 'to get GM engines'\n",
            "SUBJ: Alfa Romeos GM engines VERB: get OBJ: GM engines\n",
            "BP surges ahead on high oil price\n",
            "SUBJ: BP VERB: surges OBJ: oil price\n",
            "McDonald's to sponsor MTV show\n",
            "SUBJ: McDonald MTV VERB: sponsor OBJ: MTV\n",
            "Worldcom boss 'left books alone'\n",
            "SUBJ: Worldcom boss VERB: left OBJ: books\n",
            "Egypt to sell off state-owned bank\n",
            "SUBJ: Egypt state bank VERB: sell OBJ: state\n",
            "Insurance bosses plead guilty\n",
            "SUBJ: Insurance bosses VERB: plead OBJ: None\n",
            "Cactus diet deal for Phytopharm\n",
            "SUBJ: Cactus diet deal Phytopharm VERB: None OBJ: None\n",
            "Strong quarterly growth for Nike\n",
            "SUBJ: growth Nike VERB: None OBJ: None\n",
            "Euronext joins bid battle for LSE\n",
            "SUBJ: joins VERB: bid OBJ: battle LSE\n",
            "US to rule on Yukos refuge call\n",
            "SUBJ:  VERB: rule OBJ: Yukos refuge call\n",
            "Yukos loses US bankruptcy battle\n",
            "SUBJ: Yukos VERB: loses OBJ: bankruptcy battle\n",
            "Battered dollar hits another low\n",
            "SUBJ: dollar VERB: hits OBJ: \n",
            "Yukos sues four firms for $20bn\n",
            "SUBJ: Yukos VERB: sues OBJ: firms\n",
            "Delta cuts fares in survival plan\n",
            "SUBJ: Delta cuts fares survival plan VERB: None OBJ: None\n",
            "Salary scandal in Cameroon\n",
            "SUBJ: Salary scandal Cameroon VERB: None OBJ: None\n",
            "Bank set to leave rates on hold\n",
            "SUBJ: Bank VERB: set OBJ: rates\n",
            "Sluggish economy hits German jobs\n",
            "SUBJ: economy VERB: hits OBJ: jobs\n",
            "Wipro beats forecasts once again\n",
            "SUBJ: Wipro VERB: beats OBJ: forecasts\n",
            "SA unveils 'more for all' budget\n",
            "SUBJ: SA unveils VERB: None OBJ: None\n",
            "Renault boss hails 'great year'\n",
            "SUBJ: Renault boss year VERB: None OBJ: None\n",
            "Mild winter drives US oil down 6%\n",
            "SUBJ: Mild winter drives % VERB: oil OBJ: %\n",
            "Egypt and Israel seal trade deal\n",
            "SUBJ: Egypt Israel seal trade deal VERB: None OBJ: None\n",
            "Iraq and Afghanistan in WTO talks\n",
            "SUBJ: Iraq Afghanistan WTO talks VERB: None OBJ: None\n",
            "China had role in Yukos split-up\n",
            "SUBJ: China VERB: had OBJ: role Yukos split\n",
            "Venezuela identifies 'idle' farms\n",
            "SUBJ: Venezuela farms VERB: None OBJ: None\n",
            "Bush budget seeks deep cutbacks\n",
            "SUBJ: Bush budget VERB: seeks OBJ: cutbacks\n",
            "French wine gets 70m euro top-up\n",
            "SUBJ: wine VERB: gets OBJ: euro top\n",
            "Ryanair in $4bn Boeing plane deal\n",
            "SUBJ: Ryanair Boeing plane deal VERB: None OBJ: None\n",
            "Japan turns to beer alternatives\n",
            "SUBJ: Japan VERB: turns OBJ: beer alternatives\n",
            "MCI shareholder sues to stop bid\n",
            "SUBJ: MCI shareholder VERB: sues OBJ: bid\n",
            "Novartis hits acquisition trail\n",
            "SUBJ: Novartis VERB: hits OBJ: acquisition trail\n",
            "SEC to rethink post-Enron rules\n",
            "SUBJ: SEC VERB: rethink OBJ: post\n",
            "BBC poll indicates economic gloom\n",
            "SUBJ: BBC poll VERB: indicates OBJ: gloom\n",
            "WMC says Xstrata bid is too low\n",
            "SUBJ: WMC VERB: says OBJ: Xstrata bid\n",
            "Japanese mogul arrested for fraud\n",
            "SUBJ: mogul VERB: arrested OBJ: fraud\n",
            "Fannie Mae 'should restate books'\n",
            "SUBJ: Fannie Mae VERB: restate OBJ: books\n",
            "US trade gap ballooned in October\n",
            "SUBJ:  VERB: trade OBJ: gap October\n",
            "Nasdaq planning $100m-share sale\n",
            "SUBJ: Nasdaq share sale VERB: planning OBJ: \n",
            "Oil prices fall back from highs\n",
            "SUBJ: Oil prices VERB: fall OBJ: highs\n",
            "French consumer spending rising\n",
            "SUBJ: consumer spending VERB: rising OBJ: None\n",
            "Saudi ministry to employ women\n",
            "SUBJ: Saudi ministry women VERB: employ OBJ: women\n",
            "Telegraph newspapers axe 90 jobs\n",
            "SUBJ: Telegraph newspapers VERB: axe OBJ: jobs\n",
            "UK interest rates held at 4.75%\n",
            "SUBJ: interest rates VERB: held OBJ: %\n",
            "US budget deficit to reach $368bn\n",
            "SUBJ:  VERB: budget OBJ: deficit\n",
            "UK house prices dip in November\n",
            "SUBJ: UK house prices VERB: dip OBJ: November\n",
            "Verizon 'seals takeover of MCI'\n",
            "SUBJ: Verizon seals takeover MCI VERB: None OBJ: None\n",
            "Cars pull down US retail figures\n",
            "SUBJ: Cars VERB: pull OBJ: figures\n",
            "Christmas sales worst since 1981\n",
            "SUBJ: Christmas sales VERB: None OBJ: None\n",
            "Orange colour clash set for court\n",
            "SUBJ: Orange colour clash court VERB: set OBJ: court\n",
            "Steady job growth continues in US\n",
            "SUBJ: job growth VERB: continues OBJ: \n",
            "Fresh hope after Argentine crisis\n",
            "SUBJ: hope crisis VERB: None OBJ: None\n",
            "France Telecom gets Orange boost\n",
            "SUBJ: France Telecom VERB: gets OBJ: Orange boost\n",
            "Tate & Lyle boss bags top award\n",
            "SUBJ: Tate Lyle boss bags award VERB: None OBJ: None\n",
            "GSK aims to stop Aids profiteers\n",
            "SUBJ: GSK VERB: aims OBJ: Aids profiteers\n",
            "GM in crunch talks on Fiat future\n",
            "SUBJ: GM crunch talks Fiat future VERB: None OBJ: None\n",
            "News Corp eyes video games market\n",
            "SUBJ: News Corp eyes VERB: video OBJ: games market\n",
            "Market unfazed by Aurora setback\n",
            "SUBJ: Market Aurora setback VERB: None OBJ: None\n",
            "US gives foreign firms extra time\n",
            "SUBJ:  VERB: gives OBJ: firms\n",
            "US economy shows solid GDP growth\n",
            "SUBJ: economy GDP growth VERB: shows OBJ: GDP growth\n",
            "India power shares jump on debut\n",
            "SUBJ: India power shares VERB: jump OBJ: debut\n",
            "Liberian economy starts to grow\n",
            "SUBJ: economy VERB: starts OBJ: None\n",
            "Tobacco giants hail court ruling\n",
            "SUBJ: Tobacco giants VERB: hail OBJ: court ruling\n",
            "Bad weather hits Nestle sales\n",
            "SUBJ: weather VERB: hits OBJ: Nestle sales\n",
            "EU ministers to mull jet fuel tax\n",
            "SUBJ: EU ministers jet fuel tax VERB: mull OBJ: jet fuel tax\n",
            "Indian oil firm eyes Yukos assets\n",
            "SUBJ: oil firm VERB: eyes OBJ: Yukos assets\n",
            "Consumers drive French economy\n",
            "SUBJ: Consumers VERB: drive OBJ: economy\n",
            "Troubled Marsh under SEC scrutiny\n",
            "SUBJ: Marsh SEC scrutiny VERB: None OBJ: None\n",
            "Bank holds interest rate at 4.75%\n",
            "SUBJ: Bank VERB: holds OBJ: interest rate\n",
            "Qantas considers offshore option\n",
            "SUBJ: Qantas VERB: considers OBJ: offshore option\n",
            "Steel firm 'to cut' 45,000 jobs\n",
            "SUBJ: Steel firm jobs VERB: None OBJ: None\n",
            "Borussia Dortmund near bust\n",
            "SUBJ: Borussia Dortmund bust VERB: None OBJ: None\n",
            "Economy 'strong' in election year\n",
            "SUBJ: Economy election year VERB: None OBJ: None\n",
            "China bans new tobacco factories\n",
            "SUBJ: China VERB: bans OBJ: tobacco factories\n",
            "India and Russia in energy talks\n",
            "SUBJ: India Russia energy talks VERB: None OBJ: None\n",
            "Arsenal 'may seek full share listing'\n",
            "SUBJ: Arsenal VERB: seek OBJ: share listing\n",
            "German jobless rate at new record\n",
            "SUBJ: rate record VERB: None OBJ: None\n",
            "US company admits Benin bribery\n",
            "SUBJ: company Benin bribery VERB: admits OBJ: Benin bribery\n",
            "Ailing EuroDisney vows turnaround\n",
            "SUBJ: EuroDisney vows turnaround VERB: Ailing OBJ: EuroDisney vows turnaround\n",
            "Record year for Chilean copper\n",
            "SUBJ: Record year Chilean copper VERB: None OBJ: None\n",
            "UK economy ends year with spurt\n",
            "SUBJ: UK economy VERB: ends OBJ: year spurt\n",
            "India-Pakistan peace boosts trade\n",
            "SUBJ: India Pakistan peace boosts VERB: None OBJ: None\n",
            "High fuel costs hit US airlines\n",
            "SUBJ: fuel costs VERB: hit OBJ: \n",
            "$1m payoff for former Shell boss\n",
            "SUBJ: payoff Shell boss VERB: None OBJ: None\n",
            "Palestinian economy in decline\n",
            "SUBJ: economy decline VERB: None OBJ: None\n",
            "Air China in $1bn London listing\n",
            "SUBJ: Air China London listing VERB: None OBJ: None\n",
            "Air Jamaica back in state control\n",
            "SUBJ: Air Jamaica state control VERB: None OBJ: None\n",
            "German growth goes into reverse\n",
            "SUBJ: growth VERB: goes OBJ: reverse\n",
            "Yukos owner sues Russia for $28bn\n",
            "SUBJ: Yukos VERB: owner OBJ: sues Russia\n",
            "Weak dollar trims Cadbury profits\n",
            "SUBJ: dollar trims Cadbury profits VERB: None OBJ: None\n",
            "'Post-Christmas lull' in lending\n",
            "SUBJ: Post Christmas lull lending VERB: None OBJ: Post Christmas lull lending\n",
            "Barclays shares up on merger talk\n",
            "SUBJ: Barclays shares merger talk VERB: None OBJ: None\n",
            "Soros group warns of Kazakh close\n",
            "SUBJ: Soros group VERB: warns OBJ: Kazakh\n",
            "Dollar hovers around record lows\n",
            "SUBJ: Dollar hovers record lows VERB: None OBJ: None\n",
            "WorldCom trial starts in New York\n",
            "SUBJ: WorldCom trial VERB: starts OBJ: New York\n",
            "Singapore growth at 8.1% in 2004\n",
            "SUBJ: Singapore growth % VERB: None OBJ: None\n",
            "US interest rate rise expected\n",
            "SUBJ:  VERB: rise OBJ: None\n",
            "Ex-Boeing director gets jail term\n",
            "SUBJ: Ex Boeing director jail term VERB: gets OBJ: jail term\n",
            "Glaxo aims high after profit fall\n",
            "SUBJ: Glaxo VERB: aims OBJ: profit fall\n",
            "Vodafone appoints new Japan boss\n",
            "SUBJ: Vodafone VERB: appoints OBJ: Japan boss\n",
            "WorldCom bosses' $54m payout\n",
            "SUBJ: WorldCom VERB: None OBJ: None\n",
            "Ebbers 'aware' of WorldCom fraud\n",
            "SUBJ: Ebbers WorldCom fraud VERB: None OBJ: None\n",
            "Wall Street cool to eBay's profit\n",
            "SUBJ: Wall Street VERB: cool OBJ: eBay profit\n",
            "Could Yukos be a blessing in disguise?\n",
            "SUBJ: Yukos VERB: be OBJ: blessing disguise\n",
            "Budget Aston takes on Porsche\n",
            "SUBJ: Budget Aston VERB: takes OBJ: Porsche\n",
            "Cash gives way to flexible friend\n",
            "SUBJ: Cash VERB: gives OBJ: way\n",
            "Asia quake increases poverty risk\n",
            "SUBJ: Asia quake increases poverty risk VERB: None OBJ: None\n",
            "Parmalat boasts doubled profits\n",
            "SUBJ: Parmalat VERB: boasts OBJ: profits\n",
            "Burren awarded Egyptian contracts\n",
            "SUBJ: Burren VERB: awarded OBJ: contracts\n",
            "Germany calls for EU reform\n",
            "SUBJ: Germany VERB: calls OBJ: EU reform\n",
            "Asia shares defy post-quake gloom\n",
            "SUBJ: Asia shares VERB: defy OBJ: quake gloom\n",
            "EMI shares hit by profit warning\n",
            "SUBJ: EMI shares profit warning VERB: hit OBJ: profit warning\n",
            "Takeover offer for Sunderland FC\n",
            "SUBJ: Takeover offer Sunderland FC VERB: None OBJ: None\n",
            "Banker loses sexism claim\n",
            "SUBJ: Banker VERB: loses OBJ: sexism claim\n",
            "News Corp makes $5.4bn Fox offer\n",
            "SUBJ: News Corp VERB: makes OBJ: Fox offer\n",
            "India's Maruti sees profits jump\n",
            "SUBJ: India Maruti VERB: sees OBJ: profits\n",
            "Fosters buys stake in winemaker\n",
            "SUBJ: Fosters VERB: buys OBJ: stake winemaker\n",
            "Nasdaq planning $100m share sale\n",
            "SUBJ: Nasdaq share sale VERB: planning OBJ: share sale\n",
            "World leaders gather to face uncertainty\n",
            "SUBJ: World leaders VERB: gather OBJ: uncertainty\n",
            "Ore costs hit global steel firms\n",
            "SUBJ: Ore VERB: costs OBJ: steel firms\n",
            "Golden rule boost for Chancellor\n",
            "SUBJ: Golden rule boost Chancellor VERB: None OBJ: None\n",
            "Swiss cement firm in buying spree\n",
            "SUBJ: cement firm spree VERB: buying OBJ: spree\n",
            "Qantas sees profits fly to record\n",
            "SUBJ: Qantas VERB: sees OBJ: profits\n",
            "House prices rebound says Halifax\n",
            "SUBJ: House prices VERB: rebound OBJ: Halifax\n",
            "Circuit City gets takeover offer\n",
            "SUBJ: Circuit City VERB: gets OBJ: takeover offer\n",
            "Trade gap narrows as exports rise\n",
            "SUBJ: Trade gap VERB: narrows OBJ: exports\n",
            "Turkey turns on the economic charm\n",
            "SUBJ: Turkey VERB: turns OBJ: charm\n",
            "Qatar and Shell in $6bn gas deal\n",
            "SUBJ: Qatar Shell gas deal VERB: None OBJ: None\n",
            "Worldcom director ends evidence\n",
            "SUBJ: Worldcom director VERB: ends OBJ: evidence\n",
            "Disney settles disclosure charges\n",
            "SUBJ: Disney VERB: settles OBJ: disclosure charges\n",
            "S Korean credit card firm rescued\n",
            "SUBJ: credit card firm VERB: rescued OBJ: None\n",
            "Consumer spending lifts US growth\n",
            "SUBJ: Consumer spending VERB: lifts OBJ: \n",
            "Argentina closes $102.6bn debt swap\n",
            "SUBJ: Argentina VERB: closes OBJ: debt swap\n",
            "Building giant in asbestos payout\n",
            "SUBJ: giant VERB: Building OBJ: giant\n",
            "US seeks new $280bn smoker ruling\n",
            "SUBJ:  VERB: seeks OBJ: smoker ruling\n",
            "Gaming firm to sell UK dog tracks\n",
            "SUBJ: firm UK dog tracks VERB: Gaming OBJ: firm UK dog tracks\n",
            "FAO warns on impact of subsidies\n",
            "SUBJ: FAO VERB: warns OBJ: impact subsidies\n",
            "Beer giant swallows Russian firm\n",
            "SUBJ: Beer giant VERB: swallows OBJ: firm\n",
            "Further rise in UK jobless total\n",
            "SUBJ: rise UK total VERB: None OBJ: None\n",
            "Japan narrowly escapes recession\n",
            "SUBJ: Japan VERB: escapes OBJ: recession\n",
            "Low-cost airlines hit Eurotunnel\n",
            "SUBJ: cost airlines VERB: hit OBJ: Eurotunnel\n",
            "UK economy facing 'major risks'\n",
            "SUBJ: UK economy risks VERB: None OBJ: None\n",
            "Barclays profits hit record level\n",
            "SUBJ: Barclays profits VERB: hit OBJ: record level\n",
            "MG Rover China tie-up 'delayed'\n",
            "SUBJ: MG Rover China tie VERB: None OBJ: None\n",
            "Asian quake hits European shares\n",
            "SUBJ: quake VERB: hits OBJ: shares\n",
            "SBC plans post-takeover job cuts\n",
            "SUBJ: SBC VERB: plans OBJ: post takeover job cuts\n",
            "Safety alert as GM recalls cars\n",
            "SUBJ: Safety GM cars VERB: recalls OBJ: cars\n",
            "Two Nigerian banks set to merge\n",
            "SUBJ: banks VERB: set OBJ: None\n",
            "India unveils anti-poverty budget\n",
            "SUBJ: India VERB: unveils OBJ: poverty budget\n",
            "Jarvis sells Tube stake to Spain\n",
            "SUBJ: Jarvis VERB: sells OBJ: Tube stake\n",
            "UK bank seals South Korean deal\n",
            "SUBJ: UK bank seals deal VERB: None OBJ: None\n",
            "US bank in $515m SEC settlement\n",
            "SUBJ:  VERB: None OBJ: \n",
            "Boeing secures giant Japan order\n",
            "SUBJ: Boeing secures Japan order VERB: None OBJ: None\n",
            "US trade deficit widens sharply\n",
            "SUBJ:  VERB: trade OBJ: deficit\n",
            "Lufthansa flies back to profit\n",
            "SUBJ: Lufthansa VERB: flies OBJ: profit\n",
            "HealthSouth ex-boss goes on trial\n",
            "SUBJ: HealthSouth VERB: goes OBJ: boss trial\n",
            "South African car demand surges\n",
            "SUBJ: car demand VERB: surges OBJ: None\n",
            "Share boost for feud-hit Reliance\n",
            "SUBJ: Share boost feud hit Reliance VERB: None OBJ: None\n",
            "GM pays $2bn to evade Fiat buyout\n",
            "SUBJ: GM VERB: pays OBJ: \n",
            "Nissan names successor to Ghosn\n",
            "SUBJ: Nissan names successor Ghosn VERB: None OBJ: None\n",
            "S&N extends Indian beer venture\n",
            "SUBJ: S N VERB: extends OBJ: beer venture\n",
            "Israeli economy picking up pace\n",
            "SUBJ: economy VERB: picking OBJ: pace\n",
            "Ukraine steel sell-off 'illegal'\n",
            "SUBJ: Ukraine steel VERB: sell OBJ: \n",
            "Dutch bank to lay off 2,850 staff\n",
            "SUBJ: bank staff VERB: lay OBJ: staff\n",
            "Ad firm WPP's profits surge 15%\n",
            "SUBJ: Ad firm WPP profits VERB: surge OBJ: %\n",
            "Algeria hit by further gas riots\n",
            "SUBJ: Algeria gas riots VERB: hit OBJ: gas riots\n",
            "US in EU tariff chaos trade row\n",
            "SUBJ: EU tariff chaos trade row VERB: None OBJ: None\n",
            "Crossrail link 'to get go-ahead'\n",
            "SUBJ: Crossrail link VERB: get OBJ: None\n",
            "Israel looks to US for bank chief\n",
            "SUBJ: Israel VERB: looks OBJ: bank chief\n",
            "Rescue hope for Borussia Dortmund\n",
            "SUBJ: Rescue hope Borussia Dortmund VERB: None OBJ: None\n",
            "Shares hit by MS drug suspension\n",
            "SUBJ: Shares MS drug suspension VERB: hit OBJ: MS drug suspension\n",
            "S Korea spending boost to economy\n",
            "SUBJ: S Korea spending boost economy VERB: None OBJ: None\n",
            "Australia rates at four year high\n",
            "SUBJ: Australia VERB: rates OBJ: year\n",
            "China continues breakneck growth\n",
            "SUBJ: China VERB: continues OBJ: breakneck growth\n",
            "Iran budget seeks state sell-offs\n",
            "SUBJ: Iran budget VERB: seeks OBJ: offs\n",
            "Deutsche Boerse boosts dividend\n",
            "SUBJ: Deutsche Boerse VERB: boosts OBJ: dividend\n",
            "IMF agrees fresh Turkey funding\n",
            "SUBJ: IMF VERB: agrees OBJ: Turkey funding\n",
            "Rich grab half Colombia poor fund\n",
            "SUBJ: grab half Colombia fund VERB: None OBJ: None\n",
            "Tsunami to cost Sri Lanka $1.3bn\n",
            "SUBJ: Tsunami VERB: cost OBJ: Sri Lanka\n",
            "Diageo to buy US wine firm\n",
            "SUBJ: Diageo wine firm VERB: buy OBJ: wine firm\n",
            "European losses hit GM's profits\n",
            "SUBJ: losses VERB: hit OBJ: GM profits\n",
            "Water firm Suez in Argentina row\n",
            "SUBJ: Water firm Suez Argentina row VERB: None OBJ: None\n",
            "Gold falls on IMF sale concerns\n",
            "SUBJ: Gold VERB: falls OBJ: IMF sale concerns\n",
            "Venezuela and China sign oil deal\n",
            "SUBJ: Venezuela China sign oil deal VERB: None OBJ: None\n",
            "Dollar gains on Greenspan speech\n",
            "SUBJ: Dollar gains Greenspan speech VERB: None OBJ: None\n",
            "Lacroix label bought by US firm\n",
            "SUBJ: Lacroix label firm VERB: bought OBJ: firm\n",
            "Reliance unit loses Anil Ambani\n",
            "SUBJ: Reliance unit VERB: loses OBJ: Anil Ambani\n",
            "Durex maker SSL awaits firm bid\n",
            "SUBJ: Durex maker SSL VERB: awaits OBJ: firm bid\n",
            "Call to save manufacturing jobs\n",
            "SUBJ: manufacturing jobs VERB: Call OBJ: manufacturing jobs\n",
            "German economy rebounds\n",
            "SUBJ: economy VERB: rebounds OBJ: None\n",
            "Saudi investor picks up the Savoy\n",
            "SUBJ: investor VERB: picks OBJ: Savoy\n",
            "Nigeria to boost cocoa production\n",
            "SUBJ: Nigeria cocoa production VERB: boost OBJ: cocoa production\n",
            "Cairn shares slump on oil setback\n",
            "SUBJ: Cairn shares VERB: slump OBJ: oil setback\n",
            "Wal-Mart to pay $14m in gun suit\n",
            "SUBJ: Wal Mart gun suit VERB: pay OBJ: gun suit\n",
            "Deutsche Telekom sees mobile gain\n",
            "SUBJ: Deutsche Telekom VERB: sees OBJ: gain\n",
            "Gazprom 'in $36m back-tax claim'\n",
            "SUBJ: Gazprom VERB: None OBJ: None\n",
            "Brussels raps mobile call charges\n",
            "SUBJ: Brussels VERB: raps OBJ: call charges\n",
            "Man Utd to open books to Glazer\n",
            "SUBJ: Man Utd books Glazer VERB: open OBJ: books\n",
            "Retail sales show festive fervour\n",
            "SUBJ: sales VERB: show OBJ: fervour\n",
            "BMW cash to fuel Mini production\n",
            "SUBJ: BMW cash Mini production VERB: fuel OBJ: Mini production\n",
            "US insurer Marsh cuts 2,500 jobs\n",
            "SUBJ:  VERB: insurer OBJ: Marsh cuts jobs\n",
            "Ford gains from finance not cars\n",
            "SUBJ: Ford VERB: gains OBJ: finance\n",
            "Feta cheese battle reaches court\n",
            "SUBJ: Feta cheese battle VERB: reaches OBJ: court\n",
            "Monsanto fined $1.5m for bribery\n",
            "SUBJ: Monsanto VERB: fined OBJ: \n",
            "China Aviation seeks rescue deal\n",
            "SUBJ: China Aviation VERB: seeks OBJ: rescue deal\n",
            "Quake's economic costs emerging\n",
            "SUBJ: Quake costs VERB: emerging OBJ: None\n",
            "Saudi NCCI's shares soar\n",
            "SUBJ: Saudi NCCI shares VERB: soar OBJ: None\n",
            "Yukos heading back to US courts\n",
            "SUBJ: Yukos courts VERB: heading OBJ: courts\n",
            "News Corp eyes video games market\n",
            "SUBJ: News Corp eyes VERB: video OBJ: games market\n",
            "Firms pump billions into pensions\n",
            "SUBJ: Firms VERB: pump OBJ: billions\n",
            "US firm 'bids for Lacroix label'\n",
            "SUBJ:  VERB: None OBJ: None\n",
            "US manufacturing expands\n",
            "SUBJ:  VERB: manufacturing OBJ: expands\n",
            "Weak end-of-year sales hit Next\n",
            "SUBJ: end year sales VERB: hit OBJ: None\n",
            "Business confidence dips in Japan\n",
            "SUBJ: Business confidence dips Japan VERB: None OBJ: None\n",
            "BMW to recall faulty diesel cars\n",
            "SUBJ: BMW diesel cars VERB: recall OBJ: diesel cars\n",
            "Quiksilver moves for Rossignol\n",
            "SUBJ: Quiksilver VERB: moves OBJ: Rossignol\n",
            "House prices show slight increase\n",
            "SUBJ: House prices VERB: show OBJ: increase\n",
            "Winn-Dixie files for bankruptcy\n",
            "SUBJ: Winn Dixie files bankruptcy VERB: None OBJ: None\n",
            "Deutsche Boerse set to 'woo' LSE\n",
            "SUBJ: Deutsche Boerse VERB: set OBJ: woo LSE\n",
            "Ericsson sees earnings improve\n",
            "SUBJ: Ericsson VERB: sees OBJ: earnings\n",
            "Aids and climate top Davos agenda\n",
            "SUBJ: Aids climate Davos agenda VERB: None OBJ: None\n",
            "Indonesia 'declines debt freeze'\n",
            "SUBJ: Indonesia declines debt freeze VERB: None OBJ: None\n",
            "Oil companies get Russian setback\n",
            "SUBJ: Oil companies VERB: get OBJ: setback\n",
            "Economy 'stronger than forecast'\n",
            "SUBJ: Economy forecast VERB: None OBJ: None\n",
            "Bush to outline 'toughest' budget\n",
            "SUBJ: Bush VERB: outline OBJ: budget\n",
            "Disaster claims 'less than $10bn'\n",
            "SUBJ: Disaster claims VERB: None OBJ: None\n",
            "Virgin Blue shares plummet 20%\n",
            "SUBJ: Virgin Blue shares VERB: plummet OBJ: %\n",
            "Cuba winds back economic clock\n",
            "SUBJ: Cuba VERB: winds OBJ: clock\n",
            "FBI agent colludes with analyst\n",
            "SUBJ: FBI agent VERB: colludes OBJ: analyst\n",
            "Court rejects $280bn tobacco case\n",
            "SUBJ: Court VERB: rejects OBJ: tobacco case\n",
            "Enron bosses in $168m payout\n",
            "SUBJ: Enron VERB: bosses OBJ: \n",
            "'Golden economic period' to end\n",
            "SUBJ: period VERB: end OBJ: period\n",
            "Call to overhaul UK state pension\n",
            "SUBJ: UK state pension VERB: Call OBJ: UK state pension\n",
            "Slowdown hits US factory growth\n",
            "SUBJ: Slowdown VERB: hits OBJ: factory growth\n",
            "Europe blames US over weak dollar\n",
            "SUBJ: Europe VERB: blames OBJ: \n"
          ]
        }
      ],
      "source": [
        "for line in business:\n",
        "    tok.set_content(line.strip())\n",
        "    seq = extract_sequences(tok)[0]\n",
        "    \n",
        "    tagger.tag(seq)\n",
        "    tree = parser.parse(seq)\n",
        "    \n",
        "    extractor = SVOExtractor()\n",
        "    tree.visit(extractor)\n",
        "    print(line.strip())\n",
        "    print(\"SUBJ: {} VERB: {} OBJ: {}\".format(extractor.subject, extractor.verb, extractor.object))"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.0"
    },
    "colab": {
      "provenance": [],
      "include_colab_link": true
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
 }