balouf · May 20, 2020 20:07
diff --git a/InputOutput-Part_I.ipynb b/InputOutput-Part_I.ipynb
 {
  "cells": [
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T08:52:16.800805Z",
          "start_time": "2020-05-20T08:52:16.772570Z"
        },
        "slideshow": {
          "slide_type": "skip"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "from IPython.core.display import display, HTML\ndisplay(HTML(\"\"\"<style>\n.prompt_container { display: none !important; }\n.prompt { display: none !important; }\n.run_this_cell { display: none !important; }\n</style>\"\"\"))",
      "execution_count": 1,
      "outputs": [
        {
          "data": {
            "text/html": "<style>\n.prompt_container { display: none !important; }\n.prompt { display: none !important; }\n.run_this_cell { display: none !important; }\n</style>",
            "text/plain": "<IPython.core.display.HTML object>"
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ]
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": "# Input, Output, and the Internet"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": "## Introduction"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": "### Objectives"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": ""
        }
      },
      "cell_type": "markdown",
      "source": "- Show simple things that facilitates dealing with files\n- Some you may know (François / Marco workshops)\n- Brief overview, references, examples\n- This is how *I* do these things *today* (probably not perfect)"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": "### Roadmap\n\n- Part I: Local files\n - Main things to know\n - Side things to know\n - Examples\n- Part II: The Internet\n - http & html\n - requests\n - BeautifulSoup\n - Selenium\n - Examples   "
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": "## Part I: Local files"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": "### Main things to know"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Bytes vs strings"
    },
    {
      "metadata": {
        "cell_style": "split"
      },
      "cell_type": "markdown",
      "source": "- Byte objects are 0's and 1's by groups of 8.\n- Meaning depends on a convention (MP3, JPG, UTF8).\n- Strings are sequence of characters.\n- Strings need encoding before which they can be stored."
    },
    {
      "metadata": {
        "cell_style": "split"
      },
      "cell_type": "markdown",
      "source": "<img src=\"https://media.geeksforgeeks.org/wp-content/uploads/string-vs-byte-in-python.png\">"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "https://www.geeksforgeeks.org/byte-objects-vs-string-python/"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Bytes vs String"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:15:13.210148Z",
          "start_time": "2020-05-20T09:15:13.192806Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "txt = \"Élise\"\nprint(f\"{txt} has length {len(txt)}\")\n\nraw = bytes(txt, encoding='utf8')\nprint(f\"{raw} has length {len(raw)}\")\n\nraw = bytes(txt, encoding='latin_1')\nprint(f\"{raw} has length {len(raw)}\")",
      "execution_count": 2,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "Élise has length 5\nb'\\xc3\\x89lise' has length 6\nb'\\xc9lise' has length 5\n"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:17:20.036628Z",
          "start_time": "2020-05-20T09:17:20.024204Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "try:\n    bytes(txt, encoding='ASCII')\nexcept Exception as e: print(e)\n\ntry:\n    bytes(txt, encoding='latin_1').decode('utf8')\nexcept Exception as e: print(e)\n\nrecode = bytes(txt, encoding='utf8').decode('latin_1')\nprint(f\"{recode} has length {len(raw)}\")",
      "execution_count": 3,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "'ascii' codec can't encode character '\\xc9' in position 0: ordinal not in range(128)\n'utf-8' codec can't decode byte 0xc9 in position 0: invalid continuation byte\nÃlise has length 5\n"
        }
      ]
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Bytes vs String"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "- Some things need bytes, other need strings\n- 95% of errors can be avoided if you understand the difference\n- Stick to utf-8, try to tell your encoding in your file (e.g. `% !TeX encoding = UTF-8`)\n- In Python, files can be opened in binary mode (`b`) or text mode (`t`).\n - Text mode transparently makes the bytes <-> string conversion\n - Default is **System dependent**, can be specified with `encoding` parameter"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Bytes vs String"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:21:11.014851Z",
          "start_time": "2020-05-20T09:21:10.998321Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "f = open('test.txt', 'wt')\nf.write('Élise')\nf.close()\nf = open('test.txt', 'rb')\nraw = f.read()\nf.close()\nprint(f\"Bytes of file are {raw}\")",
      "execution_count": 4,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "Bytes of file are b'\\xc9lise'\n"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:21:52.289689Z",
          "start_time": "2020-05-20T09:21:52.276221Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "f = open('test.txt', 'wt', encoding='utf8')\nf.write('Élise')\nf.close()\nf = open('test.txt', 'rb')\nraw = f.read()\nf.close()\nprint(f\"Bytes of file are {raw},\" \n      f\"text is {raw.decode('utf8')}\")\n\n\n",
      "execution_count": 5,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "Bytes of file are b'\\xc3\\x89lise',text is Élise\n"
        }
      ]
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# The `pathlib` module"
    },
    {
      "metadata": {
        "cell_style": "split"
      },
      "cell_type": "markdown",
      "source": "Possible issues when dealing with files:\n- OS conventions (e.g. `\\` vs `/`)\n- Absolute vs relative\n- Search for specific file(s)\n- Concatenation:\n - ``dir+files``?\n - ``dir+\"/\"+files``?\n - ``dir+\"\\\"+files``?\n - ``dir+\"\\\\\"+files``?"
    },
    {
      "metadata": {
        "cell_style": "split"
      },
      "cell_type": "markdown",
      "source": "pathlib removes most of these issues\n- Introduced in 3.4\n- Replace previous modules like os, glob\n- OS independant"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# The `pathlib` module"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:25:51.607491Z",
          "start_time": "2020-05-20T09:25:51.600044Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "from pathlib import Path\nd = Path('.')\nfile = Path('test.txt')",
      "execution_count": 7,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Why use path?\n- Bunch of useful methods: exists, is_file, is_dir, unlink, stem, suffix...\n- Simple construction: `Path('temp') / Path('tempfile')`\n- Easy to turn string-based code into path-based code\n - All common methods that accept a string accept a path\n - `Path` is idempotent: `Path(Path(s))==Path(s)`\n"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# The `pathlib` module"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:27:18.353712Z",
          "start_time": "2020-05-20T09:27:18.338804Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "d = Path('.')\nfor file in d.rglob('*ipynb*'):\n    if file.is_file():\n        print(f\"{file} is a file.\")\n    elif file.is_dir():\n        print(f\"{file} is a dir.\")",
      "execution_count": 8,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": ".ipynb_checkpoints is a dir.\nInputOutput.ipynb is a file.\nPython Academy.ipynb is a file.\nUntitled.ipynb is a file.\n.ipynb_checkpoints\\InputOutput-checkpoint.ipynb is a file.\n.ipynb_checkpoints\\Python Academy-checkpoint.ipynb is a file.\n"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:28:03.624866Z",
          "start_time": "2020-05-20T09:28:03.612903Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "file = Path('test.txt')\nprint(file.exists())\nfile = Path('tets.txt')\nprint(file.exists())",
      "execution_count": 9,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "True\nFalse\n"
        }
      ]
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Context managers"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "- Principle: some things require some cleaning after they are used\n- Context managers allow to do this implicitly\n- Python's `open` can be used as a CM\n- Many other function/classes do:\n - Temporary directory\n - Internet session\n - You can very easily write your own CM, cf https://docs.python.org/3.7/reference/datamodel.html#context-managers"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Context managers"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:33:16.266375Z",
          "start_time": "2020-05-20T09:33:16.251796Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "def get_txt():\n    f = open('test.txt', 'rb')\n    txt = f.read().decode('utf8')\n    f.close()\n    return txt\nget_txt()",
      "execution_count": 10,
      "outputs": [
        {
          "data": {
            "text/plain": "'Élise'"
          },
          "execution_count": 10,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:33:19.086772Z",
          "start_time": "2020-05-20T09:33:19.074923Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "def get_txt():\n    with open('test.txt', 'rb') as f:\n        return f.read().decode('utf8')\nget_txt()",
      "execution_count": 11,
      "outputs": [
        {
          "data": {
            "text/plain": "'Élise'"
          },
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Advantages:\n- Mess (opened file descriptor) is automatically cleaned\n- Even in case of Error!\n- Indentation shows the validity of file access\n- Sligthly shorter code"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Context managers"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Always use it if you can!\n\nCases where the old way may be better:\n    - File spans multiple cells in a notebook\n    - doctest with multiple asserts\n    - too much indentation already (maybe consider use of subfunctions?)"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# write vs json vs numpy vs dill vs pickle"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "read/write are the basic methods to access a file, but they are others:"
    },
    {
      "metadata": {
        "cell_style": "split"
      },
      "cell_type": "markdown",
      "source": "- `json` (dump/load): for list, dict, ...\n - Human-readable\n - Only works for json\n- `numpy` (save/load and others): for numpy objects\n - Multiple objects and other stuff\n - Only work for numpy"
    },
    {
      "metadata": {
        "cell_style": "split"
      },
      "cell_type": "markdown",
      "source": "- `pickle` (dump/load):\n - Included in the standard library\n - Trouble with complex objects\n- `dill` (dump/load):\n - Encode almost anything\n - not included in the standard library"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# write vs json vs numpy vs dill vs pickle"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "My choice: \n- `json` if I am sure my object is human-readable (e.g. no 1,000,000 X 1,000,000 matrix inside)\n- `dill as pickle` otherwise."
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": "### Side things to know"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Compression"
    },
    {
      "metadata": {
        "cell_style": "split"
      },
      "cell_type": "markdown",
      "source": "- `gzip`: one size fits all compression for single files\n- `zipfile`: for including multiple files in one `zip` archive\n- `zlib`: in-memory `gzip`"
    },
    {
      "metadata": {
        "cell_style": "split"
      },
      "cell_type": "markdown",
      "source": "- Common mistakes: uncompress, suppress compressed, read uncompressed, or write, compress, suppress uncompress\n- Use gzip.open in place of open to do it all in one go\n"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Compression"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:46:52.897754Z",
          "start_time": "2020-05-20T09:46:52.879630Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "file = Path(\"test.txt.gz\")\nimport gzip\ntxt = \"All work and no play makes Jack a dull boy.\\n\"*20\nprint(f\"Size of text is {len(txt)} characters.\")\nwith gzip.open(file, \"wt\") as f:\n    f.write(txt)\n    print(f\"Size of virtual uncompressed file is {f.tell()} bytes.\")\nprint(f\"Size of compressed file is {file.stat().st_size} bytes.\")",
      "execution_count": 12,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "Size of text is 880 characters.\nSize of virtual uncompressed file is 900 bytes.\nSize of compressed file is 88 bytes.\n"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:48:52.944712Z",
          "start_time": "2020-05-20T09:48:52.932852Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "with gzip.open(\"test.txt.gz\", \"rt\") as f:\n    print(f.read())",
      "execution_count": 13,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "All work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\n\n"
        }
      ]
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Compression"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:49:31.807550Z",
          "start_time": "2020-05-20T09:49:31.796656Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "import zlib\nwith open(\"test.txt.gz\", \"rb\") as f:\n    raw = f.read()\nprint(f\"Raw binary data: {raw}\")\nprint(zlib.decompress(raw, 15+32).decode())",
      "execution_count": 14,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "Raw binary data: b'\\x1f\\x8b\\x08\\x08\\x8c\\xfc\\xc4^\\x02\\xfftest.txt\\x00r\\xcc\\xc9Q(\\xcf/\\xcaVH\\xccKQ\\xc8\\xcbW(\\xc8I\\xacT\\xc8M\\xccN-V\\xf0JL\\x06\\n+\\xa4\\x94\\x02\\x95$\\xe5W\\xea\\xf1r9\\x8e*\\x1eU<\\xaa\\x98\\xda\\x8a\\x01\\x00\\x00\\x00\\xff\\xff\\x03\\x00.\\x97\\xb4\\x80\\x84\\x03\\x00\\x00'\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\n\n"
        }
      ]
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# tempfile"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "You can use the tempfile module to create temporary files and directory *somewhere*. If you use `with`, they are deleted afterwards.\n\nExamples:\n- https://gismo.readthedocs.io/en/latest/tutorials/tutorial_IO.html\n- https://gismo.readthedocs.io/en/latest/reference.html#gismo.datasets.dblp.Dblp.build"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# tempfile"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Remarks:\n- `with tempfile.TemporaryDirectory() as tmpdirname`: `tmpdirname` is the `.name` attribute, not the object itself\n- If used without `with`, clean with the `.cleanup()` method instead of the usual `.close()`"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# IO"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "`io` can be used to make in-memory variables behave like files."
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:54:07.409921Z",
          "start_time": "2020-05-20T09:54:07.402272Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "import io\nwith io.StringIO(txt) as f:\n    print(f.read())",
      "execution_count": 15,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "All work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\n\n"
        }
      ]
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:55:08.215134Z",
          "start_time": "2020-05-20T09:55:08.206054Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "with io.BytesIO(raw) as f:\n    with gzip.open(f, 'rt') as g:\n        print(g.read())",
      "execution_count": 16,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "All work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\n\n"
        }
      ]
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": "### Examples"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": "# Save/load results"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Imagine you have a function that makes a huge computation."
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:56:30.237991Z",
          "start_time": "2020-05-20T09:56:30.227409Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "def compute_machin_bidule():\n    print(\"This is a very complex function that takes a lot of time.\")\n    return \"Machin Bidule!\"\ncompute_machin_bidule()",
      "execution_count": 17,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "This is a very complex function that takes a lot of time.\n"
        },
        {
          "data": {
            "text/plain": "'Machin Bidule!'"
          },
          "execution_count": 17,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Objective: write a function that loads result from file if it exists, compute and save otherwise."
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Old vs news"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T12:42:01.250761Z",
          "start_time": "2020-05-20T12:42:01.244417Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "# Almost real code by the old me!\ndef get_machin_bidule(filename='machin.txt', \n                      directory='./'):\n    file = directory+filename\n    try:\n        f = open(file)\n        content = f.read()\n        f.close()\n        return content\n    except IOError:\n        content = compute_machin_bidule()\n        f = open(file, \"w\")\n        f.write(content)\n        f.close()\n        return content",
      "execution_count": 25,
      "outputs": []
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T12:43:12.773255Z",
          "start_time": "2020-05-20T12:43:12.765260Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "# How I do this today\nfrom pathlib import Path\nimport dill as pickle\ndef get_machin_bidule(filename='machin.pkl', \n                      directory='.'):\n    file = Path(directory) / Path(filename)\n    if file.exists():\n        with open(file, 'rb') as f:\n            return pickle.load(f)\n    else:        \n        content = compute_machin_bidule()\n        with open(file, 'wb') as f:\n            pickle.dump(content, f)\n        return content",
      "execution_count": 26,
      "outputs": []
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T09:57:25.511127Z",
          "start_time": "2020-05-20T09:57:25.498180Z"
        },
        "trusted": true
      },
      "cell_type": "code",
      "source": "get_machin_bidule()",
      "execution_count": 21,
      "outputs": [
        {
          "data": {
            "text/plain": "'Machin Bidule!'"
          },
          "execution_count": 21,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Save/load results"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Can be used as a decorator or as a MixIn.\n\nCf for example https://gismo.readthedocs.io/en/latest/reference.html#io"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": "# Video recompression"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Use case: recompress a bunch of avi files to gain space.\n- ffmpeg will do the actual job (we assume ffmpeg is installed on your system https://ffmpeg.org/download.html)\n- pathlib will do the rest"
    },
    {
      "metadata": {
        "slideshow": {
          "slide_type": "subslide"
        }
      },
      "cell_type": "markdown",
      "source": "# Video recompression"
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T10:04:04.231373Z",
          "start_time": "2020-05-20T10:04:04.219179Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "from pathlib import Path\nimport subprocess\nimport json\n\ndef recompress(filepath):\n    print(f\"Recompressing {filepath.name}.\")\n    target = filepath.with_suffix(\".mkv\")\n    cmd = (f\"ffmpeg -y -i \\\"{filepath}\\\" \"\n           f\"-c:v libx265 -c:a copy \\\"{target}\\\"\")\n    c = subprocess.run(cmd)\n    if c.returncode != 0:\n        print(f\"Error for {filepath.name}!!!\")\n        if target.exists():\n            target.unlink()\n    else:\n        old_s = filepath.stat().st_size\n        new_s = Path(target).stat().st_size\n        print(f\"Relative size of new file: \"\n              f\"{100*new_s/old_s:.2f}%\")\n        filepath.unlink()",
      "execution_count": 23,
      "outputs": []
    },
    {
      "metadata": {
        "ExecuteTime": {
          "end_time": "2020-05-20T10:04:42.833384Z",
          "start_time": "2020-05-20T10:04:05.754103Z"
        },
        "cell_style": "split",
        "trusted": true
      },
      "cell_type": "code",
      "source": "# This is where I stored videos for the talk\n# You can try this with your own\n# !!! original videos will be removed if ffmpeg succeeds !!!\nd = Path(\"../../../../../Datasets/Videos\")\nfor file in d.rglob('*.avi'):\n    recompress(file)",
      "execution_count": 24,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "Recompressing Erdos - 3D.avi.\nRelative size of new file: 1.79%\nRecompressing not_a_video.avi.\nError for not_a_video.avi!!!\nRecompressing init-random-ok.avi.\nRelative size of new file: 24.12%\n"
        }
      ]
    }
  ],
  "metadata": {
    "celltoolbar": "Slideshow",
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.7.7",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "rise": {
      "enable_chalkboard": true
    },
    "toc": {
      "nav_menu": {},
      "number_sections": true,
      "sideBar": true,
      "skip_h1_title": true,
      "base_numbering": 1,
      "title_cell": "Table of Contents",
      "title_sidebar": "Contents",
      "toc_cell": false,
      "toc_position": {},
      "toc_section_display": true,
      "toc_window_display": false
    },
    "gist": {
      "id": "",
      "data": {
        "description": "InputOutput-Part_I.ipynb",
        "public": true
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
 }