Created
May 20, 2020 20:07
-
-
Save balouf/b7d56a527056747d17f40dfda53869fa to your computer and use it in GitHub Desktop.
InputOutput-Part_I.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T08:52:16.800805Z", | |
"start_time": "2020-05-20T08:52:16.772570Z" | |
}, | |
"slideshow": { | |
"slide_type": "skip" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from IPython.core.display import display, HTML\ndisplay(HTML(\"\"\"<style>\n.prompt_container { display: none !important; }\n.prompt { display: none !important; }\n.run_this_cell { display: none !important; }\n</style>\"\"\"))", | |
"execution_count": 1, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<style>\n.prompt_container { display: none !important; }\n.prompt { display: none !important; }\n.run_this_cell { display: none !important; }\n</style>", | |
"text/plain": "<IPython.core.display.HTML object>" | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Input, Output, and the Internet" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "## Introduction" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "### Objectives" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "- Show simple things that facilitates dealing with files\n- Some you may know (François / Marco workshops)\n- Brief overview, references, examples\n- This is how *I* do these things *today* (probably not perfect)" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "### Roadmap\n\n- Part I: Local files\n - Main things to know\n - Side things to know\n - Examples\n- Part II: The Internet\n - http & html\n - requests\n - BeautifulSoup\n - Selenium\n - Examples " | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "## Part I: Local files" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "### Main things to know" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Bytes vs strings" | |
}, | |
{ | |
"metadata": { | |
"cell_style": "split" | |
}, | |
"cell_type": "markdown", | |
"source": "- Byte objects are 0's and 1's by groups of 8.\n- Meaning depends on a convention (MP3, JPG, UTF8).\n- Strings are sequence of characters.\n- Strings need encoding before which they can be stored." | |
}, | |
{ | |
"metadata": { | |
"cell_style": "split" | |
}, | |
"cell_type": "markdown", | |
"source": "<img src=\"https://media.geeksforgeeks.org/wp-content/uploads/string-vs-byte-in-python.png\">" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "https://www.geeksforgeeks.org/byte-objects-vs-string-python/" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Bytes vs String" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:15:13.210148Z", | |
"start_time": "2020-05-20T09:15:13.192806Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "txt = \"Élise\"\nprint(f\"{txt} has length {len(txt)}\")\n\nraw = bytes(txt, encoding='utf8')\nprint(f\"{raw} has length {len(raw)}\")\n\nraw = bytes(txt, encoding='latin_1')\nprint(f\"{raw} has length {len(raw)}\")", | |
"execution_count": 2, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "Élise has length 5\nb'\\xc3\\x89lise' has length 6\nb'\\xc9lise' has length 5\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:17:20.036628Z", | |
"start_time": "2020-05-20T09:17:20.024204Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "try:\n bytes(txt, encoding='ASCII')\nexcept Exception as e: print(e)\n\ntry:\n bytes(txt, encoding='latin_1').decode('utf8')\nexcept Exception as e: print(e)\n\nrecode = bytes(txt, encoding='utf8').decode('latin_1')\nprint(f\"{recode} has length {len(raw)}\")", | |
"execution_count": 3, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "'ascii' codec can't encode character '\\xc9' in position 0: ordinal not in range(128)\n'utf-8' codec can't decode byte 0xc9 in position 0: invalid continuation byte\nÃlise has length 5\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Bytes vs String" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "- Some things need bytes, other need strings\n- 95% of errors can be avoided if you understand the difference\n- Stick to utf-8, try to tell your encoding in your file (e.g. `% !TeX encoding = UTF-8`)\n- In Python, files can be opened in binary mode (`b`) or text mode (`t`).\n - Text mode transparently makes the bytes <-> string conversion\n - Default is **System dependent**, can be specified with `encoding` parameter" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Bytes vs String" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:21:11.014851Z", | |
"start_time": "2020-05-20T09:21:10.998321Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "f = open('test.txt', 'wt')\nf.write('Élise')\nf.close()\nf = open('test.txt', 'rb')\nraw = f.read()\nf.close()\nprint(f\"Bytes of file are {raw}\")", | |
"execution_count": 4, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "Bytes of file are b'\\xc9lise'\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:21:52.289689Z", | |
"start_time": "2020-05-20T09:21:52.276221Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "f = open('test.txt', 'wt', encoding='utf8')\nf.write('Élise')\nf.close()\nf = open('test.txt', 'rb')\nraw = f.read()\nf.close()\nprint(f\"Bytes of file are {raw},\" \n f\"text is {raw.decode('utf8')}\")\n\n\n", | |
"execution_count": 5, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "Bytes of file are b'\\xc3\\x89lise',text is Élise\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# The `pathlib` module" | |
}, | |
{ | |
"metadata": { | |
"cell_style": "split" | |
}, | |
"cell_type": "markdown", | |
"source": "Possible issues when dealing with files:\n- OS conventions (e.g. `\\` vs `/`)\n- Absolute vs relative\n- Search for specific file(s)\n- Concatenation:\n - ``dir+files``?\n - ``dir+\"/\"+files``?\n - ``dir+\"\\\"+files``?\n - ``dir+\"\\\\\"+files``?" | |
}, | |
{ | |
"metadata": { | |
"cell_style": "split" | |
}, | |
"cell_type": "markdown", | |
"source": "pathlib removes most of these issues\n- Introduced in 3.4\n- Replace previous modules like os, glob\n- OS independant" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# The `pathlib` module" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:25:51.607491Z", | |
"start_time": "2020-05-20T09:25:51.600044Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from pathlib import Path\nd = Path('.')\nfile = Path('test.txt')", | |
"execution_count": 7, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Why use path?\n- Bunch of useful methods: exists, is_file, is_dir, unlink, stem, suffix...\n- Simple construction: `Path('temp') / Path('tempfile')`\n- Easy to turn string-based code into path-based code\n - All common methods that accept a string accept a path\n - `Path` is idempotent: `Path(Path(s))==Path(s)`\n" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# The `pathlib` module" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:27:18.353712Z", | |
"start_time": "2020-05-20T09:27:18.338804Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "d = Path('.')\nfor file in d.rglob('*ipynb*'):\n if file.is_file():\n print(f\"{file} is a file.\")\n elif file.is_dir():\n print(f\"{file} is a dir.\")", | |
"execution_count": 8, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": ".ipynb_checkpoints is a dir.\nInputOutput.ipynb is a file.\nPython Academy.ipynb is a file.\nUntitled.ipynb is a file.\n.ipynb_checkpoints\\InputOutput-checkpoint.ipynb is a file.\n.ipynb_checkpoints\\Python Academy-checkpoint.ipynb is a file.\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:28:03.624866Z", | |
"start_time": "2020-05-20T09:28:03.612903Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "file = Path('test.txt')\nprint(file.exists())\nfile = Path('tets.txt')\nprint(file.exists())", | |
"execution_count": 9, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "True\nFalse\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Context managers" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "- Principle: some things require some cleaning after they are used\n- Context managers allow to do this implicitly\n- Python's `open` can be used as a CM\n- Many other function/classes do:\n - Temporary directory\n - Internet session\n - You can very easily write your own CM, cf https://docs.python.org/3.7/reference/datamodel.html#context-managers" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Context managers" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:33:16.266375Z", | |
"start_time": "2020-05-20T09:33:16.251796Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "def get_txt():\n f = open('test.txt', 'rb')\n txt = f.read().decode('utf8')\n f.close()\n return txt\nget_txt()", | |
"execution_count": 10, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "'Élise'" | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:33:19.086772Z", | |
"start_time": "2020-05-20T09:33:19.074923Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "def get_txt():\n with open('test.txt', 'rb') as f:\n return f.read().decode('utf8')\nget_txt()", | |
"execution_count": 11, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "'Élise'" | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Advantages:\n- Mess (opened file descriptor) is automatically cleaned\n- Even in case of Error!\n- Indentation shows the validity of file access\n- Sligthly shorter code" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Context managers" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Always use it if you can!\n\nCases where the old way may be better:\n - File spans multiple cells in a notebook\n - doctest with multiple asserts\n - too much indentation already (maybe consider use of subfunctions?)" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# write vs json vs numpy vs dill vs pickle" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "read/write are the basic methods to access a file, but they are others:" | |
}, | |
{ | |
"metadata": { | |
"cell_style": "split" | |
}, | |
"cell_type": "markdown", | |
"source": "- `json` (dump/load): for list, dict, ...\n - Human-readable\n - Only works for json\n- `numpy` (save/load and others): for numpy objects\n - Multiple objects and other stuff\n - Only work for numpy" | |
}, | |
{ | |
"metadata": { | |
"cell_style": "split" | |
}, | |
"cell_type": "markdown", | |
"source": "- `pickle` (dump/load):\n - Included in the standard library\n - Trouble with complex objects\n- `dill` (dump/load):\n - Encode almost anything\n - not included in the standard library" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# write vs json vs numpy vs dill vs pickle" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "My choice: \n- `json` if I am sure my object is human-readable (e.g. no 1,000,000 X 1,000,000 matrix inside)\n- `dill as pickle` otherwise." | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "### Side things to know" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Compression" | |
}, | |
{ | |
"metadata": { | |
"cell_style": "split" | |
}, | |
"cell_type": "markdown", | |
"source": "- `gzip`: one size fits all compression for single files\n- `zipfile`: for including multiple files in one `zip` archive\n- `zlib`: in-memory `gzip`" | |
}, | |
{ | |
"metadata": { | |
"cell_style": "split" | |
}, | |
"cell_type": "markdown", | |
"source": "- Common mistakes: uncompress, suppress compressed, read uncompressed, or write, compress, suppress uncompress\n- Use gzip.open in place of open to do it all in one go\n" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Compression" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:46:52.897754Z", | |
"start_time": "2020-05-20T09:46:52.879630Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "file = Path(\"test.txt.gz\")\nimport gzip\ntxt = \"All work and no play makes Jack a dull boy.\\n\"*20\nprint(f\"Size of text is {len(txt)} characters.\")\nwith gzip.open(file, \"wt\") as f:\n f.write(txt)\n print(f\"Size of virtual uncompressed file is {f.tell()} bytes.\")\nprint(f\"Size of compressed file is {file.stat().st_size} bytes.\")", | |
"execution_count": 12, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "Size of text is 880 characters.\nSize of virtual uncompressed file is 900 bytes.\nSize of compressed file is 88 bytes.\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:48:52.944712Z", | |
"start_time": "2020-05-20T09:48:52.932852Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "with gzip.open(\"test.txt.gz\", \"rt\") as f:\n print(f.read())", | |
"execution_count": 13, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "All work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\n\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Compression" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:49:31.807550Z", | |
"start_time": "2020-05-20T09:49:31.796656Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "import zlib\nwith open(\"test.txt.gz\", \"rb\") as f:\n raw = f.read()\nprint(f\"Raw binary data: {raw}\")\nprint(zlib.decompress(raw, 15+32).decode())", | |
"execution_count": 14, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "Raw binary data: b'\\x1f\\x8b\\x08\\x08\\x8c\\xfc\\xc4^\\x02\\xfftest.txt\\x00r\\xcc\\xc9Q(\\xcf/\\xcaVH\\xccKQ\\xc8\\xcbW(\\xc8I\\xacT\\xc8M\\xccN-V\\xf0JL\\x06\\n+\\xa4\\x94\\x02\\x95$\\xe5W\\xea\\xf1r9\\x8e*\\x1eU<\\xaa\\x98\\xda\\x8a\\x01\\x00\\x00\\x00\\xff\\xff\\x03\\x00.\\x97\\xb4\\x80\\x84\\x03\\x00\\x00'\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\n\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# tempfile" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "You can use the tempfile module to create temporary files and directory *somewhere*. If you use `with`, they are deleted afterwards.\n\nExamples:\n- https://gismo.readthedocs.io/en/latest/tutorials/tutorial_IO.html\n- https://gismo.readthedocs.io/en/latest/reference.html#gismo.datasets.dblp.Dblp.build" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# tempfile" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Remarks:\n- `with tempfile.TemporaryDirectory() as tmpdirname`: `tmpdirname` is the `.name` attribute, not the object itself\n- If used without `with`, clean with the `.cleanup()` method instead of the usual `.close()`" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# IO" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "`io` can be used to make in-memory variables behave like files." | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:54:07.409921Z", | |
"start_time": "2020-05-20T09:54:07.402272Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "import io\nwith io.StringIO(txt) as f:\n print(f.read())", | |
"execution_count": 15, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "All work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\n\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:55:08.215134Z", | |
"start_time": "2020-05-20T09:55:08.206054Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "with io.BytesIO(raw) as f:\n with gzip.open(f, 'rt') as g:\n print(g.read())", | |
"execution_count": 16, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "All work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\n\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "### Examples" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Save/load results" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Imagine you have a function that makes a huge computation." | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:56:30.237991Z", | |
"start_time": "2020-05-20T09:56:30.227409Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "def compute_machin_bidule():\n print(\"This is a very complex function that takes a lot of time.\")\n return \"Machin Bidule!\"\ncompute_machin_bidule()", | |
"execution_count": 17, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "This is a very complex function that takes a lot of time.\n" | |
}, | |
{ | |
"data": { | |
"text/plain": "'Machin Bidule!'" | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Objective: write a function that loads result from file if it exists, compute and save otherwise." | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Old vs news" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T12:42:01.250761Z", | |
"start_time": "2020-05-20T12:42:01.244417Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "# Almost real code by the old me!\ndef get_machin_bidule(filename='machin.txt', \n directory='./'):\n file = directory+filename\n try:\n f = open(file)\n content = f.read()\n f.close()\n return content\n except IOError:\n content = compute_machin_bidule()\n f = open(file, \"w\")\n f.write(content)\n f.close()\n return content", | |
"execution_count": 25, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T12:43:12.773255Z", | |
"start_time": "2020-05-20T12:43:12.765260Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "# How I do this today\nfrom pathlib import Path\nimport dill as pickle\ndef get_machin_bidule(filename='machin.pkl', \n directory='.'):\n file = Path(directory) / Path(filename)\n if file.exists():\n with open(file, 'rb') as f:\n return pickle.load(f)\n else: \n content = compute_machin_bidule()\n with open(file, 'wb') as f:\n pickle.dump(content, f)\n return content", | |
"execution_count": 26, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T09:57:25.511127Z", | |
"start_time": "2020-05-20T09:57:25.498180Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "get_machin_bidule()", | |
"execution_count": 21, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "'Machin Bidule!'" | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Save/load results" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Can be used as a decorator or as a MixIn.\n\nCf for example https://gismo.readthedocs.io/en/latest/reference.html#io" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Video recompression" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Use case: recompress a bunch of avi files to gain space.\n- ffmpeg will do the actual job (we assume ffmpeg is installed on your system https://ffmpeg.org/download.html)\n- pathlib will do the rest" | |
}, | |
{ | |
"metadata": { | |
"slideshow": { | |
"slide_type": "subslide" | |
} | |
}, | |
"cell_type": "markdown", | |
"source": "# Video recompression" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T10:04:04.231373Z", | |
"start_time": "2020-05-20T10:04:04.219179Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from pathlib import Path\nimport subprocess\nimport json\n\ndef recompress(filepath):\n print(f\"Recompressing {filepath.name}.\")\n target = filepath.with_suffix(\".mkv\")\n cmd = (f\"ffmpeg -y -i \\\"{filepath}\\\" \"\n f\"-c:v libx265 -c:a copy \\\"{target}\\\"\")\n c = subprocess.run(cmd)\n if c.returncode != 0:\n print(f\"Error for {filepath.name}!!!\")\n if target.exists():\n target.unlink()\n else:\n old_s = filepath.stat().st_size\n new_s = Path(target).stat().st_size\n print(f\"Relative size of new file: \"\n f\"{100*new_s/old_s:.2f}%\")\n filepath.unlink()", | |
"execution_count": 23, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2020-05-20T10:04:42.833384Z", | |
"start_time": "2020-05-20T10:04:05.754103Z" | |
}, | |
"cell_style": "split", | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "# This is where I stored videos for the talk\n# You can try this with your own\n# !!! original videos will be removed if ffmpeg succeeds !!!\nd = Path(\"../../../../../Datasets/Videos\")\nfor file in d.rglob('*.avi'):\n recompress(file)", | |
"execution_count": 24, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "Recompressing Erdos - 3D.avi.\nRelative size of new file: 1.79%\nRecompressing not_a_video.avi.\nError for not_a_video.avi!!!\nRecompressing init-random-ok.avi.\nRelative size of new file: 24.12%\n" | |
} | |
] | |
} | |
], | |
"metadata": { | |
"celltoolbar": "Slideshow", | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3", | |
"language": "python" | |
}, | |
"language_info": { | |
"name": "python", | |
"version": "3.7.7", | |
"mimetype": "text/x-python", | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"pygments_lexer": "ipython3", | |
"nbconvert_exporter": "python", | |
"file_extension": ".py" | |
}, | |
"rise": { | |
"enable_chalkboard": true | |
}, | |
"toc": { | |
"nav_menu": {}, | |
"number_sections": true, | |
"sideBar": true, | |
"skip_h1_title": true, | |
"base_numbering": 1, | |
"title_cell": "Table of Contents", | |
"title_sidebar": "Contents", | |
"toc_cell": false, | |
"toc_position": {}, | |
"toc_section_display": true, | |
"toc_window_display": false | |
}, | |
"gist": { | |
"id": "", | |
"data": { | |
"description": "InputOutput-Part_I.ipynb", | |
"public": true | |
} | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment