Skip to content

Instantly share code, notes, and snippets.

@balouf
Created May 20, 2020 20:07
Show Gist options
  • Save balouf/b7d56a527056747d17f40dfda53869fa to your computer and use it in GitHub Desktop.
Save balouf/b7d56a527056747d17f40dfda53869fa to your computer and use it in GitHub Desktop.
InputOutput-Part_I.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T08:52:16.800805Z",
"start_time": "2020-05-20T08:52:16.772570Z"
},
"slideshow": {
"slide_type": "skip"
},
"trusted": true
},
"cell_type": "code",
"source": "from IPython.core.display import display, HTML\ndisplay(HTML(\"\"\"<style>\n.prompt_container { display: none !important; }\n.prompt { display: none !important; }\n.run_this_cell { display: none !important; }\n</style>\"\"\"))",
"execution_count": 1,
"outputs": [
{
"data": {
"text/html": "<style>\n.prompt_container { display: none !important; }\n.prompt { display: none !important; }\n.run_this_cell { display: none !important; }\n</style>",
"text/plain": "<IPython.core.display.HTML object>"
},
"metadata": {},
"output_type": "display_data"
}
]
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "# Input, Output, and the Internet"
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "## Introduction"
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "### Objectives"
},
{
"metadata": {
"slideshow": {
"slide_type": ""
}
},
"cell_type": "markdown",
"source": "- Show simple things that facilitates dealing with files\n- Some you may know (François / Marco workshops)\n- Brief overview, references, examples\n- This is how *I* do these things *today* (probably not perfect)"
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "### Roadmap\n\n- Part I: Local files\n - Main things to know\n - Side things to know\n - Examples\n- Part II: The Internet\n - http & html\n - requests\n - BeautifulSoup\n - Selenium\n - Examples "
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "## Part I: Local files"
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "### Main things to know"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Bytes vs strings"
},
{
"metadata": {
"cell_style": "split"
},
"cell_type": "markdown",
"source": "- Byte objects are 0's and 1's by groups of 8.\n- Meaning depends on a convention (MP3, JPG, UTF8).\n- Strings are sequence of characters.\n- Strings need encoding before which they can be stored."
},
{
"metadata": {
"cell_style": "split"
},
"cell_type": "markdown",
"source": "<img src=\"https://media.geeksforgeeks.org/wp-content/uploads/string-vs-byte-in-python.png\">"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "https://www.geeksforgeeks.org/byte-objects-vs-string-python/"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Bytes vs String"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:15:13.210148Z",
"start_time": "2020-05-20T09:15:13.192806Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "txt = \"Élise\"\nprint(f\"{txt} has length {len(txt)}\")\n\nraw = bytes(txt, encoding='utf8')\nprint(f\"{raw} has length {len(raw)}\")\n\nraw = bytes(txt, encoding='latin_1')\nprint(f\"{raw} has length {len(raw)}\")",
"execution_count": 2,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Élise has length 5\nb'\\xc3\\x89lise' has length 6\nb'\\xc9lise' has length 5\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:17:20.036628Z",
"start_time": "2020-05-20T09:17:20.024204Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "try:\n bytes(txt, encoding='ASCII')\nexcept Exception as e: print(e)\n\ntry:\n bytes(txt, encoding='latin_1').decode('utf8')\nexcept Exception as e: print(e)\n\nrecode = bytes(txt, encoding='utf8').decode('latin_1')\nprint(f\"{recode} has length {len(raw)}\")",
"execution_count": 3,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "'ascii' codec can't encode character '\\xc9' in position 0: ordinal not in range(128)\n'utf-8' codec can't decode byte 0xc9 in position 0: invalid continuation byte\nÉlise has length 5\n"
}
]
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Bytes vs String"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "- Some things need bytes, other need strings\n- 95% of errors can be avoided if you understand the difference\n- Stick to utf-8, try to tell your encoding in your file (e.g. `% !TeX encoding = UTF-8`)\n- In Python, files can be opened in binary mode (`b`) or text mode (`t`).\n - Text mode transparently makes the bytes <-> string conversion\n - Default is **System dependent**, can be specified with `encoding` parameter"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Bytes vs String"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:21:11.014851Z",
"start_time": "2020-05-20T09:21:10.998321Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "f = open('test.txt', 'wt')\nf.write('Élise')\nf.close()\nf = open('test.txt', 'rb')\nraw = f.read()\nf.close()\nprint(f\"Bytes of file are {raw}\")",
"execution_count": 4,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Bytes of file are b'\\xc9lise'\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:21:52.289689Z",
"start_time": "2020-05-20T09:21:52.276221Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "f = open('test.txt', 'wt', encoding='utf8')\nf.write('Élise')\nf.close()\nf = open('test.txt', 'rb')\nraw = f.read()\nf.close()\nprint(f\"Bytes of file are {raw},\" \n f\"text is {raw.decode('utf8')}\")\n\n\n",
"execution_count": 5,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Bytes of file are b'\\xc3\\x89lise',text is Élise\n"
}
]
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# The `pathlib` module"
},
{
"metadata": {
"cell_style": "split"
},
"cell_type": "markdown",
"source": "Possible issues when dealing with files:\n- OS conventions (e.g. `\\` vs `/`)\n- Absolute vs relative\n- Search for specific file(s)\n- Concatenation:\n - ``dir+files``?\n - ``dir+\"/\"+files``?\n - ``dir+\"\\\"+files``?\n - ``dir+\"\\\\\"+files``?"
},
{
"metadata": {
"cell_style": "split"
},
"cell_type": "markdown",
"source": "pathlib removes most of these issues\n- Introduced in 3.4\n- Replace previous modules like os, glob\n- OS independant"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# The `pathlib` module"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:25:51.607491Z",
"start_time": "2020-05-20T09:25:51.600044Z"
},
"trusted": true
},
"cell_type": "code",
"source": "from pathlib import Path\nd = Path('.')\nfile = Path('test.txt')",
"execution_count": 7,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Why use path?\n- Bunch of useful methods: exists, is_file, is_dir, unlink, stem, suffix...\n- Simple construction: `Path('temp') / Path('tempfile')`\n- Easy to turn string-based code into path-based code\n - All common methods that accept a string accept a path\n - `Path` is idempotent: `Path(Path(s))==Path(s)`\n"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# The `pathlib` module"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:27:18.353712Z",
"start_time": "2020-05-20T09:27:18.338804Z"
},
"trusted": true
},
"cell_type": "code",
"source": "d = Path('.')\nfor file in d.rglob('*ipynb*'):\n if file.is_file():\n print(f\"{file} is a file.\")\n elif file.is_dir():\n print(f\"{file} is a dir.\")",
"execution_count": 8,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": ".ipynb_checkpoints is a dir.\nInputOutput.ipynb is a file.\nPython Academy.ipynb is a file.\nUntitled.ipynb is a file.\n.ipynb_checkpoints\\InputOutput-checkpoint.ipynb is a file.\n.ipynb_checkpoints\\Python Academy-checkpoint.ipynb is a file.\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:28:03.624866Z",
"start_time": "2020-05-20T09:28:03.612903Z"
},
"trusted": true
},
"cell_type": "code",
"source": "file = Path('test.txt')\nprint(file.exists())\nfile = Path('tets.txt')\nprint(file.exists())",
"execution_count": 9,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "True\nFalse\n"
}
]
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Context managers"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "- Principle: some things require some cleaning after they are used\n- Context managers allow to do this implicitly\n- Python's `open` can be used as a CM\n- Many other function/classes do:\n - Temporary directory\n - Internet session\n - You can very easily write your own CM, cf https://docs.python.org/3.7/reference/datamodel.html#context-managers"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Context managers"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:33:16.266375Z",
"start_time": "2020-05-20T09:33:16.251796Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "def get_txt():\n f = open('test.txt', 'rb')\n txt = f.read().decode('utf8')\n f.close()\n return txt\nget_txt()",
"execution_count": 10,
"outputs": [
{
"data": {
"text/plain": "'Élise'"
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:33:19.086772Z",
"start_time": "2020-05-20T09:33:19.074923Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "def get_txt():\n with open('test.txt', 'rb') as f:\n return f.read().decode('utf8')\nget_txt()",
"execution_count": 11,
"outputs": [
{
"data": {
"text/plain": "'Élise'"
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Advantages:\n- Mess (opened file descriptor) is automatically cleaned\n- Even in case of Error!\n- Indentation shows the validity of file access\n- Sligthly shorter code"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Context managers"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Always use it if you can!\n\nCases where the old way may be better:\n - File spans multiple cells in a notebook\n - doctest with multiple asserts\n - too much indentation already (maybe consider use of subfunctions?)"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# write vs json vs numpy vs dill vs pickle"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "read/write are the basic methods to access a file, but they are others:"
},
{
"metadata": {
"cell_style": "split"
},
"cell_type": "markdown",
"source": "- `json` (dump/load): for list, dict, ...\n - Human-readable\n - Only works for json\n- `numpy` (save/load and others): for numpy objects\n - Multiple objects and other stuff\n - Only work for numpy"
},
{
"metadata": {
"cell_style": "split"
},
"cell_type": "markdown",
"source": "- `pickle` (dump/load):\n - Included in the standard library\n - Trouble with complex objects\n- `dill` (dump/load):\n - Encode almost anything\n - not included in the standard library"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# write vs json vs numpy vs dill vs pickle"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "My choice: \n- `json` if I am sure my object is human-readable (e.g. no 1,000,000 X 1,000,000 matrix inside)\n- `dill as pickle` otherwise."
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "### Side things to know"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Compression"
},
{
"metadata": {
"cell_style": "split"
},
"cell_type": "markdown",
"source": "- `gzip`: one size fits all compression for single files\n- `zipfile`: for including multiple files in one `zip` archive\n- `zlib`: in-memory `gzip`"
},
{
"metadata": {
"cell_style": "split"
},
"cell_type": "markdown",
"source": "- Common mistakes: uncompress, suppress compressed, read uncompressed, or write, compress, suppress uncompress\n- Use gzip.open in place of open to do it all in one go\n"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Compression"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:46:52.897754Z",
"start_time": "2020-05-20T09:46:52.879630Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "file = Path(\"test.txt.gz\")\nimport gzip\ntxt = \"All work and no play makes Jack a dull boy.\\n\"*20\nprint(f\"Size of text is {len(txt)} characters.\")\nwith gzip.open(file, \"wt\") as f:\n f.write(txt)\n print(f\"Size of virtual uncompressed file is {f.tell()} bytes.\")\nprint(f\"Size of compressed file is {file.stat().st_size} bytes.\")",
"execution_count": 12,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Size of text is 880 characters.\nSize of virtual uncompressed file is 900 bytes.\nSize of compressed file is 88 bytes.\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:48:52.944712Z",
"start_time": "2020-05-20T09:48:52.932852Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "with gzip.open(\"test.txt.gz\", \"rt\") as f:\n print(f.read())",
"execution_count": 13,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "All work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\n\n"
}
]
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Compression"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:49:31.807550Z",
"start_time": "2020-05-20T09:49:31.796656Z"
},
"trusted": true
},
"cell_type": "code",
"source": "import zlib\nwith open(\"test.txt.gz\", \"rb\") as f:\n raw = f.read()\nprint(f\"Raw binary data: {raw}\")\nprint(zlib.decompress(raw, 15+32).decode())",
"execution_count": 14,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Raw binary data: b'\\x1f\\x8b\\x08\\x08\\x8c\\xfc\\xc4^\\x02\\xfftest.txt\\x00r\\xcc\\xc9Q(\\xcf/\\xcaVH\\xccKQ\\xc8\\xcbW(\\xc8I\\xacT\\xc8M\\xccN-V\\xf0JL\\x06\\n+\\xa4\\x94\\x02\\x95$\\xe5W\\xea\\xf1r9\\x8e*\\x1eU<\\xaa\\x98\\xda\\x8a\\x01\\x00\\x00\\x00\\xff\\xff\\x03\\x00.\\x97\\xb4\\x80\\x84\\x03\\x00\\x00'\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\nAll work and no play makes Jack a dull boy.\r\n\n"
}
]
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# tempfile"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "You can use the tempfile module to create temporary files and directory *somewhere*. If you use `with`, they are deleted afterwards.\n\nExamples:\n- https://gismo.readthedocs.io/en/latest/tutorials/tutorial_IO.html\n- https://gismo.readthedocs.io/en/latest/reference.html#gismo.datasets.dblp.Dblp.build"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# tempfile"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Remarks:\n- `with tempfile.TemporaryDirectory() as tmpdirname`: `tmpdirname` is the `.name` attribute, not the object itself\n- If used without `with`, clean with the `.cleanup()` method instead of the usual `.close()`"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# IO"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "`io` can be used to make in-memory variables behave like files."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:54:07.409921Z",
"start_time": "2020-05-20T09:54:07.402272Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "import io\nwith io.StringIO(txt) as f:\n print(f.read())",
"execution_count": 15,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "All work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\n\n"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:55:08.215134Z",
"start_time": "2020-05-20T09:55:08.206054Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "with io.BytesIO(raw) as f:\n with gzip.open(f, 'rt') as g:\n print(g.read())",
"execution_count": 16,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "All work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\nAll work and no play makes Jack a dull boy.\n\n"
}
]
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "### Examples"
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "# Save/load results"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Imagine you have a function that makes a huge computation."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:56:30.237991Z",
"start_time": "2020-05-20T09:56:30.227409Z"
},
"trusted": true
},
"cell_type": "code",
"source": "def compute_machin_bidule():\n print(\"This is a very complex function that takes a lot of time.\")\n return \"Machin Bidule!\"\ncompute_machin_bidule()",
"execution_count": 17,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "This is a very complex function that takes a lot of time.\n"
},
{
"data": {
"text/plain": "'Machin Bidule!'"
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Objective: write a function that loads result from file if it exists, compute and save otherwise."
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Old vs news"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T12:42:01.250761Z",
"start_time": "2020-05-20T12:42:01.244417Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "# Almost real code by the old me!\ndef get_machin_bidule(filename='machin.txt', \n directory='./'):\n file = directory+filename\n try:\n f = open(file)\n content = f.read()\n f.close()\n return content\n except IOError:\n content = compute_machin_bidule()\n f = open(file, \"w\")\n f.write(content)\n f.close()\n return content",
"execution_count": 25,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T12:43:12.773255Z",
"start_time": "2020-05-20T12:43:12.765260Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "# How I do this today\nfrom pathlib import Path\nimport dill as pickle\ndef get_machin_bidule(filename='machin.pkl', \n directory='.'):\n file = Path(directory) / Path(filename)\n if file.exists():\n with open(file, 'rb') as f:\n return pickle.load(f)\n else: \n content = compute_machin_bidule()\n with open(file, 'wb') as f:\n pickle.dump(content, f)\n return content",
"execution_count": 26,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T09:57:25.511127Z",
"start_time": "2020-05-20T09:57:25.498180Z"
},
"trusted": true
},
"cell_type": "code",
"source": "get_machin_bidule()",
"execution_count": 21,
"outputs": [
{
"data": {
"text/plain": "'Machin Bidule!'"
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Save/load results"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Can be used as a decorator or as a MixIn.\n\nCf for example https://gismo.readthedocs.io/en/latest/reference.html#io"
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"cell_type": "markdown",
"source": "# Video recompression"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Use case: recompress a bunch of avi files to gain space.\n- ffmpeg will do the actual job (we assume ffmpeg is installed on your system https://ffmpeg.org/download.html)\n- pathlib will do the rest"
},
{
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"cell_type": "markdown",
"source": "# Video recompression"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T10:04:04.231373Z",
"start_time": "2020-05-20T10:04:04.219179Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "from pathlib import Path\nimport subprocess\nimport json\n\ndef recompress(filepath):\n print(f\"Recompressing {filepath.name}.\")\n target = filepath.with_suffix(\".mkv\")\n cmd = (f\"ffmpeg -y -i \\\"{filepath}\\\" \"\n f\"-c:v libx265 -c:a copy \\\"{target}\\\"\")\n c = subprocess.run(cmd)\n if c.returncode != 0:\n print(f\"Error for {filepath.name}!!!\")\n if target.exists():\n target.unlink()\n else:\n old_s = filepath.stat().st_size\n new_s = Path(target).stat().st_size\n print(f\"Relative size of new file: \"\n f\"{100*new_s/old_s:.2f}%\")\n filepath.unlink()",
"execution_count": 23,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2020-05-20T10:04:42.833384Z",
"start_time": "2020-05-20T10:04:05.754103Z"
},
"cell_style": "split",
"trusted": true
},
"cell_type": "code",
"source": "# This is where I stored videos for the talk\n# You can try this with your own\n# !!! original videos will be removed if ffmpeg succeeds !!!\nd = Path(\"../../../../../Datasets/Videos\")\nfor file in d.rglob('*.avi'):\n recompress(file)",
"execution_count": 24,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Recompressing Erdos - 3D.avi.\nRelative size of new file: 1.79%\nRecompressing not_a_video.avi.\nError for not_a_video.avi!!!\nRecompressing init-random-ok.avi.\nRelative size of new file: 24.12%\n"
}
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.7.7",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"rise": {
"enable_chalkboard": true
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": true,
"base_numbering": 1,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
},
"gist": {
"id": "",
"data": {
"description": "InputOutput-Part_I.ipynb",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment