Jwink3101 · December 17, 2020 22:23
diff --git a/B2_snapshot_restore.clean.ipynb b/B2_snapshot_restore.clean.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using rclone to \"extract\" Backblaze Zip Snapshots and Reupload to B2\n",
    "\n",
    "This is a ~~guide~~ demonstration of how I use rclone to expand the contents of a Backblaze snapshot on B2 into another B2 bucket.\n",
    "\n",
    "## Questions and Answers\n",
    "\n",
    "### Who is this for?\n",
    "\n",
    "Me! Seriously, I wrote this for my own recall/notes in the future but I thought I'd share it\n",
    "\n",
    "To really answer the question, this is for people who want to do something similar and can use this as a guide. It is not a \"tool\" per se. It is not designed to be an easy or user-friendly process.\n",
    "\n",
    "I use Python to do it on a VPS. Python is super readable so it should be easy enough to (lightly) customize if you don't know Python. I would say this demonstration is for people who are willing to play around and learn it. It is *not* turn-key.\n",
    "\n",
    "### Can I use Windows?\n",
    "\n",
    "No idea! I am using a Debian VPS and my restore was from a macOS backup. I suspect any tool would work\n",
    "\n",
    "### What software do you need\n",
    "\n",
    "You need rclone and FUSE so you can rclone mount. This is not a guide on either of those. \n",
    "\n",
    "I am also assuming you've **already** set up rclone with B2 and/or an additional remote.\n",
    "\n",
    "I also use the awsome `tqdm` library but you can ignore that if you don't want it.\n",
    "\n",
    "### Will this cost money\n",
    "\n",
    "Yes! You will be downloading from B2 so you pay egress. It is also *very inneficient*. I have no idea how bad but I'd imagine it isn't great! So expect to pay more egress than you're actually using.\n",
    "\n",
    "### Why do this vs downloading the entire zip file?\n",
    "\n",
    "My test restore is small but my main use is for a 200+gb restore. I want to use my VPS's bandwidth but my VPS is small! (~10gb free). So while I am paying more for egress than, say, downloading the restore (and especially more than if I were to request a USB drive or download right from Backblaze Personal), it saves me the bandwidth.\n",
    "\n",
    "### What is this for\n",
    "\n",
    "Besides just putting it into its own B2 bucket, this process is useful to seed a different backup tool (including rclone, but really any)\n",
    "\n",
    "### Can I filter it\n",
    "\n",
    "Yes! There are two places. The first and best is to filter which `files` you include. The second is with rclone filters but I do *not* suggest that as you waste the time and expense to extract the files.\n",
    "\n",
    "### This could be done better\n",
    "\n",
    "I bet! Please share. I like learning new things. This is just what I worked out!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os,sys\n",
    "import shutil\n",
    "import time\n",
    "import subprocess\n",
    "import operator\n",
    "import signal\n",
    "from pathlib import Path\n",
    "from zipfile import ZipFile\n",
    "\n",
    "from tqdm import tqdm # This is 3rd party. $ python -m pip install tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rclone v1.53.3\n",
      "- os/arch: linux/amd64\n",
      "- go version: go1.15.5\n",
      "\n",
      "3.8.3 (default, Jul  2 2020, 16:21:59)\n",
      "[GCC 7.3.0]",
      "\n"
     ]
    }
   ],
   "source": [
    "print(subprocess.check_output(['rclone','version']).decode())\n",
    "print(sys.version)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mount the restore bucket\n",
    "\n",
    "Here we mount the restore bucket. Note, do **not** add any caching unless you have the scratch space. Since my restore is bigger than my free space, I do not! This is basically a super vanilla rclone mount. In fact, when I tested with different advanced options, it failed.\n",
    "\n",
    "There are two ways to do this. The first is to use a new terminal and create the mount there. That works fine but I will instead do it all within Python and `subprocess`. With subprcocess, the arguments are passed as a list. This is actually really great since you do not have to deal with escaping. And it's easier to comment! If you do run it on a seperate terminal, `screen` is your friend."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "mountdir = Path('~/mount').expanduser()\n",
    "mountdir.mkdir(exist_ok=True)\n",
    "\n",
    "rclone_remote = 'b2:b2-snapshots-7f7799daad93/' # already set up B2. Found the bucket with `rclone lsf b2:`\n",
    "restore_zip = 'bzsnapshot_2020-12-17-07-06-19.zip' # found with `rclone lsf b2:b2-snapshots-7f7799daad93/`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "cmd = ['rclone',\n",
    "       '-vv', # Optional but may be useful later\n",
    "       'mount',rclone_remote,str(mountdir),\n",
    "       '--read-only',]\n",
    "stdout,stderr = open('stdout','wb'),open('stderr','wb') # writable in bytes mode. I usually use context managers but I will need this to stay open\n",
    "mount_proc = subprocess.Popen(cmd,stdout=stdout,stderr=stderr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make sure it mounted. This is optional"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Waiting for mount \n",
      "......mounted\n"
     ]
    }
   ],
   "source": [
    "print('Waiting for mount ',flush=True)\n",
    "for ii in range(10):\n",
    "    if os.path.ismount(mountdir):\n",
    "        break\n",
    "    if mount_proc.poll() is not None:\n",
    "        raise ValueError('did not mount')\n",
    "    time.sleep(1)\n",
    "    print('.',end='',flush=True)\n",
    "else:\n",
    "    print('ERROR: Mount did not activate. Kill proc and exiting',file=sys.stderr,flush=True)\n",
    "    mount_proc.kill()\n",
    "    sys.exit(2)\n",
    "print('mounted')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Browse the Zip\n",
    "\n",
    "Python's `zipfile` will **not** read the entire file in order to get a listing or even some random file inside. Don't believe me? See the bottom!\n",
    "\n",
    "What we need to do now is get a list of the files and use manual inspection to decide what to cut. Backblaze uses the full path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "with ZipFile(mountdir/restore_zip) as zf:\n",
    "    files = zf.infolist() # could also do namelist() but we will want the sizes later"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2149"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(files)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pick a random file to get the path. We will use this later"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Macintosh HD/Users/jwinkMAC/PyFiSync/Papers/Sorted/2010/2018/melchers2018structural.pdf'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "files[1000].filename"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Identify and save the prefix as you want it removed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "restore_prefix = 'Macintosh HD/Users/jwinkMAC/' # We will need this later to reupload. This s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Restore a single file!\n",
    "\n",
    "This is actually super easy! Just search though `files` to find the file you want. Let's assume it is the 1000th file still"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "restore_file = files[1000]\n",
    "\n",
    "restore_dir = Path('~/restore').expanduser()\n",
    "with ZipFile(mountdir/restore_zip) as zf:\n",
    "    zf.extract(restore_file,path=str(restore_dir))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inside the zip file is the full prefixed file (from root). I don't want that"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PosixPath('/home/jwink3101/restore/PyFiSync/Papers/Sorted/2010/2018/melchers2018structural.pdf')"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Optional. Remove prefix\n",
    "src = restore_dir / restore_file.filename\n",
    "dst = restore_dir / os.path.relpath(src,restore_dir / restore_prefix)\n",
    "dst.parent.mkdir(parents=True,exist_ok=True)\n",
    "shutil.move(src,dst)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extract and Upload\n",
    "\n",
    "Now, this could almost *certainly* use improvement. We will do the following:\n",
    "\n",
    "- Gather files up to the max batch size. Then for each batch:\n",
    "- Delete the restore directory\n",
    "- Restore the batched files\n",
    "- Do an rclone `copy` (*not* `sync`) to push those files\n",
    "    - Need to make the source at the `restore_prefix` so we do not keep that junk\n",
    "    \n",
    "Note that we may be able to optimize this by better backfilling the batches but I am not sure if there is any advantages with sequential reading so I will go one file after the other. It may be moot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Tool to gather the files into batches\n",
    "def group_to_size(seq,maxsize,key=None):\n",
    "    \"\"\"\n",
    "    Group seq by size up to but not to exceed \n",
    "    maxsize (unless a single item does)\n",
    "    \n",
    "    Example:\n",
    "        >>> list(group_to_size([10,20,10,90,40,50,99,2,101,0,30,90,11],100))\n",
    "        [(10, 20, 10), (90,), (40, 50), (99,), (2,), (101,), (0, 30), (90,), (11,)]\n",
    "    \n",
    "    \"\"\"\n",
    "    s = 0\n",
    "    curr = []\n",
    "    for item in seq:\n",
    "        s0 = key(item) if callable(key) else item\n",
    "        if s + s0 > maxsize: # Yield if will be pushed over\n",
    "            yield tuple(curr)\n",
    "            curr = []\n",
    "            s = 0\n",
    "        s += s0\n",
    "        curr.append(item)\n",
    "    if curr: \n",
    "        yield tuple(curr) # Anything remaining"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxsize = 512 * 1024 * 1024 # 512 mb or 536870912 bytes\n",
    "\n",
    "# dest_remote = 'b2:mynewbuckets/whatever'\n",
    "dest_remote = '/home/jwink3101/restore/tmp/'\n",
    "\n",
    "scratch = Path('~/scratch').expanduser().absolute()\n",
    "scratch.mkdir(parents=True,exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is there you can filter stuff\n",
    "# filtered = (f for f in files if ...)\n",
    "\n",
    "filtered = files # No filter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/185 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "batch 0 # files 185\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 185/185 [00:30<00:00,  6.16it/s]\n",
      "  0%|          | 0/146 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n",
      "batch 1 # files 146\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 146/146 [00:32<00:00,  4.47it/s]\n",
      "  0%|          | 0/97 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n",
      "batch 2 # files 97\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 97/97 [00:29<00:00,  3.24it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  1%|          | 2/187 [00:00<00:12, 15.09it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "batch 3 # files 187\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 187/187 [00:31<00:00,  5.87it/s]\n",
      "  0%|          | 0/126 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n",
      "batch 4 # files 126\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 126/126 [00:28<00:00,  4.39it/s]\n",
      "  0%|          | 0/141 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n",
      "batch 5 # files 141\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 141/141 [00:27<00:00,  5.05it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/78 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "batch 6 # files 78\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 78/78 [00:15<00:00,  4.90it/s]\n",
      "  0%|          | 0/54 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n",
      "batch 7 # files 54\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 54/54 [00:29<00:00,  1.84it/s]\n",
      "  0%|          | 0/138 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n",
      "batch 8 # files 138\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 138/138 [00:29<00:00,  4.68it/s]\n",
      "  0%|          | 0/730 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n",
      "batch 9 # files 730\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 730/730 [00:29<00:00, 24.90it/s]\n",
      "  0%|          | 0/101 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n",
      "batch 10 # files 101\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 101/101 [00:30<00:00,  3.30it/s]\n",
      "  0%|          | 0/137 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n",
      "batch 11 # files 137\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 137/137 [00:28<00:00,  4.84it/s]\n",
      "  0%|          | 0/29 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n",
      "batch 12 # files 29\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 29/29 [00:11<00:00,  2.58it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calling rclone\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "batches = group_to_size(filtered,maxsize,key=operator.attrgetter('file_size'))\n",
    "with ZipFile(mountdir/restore_zip) as zf:\n",
    "    for ib,batchfiles in enumerate(batches):\n",
    "        print('batch',ib,'# files',len(batchfiles))\n",
    "        # Extract all of the files\n",
    "        for file in tqdm(batchfiles):\n",
    "            zf.extract(file,path=str(scratch))\n",
    "        \n",
    "        print('calling rclone')\n",
    "        \n",
    "        cmd = ['rclone',\n",
    "               'move', # use move so they get deleted\n",
    "               str(scratch / restore_prefix), dest_remote,\n",
    "               '--transfers','20', # and/or other flags. all optional.\n",
    "              ]\n",
    "        subprocess.check_call(cmd)\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unmount"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "mount_proc.send_signal(signal.SIGINT)\n",
    "mount_proc.wait() # Hopefully this works. Otherwise you may need to kill it manually\n",
    "stdout.close()\n",
    "stderr.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Additional Notes\n",
    "\n",
    "### ZipFile\n",
    "\n",
    "Python's ZipFile will read into a zip file without reading the entire file. It does need to \"seek\" in the file, hence the mount, but rclone handles that like a champ.\n",
    "\n",
    "How do I know I'm not downloading the entire file? Well, you could look at the rclone logs. The other way is to make a file-object that will be verbose about what's going on. Note that `ZipFile` takes either a filename *or* a file-like object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "class VerboseFile(io.FileIO):\n",
    "    def read(self,*args,**kwargs):\n",
    "        print('read',*args,**kwargs)\n",
    "        r = super(VerboseFile,self).read(*args,**kwargs)\n",
    "        print('  len:',len(r))\n",
    "        return r\n",
    "    def seek(self,*args,**kwargs):\n",
    "        print('seek',*args,**kwargs)\n",
    "        return super(VerboseFile,self).seek(*args,**kwargs)\n",
    "    def close(self,*args,**kwargs):\n",
    "        print('close')\n",
    "        return super(VerboseFile,self).close(*args,**kwargs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, insetad of \n",
    "\n",
    "```python\n",
    "with ZipFile(mountdir/restore_zip) as zf:\n",
    "    ...\n",
    "```\n",
    "do\n",
    "```python\n",
    "with ZipFile(VerboseFile(mountdir/restore_zip)) as zf:\n",
    "    ...\n",
    "``` \n",
    "and you'll be able to see everything"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Using rclone to \"extract\" Backblaze Zip Snapshots and Reupload to B2\n",
	"\n",
	"This is a ~~guide~~ demonstration of how I use rclone to expand the contents of a Backblaze snapshot on B2 into another B2 bucket.\n",
	"\n",
	"## Questions and Answers\n",
	"\n",
	"### Who is this for?\n",
	"\n",
	"Me! Seriously, I wrote this for my own recall/notes in the future but I thought I'd share it\n",
	"\n",
	"To really answer the question, this is for people who want to do something similar and can use this as a guide. It is not a \"tool\" per se. It is not designed to be an easy or user-friendly process.\n",
	"\n",
	"I use Python to do it on a VPS. Python is super readable so it should be easy enough to (lightly) customize if you don't know Python. I would say this demonstration is for people who are willing to play around and learn it. It is not turn-key.\n",
	"\n",
	"### Can I use Windows?\n",
	"\n",
	"No idea! I am using a Debian VPS and my restore was from a macOS backup. I suspect any tool would work\n",
	"\n",
	"### What software do you need\n",
	"\n",
	"You need rclone and FUSE so you can rclone mount. This is not a guide on either of those. \n",
	"\n",
	"I am also assuming you've already set up rclone with B2 and/or an additional remote.\n",
	"\n",
	"I also use the awsome `tqdm` library but you can ignore that if you don't want it.\n",
	"\n",
	"### Will this cost money\n",
	"\n",
	"Yes! You will be downloading from B2 so you pay egress. It is also very inneficient. I have no idea how bad but I'd imagine it isn't great! So expect to pay more egress than you're actually using.\n",
	"\n",
	"### Why do this vs downloading the entire zip file?\n",
	"\n",
	"My test restore is small but my main use is for a 200+gb restore. I want to use my VPS's bandwidth but my VPS is small! (~10gb free). So while I am paying more for egress than, say, downloading the restore (and especially more than if I were to request a USB drive or download right from Backblaze Personal), it saves me the bandwidth.\n",
	"\n",
	"### What is this for\n",
	"\n",
	"Besides just putting it into its own B2 bucket, this process is useful to seed a different backup tool (including rclone, but really any)\n",
	"\n",
	"### Can I filter it\n",
	"\n",
	"Yes! There are two places. The first and best is to filter which `files` you include. The second is with rclone filters but I do not suggest that as you waste the time and expense to extract the files.\n",
	"\n",
	"### This could be done better\n",
	"\n",
	"I bet! Please share. I like learning new things. This is just what I worked out!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import os,sys\n",
	"import shutil\n",
	"import time\n",
	"import subprocess\n",
	"import operator\n",
	"import signal\n",
	"from pathlib import Path\n",
	"from zipfile import ZipFile\n",
	"\n",
	"from tqdm import tqdm # This is 3rd party. $ python -m pip install tqdm"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"rclone v1.53.3\n",
	"- os/arch: linux/amd64\n",
	"- go version: go1.15.5\n",
	"\n",
	"3.8.3 (default, Jul 2 2020, 16:21:59)\n",
	"[GCC 7.3.0]",
	"\n"
	]
	}
	],
	"source": [
	"print(subprocess.check_output(['rclone','version']).decode())\n",
	"print(sys.version)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Mount the restore bucket\n",
	"\n",
	"Here we mount the restore bucket. Note, do not add any caching unless you have the scratch space. Since my restore is bigger than my free space, I do not! This is basically a super vanilla rclone mount. In fact, when I tested with different advanced options, it failed.\n",
	"\n",
	"There are two ways to do this. The first is to use a new terminal and create the mount there. That works fine but I will instead do it all within Python and `subprocess`. With subprcocess, the arguments are passed as a list. This is actually really great since you do not have to deal with escaping. And it's easier to comment! If you do run it on a seperate terminal, `screen` is your friend."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"mountdir = Path('~/mount').expanduser()\n",
	"mountdir.mkdir(exist_ok=True)\n",
	"\n",
	"rclone_remote = 'b2:b2-snapshots-7f7799daad93/' # already set up B2. Found the bucket with `rclone lsf b2:`\n",
	"restore_zip = 'bzsnapshot_2020-12-17-07-06-19.zip' # found with `rclone lsf b2:b2-snapshots-7f7799daad93/`"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"cmd = ['rclone',\n",
	" '-vv', # Optional but may be useful later\n",
	" 'mount',rclone_remote,str(mountdir),\n",
	" '--read-only',]\n",
	"stdout,stderr = open('stdout','wb'),open('stderr','wb') # writable in bytes mode. I usually use context managers but I will need this to stay open\n",
	"mount_proc = subprocess.Popen(cmd,stdout=stdout,stderr=stderr)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Make sure it mounted. This is optional"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Waiting for mount \n",
	"......mounted\n"
	]
	}
	],
	"source": [
	"print('Waiting for mount ',flush=True)\n",
	"for ii in range(10):\n",
	" if os.path.ismount(mountdir):\n",
	" break\n",
	" if mount_proc.poll() is not None:\n",
	" raise ValueError('did not mount')\n",
	" time.sleep(1)\n",
	" print('.',end='',flush=True)\n",
	"else:\n",
	" print('ERROR: Mount did not activate. Kill proc and exiting',file=sys.stderr,flush=True)\n",
	" mount_proc.kill()\n",
	" sys.exit(2)\n",
	"print('mounted')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Browse the Zip\n",
	"\n",
	"Python's `zipfile` will not read the entire file in order to get a listing or even some random file inside. Don't believe me? See the bottom!\n",
	"\n",
	"What we need to do now is get a list of the files and use manual inspection to decide what to cut. Backblaze uses the full path"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"with ZipFile(mountdir/restore_zip) as zf:\n",
	" files = zf.infolist() # could also do namelist() but we will want the sizes later"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"2149"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"len(files)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Pick a random file to get the path. We will use this later"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'Macintosh HD/Users/jwinkMAC/PyFiSync/Papers/Sorted/2010/2018/melchers2018structural.pdf'"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"files[1000].filename"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Identify and save the prefix as you want it removed"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [],
	"source": [
	"restore_prefix = 'Macintosh HD/Users/jwinkMAC/' # We will need this later to reupload. This s"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Restore a single file!\n",
	"\n",
	"This is actually super easy! Just search though `files` to find the file you want. Let's assume it is the 1000th file still"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [],
	"source": [
	"restore_file = files[1000]\n",
	"\n",
	"restore_dir = Path('~/restore').expanduser()\n",
	"with ZipFile(mountdir/restore_zip) as zf:\n",
	" zf.extract(restore_file,path=str(restore_dir))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Inside the zip file is the full prefixed file (from root). I don't want that"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"PosixPath('/home/jwink3101/restore/PyFiSync/Papers/Sorted/2010/2018/melchers2018structural.pdf')"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Optional. Remove prefix\n",
	"src = restore_dir / restore_file.filename\n",
	"dst = restore_dir / os.path.relpath(src,restore_dir / restore_prefix)\n",
	"dst.parent.mkdir(parents=True,exist_ok=True)\n",
	"shutil.move(src,dst)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Extract and Upload\n",
	"\n",
	"Now, this could almost certainly use improvement. We will do the following:\n",
	"\n",
	"- Gather files up to the max batch size. Then for each batch:\n",
	"- Delete the restore directory\n",
	"- Restore the batched files\n",
	"- Do an rclone `copy` (not `sync`) to push those files\n",
	" - Need to make the source at the `restore_prefix` so we do not keep that junk\n",
	" \n",
	"Note that we may be able to optimize this by better backfilling the batches but I am not sure if there is any advantages with sequential reading so I will go one file after the other. It may be moot."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Tool to gather the files into batches\n",
	"def group_to_size(seq,maxsize,key=None):\n",
	" \"\"\"\n",
	" Group seq by size up to but not to exceed \n",
	" maxsize (unless a single item does)\n",
	" \n",
	" Example:\n",
	" >>> list(group_to_size([10,20,10,90,40,50,99,2,101,0,30,90,11],100))\n",
	" [(10, 20, 10), (90,), (40, 50), (99,), (2,), (101,), (0, 30), (90,), (11,)]\n",
	" \n",
	" \"\"\"\n",
	" s = 0\n",
	" curr = []\n",
	" for item in seq:\n",
	" s0 = key(item) if callable(key) else item\n",
	" if s + s0 > maxsize: # Yield if will be pushed over\n",
	" yield tuple(curr)\n",
	" curr = []\n",
	" s = 0\n",
	" s += s0\n",
	" curr.append(item)\n",
	" if curr: \n",
	" yield tuple(curr) # Anything remaining"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [],
	"source": [
	"maxsize = 512 * 1024 * 1024 # 512 mb or 536870912 bytes\n",
	"\n",
	"# dest_remote = 'b2:mynewbuckets/whatever'\n",
	"dest_remote = '/home/jwink3101/restore/tmp/'\n",
	"\n",
	"scratch = Path('~/scratch').expanduser().absolute()\n",
	"scratch.mkdir(parents=True,exist_ok=True)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [],
	"source": [
	"# This is there you can filter stuff\n",
	"# filtered = (f for f in files if ...)\n",
	"\n",
	"filtered = files # No filter"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	" 0%\| \| 0/185 [00:00<?, ?it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"batch 0 # files 185\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 185/185 [00:30<00:00, 6.16it/s]\n",
	" 0%\| \| 0/146 [00:00<?, ?it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n",
	"batch 1 # files 146\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 146/146 [00:32<00:00, 4.47it/s]\n",
	" 0%\| \| 0/97 [00:00<?, ?it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n",
	"batch 2 # files 97\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 97/97 [00:29<00:00, 3.24it/s]\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	" 1%\| \| 2/187 [00:00<00:12, 15.09it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"batch 3 # files 187\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 187/187 [00:31<00:00, 5.87it/s]\n",
	" 0%\| \| 0/126 [00:00<?, ?it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n",
	"batch 4 # files 126\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 126/126 [00:28<00:00, 4.39it/s]\n",
	" 0%\| \| 0/141 [00:00<?, ?it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n",
	"batch 5 # files 141\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 141/141 [00:27<00:00, 5.05it/s]\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	" 0%\| \| 0/78 [00:00<?, ?it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"batch 6 # files 78\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 78/78 [00:15<00:00, 4.90it/s]\n",
	" 0%\| \| 0/54 [00:00<?, ?it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n",
	"batch 7 # files 54\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 54/54 [00:29<00:00, 1.84it/s]\n",
	" 0%\| \| 0/138 [00:00<?, ?it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n",
	"batch 8 # files 138\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 138/138 [00:29<00:00, 4.68it/s]\n",
	" 0%\| \| 0/730 [00:00<?, ?it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n",
	"batch 9 # files 730\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 730/730 [00:29<00:00, 24.90it/s]\n",
	" 0%\| \| 0/101 [00:00<?, ?it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n",
	"batch 10 # files 101\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 101/101 [00:30<00:00, 3.30it/s]\n",
	" 0%\| \| 0/137 [00:00<?, ?it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n",
	"batch 11 # files 137\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 137/137 [00:28<00:00, 4.84it/s]\n",
	" 0%\| \| 0/29 [00:00<?, ?it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n",
	"batch 12 # files 29\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 29/29 [00:11<00:00, 2.58it/s]"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"calling rclone\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"\n"
	]
	}
	],
	"source": [
	"batches = group_to_size(filtered,maxsize,key=operator.attrgetter('file_size'))\n",
	"with ZipFile(mountdir/restore_zip) as zf:\n",
	" for ib,batchfiles in enumerate(batches):\n",
	" print('batch',ib,'# files',len(batchfiles))\n",
	" # Extract all of the files\n",
	" for file in tqdm(batchfiles):\n",
	" zf.extract(file,path=str(scratch))\n",
	" \n",
	" print('calling rclone')\n",
	" \n",
	" cmd = ['rclone',\n",
	" 'move', # use move so they get deleted\n",
	" str(scratch / restore_prefix), dest_remote,\n",
	" '--transfers','20', # and/or other flags. all optional.\n",
	" ]\n",
	" subprocess.check_call(cmd)\n",
	" "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Unmount"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {},
	"outputs": [],
	"source": [
	"mount_proc.send_signal(signal.SIGINT)\n",
	"mount_proc.wait() # Hopefully this works. Otherwise you may need to kill it manually\n",
	"stdout.close()\n",
	"stderr.close()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Additional Notes\n",
	"\n",
	"### ZipFile\n",
	"\n",
	"Python's ZipFile will read into a zip file without reading the entire file. It does need to \"seek\" in the file, hence the mount, but rclone handles that like a champ.\n",
	"\n",
	"How do I know I'm not downloading the entire file? Well, you could look at the rclone logs. The other way is to make a file-object that will be verbose about what's going on. Note that `ZipFile` takes either a filename or a file-like object"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {},
	"outputs": [],
	"source": [
	"import io\n",
	"class VerboseFile(io.FileIO):\n",
	" def read(self,args,*kwargs):\n",
	" print('read',args,*kwargs)\n",
	" r = super(VerboseFile,self).read(args,*kwargs)\n",
	" print(' len:',len(r))\n",
	" return r\n",
	" def seek(self,args,*kwargs):\n",
	" print('seek',args,*kwargs)\n",
	" return super(VerboseFile,self).seek(args,*kwargs)\n",
	" def close(self,args,*kwargs):\n",
	" print('close')\n",
	" return super(VerboseFile,self).close(args,*kwargs)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Then, insetad of \n",
	"\n",
	"```python\n",
	"with ZipFile(mountdir/restore_zip) as zf:\n",
	" ...\n",
	"```\n",
	"do\n",
	"```python\n",
	"with ZipFile(VerboseFile(mountdir/restore_zip)) as zf:\n",
	" ...\n",
	"``` \n",
	"and you'll be able to see everything"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}