Last active
April 30, 2024 15:23
-
-
Save helena-intel/dd306d93503c70ab38d31982088924fa to your computer and use it in GitHub Desktop.
201-vision-monocular-depth-estimation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"cells": [{"cell_type": "markdown", "id": "amino-disclosure", "metadata": {"id": "moved-collapse"}, "source": "# MONODEPTH on OpenVINO IR Model\n\nThis notebook demonstrates Monocular Depth Estimation with MidasNet in OpenVINO. Model information: https://github.com/openvinotoolkit/open_model_zoo/blob/master/models/public/midasnet/midasnet.md\n\nTHIS IS A WORK IN PROGRESS NOTEBOOK. IT IS NOT FOR PUBLIC RELEASE. See the [README](README.md) for instructions on how to run this notebook on your own computer."}, {"id": "1a5425b6", "cell_type": "markdown", "source": "## Preparation\n\nInstall the requirements and download the files that are necessary for running this notebook.", "metadata": {}}, {"id": "6b3dd628", "cell_type": "code", "metadata": {}, "execution_count": null, "source": "# Install required Python packages\nimport pip\npip_arguments = None if pip.__version__ < '20.3' else ' --use-deprecated=legacy-resolver'\n!pip install $pip_argumentsopenvino-dev numpy==1.18.5 ipykernel==5.5.0 matplotlib opencv-python-headless==4.2.0.32", "outputs": []}, {"id": "ab7df814", "cell_type": "code", "metadata": {}, "execution_count": null, "source": "# Download image and model files\nimport os\nimport pip\nimport urllib.parse\nimport urllib.request\nfrom pathlib import Path\n\nurls = ['https://raw.githubusercontent.com/helena-intel/openvino-notebooks/develop/201-vision-monocular-depth-estimation/monodepth.gif', 'https://raw.githubusercontent.com/helena-intel/openvino-notebooks/develop/201-vision-monocular-depth-estimation/install_and_launch_monodepth.bat', 'https://raw.githubusercontent.com/helena-intel/openvino-notebooks/develop/201-vision-monocular-depth-estimation/requirements.txt', 'https://raw.githubusercontent.com/helena-intel/openvino-notebooks/develop/201-vision-monocular-depth-estimation/requirements-image.txt', 'https://raw.githubusercontent.com/helena-intel/openvino-notebooks/develop/201-vision-monocular-depth-estimation/coco_bike.jpg', 'https://raw.githubusercontent.com/helena-intel/openvino-notebooks/develop/201-vision-monocular-depth-estimation/videos/Coco Walking in Berkeley.mp4', 'https://raw.githubusercontent.com/helena-intel/openvino-notebooks/develop/201-vision-monocular-depth-estimation/models/MiDaS_small.bin', 'https://raw.githubusercontent.com/helena-intel/openvino-notebooks/develop/201-vision-monocular-depth-estimation/models/MiDaS_small.xml']\n\nfor url in urls:\n save_path = Path(url).relative_to(fr\"https:/raw.githubusercontent.com/helena-intel/openvino-notebooks/develop/201-vision-monocular-depth-estimation\")\n os.makedirs(save_path.parent, exist_ok=True)\n safe_url = urllib.parse.quote(url, safe=\":/\")\n\n urllib.request.urlretrieve(safe_url, save_path.as_posix())", "outputs": []}, {"cell_type": "markdown", "id": "determined-debut", "metadata": {}, "source": "<img src=\"monodepth.gif\">"}, {"cell_type": "markdown", "id": "fixed-biotechnology", "metadata": {}, "source": "### What is Monodepth?\nMonocular Depth Estimation is the task of estimating scene depth using a single image. It has many potential applications in robotics, 3D reconstruction, medical imaging and autonomous systems. For this demo, we use a neural network model called [MiDaS](https://github.com/intel-isl/MiDaS) which was developed by the Intelligent Systems Lab at Intel. Check out their research paper to learn more. \n\nR. Ranftl, K. Lasinger, D. Hafner, K. Schindler and V. Koltun, [\"Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer,\"](https://ieeexplore.ieee.org/document/9178977) in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2020.3019967."}, {"cell_type": "markdown", "id": "vulnerable-thread", "metadata": {"id": "creative-cisco"}, "source": "## Preparation "}, {"cell_type": "markdown", "id": "legitimate-timber", "metadata": {"id": "faced-honolulu"}, "source": "### Imports"}, {"cell_type": "code", "execution_count": null, "id": "placed-savage", "metadata": {"id": "ahead-spider"}, "outputs": [], "source": "import os\nimport time\nimport urllib\nfrom pathlib import Path\n\nimport cv2\nimport matplotlib.cm\nimport matplotlib.pyplot as plt\nimport numpy as np\nfrom IPython.display import FileLink, HTML, Pretty, ProgressBar, Video, clear_output, display\nfrom openvino.inference_engine import IECore"}, {"cell_type": "markdown", "id": "exposed-brush", "metadata": {"id": "contained-office"}, "source": "### Settings"}, {"cell_type": "code", "execution_count": null, "id": "essential-faith", "metadata": {"id": "amber-lithuania"}, "outputs": [], "source": "DEVICE = \"CPU\"\n# MODEL_URL = \"models/midasnet.xml\" # Larger model that is slower, but gives better results\nMODEL_FILE = \"models/MiDaS_small.xml\" # Small model that is fast and gives good results on some kinds of data\n\nmodel_name = os.path.basename(MODEL_FILE)\nmodel_xml_path = Path(MODEL_FILE).with_suffix(\".xml\")"}, {"cell_type": "markdown", "id": "wired-resistance", "metadata": {}, "source": "## Functions"}, {"cell_type": "code", "execution_count": null, "id": "acute-preview", "metadata": {"id": "endangered-constraint"}, "outputs": [], "source": "def normalize_minmax(data):\n \"\"\"Normalizes the values in `data` between 0 and 1\"\"\"\n return (data - data.min()) / (data.max() - data.min())"}, {"cell_type": "code", "execution_count": null, "id": "threatened-neutral", "metadata": {}, "outputs": [], "source": "def load_image(path: str):\n \"\"\"\n Loads an image from `path` and returns it as BGR numpy array. `path` should point to an image file,\n either a local filename or an url.\n \"\"\"\n if path.startswith(\"http\"):\n # Set User-Agent to Mozilla because some websites block requests with User-Agent Python\n request = urllib.request.Request(path, headers={\"User-Agent\": \"Mozilla/5.0\"})\n response = urllib.request.urlopen(request)\n array = np.asarray(bytearray(response.read()), dtype=\"uint8\")\n image = cv2.imdecode(array, -1) # Loads the image as BGR\n else:\n image = cv2.imread(path)\n return image"}, {"cell_type": "code", "execution_count": null, "id": "selective-annotation", "metadata": {}, "outputs": [], "source": "def convert_result_to_image(result, colormap=\"viridis\"):\n \"\"\"\n Convert network result of floating point numbers to an RGB image with integer values from 0-255\n by applying a colormap.\n\n `result` is expected to be a single network result in 1,H,W shape\n `colormap` is a matplotlib colormap. See https://matplotlib.org/stable/tutorials/colors/colormaps.html\n \"\"\"\n cmap = matplotlib.cm.get_cmap(colormap)\n result = result.squeeze(0)\n result = normalize_minmax(result)\n result = cmap(result)[:, :, :3] * 255\n result = result.astype(np.uint8)\n return result"}, {"cell_type": "markdown", "id": "damaged-checkout", "metadata": {"id": "sensitive-wagner"}, "source": "## Load model and get model information\n\nLoad the model in Inference Engine with `ie.read_network` and load it to the specified device with `ie.load_network`"}, {"cell_type": "code", "execution_count": null, "id": "differential-noise", "metadata": {"id": "complete-brother"}, "outputs": [], "source": "ie = IECore()\nnet = ie.read_network(str(model_xml_path), str(model_xml_path.with_suffix(\".bin\")))\nexec_net = ie.load_network(network=net, device_name=DEVICE)\n\ninput_key = list(exec_net.input_info)[0]\noutput_key = list(exec_net.outputs.keys())[0]\n\nnetwork_input_shape = exec_net.input_info[input_key].tensor_desc.dims\nnetwork_image_height, network_image_width = network_input_shape[2:]"}, {"cell_type": "markdown", "id": "vietnamese-casino", "metadata": {"id": "compact-bargain"}, "source": "## Monodepth on Image\n\n### Load, resize and reshape input image\n\nThe input image is read with OpenCV, resized to network input size, and reshaped to (N,C,H,W) (H=height, W=width, C=number of channels, N=number of images). "}, {"cell_type": "code", "execution_count": null, "id": "guilty-protein", "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "central-psychology", "outputId": "d864ee96-3fbd-488d-da1a-88e730f34aad", "tags": []}, "outputs": [], "source": "# Download and load an image\n# Image source (CC license): https://storage.googleapis.com/openimages/web/visualizer/index.html?set=train&type=segmentation&r=false&c=%2Fm%2F02rgn06&id=470c2f96cb938855\nIMAGE_FILE = \"coco_bike.jpg\"\nimage = load_image(IMAGE_FILE)\nresized_image = cv2.resize(image, (network_image_height, network_image_width)) # resize to input shape for network\ninput_image = np.expand_dims(np.transpose(resized_image, (2, 0, 1)), 0) # reshape image to network input shape NCHW"}, {"cell_type": "markdown", "id": "great-karma", "metadata": {"id": "taken-spanking"}, "source": "### Do inference on image\n\nDo the inference, convert the result to an image, and resize it to the original image shape"}, {"cell_type": "code", "execution_count": null, "id": "loved-terrorism", "metadata": {"id": "banner-kruger"}, "outputs": [], "source": "result = exec_net.infer(inputs={input_key: input_image})[output_key]\n# convert network result of disparity map to an image that shows distance as colors\nresult_image = convert_result_to_image(result)\n# resize back to original image shape. cv2.resize expects shape in (width, height), [::-1] reverses the (height, width) shape to match this.\nresult_image = cv2.resize(result_image, image.shape[:2][::-1])"}, {"cell_type": "markdown", "id": "continuing-arbor", "metadata": {}, "source": "### Display monodepth image"}, {"cell_type": "code", "execution_count": null, "id": "experienced-couple", "metadata": {"colab": {"base_uri": "https://localhost:8080/", "height": 867}, "id": "ranging-executive", "outputId": "30373e8e-34e9-4820-e32d-764aa99d4b25"}, "outputs": [], "source": "fig, ax = plt.subplots(1, 2, figsize=(20, 15))\nax[0].imshow(image[:, :, (2, 1, 0)]) # (2,1,0) converts the image from BGR to RGB\nax[1].imshow(result_image);"}, {"cell_type": "markdown", "id": "republican-advice", "metadata": {"id": "descending-cache"}, "source": "## Monodepth on Video\n\nBy default, only the first 100 frames are processed, in order to quickly check that everything works. Change NUM_FRAMES in the cell below to modify this. Set NUM_FRAMES to 0 to process the whole video."}, {"cell_type": "markdown", "id": "academic-alberta", "metadata": {}, "source": "### Download and load video"}, {"cell_type": "code", "execution_count": null, "id": "mysterious-murder", "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "terminal-dividend", "outputId": "87f5ada0-8caf-49c3-fe54-626e2b1967f3"}, "outputs": [], "source": "# Video source: https://www.youtube.com/watch?v=fu1xcQdJRws (Public Domain)\nVIDEO_FILE = \"videos/Coco Walking in Berkeley.mp4\"\nNUM_FRAMES = 100 # Number of video frames to process. Set to 0 to process all frames.\n# Create Path objects for the input video and the resulting video\nvideo_path = Path(VIDEO_FILE)\nresult_video_path = video_path.with_name(f\"{video_path.stem}_monodepth.mp4\")"}, {"cell_type": "code", "execution_count": null, "id": "flexible-bundle", "metadata": {}, "outputs": [], "source": "cap = cv2.VideoCapture(str(video_path))\nret, image = cap.read()\nif not ret:\n raise ValueError(f\"The video at {video_path} cannot be read.\")\nFPS = cap.get(cv2.CAP_PROP_FPS)\nFRAME_HEIGHT, FRAME_WIDTH = image.shape[:2]\n# The format to use for video encoding. VP90 is slow, but it works on most systems.\n# Try the THEO encoding if you have FFMPEG installed.\nFOURCC = cv2.VideoWriter_fourcc(*\"VP90\")\n# FOURCC = cv2.VideoWriter_fourcc(*\"THEO\")\n\ncap.release()\nprint(f\"The input video has a frame width of {FRAME_WIDTH}, frame height of {FRAME_HEIGHT} and runs at {FPS} fps\")"}, {"cell_type": "markdown", "id": "accompanied-herald", "metadata": {}, "source": "### Do Inference on video and create monodepth video"}, {"cell_type": "code", "execution_count": null, "id": "worthy-yugoslavia", "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "present-albany", "outputId": "600edb69-af12-44dc-ec8e-95005b74179c"}, "outputs": [], "source": "frame_nr = 1\nstart_time = time.perf_counter()\ntotal_inference_duration = 0\n\ncap = cv2.VideoCapture(str(video_path))\nout_video = cv2.VideoWriter(\n str(result_video_path),\n FOURCC,\n FPS,\n (FRAME_WIDTH * 2, FRAME_HEIGHT),\n)\n\ntotal_frames = cap.get(cv2.CAP_PROP_FRAME_COUNT) if NUM_FRAMES == 0 else NUM_FRAMES\nprogress_bar = ProgressBar(total=total_frames)\nprogress_bar.display()\n\ntry:\n while cap.isOpened():\n ret, image = cap.read()\n if not ret:\n cap.release()\n break\n\n if frame_nr == total_frames:\n break\n\n # Prepare frame for inference\n resized_image = cv2.resize(image, (network_image_height, network_image_width)) # resize to input shape for network\n input_image = np.expand_dims(np.transpose(resized_image, (2, 0, 1)), 0) # reshape image to network input shape NCHW\n\n # Do inference\n inference_start_time = time.perf_counter()\n result = exec_net.infer(inputs={input_key: input_image})[output_key]\n inference_stop_time = time.perf_counter()\n inference_duration = inference_stop_time - inference_start_time\n total_inference_duration += inference_duration\n\n if frame_nr % 10 == 0:\n clear_output(wait=True)\n progress_bar.display()\n display(\n Pretty(f\"Processed frame {frame_nr}. Inference time: {inference_duration:.2f} seconds ({1/inference_duration:.2f} FPS)\")\n )\n\n # Transform network result to image\n result_frame = convert_result_to_image(result)[:, :, (2, 1, 0)] # Convert result from RGB to BGR\n # Resize to original image shape\n result_frame = cv2.resize(result_frame, (FRAME_WIDTH, FRAME_HEIGHT))\n # Put image and result side by side\n stacked_frame = np.hstack((image, result_frame))\n # Save frame to video\n out_video.write(stacked_frame)\n\n frame_nr = frame_nr + 1\n progress_bar.progress = frame_nr\n progress_bar.update()\n\nexcept KeyboardInterrupt:\n print(\"Processing interrupted.\")\nfinally:\n out_video.release()\n cap.release()\n end_time = time.perf_counter()\n duration = end_time - start_time\n clear_output()\n print(f\"Monodepth Video saved to '{str(result_video_path)}'.\")\n print(\n f\"Processed {frame_nr} frames in {duration:.2f} seconds. Total FPS (including video processing): {frame_nr/duration:.2f}. Inference FPS: {frame_nr/total_inference_duration:.2f} \"\n )"}, {"cell_type": "markdown", "id": "military-reggae", "metadata": {"id": "bZ89ZI369KjA"}, "source": "### Display monodepth video"}, {"cell_type": "code", "execution_count": null, "id": "intense-choir", "metadata": {}, "outputs": [], "source": "video = Video(result_video_path, width=800, embed=True)\nif not result_video_path.exists():\n plt.imshow(stacked_frame)\n raise ValueError(\"OpenCV was unable to write the video file. Showing one video frame.\")\nelse:\n print(f\"Showing monodepth video {result_video_path.resolve()}\")\n display(video)\n video_link = FileLink(result_video_path)\n display(HTML(f'If you cannot see the video in your browser, please right click on the following link to download the video {video_link._repr_html_()}'))\n"}], "metadata": {"colab": {"collapsed_sections": [], "name": "monodepth.ipynb", "provenance": [], "toc_visible": true}, "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8"}}, "nbformat": 4, "nbformat_minor": 5} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
libpython3.7-dev |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment