Puriney · May 15, 2017 19:54
diff --git a/pandas_sparse_matrix.ipynb b/pandas_sparse_matrix.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working with sparse matrix in Python\n",
    "\n",
    "Sparse matrix is saved in matrix-market format (<http://math.nist.gov/MatrixMarket/formats.html>)\n",
    "in common cases. This post is showing how to read in matrix-market format and create a Pandas dataframe.\n",
    "\n",
    "*Note: the Pandas version should be at least 0.20*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.20.1\n"
     ]
    }
   ],
   "source": [
    "import scipy.io\n",
    "import scipy.sparse\n",
    "import pandas as pd\n",
    "print(pd.__version__)\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "foo.mtx\n",
      "%%MatrixMarket matrix coordinate real general\n",
      "%=================================================================================\n",
      "%\n",
      "% This ASCII file represents a sparse MxN matrix with L \n",
      "% nonzeros in the following Matrix Market format:\n",
      "%\n",
      "% +----------------------------------------------+\n",
      "% |%%MatrixMarket matrix coordinate real general | <--- header line\n",
      "% |%                                             | <--+\n",
      "% |% comments                                    |    |-- 0 or more comment lines\n",
      "% |%                                             | <--+         \n",
      "% |    M  N  L                                   | <--- rows, columns, entries\n",
      "% |    I1  J1  A(I1, J1)                         | <--+\n",
      "% |    I2  J2  A(I2, J2)                         |    |\n",
      "% |    I3  J3  A(I3, J3)                         |    |-- L lines\n",
      "% |        . . .                                 |    |\n",
      "% |    IL JL  A(IL, JL)                          | <--+\n",
      "% +----------------------------------------------+   \n",
      "%\n",
      "% Indices are 1-based, i.e. A(1,1) is the first element.\n",
      "%\n",
      "%=================================================================================\n",
      "  5  5  8\n",
      "    1     1   1.000e+00\n",
      "    2     2   1.050e+01\n",
      "    3     3   1.500e-02\n",
      "    1     4   6.000e+00\n",
      "    4     2   2.505e+02\n",
      "    4     4  -2.800e+02\n",
      "    4     5   3.332e+01\n",
      "    5     5   1.200e+01"
     ]
    }
   ],
   "source": [
    "!ls *mtx\n",
    "!cat foo.mtx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-1: Read in mtx file by scipy\n",
    "\n",
    "`scipy.io.mmread` is the function to read in matrix-market format and return a `coo` matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  (0, 0)\t1.0\n",
      "  (1, 1)\t10.5\n",
      "  (2, 2)\t0.015\n",
      "  (0, 3)\t6.0\n",
      "  (3, 1)\t250.5\n",
      "  (3, 3)\t-280.0\n",
      "  (3, 4)\t33.32\n",
      "  (4, 4)\t12.0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<5x5 sparse matrix of type '<class 'numpy.float64'>'\n",
       "\twith 8 stored elements in COOrdinate format>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coo_mat = scipy.io.mmread('foo.mtx')\n",
    "print(coo_mat)\n",
    "coo_mat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-2: `coo` to `csr`/`csc`\n",
    "\n",
    "Scipy matrix in `coo` layout can be easily converted to other types: `csr` and `csc`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  (0, 0)\t1.0\n",
      "  (1, 1)\t10.5\n",
      "  (3, 1)\t250.5\n",
      "  (2, 2)\t0.015\n",
      "  (0, 3)\t6.0\n",
      "  (3, 3)\t-280.0\n",
      "  (3, 4)\t33.32\n",
      "  (4, 4)\t12.0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<5x5 sparse matrix of type '<class 'numpy.float64'>'\n",
       "\twith 8 stored elements in Compressed Sparse Column format>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "csc_mat = coo_mat.tocsc()\n",
    "print(csc_mat)\n",
    "csc_mat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  (0, 0)\t1.0\n",
      "  (0, 3)\t6.0\n",
      "  (1, 1)\t10.5\n",
      "  (2, 2)\t0.015\n",
      "  (3, 1)\t250.5\n",
      "  (3, 3)\t-280.0\n",
      "  (3, 4)\t33.32\n",
      "  (4, 4)\t12.0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<5x5 sparse matrix of type '<class 'numpy.float64'>'\n",
       "\twith 8 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "csr_mat = coo_mat.tocsr(copy=True)\n",
    "print(csr_mat)\n",
    "csr_mat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-3: `csr` to Pandas sparse data frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     0      1      2      3      4\n",
      "0  1.0    NaN    NaN    6.0    NaN\n",
      "1  NaN   10.5    NaN    NaN    NaN\n",
      "2  NaN    NaN  0.015    NaN    NaN\n",
      "3  NaN  250.5    NaN -280.0  33.32\n",
      "4  NaN    NaN    NaN    NaN  12.00\n"
     ]
    }
   ],
   "source": [
    "sp_df = pd.SparseDataFrame(csr_mat)#.fillna(0)\n",
    "print(sp_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     0      1      2      3      4\n",
      "0  1.0    0.0  0.000    6.0   0.00\n",
      "1  0.0   10.5  0.000    0.0   0.00\n",
      "2  0.0    0.0  0.015    0.0   0.00\n",
      "3  0.0  250.5  0.000 -280.0  33.32\n",
      "4  0.0    0.0  0.000    0.0  12.00\n"
     ]
    }
   ],
   "source": [
    "sp_df = pd.SparseDataFrame(csr_mat).fillna(0)\n",
    "print(sp_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Working with sparse matrix in Python\n",
	"\n",
	"Sparse matrix is saved in matrix-market format (<http://math.nist.gov/MatrixMarket/formats.html>)\n",
	"in common cases. This post is showing how to read in matrix-market format and create a Pandas dataframe.\n",
	"\n",
	"Note: the Pandas version should be at least 0.20"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"0.20.1\n"
	]
	}
	],
	"source": [
	"import scipy.io\n",
	"import scipy.sparse\n",
	"import pandas as pd\n",
	"print(pd.__version__)\n",
	"import numpy as np"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"foo.mtx\n",
	"%%MatrixMarket matrix coordinate real general\n",
	"%=================================================================================\n",
	"%\n",
	"% This ASCII file represents a sparse MxN matrix with L \n",
	"% nonzeros in the following Matrix Market format:\n",
	"%\n",
	"% +----------------------------------------------+\n",
	"% \|%%MatrixMarket matrix coordinate real general \| <--- header line\n",
	"% \|% \| <--+\n",
	"% \|% comments \| \|-- 0 or more comment lines\n",
	"% \|% \| <--+ \n",
	"% \| M N L \| <--- rows, columns, entries\n",
	"% \| I1 J1 A(I1, J1) \| <--+\n",
	"% \| I2 J2 A(I2, J2) \| \|\n",
	"% \| I3 J3 A(I3, J3) \| \|-- L lines\n",
	"% \| . . . \| \|\n",
	"% \| IL JL A(IL, JL) \| <--+\n",
	"% +----------------------------------------------+ \n",
	"%\n",
	"% Indices are 1-based, i.e. A(1,1) is the first element.\n",
	"%\n",
	"%=================================================================================\n",
	" 5 5 8\n",
	" 1 1 1.000e+00\n",
	" 2 2 1.050e+01\n",
	" 3 3 1.500e-02\n",
	" 1 4 6.000e+00\n",
	" 4 2 2.505e+02\n",
	" 4 4 -2.800e+02\n",
	" 4 5 3.332e+01\n",
	" 5 5 1.200e+01"
	]
	}
	],
	"source": [
	"!ls *mtx\n",
	"!cat foo.mtx"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step-1: Read in mtx file by scipy\n",
	"\n",
	"`scipy.io.mmread` is the function to read in matrix-market format and return a `coo` matrix."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" (0, 0)\t1.0\n",
	" (1, 1)\t10.5\n",
	" (2, 2)\t0.015\n",
	" (0, 3)\t6.0\n",
	" (3, 1)\t250.5\n",
	" (3, 3)\t-280.0\n",
	" (3, 4)\t33.32\n",
	" (4, 4)\t12.0\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"<5x5 sparse matrix of type '<class 'numpy.float64'>'\n",
	"\twith 8 stored elements in COOrdinate format>"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"coo_mat = scipy.io.mmread('foo.mtx')\n",
	"print(coo_mat)\n",
	"coo_mat"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step-2: `coo` to `csr`/`csc`\n",
	"\n",
	"Scipy matrix in `coo` layout can be easily converted to other types: `csr` and `csc`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" (0, 0)\t1.0\n",
	" (1, 1)\t10.5\n",
	" (3, 1)\t250.5\n",
	" (2, 2)\t0.015\n",
	" (0, 3)\t6.0\n",
	" (3, 3)\t-280.0\n",
	" (3, 4)\t33.32\n",
	" (4, 4)\t12.0\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"<5x5 sparse matrix of type '<class 'numpy.float64'>'\n",
	"\twith 8 stored elements in Compressed Sparse Column format>"
	]
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"csc_mat = coo_mat.tocsc()\n",
	"print(csc_mat)\n",
	"csc_mat"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" (0, 0)\t1.0\n",
	" (0, 3)\t6.0\n",
	" (1, 1)\t10.5\n",
	" (2, 2)\t0.015\n",
	" (3, 1)\t250.5\n",
	" (3, 3)\t-280.0\n",
	" (3, 4)\t33.32\n",
	" (4, 4)\t12.0\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"<5x5 sparse matrix of type '<class 'numpy.float64'>'\n",
	"\twith 8 stored elements in Compressed Sparse Row format>"
	]
	},
	"execution_count": 14,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"csr_mat = coo_mat.tocsr(copy=True)\n",
	"print(csr_mat)\n",
	"csr_mat"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step-3: `csr` to Pandas sparse data frame"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" 0 1 2 3 4\n",
	"0 1.0 NaN NaN 6.0 NaN\n",
	"1 NaN 10.5 NaN NaN NaN\n",
	"2 NaN NaN 0.015 NaN NaN\n",
	"3 NaN 250.5 NaN -280.0 33.32\n",
	"4 NaN NaN NaN NaN 12.00\n"
	]
	}
	],
	"source": [
	"sp_df = pd.SparseDataFrame(csr_mat)#.fillna(0)\n",
	"print(sp_df)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" 0 1 2 3 4\n",
	"0 1.0 0.0 0.000 6.0 0.00\n",
	"1 0.0 10.5 0.000 0.0 0.00\n",
	"2 0.0 0.0 0.015 0.0 0.00\n",
	"3 0.0 250.5 0.000 -280.0 33.32\n",
	"4 0.0 0.0 0.000 0.0 12.00\n"
	]
	}
	],
	"source": [
	"sp_df = pd.SparseDataFrame(csr_mat).fillna(0)\n",
	"print(sp_df)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}