Skip to content

Instantly share code, notes, and snippets.

@searbe
Created August 7, 2012 09:48
Show Gist options
  • Save searbe/3284011 to your computer and use it in GitHub Desktop.
Save searbe/3284011 to your computer and use it in GitHub Desktop.
Parse simple XLSX in PHP with SimpleXML and ZipArchive
<?php
/**
* I had to parse an XLSX spreadsheet (which should damn well have been a CSV!)
* but the usual tools were hitting the memory limit pretty quick. I found that
* manually parsing the XML worked pretty well. Note that this, most likely,
* won't work if cells contain anything more than text or a number (so formulas,
* graphs, etc ..., I don't know what'd happen).
*/
$inputFile = '/path/to/spreadsheet.xlsx';
$dir = '/path/to/tmp/dir';
// Unzip
$zip = new ZipArchive();
$zip->open($inputFile);
$zip->extractTo($dir);
// Open up shared strings & the first worksheet
$strings = simplexml_load_file($dir . '/xl/sharedStrings.xml');
$sheet = simplexml_load_file($dir . '/xl/worksheets/sheet1.xml');
// Parse the rows
$xlrows = $sheet->sheetData->row;
foreach ($xlrows as $xlrow) {
$arr = array();
// In each row, grab it's value
foreach ($xlrow->c as $cell) {
$v = (string) $cell->v;
// If it has a "t" (type?) of "s" (string?), use the value to look up string value
if (isset($cell['t']) && $cell['t'] == 's') {
$s = array();
$si = $strings->si[(int) $v];
// Register & alias the default namespace or you'll get empty results in the xpath query
$si->registerXPathNamespace('n', 'http://schemas.openxmlformats.org/spreadsheetml/2006/main');
// Cat together all of the 't' (text?) node values
foreach($si->xpath('.//n:t') as $t) {
$s[] = (string) $t;
}
$v = implode($s);
}
$arr[] = $v;
}
// Assuming the first row are headers, stick them in the headers array
if (count($headers) == 0) {
$headers = $arr;
} else {
// Combine the row with the headers - make sure we have the same column count
$values = array_pad($arr, count($headers), '');
$row = array_combine($headers, $values);
/**
* Here, do whatever you like with the [header => value] assoc array in $row.
* It might be useful just to run this script without any code here, to watch
* memory usage simply iterating over your spreadsheet.
*/
}
}
@unlink($dir);
@unlink($inputFile);
@macinnir
Copy link

This is most helpful. Thank you.

Personally, I wanted to see the columns in an array for each row -- small adjustment:

            $arr = array()
            $row = 0;
            foreach ($xlrows as $xlrow) {

                // In each row, grab it's value
                foreach ($xlrow->c as $cell) {
                    $v = (string) $cell->v;

                    // If it has a "t" (type?) of "s" (string?), use the value to look up string value
                    if (isset($cell['t']) && $cell['t'] == 's') {
                        $s  = array();
                        $si = $strings->si[(int) $v];

                        // Register & alias the default namespace or you'll get empty results in the xpath query
                        $si->registerXPathNamespace('n', 'http://schemas.openxmlformats.org/spreadsheetml/2006/main');

                        // Cat together all of the 't' (text?) node values
                        foreach($si->xpath('.//n:t') as $t) {

                            $s[] = (string) $t;

                        }
                       $v = implode($s);

                    }

                    $arr[$row][] = $v;

                }

                    $row++;
              }

@fabianledefyl
Copy link

I use this script and it worked properly!, but I have a problem with dates, for example if I have a date 20/12/2012 in the xlsx document, worksheet 41263 write me, and I did not realize that conversion does.

anyone have any idea how I can fix this.

@iliyakolev
Copy link

Hi.
Thanks for the great script.
Sometimes when there are empty cells they are missing in the xml which makes the script misplace the cell.
Example:
<row r="1" spans="1:12" x14ac:dyDescent="0.2"> <c r="A1" s="4" t="s"> <v>179</v> </c> <c r="B1" s="4" t="s"> <v>4568</v> </c> <c r="C1" s="4" t="s"> <v>180</v> </c> <c r="D1" s="4" t="s"> <v>181</v> </c> <c r="E1" s="6" t="s"> <v>5</v> </c> <c r="F1" s="6" t="s"> <v>6</v> </c> <c r="G1" s="6" t="s"> <v>7</v> </c> <c r="H1" s="6" t="s"> <v>8</v> </c> <c r="I1" s="6" t="s"> <v>4565</v> </c> <c r="J1" s="6" t="s"> <v>4566</v> </c> <c r="K1" s="7" t="s"> <v>4567</v> </c> <c r="L1" s="6" t="s"> <v>4569</v> </c> </row> <row r="2" spans="1:12" x14ac:dyDescent="0.2"> <c r="A2" s="7"> <v>1</v> </c> <c r="C2" s="7"> <v>1</v> </c> <c r="D2" s="7" t="s"> <v>13</v> </c> <c r="E2" s="8"/> <c r="F2" s="8"/> <c r="G2" s="8"/> <c r="H2" s="8"/> <c r="J2" s="6" t="s"> <v>104</v> </c> <c r="K2" s="6"/> </row> <row r="3" spans="1:12" x14ac:dyDescent="0.2"> <c r="A3" s="7"> <v>2</v> </c> <c r="C3" s="7"> <v>0</v> </c> <c r="D3" s="7" t="s"> <v>182</v> </c> <c r="E3" s="6" t="s"> <v>104</v> </c> <c r="F3" s="6" t="s"> <v>104</v> </c> <c r="G3" s="8"/> <c r="H3" s="6" t="s"> <v>104</v> </c> <c r="I3" s="6" t="s"> <v>104</v> </c> <c r="J3" s="6" t="s"> <v>104</v> </c> <c r="K3" s="6" t="s"> <v>104</v> </c> </row>

The first row are the headers. All the cells are present there. The second row is missing B, I and L cells. L is filled with "" by the script, but everything after B is misplaced by one cell and after I by two. The third row is missing is missing B & L - again everything after B in moved one up and the "$values = array_pad($arr, count($headers), '');" is filling the wholes at the end.
I've made a small fix which suits my needs, but it will work only up to 26 columns. Maybe you can think of something else.
Fix:
`// In each row, grab it's value
$alphabet = range('A', 'Z');//set the alphabet in array//Put this one outside the first loop
$ic = 0;
foreach ($xlrow->c as $cell) {
$cellLetter = trim(str_replace(range(0, 9), '', $cell['r']));//removing the numeric from R
if ($cellLetter != $alphabet[$ic]) {//cell is missing between this one and the last proccesed
$arr[] = '';//fill with empty
$ic++;
}
$v = (string)$cell->v;

            // If it has a "t" (type?) of "s" (string?), use the value to look up string value
            if (isset($cell['t']) && $cell['t'] == 's') {
                $s = array();
                $si = $strings->si[(int)$v];

                // Register & alias the default namespace or you'll get empty results in the xpath query
                $si->registerXPathNamespace('n', 'http://schemas.openxmlformats.org/spreadsheetml/2006/main');
                // Cat together all of the 't' (text?) node values
                foreach ($si->xpath('.//n:t') as $t) {
                    $s[] = (string)$t;
                }
                $v = implode($s);
            }

            $arr[] = $v;
            $ic++;
        }`

Regards!

@darkworks
Copy link

what about comments and text strikes etc

@lubosdz
Copy link

lubosdz commented Nov 8, 2016

@fabianledefyl

Dates in excel are numbers representing the number of days since 01.01.1900.
Thus 41263 in a date cell is something like 113 years and 18 days since 01.01.1900 = 18.01.2013.

Here are two PHP functions to convert Excel dates into days, months, years:

    // EXCEL serialDATE conversions based on Julian <-> Gregorian calendars
    protected static function excel_D2DMY($days) {
        if ($days < 1){
            return '';
        }
        if ($days == 60)  {
            return array('day'=>29, 'month'=>2, 'year'=>1900);
        } else {
            if ($days < 60)  {
                // Because of the 29-02-1900 bug, any serial date
                // under 60 is one off... Compensate.
                ++$days;
            }
            // Modified Julian to DMY calculation with an addition of 2415019
            $l = $days + 68569 + 2415019;
            $n = floor(( 4 * $l ) / 146097);
            $l = $l - floor(( 146097 * $n + 3 ) / 4);
            $i = floor(( 4000 * ( $l + 1 ) ) / 1461001);
            $l = $l - floor(( 1461 * $i ) / 4) + 31;
            $j = floor(( 80 * $l ) / 2447);
            $nDay = $l - floor(( 2447 * $j ) / 80);
            $l = floor($j / 11);
            $nMonth = $j + 2 - ( 12 * $l );
            $nYear = 100 * ( $n - 49 ) + $i + $l;
            $ret = array('day'=>$nDay, 'month'=>$nMonth, 'year'=>$nYear);
            return $ret;
        }
    }

    /**
    * @desc Returns Excel serialDate constant calculated from gregorian date
    * original javascript found @ http://wwwmacho.mcmaster.ca/JAVA/JD.html
    *
    * tested compatibility with EXCEL dates between 01/01/1900..31.12.2099,
    * year given eventually as 0 (=1900)..199 (=2099)
    * adds also days if supplied argument more than real number of days
    * in particular month (same applies to months)
    * as for years bellow 1900: algorythm is not exact, therefore limitation
    * for years 1900 - 2099
    */
    protected static function excel_DMY2D($d, $m, $y, $uh=0, $um=0, $us=0) {
        if($y < 1900){
            if($y<0 || $y>199){
                // Invalid year, must be between 1900 - 2099 or 0 - 199
                return false;
            }else{
                $y += 1900;
            }
        }
        $extra = 100.0*$y + $m - 190002.5;
        $rjd = 367.0*$y;
        $rjd -= floor(7.0*($y+floor(($m+9.0)/12.0))/4.0);
        $rjd += floor(275.0*$m/9.0);
        $rjd += $d;
        $rjd += ($uh + ($um + $us/60.0)/60.)/24.0;
        $rjd += 1721013.5;
        $rjd -= 0.5*$extra/abs($extra);
        $rjd += 0.5;
        $rjd -= 2415020.5; // JD correction constant for base 01/01/1900
        $rjd += ($rjd > 60) ? 2 : 1; // adjust to inheritet EXCEL/LOTUS bug (2000 was NOT leap year, but LOTUS/EXCEL treated it is if it was)
        return $rjd;
    }

@makantayebi
Copy link

nice work! On my version of PHP the $headers = array(); in the beginning is necessary. would be a tiny update to this. I suggest allowing pull requests.

@towony
Copy link

towony commented Jan 4, 2018

I want to be very grateful. I wanted a very simple one. You've already made it. I needed a little modification. thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment