-
-
Save searbe/3284011 to your computer and use it in GitHub Desktop.
<?php | |
/** | |
* I had to parse an XLSX spreadsheet (which should damn well have been a CSV!) | |
* but the usual tools were hitting the memory limit pretty quick. I found that | |
* manually parsing the XML worked pretty well. Note that this, most likely, | |
* won't work if cells contain anything more than text or a number (so formulas, | |
* graphs, etc ..., I don't know what'd happen). | |
*/ | |
$inputFile = '/path/to/spreadsheet.xlsx'; | |
$dir = '/path/to/tmp/dir'; | |
// Unzip | |
$zip = new ZipArchive(); | |
$zip->open($inputFile); | |
$zip->extractTo($dir); | |
// Open up shared strings & the first worksheet | |
$strings = simplexml_load_file($dir . '/xl/sharedStrings.xml'); | |
$sheet = simplexml_load_file($dir . '/xl/worksheets/sheet1.xml'); | |
// Parse the rows | |
$xlrows = $sheet->sheetData->row; | |
foreach ($xlrows as $xlrow) { | |
$arr = array(); | |
// In each row, grab it's value | |
foreach ($xlrow->c as $cell) { | |
$v = (string) $cell->v; | |
// If it has a "t" (type?) of "s" (string?), use the value to look up string value | |
if (isset($cell['t']) && $cell['t'] == 's') { | |
$s = array(); | |
$si = $strings->si[(int) $v]; | |
// Register & alias the default namespace or you'll get empty results in the xpath query | |
$si->registerXPathNamespace('n', 'http://schemas.openxmlformats.org/spreadsheetml/2006/main'); | |
// Cat together all of the 't' (text?) node values | |
foreach($si->xpath('.//n:t') as $t) { | |
$s[] = (string) $t; | |
} | |
$v = implode($s); | |
} | |
$arr[] = $v; | |
} | |
// Assuming the first row are headers, stick them in the headers array | |
if (count($headers) == 0) { | |
$headers = $arr; | |
} else { | |
// Combine the row with the headers - make sure we have the same column count | |
$values = array_pad($arr, count($headers), ''); | |
$row = array_combine($headers, $values); | |
/** | |
* Here, do whatever you like with the [header => value] assoc array in $row. | |
* It might be useful just to run this script without any code here, to watch | |
* memory usage simply iterating over your spreadsheet. | |
*/ | |
} | |
} | |
@unlink($dir); | |
@unlink($inputFile); |
I use this script and it worked properly!, but I have a problem with dates, for example if I have a date 20/12/2012 in the xlsx document, worksheet 41263 write me, and I did not realize that conversion does.
anyone have any idea how I can fix this.
Hi.
Thanks for the great script.
Sometimes when there are empty cells they are missing in the xml which makes the script misplace the cell.
Example:
<row r="1" spans="1:12" x14ac:dyDescent="0.2"> <c r="A1" s="4" t="s"> <v>179</v> </c> <c r="B1" s="4" t="s"> <v>4568</v> </c> <c r="C1" s="4" t="s"> <v>180</v> </c> <c r="D1" s="4" t="s"> <v>181</v> </c> <c r="E1" s="6" t="s"> <v>5</v> </c> <c r="F1" s="6" t="s"> <v>6</v> </c> <c r="G1" s="6" t="s"> <v>7</v> </c> <c r="H1" s="6" t="s"> <v>8</v> </c> <c r="I1" s="6" t="s"> <v>4565</v> </c> <c r="J1" s="6" t="s"> <v>4566</v> </c> <c r="K1" s="7" t="s"> <v>4567</v> </c> <c r="L1" s="6" t="s"> <v>4569</v> </c> </row> <row r="2" spans="1:12" x14ac:dyDescent="0.2"> <c r="A2" s="7"> <v>1</v> </c> <c r="C2" s="7"> <v>1</v> </c> <c r="D2" s="7" t="s"> <v>13</v> </c> <c r="E2" s="8"/> <c r="F2" s="8"/> <c r="G2" s="8"/> <c r="H2" s="8"/> <c r="J2" s="6" t="s"> <v>104</v> </c> <c r="K2" s="6"/> </row> <row r="3" spans="1:12" x14ac:dyDescent="0.2"> <c r="A3" s="7"> <v>2</v> </c> <c r="C3" s="7"> <v>0</v> </c> <c r="D3" s="7" t="s"> <v>182</v> </c> <c r="E3" s="6" t="s"> <v>104</v> </c> <c r="F3" s="6" t="s"> <v>104</v> </c> <c r="G3" s="8"/> <c r="H3" s="6" t="s"> <v>104</v> </c> <c r="I3" s="6" t="s"> <v>104</v> </c> <c r="J3" s="6" t="s"> <v>104</v> </c> <c r="K3" s="6" t="s"> <v>104</v> </c> </row>
The first row are the headers. All the cells are present there. The second row is missing B, I and L cells. L is filled with "" by the script, but everything after B is misplaced by one cell and after I by two. The third row is missing is missing B & L - again everything after B in moved one up and the "$values = array_pad($arr, count($headers), '');" is filling the wholes at the end.
I've made a small fix which suits my needs, but it will work only up to 26 columns. Maybe you can think of something else.
Fix:
`// In each row, grab it's value
$alphabet = range('A', 'Z');//set the alphabet in array//Put this one outside the first loop
$ic = 0;
foreach ($xlrow->c as $cell) {
$cellLetter = trim(str_replace(range(0, 9), '', $cell['r']));//removing the numeric from R
if ($cellLetter != $alphabet[$ic]) {//cell is missing between this one and the last proccesed
$arr[] = '';//fill with empty
$ic++;
}
$v = (string)$cell->v;
// If it has a "t" (type?) of "s" (string?), use the value to look up string value
if (isset($cell['t']) && $cell['t'] == 's') {
$s = array();
$si = $strings->si[(int)$v];
// Register & alias the default namespace or you'll get empty results in the xpath query
$si->registerXPathNamespace('n', 'http://schemas.openxmlformats.org/spreadsheetml/2006/main');
// Cat together all of the 't' (text?) node values
foreach ($si->xpath('.//n:t') as $t) {
$s[] = (string)$t;
}
$v = implode($s);
}
$arr[] = $v;
$ic++;
}`
Regards!
what about comments and text strikes etc
Dates in excel are numbers representing the number of days since 01.01.1900.
Thus 41263
in a date cell is something like 113 years and 18 days since 01.01.1900 = 18.01.2013
.
Here are two PHP functions to convert Excel dates into days, months, years:
// EXCEL serialDATE conversions based on Julian <-> Gregorian calendars
protected static function excel_D2DMY($days) {
if ($days < 1){
return '';
}
if ($days == 60) {
return array('day'=>29, 'month'=>2, 'year'=>1900);
} else {
if ($days < 60) {
// Because of the 29-02-1900 bug, any serial date
// under 60 is one off... Compensate.
++$days;
}
// Modified Julian to DMY calculation with an addition of 2415019
$l = $days + 68569 + 2415019;
$n = floor(( 4 * $l ) / 146097);
$l = $l - floor(( 146097 * $n + 3 ) / 4);
$i = floor(( 4000 * ( $l + 1 ) ) / 1461001);
$l = $l - floor(( 1461 * $i ) / 4) + 31;
$j = floor(( 80 * $l ) / 2447);
$nDay = $l - floor(( 2447 * $j ) / 80);
$l = floor($j / 11);
$nMonth = $j + 2 - ( 12 * $l );
$nYear = 100 * ( $n - 49 ) + $i + $l;
$ret = array('day'=>$nDay, 'month'=>$nMonth, 'year'=>$nYear);
return $ret;
}
}
/**
* @desc Returns Excel serialDate constant calculated from gregorian date
* original javascript found @ http://wwwmacho.mcmaster.ca/JAVA/JD.html
*
* tested compatibility with EXCEL dates between 01/01/1900..31.12.2099,
* year given eventually as 0 (=1900)..199 (=2099)
* adds also days if supplied argument more than real number of days
* in particular month (same applies to months)
* as for years bellow 1900: algorythm is not exact, therefore limitation
* for years 1900 - 2099
*/
protected static function excel_DMY2D($d, $m, $y, $uh=0, $um=0, $us=0) {
if($y < 1900){
if($y<0 || $y>199){
// Invalid year, must be between 1900 - 2099 or 0 - 199
return false;
}else{
$y += 1900;
}
}
$extra = 100.0*$y + $m - 190002.5;
$rjd = 367.0*$y;
$rjd -= floor(7.0*($y+floor(($m+9.0)/12.0))/4.0);
$rjd += floor(275.0*$m/9.0);
$rjd += $d;
$rjd += ($uh + ($um + $us/60.0)/60.)/24.0;
$rjd += 1721013.5;
$rjd -= 0.5*$extra/abs($extra);
$rjd += 0.5;
$rjd -= 2415020.5; // JD correction constant for base 01/01/1900
$rjd += ($rjd > 60) ? 2 : 1; // adjust to inheritet EXCEL/LOTUS bug (2000 was NOT leap year, but LOTUS/EXCEL treated it is if it was)
return $rjd;
}
nice work! On my version of PHP the $headers = array(); in the beginning is necessary. would be a tiny update to this. I suggest allowing pull requests.
I want to be very grateful. I wanted a very simple one. You've already made it. I needed a little modification. thanks again
This is most helpful. Thank you.
Personally, I wanted to see the columns in an array for each row -- small adjustment: