-
-
Save stojg/3045663 to your computer and use it in GitHub Desktop.
<?php | |
// An example on how to parse massive XML files with PHP by chunking it up to avoid running out of memory | |
// Open the XML | |
$handle = fopen('file.xml', 'r'); | |
// Get the nodestring incrementally from the xml file by defining a callback | |
// In this case using a anon function. | |
nodeStringFromXMLFile($handle, '<item>', '</item>', function($nodeText){ | |
// Transform the XMLString into an array and | |
print_r(getArrayFromXMLString($nodeText)); | |
}); | |
fclose($handle); | |
/** | |
* For every node that starts with $startNode and ends with $endNode call $callback | |
* with the string as an argument | |
* | |
* Note: Sometimes it returns two nodes instead of a single one, this could easily be | |
* handled by the callback though. This function primary job is to split a large file | |
* into manageable XML nodes. | |
* | |
* the callback will receive one parameter, the XML node(s) as a string | |
* | |
* @param resource $handle - a file handle | |
* @param string $startNode - what is the start node name e.g <item> | |
* @param string $endNode - what is the end node name e.g </item> | |
* @param callable $callback - an anonymous function | |
*/ | |
function nodeStringFromXMLFile($handle, $startNode, $endNode, $callback=null) { | |
$cursorPos = 0; | |
while(true) { | |
// Find start position | |
$startPos = getPos($handle, $startNode, $cursorPos); | |
// We reached the end of the file or an error | |
if($startPos === false) { | |
break; | |
} | |
// Find where the node ends | |
$endPos = getPos($handle, $endNode, $startPos) + mb_strlen($endNode); | |
// Jump back to the start position | |
fseek($handle, $startPos); | |
// Read the data | |
$data = fread($handle, ($endPos-$startPos)); | |
// pass the $data into the callback | |
$callback($data); | |
// next iteration starts reading from here | |
$cursorPos = ftell($handle); | |
} | |
} | |
/** | |
* This function will return the first string it could find in a resource that matches the $string. | |
* | |
* By using a $startFrom it recurses and seeks $chunk bytes at a time to avoid reading the | |
* whole file at once. | |
* | |
* @param resource $handle - typically a file handle | |
* @param string $string - what string to search for | |
* @param int $startFrom - strpos to start searching from | |
* @param int $chunk - chunk to read before rereading again | |
* @return int|bool - Will return false if there are EOL or errors | |
*/ | |
function getPos($handle, $string, $startFrom=0, $chunk=1024, $prev='') { | |
// Set the file cursor on the startFrom position | |
fseek($handle, $startFrom, SEEK_SET); | |
// Read data | |
$data = fread($handle, $chunk); | |
// Try to find the search $string in this chunk | |
$stringPos = mb_strpos($prev.$data, $string); | |
// We found the string, return the position | |
if($stringPos !== false ) { | |
return $stringPos+$startFrom - mb_strlen($prev); | |
} | |
// We reached the end of the file | |
if(feof($handle)) { | |
return false; | |
} | |
// Recurse to read more data until we find the search $string it or run out of disk | |
return getPos($handle, $string, $chunk+$startFrom, $chunk, $data); | |
} | |
/** | |
* Turn a string version of XML and turn it into an array by using the | |
* SimpleXML | |
* | |
* @param string $nodeAsString - a string representation of a XML node | |
* @return array | |
*/ | |
function getArrayFromXMLString($nodeAsString) { | |
$simpleXML = simplexml_load_string($nodeAsString); | |
if(libxml_get_errors()) { | |
user_error('Libxml throws some errors.', implode(',', libxml_get_errors())); | |
} | |
return simplexml2array($simpleXML); | |
} | |
/** | |
* Turns a SimpleXMLElement into an array | |
* | |
* @param SimpleXMLelem $xml | |
* @return array | |
*/ | |
function simplexml2array($xml) { | |
if(is_object($xml) && get_class($xml) == 'SimpleXMLElement') { | |
$attributes = $xml->attributes(); | |
foreach($attributes as $k=>$v) { | |
$a[$k] = (string) $v; | |
} | |
$x = $xml; | |
$xml = get_object_vars($xml); | |
} | |
if(is_array($xml)) { | |
if(count($xml) == 0) { | |
return (string) $x; | |
} | |
$r = array(); | |
foreach($xml as $key=>$value) { | |
$r[$key] = simplexml2array($value); | |
} | |
// Ignore attributes | |
if (isset($a)) { | |
$r['@attributes'] = $a; | |
} | |
return $r; | |
} | |
return (string) $xml; | |
} |
HI,
I tried to use this parser, but I modified the getArrayFromXMLString to this:
`
function getArrayFromXMLString($nodeAsString) {
$simpleXML = simplexml_load_string($nodeAsString);
if($simpleXML){
echo "yes";
}else{
echo "no";
}
if(libxml_get_errors()) {
user_error('Libxml throws some errors.', implode(',', libxml_get_errors()));
}
return simplexml2array($simpleXML);
}
`
And always, I got "no". My XML file is around 745 KB, with 18k lines. You have any idea ? Thanks a lot.
You are the best. Worked smoothly on 2GB xml datasets.
I found this mis-read some entries and calculated the $endPos incorrectly with my XML. In my case, when the error occured, it was because the last 2 characters of the tag were cut off in the string before being parsed. I haxed the one of the functions to check for the missing "t>" and add it when needed.
hi , my file is of 140 mb , tried with your script but got the error: Fatal error: Maximum function nesting level of '256' reached, aborting!
increase nested level in php.in
this is a great start for me. thanks so much
Error "simplexml_load_string(): namespace error : Namespace prefix commons on preference-order is not defined" is showing how to fix it?
Hi , how i can use nodeStringFromXMLFile with xml atributes like an ID? Thanks a lot.
Example:
"<reservation id="60613"><reservationNumber>38058</reservationNumber></reservation>"
I also have internal error and go to logs and i can't see it. Can you help me please?
Thank you so much, I have been stuck for almost a week now, and I really could not work, and with just this now I can start working.
Thank you, you are an inspiration.
What I want to do is to grab the xml using curl (file will be around 32mb) and display the parsed data on screen without any lagging. Memory limit is 32mb and and there are million of records
Is there any way to parse a large xml file using "yield" in php?