Skip to content

Instantly share code, notes, and snippets.

@Wimpje
Last active February 8, 2023 12:46
Show Gist options
  • Select an option

  • Save Wimpje/a796ba134d61552587a7 to your computer and use it in GitHub Desktop.

Select an option

Save Wimpje/a796ba134d61552587a7 to your computer and use it in GitHub Desktop.
Powershell, split large XML files on node name, with offset support
param( [string]$file = $(throw "file is required"), $matchesPerSplit = 50, $maxFiles = [Int32]::MaxValue, $splitOnNode = $(throw "splitOnNode is required"), $offset = 0 )
# with a little help of https://gist.github.com/awayken/5861923
$ErrorActionPreference = "Stop";
trap {
$ErrorActionPreference = "Continue"
write-error "Script failed: $_ \r\n $($_.ScriptStackTrace)"
exit (1);
}
$file = (resolve-path $file).path
$fileNameExt = [IO.Path]::GetExtension($file)
$fileNameWithoutExt = [IO.Path]::GetFileNameWithoutExtension($file)
$fileNameDirectory = [IO.Path]::GetDirectoryName($file)
$reader = [System.Xml.XmlReader]::Create($file)
$matchesCount = $idx = 0
try {
"Splitting $from on node name='$splitOnNode', with a max of $matchesPerSplit matches per file. Max of $maxFiles files will be generated."
$result = $reader.ReadToFollowing($splitOnNode)
$hasNextSibling = $true
while (-not($reader.EOF) -and $result -and $hasNextSibling -and ($idx -lt $maxFiles + $offset)) {
if ($matchesCount -lt $matchesPerSplit) {
if($offset -gt $idx) {
$idx++
continue
}
$to = [IO.Path]::Combine($fileNameDirectory, "$fileNameWithoutExt.$($idx -$offset)$fileNameExt")
"Writing to $to"
$toXml = New-Object System.Xml.XmlTextWriter($to, $null)
$toXml.Formatting = 'Indented'
$toXml.Indentation = 2
try {
$toXml.WriteStartElement("split")
$toXml.WriteAttributeString("cnt", $null, "$idx")
do {
$toXml.WriteRaw($reader.ReadOuterXml())
$matchesCount++;
$hasNextSibling = $reader.ReadToNextSibling($splitOnNode)
} while($hasNextSibling -and ($matchesCount -lt $matchesPerSplit))
$toXml.WriteEndElement();
}
finally {
$toXml.Flush()
$toXml.Close()
}
$idx++
$matchesCount = 0;
}
}
}
finally {
$reader.Close()
}
@Wimpje
Copy link
Copy Markdown
Author

Wimpje commented Jun 5, 2019

Hi! Thanks for the feedback, always those pesky off by one errors... I must say I used it for some sanity checking of large files, so didn't run into the issue. I will fix it later this week, when I'm on my windows machine :)

@Wimpje
Copy link
Copy Markdown
Author

Wimpje commented Jun 11, 2019

Hi! Thanks for the feedback, always those pesky off by one errors... I must say I used it for some sanity checking of large files, so didn't run into the issue. I will fix it later this week, when I'm on my windows machine :)

Should be fixed now!

@domOrielton
Copy link
Copy Markdown

domOrielton commented Oct 1, 2019

Thank you for excellent code - not sure if you've seen this issue before but when I process a very large file (>500mb) with no line breaks the output only ever seems to total approx 280mb and then completes with no errors - from the size a lot of the entries must be missing and I can't work out why. All I can think of is maybe it has something to do with all the text being on a single line and that somehow causes an issue? It doesn't seem to make any difference what I set the matchesPerSplit to, it will always max out at around 280mb (~295,964,333 bytes)

Update: I can confirm this issue does occur because of the large file with no line breaks. If I split the file into smaller files, add in line breaks and then join the files the script works just fine on a >500mb file

@vikjon0
Copy link
Copy Markdown

vikjon0 commented Jul 28, 2022

The code does not work on all files.
According to the doc ReadOuterXML will advance the reader to the next tag. What I don't understand is why it sometimes works.
https://docs.microsoft.com/en-us/dotnet/api/system.xml.xmlreader.readouterxml?view=net-6.0

This workaround seem to work. I have no been able to find a better solution

This seem to work in both situations which I also cannot explain
if ($reader.Name -eq $splitOnNode) {
$hasNextSibling = 1
} else {
$hasNextSibling = $reader.ReadToNextSibling($splitOnNode)
}

@vikjon0
Copy link
Copy Markdown

vikjon0 commented Jul 29, 2022

The code does not work on all files. According to the doc ReadOuterXML will advance the reader to the next tag. What I don't understand is why it sometimes works. https://docs.microsoft.com/en-us/dotnet/api/system.xml.xmlreader.readouterxml?view=net-6.0

This workaround seem to work. I have no been able to find a better solution

This seem to work in both situations which I also cannot explain if ($reader.Name -eq $splitOnNode) { $hasNextSibling = 1 } else { $hasNextSibling = $reader.ReadToNextSibling($splitOnNode) }

I think the script only works correctly on XML with "CR"?

Please compare output of sample1 & sample2

$global:ErrorActionPreference = "Stop"

$content1 = '<test><list_items><item><id>A</id></item><item><id>B</id></item></list_items></test>' | out-file -Force -filepath D:\test\sample1.xml
$content2 = '<test>' + [char]13 + '<list_items>' + [char]13 +'<item> '+ [char]13 +'<id>A</id>' + [char]13 + '</item>' + [char]13 + '<item>' + [char]13 + '<id>B</id>' + [char]13 + '</item>' + [char]13 + '</list_items>' + [char]13 + '</test>' | out-file -Force -filepath D:\test\sample2.xml

$file = (resolve-path D:\test\sample2.xml).path
#$file = (resolve-path D:\test\sample1.xml).path

$reader = [System.Xml.XmlReader]::Create($file) 

$matchesCount = $idx = 0

try {
  
    $result = $reader.ReadToFollowing("item")
    $hasNextSibling = $true
    while (-not($reader.EOF) -and $result -and $hasNextSibling) {  #JONVIK
          write-host $reader.ReadOuterXml()
          $hasNextSibling = $reader.ReadToNextSibling("item")
    }

}
finally {
    $reader.Close()
}

@AndreiPosto
Copy link
Copy Markdown

Hi, many thanks for this code, brilliant!
Could you help me with code line that I can extract the node's id of the used node and use that node's id on the file name rather than incremental $idx, please?
Many thanks!

@vikjon0
Copy link
Copy Markdown

vikjon0 commented Feb 8, 2023

I dont have time to test right now but the node should be in hear,
$reader.ReadOuterXml())

Not sure if you can extract it directly or if you need to load the content in another object first . It needs to be done without moving the readers position

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment