Created
June 25, 2015 00:59
-
-
Save hunj/9e4a2979f9bdf61b9058 to your computer and use it in GitHub Desktop.
Strip path from domain, off from sitemap xml file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def path_strip(input_file, domain, output_file) | |
raise "domain must be string form" unless domain.is_a? String | |
raise "invalid input file name" unless input_file.is_a? String | |
raise "invalid output file name" unless output_file.is_a? String | |
file = File.open(input_file, "r") | |
data = file.read | |
file.close | |
data_lines = data.lines | |
cleared_arr = [] | |
result_file = File.open(output_file, "w") | |
num = 0 | |
data_lines.each do |line| | |
if line =~ /<loc>http:\/\/#{Regexp.quote(domain)}\/.*<\/loc>/ | |
num += 1 | |
result_file.puts "link_#{num},#{line[5..-8].sub("http://#{domain}/", '')}" | |
end | |
end | |
result_file.close | |
p num | |
end | |
# example: | |
path_strip "./sitemap.xml", "hunj.github.io", "./result.csv" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Calling
File.open
twice is alright.You can refactor lines 6 to 9 to this:
I guess you could prevent creating unnecessary variables by doing something like this:
(A shorter version)
By the way, with reference to your code (L12, L21), what you're doing is: open a File, process the lines one by one, then close the File instance. Only do that if you have a lot of lines (and you think that the lines will take up a lot of memory).
An alternative would be to store the processed lines in a
String
(concatenate) orArray
, and write it to theoutput_file
all at one go. This method would be faster, but you need memory to store your data.Cheers,
Jay