-
-
Save mciantyre/32ff2c2d5cd9515c1ee7 to your computer and use it in GitHub Desktop.
""" | |
A script to take all of the LineString information out of a very large KML file. It formats it into a CSV file so | |
that you can import the information into the NDB of Google App Engine using the Python standard library. I ran this | |
script locally to generate the CSV. It processed a ~70 MB KML down to a ~36 MB CSV in about 8 seconds. | |
The KML had coordinates ordered by | |
[Lon, Lat, Alt, ' ', Lon, Lat, Alt, ' ',...] (' ' is a space) | |
The script removes the altitude to put the coordinates in a single CSV row ordered by | |
[Lat,Lon,Lat,Lon,...] | |
Dependencies: | |
- Beutiful Soup 4 | |
- lxml | |
I found a little bit of help online for using BeautifulSoup to process a KML file. I put this online to serve as | |
another example. Some things I learned: | |
- the BeautifulSoup parser *needs* to be 'xml'. I spent too much time debugging why the default one wasn't working, and | |
it was because the default is an HTML parse, not XML. | |
tl;dr | |
KML --> CSV so that GAE can go CSV --> NDB | |
""" | |
from bs4 import BeautifulSoup | |
import csv | |
def process_coordinate_string(str): | |
""" | |
Take the coordinate string from the KML file, and break it up into [Lat,Lon,Lat,Lon...] for a CSV row | |
""" | |
space_splits = str.split(" ") | |
ret = [] | |
# There was a space in between <coordinates>" "-80.123...... hence the [1:] | |
for split in space_splits[1:]: | |
comma_split = split.split(',') | |
ret.append(comma_split[1]) # lat | |
ret.append(comma_split[0]) # lng | |
return ret | |
def main(): | |
""" | |
Open the KML. Read the KML. Open a CSV file. Process a coordinate string to be a CSV row. | |
""" | |
with open('doc.kml', 'r') as f: | |
s = BeautifulSoup(f, 'xml') | |
with open('out.csv', 'wb') as csvfile: | |
writer = csv.writer(csvfile) | |
for coords in s.find_all('coordinates'): | |
writer.writerow(process_coordinate_string(coords.string)) | |
if __name__ == "__main__": | |
main() |
@mciantrye Thanks for this. I'm using Python 2.7 and have installed BS4 and lxml. I get an out.csv file but it's 80 empty lines with no error. Is there something obvious I'm missing? Thanks, Doug
Thanks guys. Your code helped me today.
from bs4 import BeautifulSoup
import csv
"""
Take the coordinate string from the KML file, and break it up into [Lat,Lon,Lat,Lon...] for a CSV row
"""
def process_coordinate_string(str):
ret = []
comma_split = str.split(',')
return [comma_split[1].strip(), comma_split[0].strip()]
"""
Open the KML. Read the KML. Open a CSV file. Process a coordinate string to be a CSV row.
"""
def main():
with open(path, 'r') as f:
s = BeautifulSoup(f, 'xml')
with open('out.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
for coords in s.find_all('coordinates'):
writer.writerow(process_coordinate_string(coords.string))
if name == "main":
main()
I got the error below while trying to run the script:
line 25, in
from bs4 import BeautifulSoup
ModuleNotFoundError: No module named 'bs4'
@akintunero you need to install the dependency:
pip install bs4
NOTE: Gave me an error using Python3
"a bytes-like object is required, not 'str' "
This can be solved by modifying line 48 of the code by changing the mode of opening the file from 'wb' to simply 'w'.
Anyways great code, thanks a lot.
Thanks so much!
I modified it to work with KML files created with google earth:
https://github.com/miasolodky/Google-Earth-KML-to-CSV/blob/master/kmltocsv.ipynb
TypeError: a bytes-like object is required, not 'str'
This error is coming in python 3.8, I don't know why.
My input list to be written is [19.33482579812116, 77.01685649730679, 19.33477755271189, 77.01738131023461, 19.33423333079384, 77.0173191392798, 19.33418091607818, 77.01764031166668, 19.33537616602636, 77.01780075660297, 19.33543133527809, 77.01679442455995, 19.33482579812116, 77.01685649730679] still, it is reading it as str.
Refer to my previous comment, while opening the file change 'wb' to 'w'. When using 'wb' you are telling him to write in binary mode.
Hi,
Your code helped me alot as I had never worked with bs4! Thanks!
I needed to tabulate the placemarks I had made in Google Earth (around 500).
I have added some code to save the lat long data provided by you, along with the name and descriptions of the placemarks which I needed for my code. This works in python3. I hope it helps somebody.
https://gist.github.com/mohitsingh2806/deee300a2f5bdd2768967116bd209019
EDIT: Seeing my code again, I realised that due to lot of smaller changes incrementally, this code is now almost completely different from the original code you had shared. But, nonetheless, I must say that your code helped me a lot and thank you again for it.
Refer to my previous comment, while opening the file change 'wb' to 'w'. When using 'wb' you are telling him to write in binary mode.
Thank you, helped me complete that task.
This helped me, thanks! I needed something slightly more pandas-friendly, so I slightly edited it. I'm sharing it in this thread in case someone else needs it (Python 3.x):
def process_coordinate_string(str):
"""
Take the coordinate string from the KML file, and break it up into [Lat,Lon,Lat,Lon...] for a CSV row
"""
space_splits = str.split(" ")
ret = []
# There was a space in between <coordinates>" "-80.123...... hence the [1:]
for split in space_splits[1:]:
comma_split = split.split(',')
ret.append(comma_split[1]) # lat
ret.append(comma_split[0]) # lng
return ret
def main():
"""
Open the KML. Read the KML. Open a CSV file. Process a coordinate string to be a CSV row.
"""
with open('input.kml', 'r') as f:
s = BeautifulSoup(f, 'xml')
for coords in s.find_all('coordinates'):
data = process_coordinate_string(coords.string)
lats = [float(x) for index, x in enumerate(data) if index % 2 == 0]
lons = [float(x) for index, x in enumerate(data) if index % 2 == 1]
df = pd.DataFrame({'Lat' : lats, 'Lon' : lons})
df.to_csv("kml_to_df.csv", index = False)
Slight modification on WxBDM, as I had some issues with lack of standardization on the kml file generated. Also, imports from a kml folder and exports to a csv folder with the same shared filename, to allow for mass conversion. Function is now called with the kml filename as an argument.
kml2csv('test.kml')
from bs4 import BeautifulSoup
import csv
def process_coordinate_string(str):
"""
Take the coordinate string from the KML file, and break it up into [Lat,Lon,Lat,Lon...] for a CSV row
"""
space_splits = str.split(" ")
ret = []
# There was a space in between <coordinates>" "-80.123...... hence the [1:]
for split in space_splits[1:]:
comma_split = split.split(',')
# Checks for len on the split, because depending on kml file generator you might get an empty
# string (which would be misinterpreted as a coordinate)
if(len(split.split(',')) == 3):
ret.append(comma_split[1]) # lat
ret.append(comma_split[0]) # lng
return ret
def kml2csv(fname):
"""
Open the KML. Read the KML. Open a CSV file. Process a coordinate string to be a CSV row.
Input: Filename with extension ('example.kml'), located in 'kml' folder.
Output: File with the same name as input, but in .csv format, located in 'csv' folder.
"""
out_fname = fname.split('.kml')[0] + '.csv'
with open('kml/'+fname, 'r') as f:
s = BeautifulSoup(f, 'xml')
for coords in s.find_all('coordinates'):
data = process_coordinate_string(coords.string)
lats = [float(x) for index, x in enumerate(data) if index % 2 == 0]
lons = [float(x) for index, x in enumerate(data) if index % 2 == 1]
df = pd.DataFrame({'Lat' : lats, 'Lon' : lons})
df.to_csv("csv/"+out_fname, index = False)
I am using the above examples but I only get the first and last coordinate in a csv file. It is as if it is not looping, however since I am getting the first and last coordinate I have to assume that it is reading the coordinates list.
`def process_coordinate_string(str):
# Take the coordinate string from the KML file, and break it up into [Lat,Lon,Lat,Lon...] for a CSV row
ret = []
comma_split = str.split(',')
return [comma_split[1], comma_split[0]]
def main():
# Open the KML. Read the KML. Open a CSV file. Process a coordinate string to be a CSV row.
with open('61956195-6202689-a300234067548720_2022-02-23-16-15-48.kml', 'r') as f:
s = BeautifulSoup(f, 'xml')
with open('trajectory-6195.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
for coords in s.find_all('coordinates'):
writer.writerow(process_coordinate_string(coords.string))
if name == "main":
main()`
I am a relative beginner. Any reason as to why that may be happening?
Thanks
Nice work guys, thanks a lot for sharing! I needed to take some more columns out of my kml file (name, description, and add some custom columns) along with the coordinates, so I used your code and created a new gist (works in Python 3): https://gist.github.com/anamariakantar/a0c154a3df92a0ee7adc7f7a78061623