I wrote these instructions on how to install and use xsv – a powerful CSV-handling command-line tool, because someone asked how to deal with a data file that was too big to open in Excel or even Notepad. I didn't know how familiar the person was with installing/running downloadable .exe files or with Powershell, so I've tried to include some general instructions that hopefully are useful to even novices.
This mini-guide is not at all meant to be exhaustive as it basically shows just one of xsv's many useful functions. But if you're new to the idea of using command-line tools to do things, hopefully this can be a friendly intro to it.
Here's an example of a CSV that, at 3 million rows, is too big for Excel to open: https://burntsushi.net/stuff/worldcitiespop.csv
If you want to skip the tedious/convoluted setup info, jump to the section titled Using xsv split to see the easy xsv one-liner.
Or to save you a click, here it is:
.\xsv.exe split -s 900000 outputdir worldcitiespop.csv
Caveat: if these instructions seem dumb, it's because I'm dumb and haven't really spent a lot of time learning Win10 beyond the bare minimum to do what little I do in Windows. Obviously do what you know to be better
Go to the xsv releases/downloads page to find a list of zipped, downloadable binaries:
https://github.com/BurntSushi/xsv/releases/tag/0.13.0
I'm not sure what the differences are between the different versions, but I picked the first pc-windows binary – xsv-0.13.0-i686-pc-windows-gnu.zip, and it worked just fine on my macOS bootcamp install of Windows 10 Home edition:
https://github.com/BurntSushi/xsv/releases/download/0.13.0/xsv-0.13.0-i686-pc-windows-gnu.zip
Download and open the zip file. You'll find that it contains a single executable file named xsv.exe. Extract that into your Downloads folder.
Note: I actually recommend making a subfolder in your Downloads folder to do things in. For the example and screenshots I show in this guide, I made a subfolder with path
Downloads\foo
and putxsv.exe
and the example CSV file into it.
The rest of this guide assumes you too have a folder named
Downloads\foo
, so substitute accordingly
The proper way to install xsv.exe
so that it's easily usable on your system is to put it in your system path, e.g. C:\Windows\System32
. I'll assume you don't know or care about doing that for now, but you can read more info about it on this StackOverflow answer
The xsv homepage also mentions the possibility of installing cargo (Rust's package manager), which then makes installing xsv as easy as running this Powershell command:
cargo install xsv
(I have no idea how easy it is to install cargo
on Windows, but it works pretty well on the Mac side!)
For the purposes of this guide, I'll assume you've installed
xsv
the "simple" (i.e. limited) way.
So double-clicking xsv.exe
to run it won't work, as xsv is a command-line application, which means you need to run it from a command line interface (i.e CLI). Assuming you're on a relatively modern version of Windows, you should have a built-in CLI called Powershell.
The easiest way to get to Powershell is to open your Start Menu and start typing powersh
in the search bar.
Opening it should result in a window and text-based interface that looks like this:
Assuming you installed xsv.exe
the "simple" way, and xsv.exe
is in Downloads\foo
(or whatever you named your subfolder), you now have to change to that directory in Powershell, i.e. make Downloads\foo
your "working directory". Do this via the cd
(i.e. "change directory") command.
This should work:
cd ~\Downloads\foo
Now finally we can run xsv.exe
:
.\xsv.exe
And you should get output that looks like:
If that's what you see, that means xsv.exe "works" on your machine, and now you can use it to split your data file.
For the purposes of this example, let's use a publicly available large CSV file. Click and download the following file into your working directory alongside xsv.exe
– e.g. in Downloads\foo
https://burntsushi.net/stuff/worldcitiespop.csv (150MB/3 million rows)
(I'm assuming the file was downloaded and saved as worldcitiespop.csv
)
As we saw when invoking .\xsv.exe
, xsv.exe doesn't do anything when run by itself, other than print out a big help message. In general, it expects to be run with a subcommand – e.g. something like count
, headers
, split
– followed by the name of an input file, e.g. worldcitiespop.csv
So, running the count
subcommand like this:
.\xsv.exe count worldcitiespop.csv
Will result in xsv printing out the total number of rows:
The following use of headers
will list the column names (i.e. the first row):
.\xsv.exe headers worldcitiespop.csv
count
and headers
are simple subcommands to inspect a data file. You don't need to run them before running split
, but it's a good idea to, as it'll at least verify you have a properly formatted CSV file.
Without getting into the details of xsv's options, or how command-line things work in general, it's enough to say that the split
subcommand at a minimum requires a few extra parameters from us, namely:
- the number of records per split file
- a subdirectory to dump the split files (the original input file, e.g. worldcitiespop.csv, will be unaltered)
You can read the details in the xsv split documentation. But it's probably easy enough to show you an example command that will just work:
.\xsv.exe split -s 900000 outputdir worldcitiespop.csv
The above command tells xsv to split worldcitiespop.csv
file into smaller files – each no bigger than 900,000 records – and to put those smaller files into a new subdirectory named outputdir.
If it works (and it took less than a second to complete on my machine), this is what the relevant files in your working directory (e.g. Downloads\foo
) should look like:
And this is what that outputdir
subfolder should contain: 4 smaller CSV files, each named after the record number that the file begins with:
In Excel, you'll find that each of those files has the exact same header and structure:
And that's the minimum info needed on how to use xsv to split a big file into smaller chunks. However, that's only one of xsv's very useful features. And if you're a relatively experienced data wrangler, you'll find that xsv has a great list of features that can likely replace a lot of common tasks that you've been doing in clunky SQL/pandas/R.
-30-
(ignore the list of full size images in this Gistfile below, I have no idea if I can just delete them without deleting them from the actual guide text)