I received a question about how to randomize column order in a text file. I came up with the method that using common Unix command line tools (sort
, sed
, tr
, join
) and the BASH shell. It has been tested on the BSD command line tools in macOS 12 (running zsh
) and gnu command line tools BASH on CentOS (running bash
).
NOTE: this does not handle quoted commas in the CSV. The only commas should be the delimiters.
NUM_COLS=10
NUM_RANDOMIZED_OUTPUT=5
INPUT_FILE=input.csv
OUTPUT_PREFIX=output
seq 1 ${NUM_COLS} > cols
for x in `seq 1 ${NUM_RANDOMIZED_OUTPUT}`; do
sort --random-sort cols | sed -r 's/^/1./' | tr "\n" , | sed -r 's/,$//'
echo
done >randomized_col_order
for y in `seq 1 ${NUM_RANDOMIZED_OUTPUT}`; do
current_order=`sed -n ${y}p randomized_col_order`
join -t, -o "${current_order}" ${INPUT_FILE} ${INPUT_FILE} > ${OUTPUT_PREFIX}_${y}
done
rm cols randomized_col_order
input.csv
:
1,2,3,4,5,6,7,8,9,10
Output files:
ls output_*
output_1
output_2
output_3
output_4
output_5
cat output_*
3,8,1,9,2,4,6,5,10,7
3,1,7,4,10,9,6,2,8,5
6,1,10,8,7,3,5,2,4,9
5,3,2,4,7,6,1,9,8,10
6,4,3,2,7,9,8,1,5,10
This version is modified such that the first column position is not randomized. This would be for cases where the first column is an identifier column.
The changes to the code are seq 1 ${NUM_COLS} > cols
becomes seq 2 ${NUM_COLS} > cols
and the addition of echo -n "1.1,"
to the beginning of each line of randomized_col_order
.
NUM_COLS=10
NUM_RANDOMIZED_OUTPUT=5
INPUT_FILE=input.csv
OUTPUT_PREFIX=output
seq 2 ${NUM_COLS} > cols
for x in `seq 1 ${NUM_RANDOMIZED_OUTPUT}`; do
echo -n "1.1,"
sort --random-sort cols | sed -r 's/^/1./' | tr "\n" , | sed -r 's/,$//'
echo
done >randomized_col_order
for y in `seq 1 ${NUM_RANDOMIZED_OUTPUT}`; do
current_order=`sed -n ${y}p randomized_col_order`
join -t, -o "${current_order}" ${INPUT_FILE} ${INPUT_FILE} > ${OUTPUT_PREFIX}_${y}
done
rm cols randomized_col_order
cat output_*
1,3,4,6,2,5,10,9,7,8
1,5,7,10,8,4,6,2,9,3
1,2,4,7,3,5,10,9,8,6
1,6,2,3,7,4,5,8,9,10
1,6,3,7,9,4,5,2,8,10