Skip to content

Instantly share code, notes, and snippets.

@huntc
Last active January 1, 2020 10:04
Show Gist options
  • Save huntc/d3042b4171ac5f6e3da3b634aee26abc to your computer and use it in GitHub Desktop.
Save huntc/d3042b4171ac5f6e3da3b634aee26abc to your computer and use it in GitHub Desktop.
Some Rust to read a CSV file and fold over its lines - seems significantly slower than its Scala counterpart without having optimised it
use std::env;
use std::fs::File;
use std::io::{BufRead, BufReader};
use std::os::raw::c_double;
use chrono::{Duration, NaiveDateTime};
struct State {
x0: Option<(NaiveDateTime, c_double)>,
x1: Option<(NaiveDateTime, c_double)>,
y: Option<(NaiveDateTime, c_double)>,
}
impl State {
fn empty() -> State {
State {
x0: None,
x1: None,
y: None,
}
}
}
fn main() {
let args: Vec<String> = env::args().collect();
let filename = &args[1];
let file = File::open(filename).unwrap();
let reader = BufReader::new(file);
reader
.lines()
.enumerate()
.fold(State::empty(), |last_state, (_, line)| {
let line = line.unwrap();
let cols: Vec<&str> = line.split(',').collect();
let timestamp_components: Vec<_> = cols[0].split('.').collect();
let time = NaiveDateTime::from_timestamp(
timestamp_components[0].parse::<i64>().unwrap(),
timestamp_components[1].parse::<u32>().unwrap() * 1000,
);
let x: c_double = cols[1].parse().unwrap();
let y: c_double = cols[2].parse().unwrap();
let new_state = match last_state {
// ... Omitted as this is an algorithm I don't want to share, but I don't believe that it is the cause
};
if let State {
x0: Some((new_x0_time, new_x0)),
x1: Some((_, new_x1)),
y: Some((_, new_y)),
} = new_state
{
println!("{}, {}. {}, {}", new_x0_time, new_x0, new_x1, new_y);
State::empty()
} else {
new_state
}
});
}
@huntc
Copy link
Author

huntc commented Dec 19, 2019

Relevant Scala comparison:

  val s = Source.fromFile(args(0))
  try {
    s.getLines().foldLeft(State.empty) {
      case (lastState, line) =>
        val cols = line.split(",")
        val time = Instant.ofEpochMilli((cols(0).toDouble * 1000L).toLong)
        val x = cols(1).toDouble
        val y = cols(2).toDouble
        val newState = lastState match { ...

@mmstick
Copy link

mmstick commented Dec 20, 2019

That's a lot of allocations that you're doing there with each collect in each iteration.

Also check out the lexical crate.

let mut cols = line.split(',');

let mut timestamp = cols.next().unwrap().split(',');

let time = NaiveDateTime::from_timestamp(
    lexical::parse(timestamp.next().unwrap()).unwrap(),
    lexical::parse(timestamp.next().unwrap()).unwrap() * 1000
);

let x: c_double = lexical::parse(cols.next().unwrap()).unwrap();
let y: c_double = lexical::parse(cols.next().unwrap()).unwrap();

@huntc
Copy link
Author

huntc commented Dec 20, 2019

Thanks! I’ll give it a go.

@mmstick
Copy link

mmstick commented Dec 20, 2019

If you really do want to collect, you can convert your closure into a move closure move || {}, and above that create some vectors that you will reuse between iterations. Simply make sure to clear them. That said, there's really no need to allocate here.

@huntc
Copy link
Author

huntc commented Dec 20, 2019

Actually, my colleague @longshorej also came up with:

            let line = line.unwrap();
            let mut cols = line.split(',');
            let col_0 = cols.next().unwrap();
            let col_1 = cols.next().unwrap();
            let col_2 = cols.next().unwrap();
            let mut timestamp_components = col_0.split('.');
            let mut ts_0 = timestamp_components.next().unwrap();
            let mut ts_1 = timestamp_components.next().unwrap();
            let time = NaiveDateTime::from_timestamp(
                ts_0.parse::<i64>().unwrap(),
                ts_1.parse::<u32>().unwrap() * 1000,
            );
            let x: c_double = col_1.parse().unwrap();
            let y: c_double = col_2.parse().unwrap();

...which should reduce the allocations. I didn't notice much of a saving here though either.

@mmstick
Copy link

mmstick commented Dec 20, 2019

If you're doing any sort of float parsing (c_double), Rust's default parser is optimized for precision, rather than speed. Hence you will see a significant improvement in parsing times using lexical's speed-optimized parsers.

Additionally, unfortunately, the .lines() method on the BufReader does allocate a new string with each line read. You can eliminate the allocations with

let line = &mut String::new();
let mut read;

loop {
    read = reader.read_line(line).unwrap();
    if read == 0 {
        break
    }

    let line = line.as_str();

    // Code here

}

@mmstick
Copy link

mmstick commented Dec 20, 2019

linereader does offer an alternative line reader with a .for_each() method for interacting with lines in an allocation-free way.

@huntc
Copy link
Author

huntc commented Dec 20, 2019

Additionally, unfortunately, the .lines() method on the BufReader does allocate a new string with each line read. You can eliminate the allocations with

Ah, I had wondered about that. I'll bet that's my problem.

@huntc
Copy link
Author

huntc commented Dec 20, 2019

linereader does offer an alternative line reader with a .for_each() method for interacting with lines in an allocation-free way.

Interesting. Again, thanks. Having to seek alternatives to core Rust is starting to put me off though...

@samsieber
Copy link

samsieber commented Dec 31, 2019

So, this was posted on Reddit, and I hope you don't mind the intrusion. I just have two ideas:

  • I assume you're running this in release mode, so I'd be curious to see what adding -C target-cpu=native as the RUSTFLAGS environment variable when compiling. You'd have to compile it on the same machine that will be running it though.

  • The println! macro needs to lock the stdout (which is behind.. .a mutex, I think?). You can lock it at the top and use writeln! instead. I'm not as familiar with the performance implications here in the JVM, but I'm guessing that it gets optimized out in a single threaded context.

@mmstick
Copy link

mmstick commented Dec 31, 2019

I still say that relying on Rust's native float parser, as opposed to using lexical, is going to be responsible for significant CPU costs, since Rust's float parser is optimized for precision, not speed.

@mariusmuja
Copy link

As mentioned above, try to reduce the number of allocations. Two easy changes:

  • Use ArrayVec for stack allocation of cols:

instead of:

let cols: Vec<&str> = line.split(',').collect();

use:

let cols: ArrayVec<[&str; 3]> = line.split(',').take(3).collect();
  • Use LineReader to not allocate a String for each line:
let mut reader = LineReader::new(File::open(filename).unwrap());
while let Some(line) = reader.next_line() {
   ...
}

@mariusmuja
Copy link

Or you could use itertools's .collect_tuple() for a nicer api:

use itertools::Itertools;

let (timestamp_str, x_str, y_str) =  line.split(',').take(3).collect_tuple();
let (secs, nanos) = timestamp_str.split('.').take(2).collect_tuple();

@neoeinstein
Copy link

Interesting. Again, thanks. Having to seek alternatives to core Rust is starting to put me off though...

It should be noted that, when doing things at the level of Rust, there are often many ways to do things, and the "right" way depends on the tradeoffs that you want to make. One of those examples is in the float parser, as already mentioned (Do you want the most precise parsing of the float value, or a good fast approximation). Rust's core is, by design, relatively thin. It provides a good many things, done in a way that is acceptable for a good number of use cases. Depending on what you want to optimize for, there are a ways to help you do that (lexical, linereader, and holding a lock on stdout as examples).

Also, if you're worried about speed, you definitely want release mode (cargo run --release). And, if you're only going to use the binary on your own machine, using RUSTFLAGS="-C target-cpu=native" is a pretty good way to eek out a bit more performance from your SIMD registers and instructions.

@auterium
Copy link

auterium commented Jan 1, 2020

I think println! locks and unlocks the stdout at each call, try locking once and eritting to the lock. Not sure how much speed improvement you can get though.

Another possible improvement is to parallelize reading the file in chunks with rayon. If the source file can fit at list twice in memory, you can buffer the parsed chunks and return them on each parallel iteration and then join them for one single write (rayon guarantees order of the returns). If the file is too large, you can write each parsed chunk to individual temp files and then join them to your final output

@masklinn
Copy link

masklinn commented Jan 1, 2020

I think println! locks and unlocks the stdout at each call, try locking once and eritting to the lock. Not sure how much speed improvement you can get though.

Possibly some. The major issue is likely that in Rust stdout is line-buffered (always IIRC).

I'm assuming the JVM uses the C runtime behaviour of line-buffering stdout when it's connected to a tty but fully buffering when it's connected to a pipe (and we don't have a benchmark script but this would usually be piped to /dev/null), so it's only going to flush stdout every 4kB or so whereas rust is going to flush on every iteration.

Here's a loop just println-ing 10 million integers to stdout on my machine:

cargo run --release > /dev/null  9.00s user 5.56s system 99% cpu 14.583 total

locking and writeln-ing:

cargo run --release > /dev/null  8.42s user 6.20s system 99% cpu 14.733 total

wrapping StdoutLock into a BufWriter:

cargo run --release > /dev/null  1.05s user 0.03s system 99% cpu 1.089 total

Of course the issue there is that you lose line buffering in "interactive" mode, so ideally you'd need something like atty to check whether your stdout is a tty, and probably box your Writer as a trait object e.g.

let mut out: Box<dyn Write> = if atty::is(Stream::Stdout) {
    Box::new(stdout.lock())
} else {
    Box::new(BufWriter::new(stdout.lock()))
};

Here's using socat to connect stdout to a tty without being slowed down by the actual terminal:

cargo run --release > /dev/null  1.16s user 0.04s system 91% cpu 1.322 total
socat - EXEC:'cargo run --release',pty,setsid,ctty > /dev/null  17.15s user 24.01s system 194% cpu 21.205 total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment