Skip to content

Instantly share code, notes, and snippets.

@colinsurprenant
Last active August 29, 2015 14:04
Show Gist options
  • Save colinsurprenant/de6d2bbf789a93e63606 to your computer and use it in GitHub Desktop.
Save colinsurprenant/de6d2bbf789a93e63606 to your computer and use it in GitHub Desktop.
binary encoded strings problem in JRuby<->Java.
# encoding: utf-8
require "java"
require "bundler/setup"
require "protobuf" # gem "protobuf"
require "benchmark"
java_import "java.util.ArrayList"
class Message < Protobuf::Message
required :string, :foo, 1
end
INPUT = "foo κόσμε"
COUNT = 1_000_000
# take a protocol buffer serialized to a binary string and pass it to the java world (into an ArrayList) and back
# forcing JRuby String<->Java String casting. the following 2 examples will fail because of the UTF-8 encoding
# in the Java world String.
serialized = Message.encode(:foo => INPUT) # outputs ASCII-8BIT encondig
values = ArrayList.new([serialized])
begin
Message.decode(values[0])
raise("it worked but we expected to fail")
rescue
# do nothig we expected to fail
end
serialized = Message.encode(:foo => INPUT) # outputs ASCII-8BIT encondig
values = ArrayList.new([serialized])
begin
Message.decode(values[0].force_encoding("ASCII-8BIT"))
raise("it worked but we expected to fail")
rescue
# do nothig we expected to fail
end
# benchmark 3 alternate ways to work with binary encoded string between JRuby and Java
Benchmark.bmbm(30) do |b|
b.report("ByteList") do
COUNT.times.each do
serialized = Message.encode(:foo => INPUT)
values = ArrayList.new([Java::OrgJrubyUtil.ByteList.new(serialized.to_java_bytes)])
value = values[0]
protobuf = Message.decode(value)
raise unless protobuf[:foo] == INPUT
end
end
end
Benchmark.bmbm(30) do |b|
b.report("get_bytes") do
COUNT.times.each do
serialized = Message.encode(:foo => INPUT)
values = ArrayList.new([serialized])
value = values[0].to_java.get_bytes("ISO-8859-1")
protobuf = Message.decode(String.from_java_bytes(value))
raise unless protobuf[:foo] == INPUT
end
end
end
Benchmark.bmbm(30) do |b|
b.report("to_java_object") do
COUNT.times.each do
serialized = Message.encode(:foo => INPUT)
values = ArrayList.new([serialized.to_java_bytes.to_java_object])
value = values[0]
protobuf = Message.decode(value.to_s)
raise unless protobuf[:foo] == INPUT
end
end
end
source "https://rubygems.org"
gem "protobuf"
Rehearsal ------------------------------------------------------------------
ByteList 21.890000 0.160000 22.050000 ( 17.456000)
-------------------------------------------------------- total: 22.050000sec
user system total real
ByteList 15.070000 0.080000 15.150000 ( 14.566000)
Rehearsal ------------------------------------------------------------------
get_bytes 17.650000 0.090000 17.740000 ( 16.680000)
-------------------------------------------------------- total: 17.740000sec
user system total real
get_bytes 16.710000 0.080000 16.790000 ( 16.314000)
Rehearsal ------------------------------------------------------------------
to_java_object 14.070000 0.060000 14.130000 ( 13.698000)
-------------------------------------------------------- total: 14.130000sec
user system total real
to_java_object 13.900000 0.060000 13.960000 ( 13.762000)
@ismith
Copy link

ismith commented Aug 11, 2014

Another way to do this seems to be to do: s.encode("ASCII-8BIT") before decoding.

I.e., In the storm context we've been discussing, put message.to_s into a tuple, and then on the other side call Message.decode(tuple[:field].encode('ASCII-8BIT')).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment