-
-
Save igrigorik/83334277835625916cd6 to your computer and use it in GitHub Desktop.
require 'open-uri' | |
require 'zlib' | |
require 'yajl' | |
# References | |
# - https://developers.google.com/bigquery/preparing-data-for-bigquery#dataformats | |
# - https://developers.google.com/bigquery/docs/data#nested | |
# | |
def type(t) | |
return 'FLOAT' if t.is_a?(Float) | |
return 'INTEGER' if t.is_a?(Integer) | |
return 'STRING' if t.is_a?(String) | |
return 'BOOLEAN' if t.is_a?(TrueClass) || t.is_a?(FalseClass) | |
return 'RECORD' if t.is_a?(Hash) | |
return type(t.first) if t.is_a?(Array) | |
puts "Unknown type for #{t}, #{t.class}" | |
raise Exception | |
end | |
def mode(e) | |
if e.is_a? Array | |
'REPEATED' | |
else | |
'NULLABLE' | |
end | |
end | |
def traverse(target, event) | |
event.each_pair do |k,v| | |
desc = target.find {|e| e['name'] == k} || {} | |
target << desc if desc.empty? | |
desc['name'] = k | |
# Note: we skip empty REPEATED fields until we encounter a non-empty one. | |
# This may result in empty REPEATED declarations, which will be rejected | |
# by BigQuery... You'll have to handle this on your own. | |
next if v.nil? || (v.is_a?(Array) && v.first.nil?) | |
desc['type'] = type(v) | |
desc['mode'] = mode(v) | |
if desc['type'] == 'RECORD' | |
desc['fields'] ||= [] | |
v = [v] if desc['mode'] != 'REPEATED' | |
v.each do |e| | |
traverse(desc['fields'], e) unless e.nil? | |
end | |
end | |
end | |
end | |
@fields = [] | |
file = ARGV[0] | |
data = open(file) | |
if File.extname(file) == '.gz' | |
data = Zlib::GzipReader.new(StringIO.new(data.read)).read | |
end | |
Yajl::Parser.parse(data) do |event| | |
traverse(@fields, event) | |
end | |
def check(target) | |
target.each do |field| | |
if !(field.has_key?('name') && !field['name'].nil? && | |
field.has_key?('type') && !field['type'].nil?) | |
STDERR.puts "Warning: #{field} has an unknown type." | |
field['type'] = 'STRING' | |
field['mode'] = 'NULLABLE' | |
end | |
if field['fields'] | |
check(field['fields']) | |
end | |
end | |
end | |
check(@fields) | |
puts Yajl::Encoder.encode(@fields, :pretty => true) |
Great. I created something similar in Javascript:
https://gist.github.com/oyvindholmstad/f5ecd3dcf4e471f77907
Here is a quick one - a little less sophisticated for python:
https://gist.github.com/danielecook/3175c578c8a0118ead35
python bigquery_schema.py <filename>
Works with gzipped files too!
hi,
homes-MacBook-Air:bigquery-fluentd-docker-sample office_$ ruby json-bq-schema-generator.rb testbg.log
json-bq-schema-generator.rb:31:in traverse': undefined method
each_pair' for [[{"n"=>"def", "v"=>"3000"}]]:Array (NoMethodError)
from json-bq-schema-generator.rb:50:in block (2 levels) in traverse' from json-bq-schema-generator.rb:49:in
each'
from json-bq-schema-generator.rb:49:in block in traverse' from json-bq-schema-generator.rb:31:in
each_pair'
from json-bq-schema-generator.rb:31:in traverse' from json-bq-schema-generator.rb:65:in
block in
from /Library/Ruby/Gems/2.0.0/gems/yajl-ruby-1.2.1/lib/yajl.rb:37:in
call' from /Library/Ruby/Gems/2.0.0/gems/yajl-ruby-1.2.1/lib/yajl.rb:37:in
parse'from /Library/Ruby/Gems/2.0.0/gems/yajl-ruby-1.2.1/lib/yajl.rb:37:in
parse' from json-bq-schema-generator.rb:64:in
'homes-MacBook-Air:bigquery-fluentd-docker-sample office_$
Any idea? Thanks !!!
Hi,
I am facing this issue when trying to run it. I have installed yajl, yajl-ruby. May I know, what I am missing here, thanks
$ruby json-bq-schema-generator.rb data1.json
json-bq-schema-generator.rb:64:in `
satrox28, I got the same problem, for me it worked changing:
require 'yajl'
by
require 'yajl/yajl'
require 'yajl'
¯_(ツ)_/¯
Having the same problem with Yajl::Paser. require 'yajl/yajl' does not work for me. Please advise.
I have created a JSON to BQ schema generator json2bqschema
(https://github.com/ahsandar/json2bqschema) packaged in a docker container using this gist for use from CLI
. It has some changes , it is suppose to work on a single JSON object and not zip data sets. Also its using JSON
module rather than yajl
as the container size was 5x just to use yajl
. As it is for running on a single json object the performance gains are not an issue. Anyone is more than welcome to fork and extend this package. thanks to @igrigorik for this gist
This is super helpful thank you very much!