Skip to content

Instantly share code, notes, and snippets.

@raggi
Last active March 30, 2017 05:50
Show Gist options
  • Save raggi/4957402 to your computer and use it in GitHub Desktop.
Save raggi/4957402 to your computer and use it in GitHub Desktop.
Example plain text gem indexer
specs.4.8
prerelease_specs.4.8
versions.list
names.list
specs
deps
gems

Prototype Gem Indexer

  • indexer.rb this will pull down specs from the current servers, and then build an full example of the proposed formats.
  • example_*.rb these are example reader programs. As you can see, read_versions.rb has comparable performance to read_marshal.rb under 1.9+.
  • example_read_deps.rb contains an example parser for the dependency format implemented by indexer. This format is particularly designed for resolver style use cases.
  • Some comments are also in the indexer itself.

Formats

names.list

This is just a line separated list of names, trivial to consume with xargs(1), and any other tool.

versions.list

This is a key -> values list of name to version-platform. While this duplicates platform information, that duplication may not be worth extra format complexities, as it is much more rare than the omitted default ruby platform. This list is designed to be able to be easily joined using tools like awk(1) to produce standard name-version[-platform].gem file names. It is also easy and quick to parse with any modern ruby (see read_versions.rb).

deps/*

These files have a slightly more complex format, but for consistency they begin with an outer level format that is similar to versions.list, with a space separated key -> values list. Note that spaces are allowed in values, but there is only one key -> values list per line (so split(' ', 2)).

The values format is more complex and presently non-standard, but it is a textual representation of runtime only Gem::Dependency objects separated by commas. The csv value format is:

dependency_name:requirement1&requirement2

This makes some assumptions: there are no names or requirements containing : or & characters. I believe this is valid today. These characters could be exchanged for non-printable characters or path characters if necessary, although the relative human readability is a nice to have.

Performance

TL;DR, as good or better than existing.

Full index from cached specs: ruby indexer.rb 136.29s user 78.71s system 117% cpu 3:03.19 total

Old version index: ruby example_read_marshal.rb 0.70s user 0.12s system 95% cpu 0.858 total

New version index: ruby example_read_versions.rb 0.70s user 0.13s system 99% cpu 0.838 total

Parsing the rails dependency index: ruby example_read_deps.rb 0.08s user 0.04s system 96% cpu 0.126 total

Bandwidth: After generic gzip transport compression, the versions.list is half the size of the old specs.4.8 marshal format. The deps entries are 20% the size of the specs files (202mb vs 1.1gb), and a gzip tarball of all deps files is only 2.7MB, vs 20MB for all the marshalled specs. It may be sensible to periodically preseed clients with these tarballs, followed by periodic incremental updates of single files. In all cases, it's important to note that these lightweight ascii formats compress better through gzip, and also lend themselves to progressive updates and efficient syncing.

General:

After caching all the gemspecs, on my Macbook Air, generation of these indices for the whole of the rubygems.org gems data set took 3 mintes. Considering that the reader examples also allow for progressive updates as they coerce all file formats into a Hash, merging the progressive data, this could result in near instantaneous usage for general rubygems servers and clients. It is possible to operate in an append-only manner in normal operation, with periodic full rebuilds, although that may require slightly more advanced client semantics. The idea would follow using an HTTP Range query in order to fetch any file data after a certain size (alongside a conditional fetch). The additional complexity to the Range approach is merely that a checksum would be required in order to detect corruption, and a full refetch would follow. These additional HTTP semantics are still to be fleshed out, but it should be noted that these semantics are also very well suited to rsync based mirroring systems.

Read performance for the versions + platforms list is almost exactly equal to that of specs.4.8 on Ruby 1.9+, and is actually faster on JRuby than the marshal format. On 1.8, performance is significantly worse than marshal, but not insurmountable. Coupled with the fact that 1.8 is on the way out, this is likely to be acceptable. The reader may be able to be further optimized.

There is some difference between specs.4.8 and versions.list, specifically that the versions in versions.list are plain text, whereas they are Gem::Version objects in specs.4.8. This may lead to some other performance issues later on in a working pipeline, however, as Gem::Version construction is presently lazy, the real world impact of this change is minimal, other than decoupling and reducing the size of the data format quite significantly.

Client performance for clients such as Bundler that consume the deps indexes should improve in several ways. No application server is required for dependency lists. Dependency lists can be cached on disk, and refected with conditional and potentially range queries. The files are not currently compressed, as this would cause potentially unnecessary disruption for progressive updates and prevent range queries. By contrast, using HTTP transport compression appropriately, and potentially enabling precompression as an http server side only optimization should both simplify client code, and still enable fast and efficient data transfers. On disk size for clients may increase slightly, however, this is probably negligable.

Shortcomings & TODOs

  • There's still one big index for versions. This may want to be split out.
  • The ASCII format may be brittle to extend, but it's probably no harder to change than Marshal.
  • Some example client implementations are needed, especially to demonstrate efficient HTTP semantics for fetching and caching multiple dependency files.
  • There are no checksums being generated for the files yet, potentially .md5 and .sha1 should be generated alongside the indices.
  • There is no consideration yet for distribution platform signatures. These may be able to be added as files alongside, like the checksums, or they may want to be embedded. Embedding may result in more complex progressive updates and more complex parsing, which is not desired.
reqs_hash = Hash.new { |h, n| h[n] = Hash.new { |rh, v| rh[v] = [] } }
name = 'rails'
open("deps/#{name}") do |io|
io.each_line do |line|
line.chomp!
version, deps = line.split(' ', 2)
deps = deps.split(',')
deps.map! do |str|
name, reqs = str.split(':', 2)
reqs = reqs.split('&')
Gem::Dependency.new(name, reqs)
end
reqs_hash[name][version].concat deps
end
end
specs = Marshal.load(File.read('specs.4.8'))
spec_hash = Hash.new { |h,n| h[n] = [] }
specs.each do |n,v,p|
spec_hash[n] << [v,p]
end
specs_hash = Hash.new { |h,n| h[n] = [] }
open('versions.list') do |io|
io.each_line do |line|
name, versions = line.split(' ', 2)
versions = versions.split(',')
versions.map! { |v| v.split('-', 2) }
versions.each { |v| v[1] ||= 'ruby' }
specs_hash[name].concat versions
end
end
#!/usr/bin/env ruby
require 'rubygems'
require 'open-uri'
specs = []
%w[specs.4.8 prerelease_specs.4.8].each do |idx|
unless File.exists? idx
begin
open(idx, 'wb+') do |io|
io << Gem.gunzip(open("http://production.cf.rubygems.org/#{idx}.gz").read)
end
rescue
File.unlink idx
raise
end
end
specs.concat Marshal.load(File.read(idx))
end
spec_hash = Hash.new { |h,n| h[n] = [] }
specs.each do |n,v,p|
if p == 'ruby'
spec_hash[n] << v.to_s
else
spec_hash[n] << "#{v.to_s}-#{p}"
end
end
# names.list is useful for first-stage discovery in mirrors, and also for the
# command line metaphone/hamming distance helpers.
open('names.list', 'w+') { |io| io.puts *spec_hash.keys }
# versions.list is useful for second stage discovery in mirrors, and also for
# single-gem command line installations or progressive (unresolved)
# installations. It is also useful for platform gem discovery. In general this
# would be kept up to date similarly to specs.4.8, but maybe a future design
# could split this out by alphanumerics, or use append & checksum semantics in
# order to allow for HTTP Range queries in updates. This would work for
# well incrementals. See read_versions for an example of how this might be
# consumed in a progressive safe way. Periodic complete rewrites for data
# efficiency would generally not affect the system in an adverse manner, and
# could be daily/weekly/etc.
open('versions.list', 'w+') do |io|
spec_hash.each do |name, versions|
io.puts "#{name} #{versions.join(",")}"
end
end
require 'fileutils'
require 'thread'
FileUtils.mkdir_p 'specs'
FileUtils.mkdir_p 'deps'
$print_mutex ||= Mutex.new
$total = specs.size * 2
$count = 0
def progress
$print_mutex.synchronize do
$count += 1
return unless $count % 1000 == 0
print "\r#{$count}/#{$total} #{(($count/$total.to_f) * 100).to_i} %"
end
end
require 'net/http/persistent'
http = Net::HTTP::Persistent.new
spec_q = Queue.new
deps_q = Queue.new
# Fetch all the specs, so we can get dependencies. This isn't really required if
# we had the gemcutter DB to hand, but it also serves as an example of how to do
# this. If we modified the rubygems http client code to HTTP transport
# compression, then we could have raw .gemspec files in this kind of filesystem
# layout, and rely on transport compressors (mirrors that care about efficiency
# can obviously precompress in the standard manners).
spec_ts = Array.new(50) do
Thread.new do
while nvp = spec_q.pop
progress
n, v, p = nvp
file = "#{n}-#{v}#{'-' + p unless p == 'ruby'}.gemspec"
if File.exists?("specs/#{file}")
deps_q << file
next
end
uri = "http://production.cf.rubygems.org/quick/Marshal.4.8/#{file}.rz"
http.request(URI uri) do |res|
case res
when Net::HTTPSuccess
open("specs/#{file}", "wb+") { |o| o << Gem.inflate(res.body) }
else
$stderr.puts "\nFailed download: specs/#{file}"
end
end
deps_q << file
end
end
end
specs.each do |nvp|
spec_q << nvp
end
spec_ts.size.times { spec_q << nil }
# Generate dependency files. These dependency files have a slightly different
# format, as there is currently no good built-in ascii format for
# Gem::Dependency. In order to keep the first order parsing similar to that of
# versions.list, the outer format is `name csv`, and the inner format of csv is
# then `dependency name:requirement1&requirement2`. We will need to ensure that
# there are never any collisions with these characters, or consider moving the
# format to something that will not collide (e.g. non-printable characters). The
# human readable format is convenient for debugging and non-ruby interaction,
# however, so it's a nice to have.
deps_ts = Array.new(50) do
Thread.new do
while specfile = deps_q.pop
progress
next unless File.exists?("specs/#{specfile}")
spec = begin
Marshal.load(File.read("specs/#{specfile}"))
rescue
$stderr.puts "\nCorrupt spec: #{specfile}"
File.unlink("specs/#{specfile}")
next
end
file = "deps/#{spec.name}"
open(file, "a+") do |io|
deps = spec.dependencies.select { |d| d.respond_to?(:type) ? d.type == :runtime : true }
deps.map! { |d| d.kind_of?(Array) ? "#{d.first} #{d[1]}" : "#{d.name}:#{d.requirements_list.join("&")}" }
io.puts "#{spec.version.to_s} #{deps.join(",")}"
end
end
end
end
until spec_q.empty?
spec_ts.each { |t| t.join(0.01) }
sleep 1
end
spec_ts.each { |t| t.join }
deps_ts.size.times { deps_q << nil }
until deps_q.empty?
deps_ts.each { |t| t.join(0.01) }
sleep 1
end
deps_ts.each { |t| t.join }
puts
@raggi
Copy link
Author

raggi commented Feb 20, 2013

  • Drop names.list
  • Use a single separator for all tuples on a line
  • Validate out the separators from Specification and check during full index
  • Implement incremental indexing
  • Move back to pre-gzip
  • consider long term growth rate model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment