Skip to content

Instantly share code, notes, and snippets.

@skade
Created November 26, 2014 11:01
Show Gist options
  • Save skade/33ef03948ff280dd2a44 to your computer and use it in GitHub Desktop.
Save skade/33ef03948ff280dd2a44 to your computer and use it in GitHub Desktop.
A quick way to break the input handling of mongodb using Umlauts
# encoding: UTF-8
require 'mongo'
include Mongo
mongo_client = MongoClient.new
db = mongo_client.db("test")
coll = db.collection("txt")
coll.ensure_index(
{ content: "text" },
{ default_language: "de" }
)
coll.remove({})
coll.insert({ content: "Test" })
coll.insert({ content: "test" })
coll.insert({ content: "Überlin" })
puts "contents"
puts coll.find.each { |d| puts d.inspect }
puts "search tests"
puts "-- without Umlaut"
puts "---- uppercase query"
print "number of results: "
puts coll.find("$text" => { "$search" => "Test", "$language" => "de" }).count
puts "---- lowercase query"
print "number of results: "
puts coll.find("$text" => { "$search" => "test", "$language" => "de" }).count
puts "-- with Umlaut"
puts "---- uppercase query"
print "number of results: "
puts coll.find("$text" => { "$search" => "Überlin", "$language" => "de" }).count
puts "---- lowercase query"
print "number of results: "
puts coll.find("$text" => { "$search" => "überlin", "$language" => "de" }).count
puts "-- decomposed Umlaut"
puts "---- uppercase query"
print "number of results: "
puts coll.find("$text" => { "$search" => "\u0055\u0308berlin", "$language" => "de" }).count
puts "---- lowercase query"
print "number of results: "
puts coll.find("$text" => { "$search" => "\u0075\u0308berlin", "$language" => "de" }).count
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment