Skip to content

Instantly share code, notes, and snippets.

@tsabat
Created August 20, 2012 01:25
Show Gist options
  • Save tsabat/3399068 to your computer and use it in GitHub Desktop.
Save tsabat/3399068 to your computer and use it in GitHub Desktop.

###Background

My table looks like this:

id (int), hash_value (text)

with the following settings:

InnoDB
utf8
utf8_unicode_ci

the data I store in the table hash_seeds which looks like this

id  hash_value
--  ---------------------
1   ab87g:nGsqr:7y9H4
2   ab8hu:ngsqr:88hYt

The hash values later become unique identifiers for pens rows. For example, in the table

id  name    slug
--  ------  -------------
77  duder   nGsqr

the slug nGsqr is part of the hash_value column.

###Problem

We expect to do lookups in a case-sensitive way, but the collation may prevent that. For example, the pens table will look like this someday:

id  name    slug
--  ------  -------------
77  duder   nGsqr
78  friend  ngsqr

We do a lookup on the pens table by slug, and I think we'll have problems if we update or select with the wrong collation. Since the collation is utf8_unicode_ci, (the _ci means case-insentive) is case-insenstive, we could conceivibly update or delete the wrong row.

For example, in the hash_seeds table above, the query

Select hash_value from hash_seeds where hash_value like '%nGsqr%'

returns rows 1 and 2. However, the query

Select hash_value from hash_seeds where hash_value like _utf8 '%nGsqr%' collate utf8_bin

returns only row 1.

###Question

I've tried resetting the collation of that table to utf8_bin, but without success. The query

Select hash_value from hash_seeds where hash_value like '%nGsqr%'

still returns both rows.

So, is there something I need to do to force the table to understand my collation, or do I have to edit each query to include the _utf8 and collate indicators?

@tsabat
Copy link
Author

tsabat commented Sep 3, 2012

Ok, so I've made some progress. I've set the defaults on the server to character set utf8 and collation to utf8_bin. The server looks like this now:

mysql> show variables like 'collation%';
+----------------------+----------+
| Variable_name        | Value    |
+----------------------+----------+
| collation_connection | utf8_bin |
| collation_database   | utf8_bin |
| collation_server     | utf8_bin |
+----------------------+----------+
3 rows in set (0.00 sec)

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

I followed the instructions on stack overflow for the my.cnf file.

I did have to add the skip-character-set-client-handshake directive to get all collations to read utf8_bin. Is that bad?

@tsabat
Copy link
Author

tsabat commented Sep 3, 2012

For the move, we'll mysqldump and load the files using the collation set above. i'll pass them through the following sed to change the collation on the columns:

sed 's/utf8_unicode_ci/utf8_bin/g' codepen.sql > codepen_new.sql

Do you see any problems with that? Here's an example table:

-- ----------------------------
--  Table structure for `hash_seeds`
-- ----------------------------
DROP TABLE IF EXISTS `hash_seeds`;
CREATE TABLE `hash_seeds` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `hash_value` text CHARACTER SET utf8 COLLATE utf8_bin,
  `created_at` datetime NOT NULL,
  `updated_at` datetime NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=60234 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment