MySQL Character Set

I've had a lot of trouble with character encodings and MySQL lately. It seems MySQL uses Latin-1 by default (in MySQL cli tool, enter show variables like "character%";) , but many characters that are used in Drupal databases are utf-8. This issue arises when using MS Word to write down content initially and then copying and pasting it into the CMS.

You can reproduce this issue by inserting a node (or any field in any MySQL table) with a title like: ‘A title with smartquotes’, and trying to output it after fetching it from the database. Most frameworks and CMS already address the issue, and so you'd have to do it on a standalone script to see it.

This is something you will notice if you try outputting nodes from Drupal into a PHP page outside of Drupal, for instance, and those special characters (smart quotes, etc, ...) won't be displayed correctly. After some frustrating research, I have found the solution is simple. In PHP you must use the mysql_set_charset function, e.g.:

    <?php
    $conn = mysql_connect('localhost','user','xxx');
    mysql_set_charset('utf8', $conn);

This will ensure characters are treated as utf8, and all special characters should be back to normal on page.

More recently, I've run into the exact same problem with Python while using MySQLdb. Every time one of my MySQL fields would contain a smart quote, I'd get a UnicodeEncodeError. Despite some advice to switch to using ourSQL or even to switch all together to PostgreSQL, it turns out the solution with MySQLdb is as straightforward as the one for PHP. Instead of:

    db = MySQLdb.connect(host = "localhost", 
                       user = "user", 
                       passwd = "pass", 
                       db = "db_name")

You should add two parameters, use_unicode and charset:

    db = MySQLdb.connect(host = "localhost", 
                       user = "user", 
                       passwd = "pass", 
                       db = "db_name",
                       use_unicode = True,
                       charset = "utf8")

After that, you won't have any more error, and all your characters will be displayed properly.