Skip Navigation

Why would a UTF-8 MySQL backup contain invalid UTF-8 characters?

I’ve been running into several problems with restoring MySQL backups. Namely, the backups come from an environment other than the one I’m working in and I’m forced to remove superuser commands contained in the backups.

The problem is when trying to remove those commands I’m constantly getting UTF-8 encoding errors because there are loads of invalid character sequences.

Why would MySQL encode a backup as UTF-8 if the data isn’t actually UTF-8? This feels like bad design to me.

18 comments
  • MySQL doesn’t actually check if all data is valid UTF-8 during the dump. It simply wraps the raw content in UTF-8

    What versions are you using?

    • Yet as a developer I’m expected to deal with crazy stuff like ASCII, weird encoding standards for email, Punycode, etc. but MySQL developers couldn’t figure out how to encode characters properly while dumping the database?

      • I'd love to be a superior asshole because I use ⭐Microsoft SQL⭐ hair flip but they charge you for convenience and call it the industry standard which I find kind of abhorrent

        sorry to trauma dump

  • Encoding is hard. Especially when your data comes from web forms or CSV files. And MySQL needed three tries to get UTF-8 right and you need DB admins and often programmers as well who know this. So not everything MySQL calls UTF-8 actually is.

    And often enough it took a long while for something to actually reach UTF-8 status. And idiots not converting the data leads to databases with a mixture of encodings.

    • I guess what gets me is that it’s writing to a UTF-8 file so you’d expect that file to contain UTF-8 only. Hell, I’d take UTF-8 with Base64 encoded data for binary data over the hodgepodge .sql file coming out of the thing.

      • There is no such thing* as a UTF-8 file. It's just text encoded in some way. It's only a UTF-8 file if everything is encoded as UTF-8 which it's evidently not.

        You can even tell MySQL to export perfectly valid UTF-8 text encoded as ISO 8859-1 to import into a UTF-8 table without any troubles (maybe apart from stuff that could not be encoded in ISO 8859-1).

        *Yes, technically there could be a BOM at the beginning but almost no tool uses that and most get confused by it. And it would still not force any data written to it to be UTF-8.

  • Might be best to setup an identical environment to the one it was backed up from.

    encoding can be inconsistent across platforms.

18 comments