Migrating MySQL to utf8mb4 Safely With AI

MySQL’s utf8 character set is one of the great traps in the database world: it’s not real UTF-8. It only stores up to three bytes per character, which means it silently chokes on anything in the four-byte range — most emoji, some CJK characters, plenty of mathematical and historical symbols. The real, full Unicode encoding is utf8mb4. If your tables are still on the old utf8 (now aliased to utf8mb3), then somewhere a user has typed an emoji into a comment field and gotten either a truncated string or an Incorrect string value error, and you’ve probably got a support ticket about it.

Migrating to utf8mb4 is conceptually simple and operationally full of sharp edges: index length limits, collation choices, connection charset, and the fact that converting a big table rewrites it. I use AI to plan the migration, generate the conversion statements, and flag the gotchas specific to my schema. Then I run the whole thing on a replica and verify data round-trips correctly before production. The model plans the migration; I prove it’s safe on real data.

Find what’s actually on the old charset

You can’t convert what you haven’t inventoried. First, see what’s there:

SELECT table_schema, table_name, table_collation
FROM information_schema.tables
WHERE table_collation LIKE 'utf8\_%'
  AND table_schema NOT IN ('mysql','information_schema','performance_schema','sys');

SELECT table_schema, table_name, column_name, character_set_name, collation_name
FROM information_schema.columns
WHERE character_set_name = 'utf8'
  AND table_schema NOT IN ('mysql','information_schema','performance_schema','sys');

Charset can be set at the column level, overriding the table default, so I check both. I paste the inventory into AI and ask it to group the work and flag risks:

Here’s an inventory of tables and columns still on utf8 (utf8mb3). Group them into a migration plan, flag any columns where converting to utf8mb4 risks hitting the index key-length limit, and tell me which collation (utf8mb4_0900_ai_ci vs utf8mb4_unicode_ci) fits MySQL 8.0. Note anything that requires care.

Mind the index length limit

The classic failure: a 4-byte charset makes each character take more bytes, so an indexed VARCHAR(255) that fit under the old key-length limit can blow past it. With innodb_large_prefix and the modern DYNAMIC row format this is mostly a non-issue on current MySQL, but on older configs an index on VARCHAR(255) utf8mb4 exceeds the 767-byte legacy limit (255 × 4 = 1020). A good AI plan calls this out:

Three columns are indexed VARCHAR(255) and will exceed the 767-byte index prefix limit under utf8mb4 if your tables use the old COMPACT/REDUNDANT row format. Confirm innodb_default_row_format = DYNAMIC and innodb_large_prefix (default on 8.0) before converting. If you can’t change row format, shorten the indexed prefix to VARCHAR(191) or use a prefix index KEY (col(191)).

That 191 number (the largest VARCHAR that indexes cleanly at 767 bytes / 4) is the famous workaround, and I want AI to surface it for the right columns rather than blanket-applying it.

Convert the right way

There are two halves: the table data and the connection. For each table, convert both the default charset and the existing columns:

ALTER TABLE comments
  CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

CONVERT TO CHARACTER SET rewrites the table and converts column data, which is what you want. But on a large table this is a blocking rewrite — so for anything big I drive the same ALTER through gh-ost to keep it online, exactly as I would for any other heavy schema change. I have AI generate both the direct ALTER for small tables and the gh-ost invocation for large ones, and I tell it the row counts so it picks correctly.

The connection charset matters just as much. If the app connects as utf8, four-byte characters get mangled in transit no matter what the table stores. Set it server-side and confirm the app’s driver agrees:

[mysqld]
character_set_server = utf8mb4
collation_server     = utf8mb4_0900_ai_ci

SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';

I make AI include the connection-string change for whatever driver the app uses, because a perfect table migration with a utf8 connection is still broken. The MySQL category covers the gh-ost mechanics in a separate guide.

Verify the round-trip on a replica

This is the step that proves the migration worked. On a replica, after converting, I insert and read back genuine four-byte content:

INSERT INTO comments (body) VALUES ('rocket 🚀 and 𝕄𝕒𝕥𝕙 symbols 𠀀');
SELECT body, HEX(body) FROM comments WHERE id = LAST_INSERT_ID();

If the emoji and the four-byte CJK character come back intact — and the HEX shows the full byte sequence rather than a 3F (?) replacement — the migration is real. If you see question marks, something in the chain is still utf8: the connection, a column, or the client. I also compare row counts and a checksum before and after conversion to confirm CONVERT TO CHARACTER SET didn’t silently lose or alter rows. Only after this round-trip passes on a replica do I run the production migration.

Guardrails

Convert connection and storage together. Either alone leaves the system broken. Verify both with SHOW VARIABLES.
Watch the index key length. On older row formats, VARCHAR(255) indexes break under utf8mb4. Confirm DYNAMIC row format or shorten indexed prefixes.
Use gh-ost for big tables. CONVERT TO CHARACTER SET rewrites the table and blocks; do it online for anything large.
Pick the collation deliberately. utf8mb4_0900_ai_ci is the modern MySQL 8.0 default with better Unicode handling, but mixing collations across joined columns causes errors and index-less comparisons. Keep them consistent.
Prove the round-trip. Insert real four-byte characters on a replica and confirm they come back byte-for-byte before going to production.

I keep the utf8mb4-migration prompt saved with the index-length and connection-charset checks baked in — there’s a starter at /prompts/. AI inventories the work, generates the conversion statements, and flags the index-length and collation traps faster than I’d track them down by hand. But the emoji that round-trips intact on a replica is the only thing that proves the migration actually works. Plan with the model, verify on real data, and you fix the silent-truncation bug without introducing a new one.

Find what’s actually on the old charset

Mind the index length limit

Convert the right way

Verify the round-trip on a replica

Guardrails

Download the Free 500-Prompt DevOps AI Toolkit