Skip to content

MDEV-38260 applier hang due to sequence table access#5134

Merged
janlindstrom merged 1 commit into
10.11from
10.11-MDEV-38260
May 29, 2026
Merged

MDEV-38260 applier hang due to sequence table access#5134
janlindstrom merged 1 commit into
10.11from
10.11-MDEV-38260

Conversation

@sjaakola
Copy link
Copy Markdown
Contributor

Avoiding to mark sequence table write as a DDL transaction for galera applying.

Skipping commit for DDL marked GTID log event in applying as this would lead to double commit, if the transaction has also a real transaction. Such scnenario could happen if transaction has sequence table writes together with other innodb table writes.

The commit contains a new mtr test galera.MDEV-38260 for testing applying of a transaction having sequence table access when using GTID mode

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a replication hang issue in Galera when using sequence tables (MDEV-38260) by skipping marking transactions as DDL during sequence table inserts when applying wsrep events. It also adds a new test case and debug logging. The review feedback highlights a critical issue in sql/wsrep_schema.cc where replacing the transaction commit logic with thd->transaction->cleanup() for DDL events can cause transaction leaks, lock conflicts, and GTID loss. Additionally, a format specifier mismatch (%d instead of %llu for ulonglong) was identified in the debug log added to sql/wsrep_applier.cc.

Comment thread sql/wsrep_schema.cc
Comment on lines +1667 to 1676
/* gtid slave state recording above has started a trasaction and called for
statemet commit. Resetting here the transaction state for the actual DDL
execution to happen by following events
*/
thd->transaction->cleanup();
MYSQL_COMMIT_TRANSACTION(thd->m_transaction_psi);
thd->m_transaction_psi= NULL;
thd->server_status&=
~(SERVER_STATUS_IN_TRANS | SERVER_STATUS_IN_TRANS_READONLY);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Reverting the transaction commit logic to thd->transaction->cleanup() for in_ddl is highly problematic and dangerous.

When in_ddl is true (which is the case for all real DDL events), the GTID event writes to the mysql.gtid_slave_pos table (which is typically an InnoDB transactional table). By calling thd->transaction->cleanup() instead of committing, the SQL layer discards the transaction state (including clearing the registered storage engines) without committing or rolling back the changes in InnoDB.

This results in:

  1. Transaction Leak: An active, uncommitted transaction is left dangling in InnoDB, holding locks on the mysql.gtid_slave_pos table.
  2. Lock Conflicts: Subsequent transactions attempting to update the GTID position will block and eventually hit lock wait timeouts.
  3. GTID Loss: The GTID position update is never committed and may be rolled back, breaking replication consistency.

Since the root cause of the sequence table issue is already addressed in ha_sequence.cc (by preventing sequence writes from being marked as DDL), in_ddl will no longer be incorrectly set to true for sequence writes. Therefore, this workaround in wsrep_schema.cc is both unnecessary and harmful. We should revert this change and restore the original commit logic.

(Note: You can also remove the unused #include <mysql/psi/mysql_transaction.h> from the top of this file.)

    // Commit transaction if this GTID is part of DDL-clause because
    // DDL causes implicit commit assuming there is no multi statement
    // transaction ongoing.
    if((error= trans_commit_stmt(thd)))
      goto out;

    (void)trans_commit(thd);
  }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modified part in the wsrep_schema changes the refactoring done in commit: 133e26f which is fix for MDEV-34924. This code path is reached when applying of a GTID log event has FL_DDL flag, which should happen only for DDL statements. However, it has turned out that a write to sequence table can also have this flag, and this has caused hanging in some use cases.

This patch changes wsrep_schema transaction commit to happen only in the storage engine and to not change the wsrep-lib transaction state. If wsrep-lib transaction is committed at this point, it will cause double commit to happen at the end of the actual transaction processing. Note that the GTID log event, is the first event to be applied for a replicated write set.

Comment thread sql/wsrep_applier.cc
}

typ= ev->get_type_code();
WSREP_DEBUG("applying event %d, type %d, buf_len %zu", event, typ, buf_len);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The event variable is of type ulonglong (which is unsigned long long). Using %d as the format specifier is incorrect and can lead to compilation warnings/errors (especially with -Wformat enabled) or print corrupted/truncated values on 64-bit platforms.

Please use %llu instead of %d for printing ulonglong values.

    WSREP_DEBUG("applying event %llu, type %d, buf_len %zu", event, typ, buf_len);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event is of type int, not unsigned long long, as gemini suggests

Copy link
Copy Markdown
Contributor

@janlindstrom janlindstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Avoiding to mark sequence table write as a DDL transaction
for galera applying.

Skipping commit for DDL marked GTID log event in applying as this
would lead to double commit, if the transaction has also a real
transaction. Such scnenario could happen if transaction has sequence
table writes together with other innodb table writes.

The commit contains a new mtr test galera.MDEV-38260 for testing
applying of a transaction having sequence table access when using
GTID mode
@janlindstrom janlindstrom enabled auto-merge (rebase) May 29, 2026 06:10
@janlindstrom janlindstrom merged commit 3c435b8 into 10.11 May 29, 2026
16 of 18 checks passed
@grooverdan grooverdan deleted the 10.11-MDEV-38260 branch May 29, 2026 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants