Skip to content

pt-table-sync

NAME

pt-table-sync - Synchronize MySQL table data efficiently.

SYNOPSIS

Usage

pt-table-sync [OPTIONS] DSN [DSN]

pt-table-sync synchronizes data efficiently between MySQL tables.

This tool changes data, so for maximum safety, you should back up your data before using it. When synchronizing a server that is a replication replica with the --replicate or --sync-to-source methods, it always makes the changes on the replication source, never the replication replica directly. This is in general the only safe way to bring a replica back in sync with its source; changes to the replica are usually the source of the problems in the first place. However, the changes it makes on the source should be no-op changes that set the data to their current values, and actually affect only the replica.

Sync db.tbl on host1 to host2:

pt-table-sync --execute h=host1,D=db,t=tbl h=host2

Sync all tables on host1 to host2 and host3:

pt-table-sync --execute host1 host2 host3

Make replica1 have the same data as its replication source:

pt-table-sync --execute --sync-to-source replica1

Resolve differences that pt-table-checksum found on all replicas of source1:

pt-table-sync --execute --replicate percona.checksum source1

Same as above but only resolve differences on replica1:

pt-table-sync --execute --replicate percona.checksum \
  --sync-to-source replica1

Sync source2 in a source-source replication configuration, where source2’s copy of db.tbl is known or suspected to be incorrect:

pt-table-sync --execute --sync-to-source h=source2,D=db,t=tbl

Note that in the source-source configuration, the following will NOT do what you want, because it will make changes directly on source2, which will then flow through replication and change source1’s data:

# Don't do this in a source-source setup!
pt-table-sync --execute h=source1,D=db,t=tbl source2

RISKS

WARNING: pt-table-sync changes data! Before using this tool, please:

  • Read the tool’s documentation

  • Review the tool’s known “BUGS”

  • Test the tool on a non-production server

  • Backup your production server and verify the backups

pt-table-sync is mature, proven in the real world, and well tested, but if used improperly it can have adverse consequences. Always test syncing first with --dry-run and --print.

DESCRIPTION

pt-table-sync does one-way and bidirectional synchronization of table data. It does not synchronize table structures, indexes, or any other schema objects. The following describes one-way synchronization. “BIDIRECTIONAL SYNCING” is described later.

This tool is complex and functions in several different ways. To use it safely and effectively, you should understand three things: the purpose of --replicate, finding differences, and specifying hosts. These three concepts are closely related and determine how the tool will run. The following is the abbreviated logic:

if DSN has a t part, sync only that table:
   if 1 DSN:
      if --sync-to-source:
         The DSN is a replica.  Connect to its source and sync.
   if more than 1 DSN:
      The first DSN is the source.  Sync each DSN in turn.
else if --replicate:
   if --sync-to-source:
      The DSN is a replica.  Connect to its source, find records
      of differences, and fix.
   else:
      The DSN is the source.  Find replicas and connect to each,
      find records of differences, and fix.
else:
   if only 1 DSN and --sync-to-source:
      The DSN is a replica.  Connect to its source, find tables and
      filter with --databases etc, and sync each table to the source.
   else:
      find tables, filtering with --databases etc, and sync each
      DSN to the first.

pt-table-sync can run in one of two ways: with --replicate or without. The default is to run without --replicate which causes pt-table-sync to automatically find differences efficiently with one of several algorithms (see “ALGORITHMS”). Alternatively, the value of --replicate, if specified, causes pt-table-sync to use the differences already found by having previously ran pt-table-checksum with its own --replicate option. Strictly speaking, you don’t need to use --replicate because pt-table-sync can find differences, but many people use --replicate if, for example, they checksum regularly using pt-table-checksum then fix differences as needed with pt-table-sync. If you’re unsure, read each tool’s documentation carefully and decide for yourself, or consult with an expert.

Regardless of whether --replicate is used or not, you need to specify which hosts to sync. There are two ways: with --sync-to-source or without. Specifying --sync-to-source makes pt-table-sync expect one and only replica DSN on the command line. The tool will automatically discover the replica’s source and sync it so that its data is the same as its source. This is accomplished by making changes on the source which then flow through replication and update the replica to resolve its differences. Be careful though: although this option specifies and syncs a single replica, if there are other replicas on the same source, they will receive via replication the changes intended for the replica that you’re trying to sync.

Alternatively, if you do not specify --sync-to-source, the first DSN given on the command line is the source host. There is only ever one source host. If you do not also specify --replicate, then you must specify at least one other DSN as the destination host. There can be one or more destination hosts. Source and destination hosts must be independent; they cannot be in the same replication topology. pt-table-sync will die with an error if it detects that a destination host is a replica because changes are written directly to destination hosts (and it’s not safe to write directly to replicas). Or, if you specify --replicate (but not --sync-to-source) then pt-table-sync expects one and only one source DSN on the command line. The tool will automatically discover all the source’s replicas and sync them to the source. This is the only way to sync several (all) replicas at once (because --sync-to-source only specifies one replica).

Each host on the command line is specified as a DSN. The first DSN (or only DSN for cases like --sync-to-source) provides default values for other DSNs, whether those other DSNs are specified on the command line or auto-discovered by the tool. So in this example,

pt-table-sync --execute h=host1,u=msandbox,p=msandbox h=host2

the host2 DSN inherits the u and p DSN parts from the host1 DSN. Use the --explain-hosts option to see how pt-table-sync will interpret the DSNs given on the command line.

LIMITATIONS

Replicas using row-based replication

pt-table-sync requires statement-based replication when used with the --sync-to-source or --replicate option. Therefore it will set binlog_format=STATEMENT on the source for its session if required. To do this user must have SUPER privilege.

OUTPUT

If you specify the --verbose option, you’ll see information about the differences between the tables. There is one row per table. Each server is printed separately. For example,

# Syncing h=host1,D=test,t=test1
# DELETE REPLACE INSERT UPDATE ALGORITHM START    END      EXIT DATABASE.TABLE
#      0       0      3      0 Chunk     13:00:00 13:00:17 2    test.test1

Table test.test1 on host1 required 3 INSERT statements to synchronize and it used the Chunk algorithm (see “ALGORITHMS”). The sync operation for this table started at 13:00:00 and ended 17 seconds later (times taken from NOW() on the source host). Because differences were found, its “EXIT STATUS” was 2.

If you specify the --print option, you’ll see the actual SQL statements that the script uses to synchronize the table if --execute is also specified.

If you want to see the SQL statements that pt-table-sync is using to select chunks, nibbles, rows, etc., then specify --print once and --verbose twice. Be careful though: this can print a lot of SQL statements.

There are cases where no combination of INSERT, UPDATE or DELETE statements can resolve differences without violating some unique key. For example, suppose there’s a primary key on column a and a unique key on column b. Then there is no way to sync these two tables with straightforward UPDATE statements:

+---+---+  +---+---+
| a | b |  | a | b |
+---+---+  +---+---+
| 1 | 2 |  | 1 | 1 |
| 2 | 1 |  | 2 | 2 |
+---+---+  +---+---+

The tool rewrites queries to DELETE and REPLACE in this case. This is automatically handled after the first index violation, so you don’t have to worry about it.

Be careful when using pt-table-sync in any source-source setup. Source-source replication is inherently tricky, and it’s easy to make mistakes. You need to be sure you’re using the tool correctly for source-source replication. See the “SYNOPSIS” for the overview of the correct usage.

Also be careful with tables that have foreign key constraints with ON DELETE or ON UPDATE definitions because these might cause unintended changes on the child tables. See --[no]check-child-tables.

In general, this tool is best suited when your tables have a primary key or unique index. Although it can synchronize data in tables lacking a primary key or unique index, it might be best to synchronize that data by another means.

REPLICATION SAFETY

Synchronizing a replication source and replica safely is a non-trivial problem, in general. There are all sorts of issues to think about, such as other processes changing data, trying to change data on the replica, whether the destination and source are a source-source pair, and much more.

In general, the safe way to do it is to change the data on the source, and let the changes flow through replication to the replica like any other changes. However, this works only if it’s possible to REPLACE into the table on the source. REPLACE works only if there’s a unique index on the table (otherwise it just acts like an ordinary INSERT).

If your table has unique keys, you should use the --sync-to-source and/or --replicate options to sync a replica to its source. This will generally do the right thing. When there is no unique key on the table, there is no choice but to change the data on the replica, and pt-table-sync will detect that you’re trying to do so. It will complain and die unless you specify --no-check-replica (see --[no]check-replica).

If you’re syncing a table without a primary or unique key on a source-source pair, you must change the data on the destination server. Therefore, you need to specify --no-bin-log for safety (see --[no]bin-log). If you don’t, the changes you make on the destination server will replicate back to the source server and change the data there!

The generally safe thing to do on a source-source pair is to use the --sync-to-source option so you don’t change the data on the destination server. You will also need to specify --no-check-replica to keep pt-table-sync from complaining that it is changing data on a replica.

ALGORITHMS

pt-table-sync has a generic data-syncing framework which uses different algorithms to find differences. The tool automatically chooses the best algorithm for each table based on indexes, column types, and the algorithm preferences specified by --algorithms. The following algorithms are available, listed in their default order of preference:

Chunk

Finds an index whose first column is numeric (including date and time types), and divides the column’s range of values into chunks of approximately --chunk-size rows. Syncs a chunk at a time by checksumming the entire chunk. If the chunk differs on the source and destination, checksums each chunk’s rows individually to find the rows that differ.

It is efficient when the column has sufficient cardinality to make the chunks end up about the right size.

The initial per-chunk checksum is quite small and results in minimal network traffic and memory consumption. If a chunk’s rows must be examined, only the primary key columns and a checksum are sent over the network, not the entire row. If a row is found to be different, the entire row will be fetched, but not before.

Note that this algorithm will not work if chunking a char column where all the values start with the same character. In that case, the tool will exit and suggest picking a different algorithm.

Nibble

Finds an index and ascends the index in fixed-size nibbles of --chunk-size rows, using a non-backtracking algorithm (see pt-archiver for more on this algorithm). It is very similar to “Chunk”, but instead of pre-calculating the boundaries of each piece of the table based on index cardinality, it uses LIMIT to define each nibble’s upper limit, and the previous nibble’s upper limit to define the lower limit.

It works in steps: one query finds the row that will define the next nibble’s upper boundary, and the next query checksums the entire nibble. If the nibble differs between the source and destination, it examines the nibble row-by-row, just as “Chunk” does.

GroupBy

Selects the entire table grouped by all columns, with a COUNT(*) column added. Compares all columns, and if they’re the same, compares the COUNT(*) column’s value to determine how many rows to insert or delete into the destination. Works on tables with no primary key or unique index.

Stream

Selects the entire table in one big stream and compares all columns. Selects all columns. Much less efficient than the other algorithms, but works when there is no suitable index for them to use.

BIDIRECTIONAL SYNCING

Bidirectional syncing is an experimental feature. To make it work reliably there are a number of strict limitations:

* only works when syncing one server to other independent servers
* does not work in any way with replication
* requires that the table(s) are chunkable with the Chunk algorithm
* is not N-way, only bidirectional between two servers at a time
* does not handle DELETE changes

For example, suppose we have three servers: c1, r1, r2. c1 is the central server, a pseudo-source to the other servers (viz. r1 and r2 are not replicas to c1). r1 and r2 are remote servers. Rows in table foo are updated and inserted on all three servers and we want to synchronize all the changes between all the servers. Table foo has columns:

id    int PRIMARY KEY
ts    timestamp auto updated
name  varchar

Auto-increment offsets are used so that new rows from any server do not create conflicting primary key (id) values. In general, newer rows, as determined by the ts column, take precedence when a same but differing row is found during the bidirectional sync. “Same but differing” means that two rows have the same primary key (id) value but different values for some other column, like the name column in this example. Same but differing conflicts are resolved by a “conflict”. A conflict compares some column of the competing rows to determine a “winner”. The winning row becomes the source and its values are used to update the other row.

There are subtle differences between three columns used to achieve bidirectional syncing that you should be familiar with: chunk column (--chunk-column), comparison column(s) (--columns), and conflict column (--conflict-column). The chunk column is only used to chunk the table; e.g. “WHERE id >= 5 AND id < 10”. Chunks are checksummed and when chunk checksums reveal a difference, the tool selects the rows in that chunk and checksums the --columns for each row. If a column checksum differs, the rows have one or more conflicting column values. In a traditional unidirectional sync, the conflict is a moot point because it can be resolved simply by updating the entire destination row with the source row’s values. In a bidirectional sync, however, the --conflict-column (in accordance with other --conflict-* options list below) is compared to determine which row is “correct” or “authoritative”; this row becomes the “source”.

To sync all three servers completely, two runs of pt-table-sync are required. The first run syncs c1 and r1, then syncs c1 and r2 including any changes from r1. At this point c1 and r2 are completely in sync, but r1 is missing any changes from r2 because c1 didn’t have these changes when it and r1 were synced. So a second run is needed which syncs the servers in the same order, but this time when c1 and r1 are synced r1 gets r2’s changes.

The tool does not sync N-ways, only bidirectionally between the first DSN given on the command line and each subsequent DSN in turn. So the tool in this example would be ran twice like:

pt-table-sync --bidirectional h=c1 h=r1 h=r2

The --bidirectional option enables this feature and causes various sanity checks to be performed. You must specify other options that tell pt-table-sync how to resolve conflicts for same but differing rows. These options are:

* --conflict-column
* --conflict-comparison
* --conflict-value
* --conflict-threshold
* --conflict-error  (optional)

Use --print to test this option before --execute. The printed SQL statements will have comments saying on which host the statement would be executed if you used --execute.

Technical side note: the first DSN is always the “left” server and the other DSNs are always the “right” server. Since either server can become the source or destination it’s confusing to think of them as “src” and “dst”. Therefore, they’re generically referred to as left and right. It’s easy to remember this because the first DSN is always to the left of the other server DSNs on the command line.

EXIT STATUS

The following are the exit statuses (also called return values, or return codes) when pt-table-sync finishes and exits.

STATUS  MEANING
======  =======================================================
0       Success.
1       Internal error.
2       At least one table differed on the destination.
3       Combination of 1 and 2.

OPTIONS

Specify at least one of --print, --execute, or --dry-run.

--where and --replicate are mutually exclusive.

This tool accepts additional command-line arguments. Refer to the “SYNOPSIS” and usage information for details.

--algorithms

type: string; default: Chunk,Nibble,GroupBy,Stream

Algorithm to use when comparing the tables, in order of preference.

For each table, pt-table-sync will check if the table can be synced with the given algorithms in the order that they’re given. The first algorithm that can sync the table is used. See “ALGORITHMS”.

--ask-pass

Prompt for a password when connecting to MySQL.

--bidirectional

Enable bidirectional sync between first and subsequent hosts.

See “BIDIRECTIONAL SYNCING” for more information.

--[no]bin-log

default: yes

Log to the binary log (SET SQL_LOG_BIN=1).

Specifying --no-bin-log will SET SQL_LOG_BIN=0.

--buffer-in-mysql

Instruct MySQL to buffer queries in its memory.

This option adds the SQL_BUFFER_RESULT option to the comparison queries. This causes MySQL to execute the queries and place them in a temporary table internally before sending the results back to pt-table-sync. The advantage of this strategy is that pt-table-sync can fetch rows as desired without using a lot of memory inside the Perl process, while releasing locks on the MySQL table (to reduce contention with other queries). The disadvantage is that it uses more memory on the MySQL server instead.

You probably want to leave --[no]buffer-to-client enabled too, because buffering into a temp table and then fetching it all into Perl’s memory is probably a silly thing to do. This option is most useful for the GroupBy and Stream algorithms, which may fetch a lot of data from the server.

--[no]buffer-to-client

default: yes

Fetch rows one-by-one from MySQL while comparing.

This option enables mysql_use_result which causes MySQL to hold the selected rows on the server until the tool fetches them. This allows the tool to use less memory but may keep the rows locked on the server longer.

If this option is disabled by specifying --no-buffer-to-client then mysql_store_result is used which causes MySQL to send all selected rows to the tool at once. This may result in the results “cursor” being held open for a shorter time on the server, but if the tables are large, it could take a long time anyway, and use all your memory.

For most non-trivial data sizes, you want to leave this option enabled.

This option is disabled when --bidirectional is used.

--channel

type: string

Channel name used when connected to a server using replication channels. Suppose you have two sources, source_a at port 12345, source_b at port 1236 and a replica connected to both sources using channels chan_source_a and chan_source_b. If you want to run pt-table-sync to synchronize the replica against source_a, pt-table-sync won’t be able to determine what’s the correct source since SHOW REPLICA STATUS will return 2 rows. In this case, you can use –channel=chan_source_a to specify the channel name to use in the SHOW REPLICA STATUS command.

--charset

short form: -A; type: string

Default character set. If the value is utf8, sets Perl’s binmode on STDOUT to utf8, passes the mysql_enable_utf8 option to DBD::mysql, and runs SET NAMES UTF8 after connecting to MySQL. Any other value sets binmode on STDOUT without the utf8 layer, and runs SET NAMES after connecting to MySQL.

--[no]check-child-tables

default: yes

Check if --execute will adversely affect child tables. When --replace, --replicate, or --sync-to-source is specified, the tool may sync tables using REPLACE statements. If a table being synced has child tables with ON DELETE CASCADE, ON UPDATE CASCADE, or ON UPDATE SET NULL, the tool prints an error and skips the table because REPLACE becomes DELETE then INSERT, so the DELETE will cascade to the child table and delete its rows. In the worst case, this can delete all rows in child tables!

Specify --no-check-child-tables to disable this check. To completely avoid affecting child tables, also specify --no-foreign-key-checks so MySQL will not cascade any operations from the parent to child tables.

This check is only preformed if --execute and one of --replace, --replicate, or --sync-to-source is specified. --print does not check child tables.

The error message only prints the first child table found with an ON DELETE CASCADE, ON UPDATE CASCADE, or ON UPDATE SET NULL foreign key constraint. There could be other affected child tables.

--[no]check-master

default: yes

This option is deprecated and will be removed in future releases. Use --[no]check-source instead.

--[no]check-slave

This option is deprecated and will be removed in future releases. Use --[no]check-replica instead.

default: yes

--[no]check-source

default: yes

With --sync-to-source, try to verify that the detected source is the real source.

--[no]check-replica

default: yes

Check whether the destination server is a replica.

If the destination server is a replica, it’s generally unsafe to make changes on it. However, sometimes you have to; --replace won’t work unless there’s a unique index, for example, so you can’t make changes on the source in that scenario. By default pt-table-sync will complain if you try to change data on a replica. Specify --no-check-replica to disable this check. Use it at your own risk.

--[no]check-triggers

default: yes

Check that no triggers are defined on the destination table.

Triggers were introduced in MySQL v5.0.2, so for older versions this option has no effect because triggers will not be checked.

--chunk-column

type: string

Chunk the table on this column.

--chunk-index

type: string

Chunk the table using this index.

--chunk-size

type: string; default: 1000

Number of rows or data size per chunk.

The size of each chunk of rows for the “Chunk” and “Nibble” algorithms. The size can be either a number of rows, or a data size. Data sizes are specified with a suffix of k=kibibytes, M=mebibytes, G=gibibytes. Data sizes are converted to a number of rows by dividing by the average row length.

--columns

short form: -c; type: array

Compare this comma-separated list of columns.

--config

type: Array

Read this comma-separated list of config files; if specified, this must be the first option on the command line.

--conflict-column

type: string

Compare this column when rows conflict during a --bidirectional sync.

When a same but differing row is found the value of this column from each row is compared according to --conflict-comparison, --conflict-value and --conflict-threshold to determine which row has the correct data and becomes the source. The column can be any type for which there is an appropriate --conflict-comparison (this is almost all types except, for example, blobs).

This option only works with --bidirectional. See “BIDIRECTIONAL SYNCING” for more information.

--conflict-comparison

type: string

Choose the --conflict-column with this property as the source.

The option affects how the --conflict-column values from the conflicting rows are compared. Possible comparisons are one of these MAGIC_comparisons:

newest|oldest|greatest|least|equals|matches

COMPARISON  CHOOSES ROW WITH
==========  =========================================================
newest      Newest temporal --conflict-column value
oldest      Oldest temporal --conflict-column value
greatest    Greatest numerical "--conflict-column value
least       Least numerical --conflict-column value
equals      --conflict-column value equal to --conflict-value
matches     --conflict-column value matching Perl regex pattern
            --conflict-value

This option only works with --bidirectional. See “BIDIRECTIONAL SYNCING” for more information.

--conflict-error

type: string; default: warn

How to report unresolvable conflicts and conflict errors

This option changes how the user is notified when a conflict cannot be resolved or causes some kind of error. Possible values are:

* warn: Print a warning to STDERR about the unresolvable conflict
* die:  Die, stop syncing, and print a warning to STDERR

This option only works with --bidirectional. See “BIDIRECTIONAL SYNCING” for more information.

--conflict-threshold

type: string

Amount by which one --conflict-column must exceed the other.

The --conflict-threshold prevents a conflict from being resolved if the absolute difference between the two --conflict-column values is less than this amount. For example, if two --conflict-column have timestamp values “2009-12-01 12:00:00” and “2009-12-01 12:05:00” the difference is 5 minutes. If --conflict-threshold is set to “5m” the conflict will be resolved, but if --conflict-threshold is set to “6m” the conflict will fail to resolve because the difference is not greater than or equal to 6 minutes. In this latter case, --conflict-error will report the failure.

This option only works with --bidirectional. See “BIDIRECTIONAL SYNCING” for more information.

--conflict-value

type: string

Use this value for certain --conflict-comparison.

This option gives the value for equals and matches --conflict-comparison.

This option only works with --bidirectional. See “BIDIRECTIONAL SYNCING” for more information.

--databases

short form: -d; type: hash

Sync only this comma-separated list of databases.

A common request is to sync tables from one database with tables from another database on the same or different server. This is not yet possible. --databases will not do it, and you can’t do it with the D part of the DSN either because in the absence of a table name it assumes the whole server should be synced and the D part controls only the connection’s default database.

--defaults-file

short form: -F; type: string

Only read mysql options from the given file. You must give an absolute pathname.

--dry-run

Analyze, decide the sync algorithm to use, print and exit.

Implies --verbose so you can see the results. The results are in the same output format that you’ll see from actually running the tool, but there will be zeros for rows affected. This is because the tool actually executes, but stops before it compares any data and just returns zeros. The zeros do not mean there are no changes to be made.

--engines

short form: -e; type: hash

Sync only this comma-separated list of storage engines.

--execute

Execute queries to make the tables have identical data.

This option makes pt-table-sync actually sync table data by executing all the queries that it created to resolve table differences. Therefore, the tables will be changed! And unless you also specify --verbose, the changes will be made silently. If this is not what you want, see --print or --dry-run.

--explain-hosts

Print connection information and exit.

Print out a list of hosts to which pt-table-sync will connect, with all the various connection options, and exit.

--float-precision

type: int

Precision for FLOAT and DOUBLE number-to-string conversion. Causes FLOAT and DOUBLE values to be rounded to the specified number of digits after the decimal point, with the ROUND() function in MySQL. This can help avoid checksum mismatches due to different floating-point representations of the same values on different MySQL versions and hardware. The default is no rounding; the values are converted to strings by the CONCAT() function, and MySQL chooses the string representation. If you specify a value of 2, for example, then the values 1.008 and 1.009 will be rounded to 1.01, and will checksum as equal.

--[no]foreign-key-checks

default: yes

Enable foreign key checks (SET FOREIGN_KEY_CHECKS=1).

Specifying --no-foreign-key-checks will SET FOREIGN_KEY_CHECKS=0.

--function

type: string

Which hash function you’d like to use for checksums.

The default is CRC32. Other good choices include MD5 and SHA1. If you have installed the FNV_64 user-defined function, pt-table-sync will detect it and prefer to use it, because it is much faster than the built-ins. You can also use MURMUR_HASH if you’ve installed that user-defined function. Both of these are distributed with Percona Server. See pt-table-checksum for more information and benchmarks.

--help

Show help and exit.

--[no]hex-blob

default: yes

HEX() BLOB, TEXT and BINARY columns.

When row data from the source is fetched to create queries to sync the data (i.e. the queries seen with --print and executed by --execute), binary columns are wrapped in HEX() so the binary data does not produce an invalid SQL statement. You can disable this option but you probably shouldn’t.

--host

short form: -h; type: string

Connect to host.

--ignore-columns

type: Hash

Ignore this comma-separated list of column names in comparisons.

This option causes columns not to be compared. However, if a row is determined to differ between tables, all columns in that row will be synced, regardless. (It is not currently possible to exclude columns from the sync process itself, only from the comparison.)

--ignore-databases

type: Hash

Ignore this comma-separated list of databases.

(system databases such as information_schema and performance_schema are ignored by default)

--ignore-engines

type: Hash; default: FEDERATED,MRG_MyISAM

Ignore this comma-separated list of storage engines.

--ignore-tables

type: Hash

Ignore this comma-separated list of tables.

Table names may be qualified with the database name.

--ignore-tables-regex

type: string; group: Filter

Ignore tables whose names match the Perl regex.

--[no]index-hint

default: yes

Add FORCE/USE INDEX hints to the chunk and row queries.

By default pt-table-sync adds a FORCE/USE INDEX hint to each SQL statement to coerce MySQL into using the index chosen by the sync algorithm or specified by --chunk-index. This is usually a good thing, but in rare cases the index may not be the best for the query so you can suppress the index hint by specifying --no-index-hint and let MySQL choose the index.

This does not affect the queries printed by --print; it only affects the chunk and row queries that pt-table-sync uses to select and compare rows.

--lock

type: int

Lock tables: 0=none, 1=per sync cycle, 2=per table, or 3=globally.

This uses LOCK TABLES. This can help prevent tables being changed while you’re examining them. The possible values are as follows:

VALUE  MEANING
=====  =======================================================
0      Never lock tables.
1      Lock and unlock one time per sync cycle (as implemented
       by the syncing algorithm).  This is the most granular
       level of locking available.  For example, the Chunk
       algorithm will lock each chunk of C<N> rows, and then
       unlock them if they are the same on the source and the
       destination, before moving on to the next chunk.
2      Lock and unlock before and after each table.
3      Lock and unlock once for every server (DSN) synced, with
       C<FLUSH TABLES WITH READ LOCK>.

A replication replica is never locked if --replicate or --sync-to-source is specified, since in theory locking the table on the source should prevent any changes from taking place. (You are not changing data on your replica, right?) If --wait is given, the source (source) is locked and then the tool waits for the replica to catch up to the source before continuing.

If --transaction is specified, LOCK TABLES is not used. Instead, lock and unlock are implemented by beginning and committing transactions. The exception is if --lock is 3.

If --no-transaction is specified, then LOCK TABLES is used for any value of --lock. See --[no]transaction.

--lock-and-rename

Lock the source and destination table, sync, then swap names. This is useful as a less-blocking ALTER TABLE, once the tables are reasonably in sync with each other (which you may choose to accomplish via any number of means, including dump and reload or even something like pt-archiver). It requires exactly two DSNs and assumes they are on the same server, so it does no waiting for replication or the like. Tables are locked with LOCK TABLES.

--password

short form: -p; type: string

Password to use when connecting. If password contains commas they must be escaped with a backslash: “exam,ple”

--pid

type: string

Create the given PID file. The tool won’t start if the PID file already exists and the PID it contains is different than the current PID. However, if the PID file exists and the PID it contains is no longer running, the tool will overwrite the PID file with the current PID. The PID file is removed automatically when the tool exits.

--port

short form: -P; type: int

Port number to use for connection.

--print

Print queries that will resolve differences.

If you don’t trust pt-table-sync, or just want to see what it will do, this is a good way to be safe. These queries are valid SQL and you can run them yourself if you want to sync the tables manually.

--recursion-method

type: array; default: processlist,hosts

Preferred recursion method used to find replicas.

Possible methods are:

METHOD       USES
===========  ==================
processlist  SHOW PROCESSLIST
hosts        SHOW REPLICAS (SHOW SLAVE HOSTS before MySQL 8.1)
none         Do not find replicas

The processlist method is preferred because SHOW REPLICAS is not reliable. However, the hosts method is required if the server uses a non-standard port (not 3306). Usually pt-table-sync does the right thing and finds the replicas, but you may give a preferred method and it will be used first. If it doesn’t find any replicas, the other methods will be tried.

--replace

Write all INSERT and UPDATE statements as REPLACE.

This is automatically switched on as needed when there are unique index violations.

--replicate

type: string

Sync tables listed as different in this table.

Specifies that pt-table-sync should examine the specified table to find data that differs. The table is exactly the same as the argument of the same name to pt-table-checksum. That is, it contains records of which tables (and ranges of values) differ between the source and replica.

For each table and range of values that shows differences between the source and replica, pt-table-sync will sync that table, with the appropriate WHERE clause, to its source.

This automatically sets --wait to 60 and causes changes to be made on the source instead of the replica.

If --sync-to-source is specified, the tool will assume the server you specified is the replica, and connect to the source as usual to sync.

Otherwise, it will try to use SHOW PROCESSLIST to find replicas of the server you specified. If it is unable to find any replicas via SHOW PROCESSLIST, it will inspect SHOW REPLICAS instead. You must configure each replica’s report-host, report-port and other options for this to work right. After finding replicas, it will inspect the specified table on each replica to find data that needs to be synced, and sync it.

The tool examines the source’s copy of the table first, assuming that the source is potentially a replica as well. Any table that shows differences there will NOT be synced on the replica(s). For example, suppose your replication is set up as A->B, B->C, B->D. Suppose you use this argument and specify server B. The tool will examine server B’s copy of the table. If it looks like server B’s data in table test.tbl1 is different from server A’s copy, the tool will not sync that table on servers C and D.

--slave-user

type: string

This option is deprecated and will be removed in future releases. Use --replica-user instead.

--slave-password

type: string

This option is deprecated and will be removed in future releases. Use --replica-password instead.

--replica-user

type: string

Sets the user to be used to connect to the replicas. This parameter allows you to have a different user with less privileges on the replicas but that user must exist on all replicas.

--replica-password

type: string

Sets the password to be used to connect to the replicas. It can be used with –replica-user and the password for the user must be the same on all replicas.

--set-vars

type: Array

Set the MySQL variables in this comma-separated list of variable=value pairs.

By default, the tool sets:

wait_timeout=10000

Variables specified on the command line override these defaults. For example, specifying --set-vars wait_timeout=500 overrides the defaultvalue of 10000.

The tool prints a warning and continues if a variable cannot be set.

--socket

short form: -S; type: string

Socket file to use for connection.

--sync-to-master

This option is deprecated and will be removed in future releases. Use --sync-to-source instead.

--sync-to-source

Treat the DSN as a replica and sync it to its source.

Treat the server you specified as a replica. Inspect SHOW REPLICA STATUS, connect to the server’s source, and treat the source as the source and the replica as the destination. Causes changes to be made on the source. Sets --wait to 60 by default, sets --lock to 1 by default, and disables --[no]transaction by default. See also --replicate, which changes this option’s behavior.

--tables

short form: -t; type: hash

Sync only this comma-separated list of tables.

Table names may be qualified with the database name.

--timeout-ok

Keep going if --wait fails.

If you specify --wait and the replica doesn’t catch up to the source’s position before the wait times out, the default behavior is to abort. This option makes the tool keep going anyway. Warning: if you are trying to get a consistent comparison between the two servers, you probably don’t want to keep going after a timeout.

--[no]transaction

Use transactions instead of LOCK TABLES.

The granularity of beginning and committing transactions is controlled by --lock. This is enabled by default, but since --lock is disabled by default, it has no effect.

Most options that enable locking also disable transactions by default, so if you want to use transactional locking (via LOCK IN SHARE MODE and FOR UPDATE, you must specify --transaction explicitly.

If you don’t specify --transaction explicitly pt-table-sync will decide on a per-table basis whether to use transactions or table locks. It currently uses transactions on InnoDB tables, and table locks on all others.

If --no-transaction is specified, then pt-table-sync will not use transactions at all (not even for InnoDB tables) and locking is controlled by --lock.

When enabled, either explicitly or implicitly, the transaction isolation level is set REPEATABLE READ and transactions are started WITH CONSISTENT SNAPSHOT.

--trim

TRIM() VARCHAR columns in BIT_XOR and ACCUM modes. Helps when comparing MySQL 4.1 to >= 5.0.

This is useful when you don’t care about the trailing space differences between MySQL versions which vary in their handling of trailing spaces. MySQL 5.0 and later all retain trailing spaces in VARCHAR, while previous versions would remove them.

--[no]unique-checks

default: yes

Enable unique key checks (SET UNIQUE_CHECKS=1).

Specifying --no-unique-checks will SET UNIQUE_CHECKS=0.

--user

short form: -u; type: string

User for login if not current user.

--verbose

short form: -v; cumulative: yes

Print results of sync operations.

See “OUTPUT” for more details about the output.

--version

Show version and exit.

--[no]version-check

default: yes

Check for the latest version of Percona Toolkit, MySQL, and other programs.

This is a standard “check for updates automatically” feature, with two additional features. First, the tool checks its own version and also the versions of the following software: operating system, Percona Monitoring and Management (PMM), MySQL, Perl, MySQL driver for Perl (DBD::mysql), and Percona Toolkit. Second, it checks for and warns about versions with known problems. For example, MySQL 5.5.25 had a critical bug and was re-released as 5.5.25a.

A secure connection to Percona’s Version Check database server is done to perform these checks. Each request is logged by the server, including software version numbers and unique ID of the checked system. The ID is generated by the Percona Toolkit installation script or when the Version Check database call is done for the first time.

Any updates or known problems are printed to STDOUT before the tool’s normal output. This feature should never interfere with the normal operation of the tool.

For more information, visit https://www.percona.com/doc/percona-toolkit/LATEST/version-check.html.

--wait

short form: -w; type: time

How long to wait for replicas to catch up to their source.

Make the source wait for the replica to catch up in replication before comparing the tables. The value is the number of seconds to wait before timing out (see also --timeout-ok). Sets --lock to 1 and --[no]transaction to 0 by default. If you see an error such as the following,

SOURCE_POS_WAIT returned -1

It means the timeout was exceeded and you need to increase it.

The default value of this option is influenced by other options. To see what value is in effect, run with --help.

To disable waiting entirely (except for locks), specify --wait 0. This helps when the replica is lagging on tables that are not being synced.

--where

type: string

WHERE clause to restrict syncing to part of the table.

--[no]zero-chunk

default: yes

Add a chunk for rows with zero or zero-equivalent values. The only has an effect when --chunk-size is specified. The purpose of the zero chunk is to capture a potentially large number of zero values that would imbalance the size of the first chunk. For example, if a lot of negative numbers were inserted into an unsigned integer column causing them to be stored as zeros, then these zero values are captured by the zero chunk instead of the first chunk and all its non-zero values.

DSN OPTIONS

These DSN options are used to create a DSN. Each option is given like option=value. The options are case-sensitive, so P and p are not the same option. There cannot be whitespace before or after the = and if the value contains whitespace it must be quoted. DSN options are comma-separated. See the percona-toolkit manpage for full details.

  • A

dsn: charset; copy: yes

Default character set.

  • D

dsn: database; copy: yes

Database containing the table to be synced.

  • F

dsn: mysql_read_default_file; copy: yes

Only read default options from the given file

  • h

dsn: host; copy: yes

Connect to host.

  • p

dsn: password; copy: yes

Password to use when connecting. If password contains commas they must be escaped with a backslash: “exam,ple”

  • P

dsn: port; copy: yes

Port number to use for connection.

  • S

dsn: mysql_socket; copy: yes

Socket file to use for connection.

  • t

copy: yes

Table to be synced.

  • u

dsn: user; copy: yes

User for login if not current user.

  • s

dsn: mysql_ssl; copy: yes

Create SSL connection

ENVIRONMENT

The environment variable PTDEBUG enables verbose debugging output to STDERR. To enable debugging and capture all output to a file, run the tool like:

PTDEBUG=1 pt-table-sync ... > FILE 2>&1

Be careful: debugging output is voluminous and can generate several megabytes of output.

ATTENTION

Using <PTDEBUG> might expose passwords. When debug is enabled, all command line parameters are shown in the output.

SYSTEM REQUIREMENTS

You need Perl, DBI, DBD::mysql, and some core packages that ought to be installed in any reasonably new version of Perl.

BUGS

For a list of known bugs, see https://jira.percona.com/projects/PT/issues.

Please report bugs at https://jira.percona.com/projects/PT. Include the following information in your bug report:

  • Complete command-line used to run the tool

  • Tool --version

  • MySQL version of all servers involved

  • Output from the tool including STDERR

  • Input files (log/dump/config files, etc.)

If possible, include debugging output by running the tool with PTDEBUG; see “ENVIRONMENT”.

DOWNLOADING

Visit http://www.percona.com/software/percona-toolkit/ to download the latest release of Percona Toolkit. Or, get the latest release from the command line:

wget percona.com/get/percona-toolkit.tar.gz

wget percona.com/get/percona-toolkit.rpm

wget percona.com/get/percona-toolkit.deb

You can also get individual tools from the latest release:

wget percona.com/get/TOOL

Replace TOOL with the name of any tool.

AUTHORS

Baron Schwartz

ACKNOWLEDGMENTS

My work is based in part on Giuseppe Maxia’s work on distributed databases, http://www.sysadminmag.com/articles/2004/0408/ and code derived from that article. There is more explanation, and a link to the code, at http://www.perlmonks.org/?node_id=381053.

Another programmer extended Maxia’s work even further. Fabien Coelho changed and generalized Maxia’s technique, introducing symmetry and avoiding some problems that might have caused too-frequent checksum collisions. This work grew into pg_comparator, http://www.coelho.net/pg_comparator/. Coelho also explained the technique further in a paper titled “Remote Comparison of Database Tables” (http://cri.ensmp.fr/classement/doc/A-375.pdf).

This existing literature mostly addressed how to find the differences between the tables, not how to resolve them once found. I needed a tool that would not only find them efficiently, but would then resolve them. I first began thinking about how to improve the technique further with my article https://web.archive.org/web/20071018105253/http://www.xaprb.com/blog/2007/03/05/an-algorithm-to-find-and-resolve-data-differences-between-mysql-tables/, where I discussed a number of problems with the Maxia/Coelho “bottom-up” algorithm. After writing that article, I began to write this tool. I wanted to actually implement their algorithm with some improvements so I was sure I understood it completely. I discovered it is not what I thought it was, and is considerably more complex than it appeared to me at first. Fabien Coelho was kind enough to address some questions over email.

The first versions of this tool implemented a version of the Coelho/Maxia algorithm, which I called “bottom-up”, and my own, which I called “top-down.” Those algorithms are considerably more complex than the current algorithms and I have removed them from this tool, and may add them back later. The improvements to the bottom-up algorithm are my original work, as is the top-down algorithm. The techniques to actually resolve the differences are also my own work.

Another tool that can synchronize tables is the SQLyog Job Agent from webyog. Thanks to Rohit Nadhani, SJA’s author, for the conversations about the general techniques. There is a comparison of pt-table-sync and SJA at https://web.archive.org/web/20070919024435/http://www.xaprb.com/blog/2007/04/05/mysql-table-sync-vs-sqlyog-job-agent/

Thanks to the following people and organizations for helping in many ways:

The Rimm-Kaufman Group http://www.rimmkaufman.com/, MySQL AB http://www.mysql.com/, Blue Ridge InternetWorks http://www.briworks.com/, Percona http://www.percona.com/, Fabien Coelho, Giuseppe Maxia and others at MySQL AB, Kristian Koehntopp (MySQL AB), Rohit Nadhani (WebYog), The helpful monks at Perlmonks, And others too numerous to mention.

ABOUT PERCONA TOOLKIT

This tool is part of Percona Toolkit, a collection of advanced command-line tools for MySQL developed by Percona. Percona Toolkit was forked from two projects in June, 2011: Maatkit and Aspersa. Those projects were created by Baron Schwartz and primarily developed by him and Daniel Nichter. Visit http://www.percona.com/software/ to learn about other free, open-source software from Percona.

VERSION

pt-table-sync 3.6.0