Chronicle of fixing broken replication, and a question

am 26.07.2011 16:18:15 von Todd Lyons

Yeah it happens to all of us. Some master->slave replication system
breaks and goes unnoticed. In my case, I had a monitoring script, but
it was being called with a --verbose flag, and internally that
"verbose" was being mapped to "debug", and the routine to notify
sysadmins of broken replication doesn't fire if debug is set, so
warning emails were not being sent out. You might ask, why is this a
problem? Just restart replication, skipping problem statements.
Well, the big issue was that I have automatic purge of binlogs set.
Over the course of a few days, the purging of binlogs surpassed where
the replication stopped. Damn...

This is a mysql 5.0.77 (CentOS 5.x) installation with maatkit and
xtrabackup utilities available. So there's really only one thing to
do. xtrabackup to the rescue!

1. I did a full copy of the running master database server using
xtrabackup to a backup server via nfs. It took 2 hours, of which the
last 15 minutes did a write lock of the entire server as it copied
over the *.frm files and the few myisam tables. This was the most
troublesome part as it was visible to both users and website owners
:-(
2. On the slave, I did the apply-logs step (via nfs to the backup
server), taking another 2 hours.
3. I renamed the old directory on the slave server and created a new
mysql directory, owned by the correct user.
4. I copied the backup data from the backup server to this new directory.
5. I deleted the iblogfile* files so that mysql would create them on
startup based on the sizes set in the my.cnf.
6. I started it up. It still wanted to do a crash recovery since the
ibdata and iblog files didn't match, but that was expected. After a
few minutes of scanning (we use innodb_table_per_file, so it had a lot
of tablespace spread out over lots of files), it started up and was
accepting connections. Looking good so far.
7. I did 'cat xtrabackup_slave_info'. For some reason it had this:
CHANGE MASTER TO MASTER_LOG_FILE='', MASTER_LOG_POS=

*UH OH* *CRAP* *$#^@%&!*

** See note below

8. Calm down, think about things a bit. Obviously I didn't
compensate for the fact that I was doing the original backup on the
master, but the apply-log on the slave (and --slave-info option
doesn't have the data available on the
9. I know about what time I started the backup. I know that this is
a database server dedicated to ecommerce shopping carts. As a result,
there is a steady flow in of data as customers and web_robots access
the shopping carts.
10. Knowing the data in #9, I look at the statements being replicated
with mysqlbinlog on the master with --start-datetime="2011-07-25
12:00:00". A quick query of the data in that database on the
newly-recovered-but-not-yet-replicating slave found me a last
timestamp of 13:16.49.
11. I adjust my timestamp and trace data from that point until I find
one that didn't get inserted into the database by the replication
process (meaning, this was when I started the backup). I determined
that the timestamp of data not yet in the database was 13:16:52.
12. Comments in mysqlbinlog output at that timestamp indicated it was
at position 487727085.
13. I repeated the mysqlbinlog using --start-position 487727085 and
verified that it was the same output as #12.
14. I did a change master statement on the slave, setting the
position to 487727085, I started replication, and it caught up after
about 20 minutes.
15. If #14 would have had issues, I did have mk-slave-restart
available to force it to skip problems and restart replication.

So everything is alright. There are no issues. But there is a question.

Is there a better way to do this? Is there something automated that
will "compare THIS binlog to THAT server and tell me where the data
was no longer being inserted" ? I'm looking to see how others deal
with a complete restart of a running system like this.

** It turns out that xtrabackup gave me confidence of the log position
I had determined. Running the backup on the master, it also made this
file:
# cat xtrabackup_binlog_info
mysqld-bin.000259 487727085
which of course confirmed the log position I determined in step #12.

To be honest, the biggest step in the sequence above is likely #8.
The ability to assess a situation without freaking out (for too long)
is a big part of a sysadmin job IMHO.

Regards... Todd
--
If Americans could eliminate sugary beverages, potatoes, white bread,
pasta, white rice and sugary snacks, we would wipe out almost all the
problems we have with weight and diabetes and other metabolic
diseases. -- Dr. Walter Willett, Harvard School of Public Health

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Chronicle of fixing broken replication, and a question

am 26.07.2011 17:18:11 von Reindl Harald

--------------enigA16E7A5E2030C7E22EA5F53C
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Am 26.07.2011 16:18, schrieb Todd Lyons:

> 1. I did a full copy of the running master database server using
> xtrabackup to a backup server via nfs. It took 2 hours, of which the
> last 15 minutes did a write lock of the entire server as it copied
> over the *.frm files and the few myisam tables. This was the most
> troublesome part as it was visible to both users and website owners
> :-(

why are not using two rsync-runs?
the first while mysqld is running
the second directly after stop mysqld

this way you can be 100% sure that you can start the replication
from scratch and your downtime is only a few seconds, best if
enough space to have this target on the master-machine because
while you take the slow way over the network the master is running
with a fresh binlog

#!/bin/bash
rsync --times --perms --owner --group --recursive --delete-after /mysql_d=
ata/ /mysql_backup/
date
/sbin/service mysqld stop
cd /mysql_data/
rm -f /mysql_data/bin*
rsync --times --perms --owner --group --recursive --delete-after /mysql_d=
ata/ /mysql_backup/
/sbin/service mysqld start

--------------enigA16E7A5E2030C7E22EA5F53C
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4u2rMACgkQhmBjz394Anlr2QCfVsq2g6tRE7HZUpqd43Kn ayii
F5YAn3KxId1Bb/JXnbU3QD65JemnB2zu
=orSb
-----END PGP SIGNATURE-----

--------------enigA16E7A5E2030C7E22EA5F53C--

Re: Chronicle of fixing broken replication, and a question

am 26.07.2011 19:13:17 von Todd Lyons

On Tue, Jul 26, 2011 at 8:18 AM, Reindl Harald wro=
te:
>> 1. =A0I did a full copy of the running master database server using
>> xtrabackup to a backup server via nfs. =A0It took 2 hours, of which the
>> last 15 minutes did a write lock of the entire server as it copied
>> over the *.frm files and the few myisam tables. =A0This was the most
>> troublesome part as it was visible to both users and website owners
> why are not using two rsync-runs?
> the first while mysqld is running
> the second directly after stop mysqld
>
> this way you can be 100% sure that you can start the replication
> from scratch and your downtime is only a few seconds, best if

43 GB is more than a few seconds.

> enough space to have this target on the master-machine because
> while you take the slow way over the network the master is running
> with a fresh binlog

I tested a slightly modified version of your quickie script, first
using the nfs share:
Starting first rsync
real 37m48.201s
Stopping MySQL: [ OK ]
Starting second rsync
real 4m24.536s
Starting MySQL: [ OK ]

Then I ran it using local spindles:
Starting first rsync
real 26m10.747s
Stopping MySQL: [ OK ]
Starting second rsync
real 3m11.945s
Starting MySQL: [ OK ]

So I could have lowerd the amount of time mysql was not available by
quite a bit doing it that way. Plus I would have removed the
requirement to apply-logs (due to not copying innodb files while they
were open). In the end, xtrabackup worked as designed, but the fact
that my large number of databases and tables and innodb_file_per_table
slows things down tremendously, so it isn't the best fit in this case.

....Todd

--=20
If Americans could eliminate sugary beverages, potatoes, white bread,
pasta, white rice and sugary snacks, we would wipe out almost all the
problems we have with weight and diabetes and other metabolic
diseases. -- Dr. Walter Willett, Harvard School of Public Health

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg

Re: Chronicle of fixing broken replication, and a question

am 26.07.2011 23:58:59 von Reindl Harald

--------------enigAC0B47E51EBC33129095C259
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Am 26.07.2011 19:13, schrieb Todd Lyons:
> On Tue, Jul 26, 2011 at 8:18 AM, Reindl Harald =
wrote:
>>> 1. I did a full copy of the running master database server using
>>> xtrabackup to a backup server via nfs. It took 2 hours, of which the=

>>> last 15 minutes did a write lock of the entire server as it copied
>>> over the *.frm files and the few myisam tables. This was the most
>>> troublesome part as it was visible to both users and website owners
>> why are not using two rsync-runs?
>> the first while mysqld is running
>> the second directly after stop mysqld
>>
>> this way you can be 100% sure that you can start the replication
>> from scratch and your downtime is only a few seconds, best if
>=20
> 43 GB is more than a few seconds.

well, but it is faster in summary and without any pitfalls

>> enough space to have this target on the master-machine because
>> while you take the slow way over the network the master is running
>> with a fresh binlog
>=20
> I tested a slightly modified version of your quickie script, first
> using the nfs share:
> Starting first rsync
> real 37m48.201s
> Stopping MySQL: [ OK ]
> Starting second rsync
> real 4m24.536s
> Starting MySQL: [ OK ]

this is a good value for remote

> Then I ran it using local spindles:
> Starting first rsync
> real 26m10.747s
> Stopping MySQL: [ OK ]
> Starting second rsync
> real 3m11.945s
> Starting MySQL: [ OK ]

weel, 3 minutes is a acceptable downtime with the benefit
that the slave will be binary-identical and all old logs
are gone away

> So I could have lowerd the amount of time mysql was not available by
> quite a bit doing it that way. Plus I would have removed the
> requirement to apply-logs (due to not copying innodb files while they
> were open). In the end, xtrabackup worked as designed, but the fact
> that my large number of databases and tables and innodb_file_per_table
> slows things down tremendously, so it isn't the best fit in this case

with smaller databases it is the same, our main webserver has
only 800 MB database files and here we are speaking really
about few seconds for the second rsync and the only thing to do
is re-init the replication after that - i find really no case where
the rsync will not win at the end :-)

i am doing this since years for all sort of backups (as example
stopping the slave shortly and make a consistent off-site-backup)
because i do not trust any tool which makes a hot-backup and trying
a sync with binary logs

--------------enigAC0B47E51EBC33129095C259
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4vOKMACgkQhmBjz394AnkmBgCghMO50qapoyy36v+HWYJK EGes
mfQAn2jJ6ekEqX47Tb0hNztEk3/EpGvK
=O1d5
-----END PGP SIGNATURE-----

--------------enigAC0B47E51EBC33129095C259--