MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 09.11.2006 05:23:48 von alf

Hi,

is it possible that due to OS crash or mysql itself crash or some e.g.
SCSI failure to lose all the data stored in the table (let's say million
of 1KB rows). In other words what is the worst case scenario for MyISAM
backend?

Also is it possible to not to lose data but get them corrupted?

Thx, Andy

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 09.11.2006 05:55:36 von gordonb.ybl0v

>is it possible that due to OS crash or mysql itself crash or some e.g.
>SCSI failure to lose all the data stored in the table (let's say million
>of 1KB rows).

It is always possible that your computer system will catch fire and
lose all data EVEN IF IT'S POWERED OFF. And the same nuclear attack
might take up all your backups, too. And you and all your employees.
Or the whole thing could just be stolen.

Managing to smash just one sector, the sector containing the data
file inode, or worse, the sector containing the data file, index
file, AND table definition inodes, could pretty well kill a table.
I have had the experience of a hard disk controller that sometimes
flipped some bits in the sectors before writing them. It took weeks
to discover this.

>In other words what is the worst case scenario for MyISAM
>backend?

Probably, total loss of data and hardware.

>Also is it possible to not to lose data but get them corrupted?

I call that 'lost'. But yes, it is possible to end up with a bunch
of data that's bad and you don't realize it until things have gotten
much worse.

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 09.11.2006 07:04:21 von alf

Gordon Burditt wrote:
>>is it possible that due to OS crash or mysql itself crash or some e.g.
>>SCSI failure to lose all the data stored in the table (let's say million
>>of 1KB rows).
>
> Managing to smash just one sector, the sector containing the data
> file inode, or worse, the sector containing the data file, index
> file, AND table definition inodes, could pretty well kill a table.
> I have had the experience of a hard disk controller that sometimes
> flipped some bits in the sectors before writing them. It took weeks
> to discover this.
>
>
>>In other words what is the worst case scenario for MyISAM
>>backend?
>
>
> Probably, total loss of data and hardware.
>

well, let's narrow it down to the mysql bug causing it to crash. Or
better to the all situations where trx's capabilities of InnoDB can
easily take care of a recovery (to the last committed trx).

I wonder if there is a possibility due to internal structure of MyISAM
backend to lose entire table where even recovery tools give up.

Would using ext3 help?

Thx in advance, Andy

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 09.11.2006 13:37:25 von Jerry Stuckle

alf wrote:
> Gordon Burditt wrote:
>
>>> is it possible that due to OS crash or mysql itself crash or some e.g.
>>> SCSI failure to lose all the data stored in the table (let's say million
>>> of 1KB rows).
>>
>>
>> Managing to smash just one sector, the sector containing the data
>> file inode, or worse, the sector containing the data file, index
>> file, AND table definition inodes, could pretty well kill a table.
>> I have had the experience of a hard disk controller that sometimes
>> flipped some bits in the sectors before writing them. It took weeks
>> to discover this.
>>
>>
>>> In other words what is the worst case scenario for MyISAM
>>> backend?
>>
>>
>>
>> Probably, total loss of data and hardware.
>>
>
> well, let's narrow it down to the mysql bug causing it to crash. Or
> better to the all situations where trx's capabilities of InnoDB can
> easily take care of a recovery (to the last committed trx).
>
> I wonder if there is a possibility due to internal structure of MyISAM
> backend to lose entire table where even recovery tools give up.
>
> Would using ext3 help?
>
>
> Thx in advance, Andy

As Gordon said - anything's possible.

I don't see why ext3 would help. It knows nothing about the internal
format of the tables, and that's what is most likely to get screwed up
in a database crash. I would think it would be almost impossible to
recover to a consistent point in the database unless you have a very
detailed knowledge of the internal format of the files. And even then
it might be impossible if your system is very busy.

The best strategy is to keep regular backups of the database.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 09.11.2006 16:15:35 von Toby

Gordon Burditt wrote:
> >is it possible that due to OS crash or mysql itself crash or some e.g.
> >SCSI failure to lose all the data stored in the table (let's say million
> >of 1KB rows).
>
> It is always possible that your computer system will catch fire and
> lose all data EVEN IF IT'S POWERED OFF. And the same nuclear attack
> might take up all your backups, too. And you and all your employees.
> Or the whole thing could just be stolen.
>
> Managing to smash just one sector, the sector containing the data
> file inode, or worse, the sector containing the data file, index
> file, AND table definition inodes, could pretty well kill a table.
> I have had the experience of a hard disk controller that sometimes
> flipped some bits in the sectors before writing them. It took weeks
> to discover this.

I spent weeks on a similar problem too - turned out to be bad RAM. The
only filesystem that I know of which can handle such hardware failures
is Sun's ZFS:
http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data

>
> >In other words what is the worst case scenario for MyISAM
> >backend?
>
> Probably, total loss of data and hardware.
>
> >Also is it possible to not to lose data but get them corrupted?
>
> I call that 'lost'. But yes, it is possible to end up with a bunch
> of data that's bad and you don't realize it until things have gotten
> much worse.

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 09.11.2006 16:17:27 von alf

Jerry Stuckle wrote:

> I don't see why ext3 would help.

only to not to get the file system corrupted.

> It knows nothing about the internal
> format of the tables, and that's what is most likely to get screwed up
> in a database crash. I would think it would be almost impossible to
> recover to a consistent point in the database unless you have a very
> detailed knowledge of the internal format of the files.

Well, mysql recovery procedures does have that knowledge. There are
different levels of disaster. My assumption is that the file system
survives.

>
> The best strategy is to keep regular backups of the database.
>

in my case it is a bit different. There are millions of rows which get
inserted, live for a few minutes or hours and then they get deleted. the
backup is not even feasible. While I can afford some (1-5%) data loss
due to crash, I still must not lose entire table. Wonder if mysql
recovery procedures can ensure that.

--
alf

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 09.11.2006 20:27:10 von Jerry Stuckle

alf wrote:
> Jerry Stuckle wrote:
>
>> I don't see why ext3 would help.
>
>
>
> only to not to get the file system corrupted.
>
>

That doesn't mean the tables themselves can't be corrupted. For instance
if MySQL crashes in the middle of large write operation. Nothing the
file system can do to prevent that from happening. And you would have
to know exactly where to stop the file system restore to recover the
data - which would require a good knowledge of MySQL table structure.

>
>> It knows nothing about the internal format of the tables, and that's
>> what is most likely to get screwed up in a database crash. I would
>> think it would be almost impossible to recover to a consistent point
>> in the database unless you have a very detailed knowledge of the
>> internal format of the files.
>
>
>
> Well, mysql recovery procedures does have that knowledge. There are
> different levels of disaster. My assumption is that the file system
> survives.
>

Yes, it does. That's its job, after all. But if the tables themselves
are corrupted, nothing the file system will do will help that. And if
MySQL can't recover the data because of this, which file system you use
doesn't make any difference.
>
>>
>> The best strategy is to keep regular backups of the database.
>>
>
> in my case it is a bit different. There are millions of rows which get
> inserted, live for a few minutes or hours and then they get deleted. the
> backup is not even feasible. While I can afford some (1-5%) data loss
> due to crash, I still must not lose entire table. Wonder if mysql
> recovery procedures can ensure that.
>

Backups are ALWAYS feasible. And critical if you want to keep your data
safe. There is no replacement.

You can get some help by using INNODB tables and enabling the binary
log. That will allow MySQL to recover from the last good backup by
rolling the logs forward. There should be little or no loss of data.

But you still need the backups. There's no way to feasibly roll forward
a year's worth of data, for instance.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 09.11.2006 22:47:21 von alf

Jerry Stuckle wrote:

> That doesn't mean the tables themselves can't be corrupted. For instance
> if MySQL crashes in the middle of large write operation. Nothing the
> file system can do to prevent that from happening. And you would have
> to know exactly where to stop the file system restore to recover the
> data - which would require a good knowledge of MySQL table structure.

I understand that.

> Yes, it does. That's its job, after all. But if the tables themselves
> are corrupted, nothing the file system will do will help that. And if
> MySQL can't recover the data because of this, which file system you use
> doesn't make any difference.

Not sure I agree. ext3 enables a quick recovery because there is a
trxlog of the file system itself. In ext2 you can lose files. So there
is a small step froward.

>
> Backups are ALWAYS feasible. And critical if you want to keep your data
> safe. There is no replacement.

In my case backups get outdated every minute or so. There is a lot of
data coming into DB and leaving it. Also losing the data from last
minute or so is not as critical (as opposed to banking systems).
Critical is losing like 5%. I know the system is just different.

> You can get some help by using INNODB tables and enabling the binary
> log. That will allow MySQL to recover from the last good backup by
> rolling the logs forward. There should be little or no loss of data.

For some other reasons INNODB is not an option. My job is to find out
if crashing the mysql or the actual hardware the mysql is running on can
lead that significant amount of data (more then 5%) is lost. From what
I understand from here it is.

Thx a lot, A.

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 10.11.2006 04:47:48 von Jerry Stuckle

alf wrote:
> Jerry Stuckle wrote:
>
>> That doesn't mean the tables themselves can't be corrupted. For
>> instance if MySQL crashes in the middle of large write operation.
>> Nothing the file system can do to prevent that from happening. And
>> you would have to know exactly where to stop the file system restore
>> to recover the data - which would require a good knowledge of MySQL
>> table structure.
>
>
> I understand that.
>
>
>> Yes, it does. That's its job, after all. But if the tables
>> themselves are corrupted, nothing the file system will do will help
>> that. And if MySQL can't recover the data because of this, which file
>> system you use doesn't make any difference.
>
>
> Not sure I agree. ext3 enables a quick recovery because there is a
> trxlog of the file system itself. In ext2 you can lose files. So there
> is a small step froward.
>

So? If the file itself is corrupted, all it will do is recover a
corrupted file. What's the gain there?

>
>>
>> Backups are ALWAYS feasible. And critical if you want to keep your
>> data safe. There is no replacement.
>
>
> In my case backups get outdated every minute or so. There is a lot of
> data coming into DB and leaving it. Also losing the data from last
> minute or so is not as critical (as opposed to banking systems).
> Critical is losing like 5%. I know the system is just different.
>

Without backups or logs/journals, I don't think ANY RDB can provide the
recovery you want.

>
>> You can get some help by using INNODB tables and enabling the binary
>> log. That will allow MySQL to recover from the last good backup by
>> rolling the logs forward. There should be little or no loss of data.
>
>
>
> For some other reasons INNODB is not an option. My job is to find out if
> crashing the mysql or the actual hardware the mysql is running on can
> lead that significant amount of data (more then 5%) is lost. From what
> I understand from here it is.
>
> Thx a lot, A.
>
>

You have a problem. The file system will be able to recover a file, but
it won't be able to fix a corrupted file. And without backups and
logs/journals, neither MySQL nor any other RDB will be able to guarantee
even 1% recovery - much less 95%.

Let's say MySQL starts to completely rewrite a 100Mb table. 10 bytes
into it, MySQL crashes. Your file system will see a 10 byte file and
recover that much. The other 99.99999MB will be lost. And without a
backup and binary logs, MySQL will not be able to recover.

Sure, you might be able to roll forward the file system journal. But
you'll have to know *exactly* where to stop or your database will be
inconsistent. And even if you do figure out *exactly* where to stop,
the database may still not be consistent.

You have the wrong answer to your problem. The RDB must do the
logging/journaling. For MySQL that means INNODB. MSSQL, Oracle, DB2,
etc. all have their versions of logging/journaling, also. And they
still require a backup to start.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 10.11.2006 12:45:18 von Axel Schwenke

Jerry Stuckle wrote:
> alf wrote:
>>
>> Not sure I agree. ext3 enables a quick recovery because there is a
>> trxlog of the file system itself. In ext2 you can lose files. So there
>> is a small step froward.
>
> So? If the file itself is corrupted, all it will do is recover a
> corrupted file. What's the gain there?

The gain is, that you have a chance to recover at all. With no files,
there is *no* way to recover.

However, thats not a real problem. MySQL never touches the datafile
itself once it is created. Only exception: REPAIR TABLE. This will
recreate the datafile (as new file with extension .TMD) and then
rename files.

DELETE just marks a record as deleted (1 bit). INSERT writes a new
record at the end of the datafile (or into a hole, if one exists).
UPDATE is done either in place or as INSERT + DELETE.

Most file operations on MyISAM tables are easier, faster and less
risky, if the table uses fixed length records. Then there is no need to
collapse adjacent unused records into one, UPDATE can be done in place,
there will be no fragmentation and such.

The MyISAM engine is quite simple. Data and index are held in separate
files. Data is structured in records. Whenever a record is modified,
it's written to disk immediately (however the operation system might
cache this). MyISAM never touches records without need. So if mysqld
goes down while in normal operation, only those records can be damaged
that were in use by active UPDATE, DELETE or INSERT operations.

There are two exceptions: REPAIR TABLE and OPTIMIZE TABLE. Both
recreate the datafile with new name and then switch by renaming.
There is still no chance to lose *both* files.

Indexes are different, though. Indexes are organized in pages and
heavily cached. You can even instruct mysqld to never flush modified
index pages to disk (except at shutdown or cache restructuring).
However indexes can be rebuilt from scratch, without losing data.
The only thing lost is the time needed for recovery.

HTH, XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 10.11.2006 13:58:20 von Jerry Stuckle

Axel Schwenke wrote:
> Jerry Stuckle wrote:
>
>>alf wrote:
>>
>>>Not sure I agree. ext3 enables a quick recovery because there is a
>>>trxlog of the file system itself. In ext2 you can lose files. So there
>>>is a small step froward.
>>
>>So? If the file itself is corrupted, all it will do is recover a
>>corrupted file. What's the gain there?
>
>
> The gain is, that you have a chance to recover at all. With no files,
> there is *no* way to recover.
>

What you don't get it that it's not the presence or absence of the files
- it's the CONTENTS of the files that matters. There is very little
chance you will lose the files completely in the case of a crash. There
is a much bigger (although admittedly still small) that the files will
be corrupted. And a huge chance if you have more than one table your
database will be inconsistent.

> However, thats not a real problem. MySQL never touches the datafile
> itself once it is created. Only exception: REPAIR TABLE. This will
> recreate the datafile (as new file with extension .TMD) and then
> rename files.
>

Excuse me? MySQL ALWAYS touches the data file. That's where the
information is stored! And it is constantly rewriting the files to disk.

> DELETE just marks a record as deleted (1 bit). INSERT writes a new
> record at the end of the datafile (or into a hole, if one exists).
> UPDATE is done either in place or as INSERT + DELETE.
>

Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
portion of the file to do all of this.

> Most file operations on MyISAM tables are easier, faster and less
> risky, if the table uses fixed length records. Then there is no need to
> collapse adjacent unused records into one, UPDATE can be done in place,
> there will be no fragmentation and such.
>

No, no fragmentation. But what happens if the row spans a disk and the
system crashes between writes, for instance? Depending on exactly where
the block was split, you could completely screw up that row, bug be very
difficult to detect. Sure, it's only one row. But data corruption like
this can be much worse than just losing a row. The latter is easier to
determine.

> The MyISAM engine is quite simple. Data and index are held in separate
> files. Data is structured in records. Whenever a record is modified,
> it's written to disk immediately (however the operation system might
> cache this). MyISAM never touches records without need. So if mysqld
> goes down while in normal operation, only those records can be damaged
> that were in use by active UPDATE, DELETE or INSERT operations.
>

But the caching is all too important. It's not unusual to have hundreds
of MB of disk cache in a busy system. That's a lot of data which can be
lost.

> There are two exceptions: REPAIR TABLE and OPTIMIZE TABLE. Both
> recreate the datafile with new name and then switch by renaming.
> There is still no chance to lose *both* files.
>

True - but these are so seldom used it's almost not worth talking about.
And even then it's a good idea to backup the database before repairing
or optimizing it.

> Indexes are different, though. Indexes are organized in pages and
> heavily cached. You can even instruct mysqld to never flush modified
> index pages to disk (except at shutdown or cache restructuring).
> However indexes can be rebuilt from scratch, without losing data.
> The only thing lost is the time needed for recovery.
>

True. But that's not a big concern, is it?

>
> HTH, XL
> --
> Axel Schwenke, Senior Software Developer, MySQL AB
>
> Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
> MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 10.11.2006 15:21:28 von Axel Schwenke

Jerry Stuckle wrote:
> Axel Schwenke wrote:
>> Jerry Stuckle wrote:
>>
>>>So? If the file itself is corrupted, all it will do is recover a
>>>corrupted file. What's the gain there?
>>
>> The gain is, that you have a chance to recover at all. With no files,
>> there is *no* way to recover.
>
> What you don't get it that it's not the presence or absence of the files
> - it's the CONTENTS of the files that matters.

Agreed. But Alf worried he could lose whole tables aka files.

> There is very little
> chance you will lose the files completely in the case of a crash. There
> is a much bigger (although admittedly still small) that the files will
> be corrupted. And a huge chance if you have more than one table your
> database will be inconsistent.
>
>> However, thats not a real problem. MySQL never touches the datafile
>> itself once it is created. Only exception: REPAIR TABLE. This will
>> recreate the datafile (as new file with extension .TMD) and then
>> rename files.
>
> Excuse me? MySQL ALWAYS touches the data file.

Sorry, I didn't express myself clear here: MyISAM never touches the
metadata for a data file. The file itself is created with CREATE TABLE.
Later on there is data appended to the file or some block inside the
file is modified. But the file itself stays there and there is
virtually no chance to lose it. So indeed there is no gain from using
a filesystem with metadata journaling (in fact most "journaling"
filesystems use the journal only for metadata).

> And it is constantly rewriting the files to disk.
....
> Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
> portion of the file to do all of this.

What do you call "rewrite"?

Of cource MySQL writes modified data. MySQL never reads an otherwise
unmodified record and rewrites it somewhere else.

>> Most file operations on MyISAM tables are easier, faster and less
>> risky, if the table uses fixed length records. Then there is no need to
>> collapse adjacent unused records into one, UPDATE can be done in place,
>> there will be no fragmentation and such.
>
> ... what happens if the row spans a disk and the
> system crashes between writes, for instance? Depending on exactly where
> the block was split, you could completely screw up that row, bug be very
> difficult to detect. Sure, it's only one row. But data corruption like
> this can be much worse than just losing a row. The latter is easier to
> determine.

Agreed. But then again I don't know how *exactly* MyISAM does those
nonatomic writes. One could imagine that the record is first written
with a "this record is invalid" flag set. As soon as the complete
record was written successfully, this flag is cleared in an atomic
write. I know Monty is very fond of atomic operations.

But still there is no difference to what I said: If mysqld crashes,
there is a good chance that all records that mysqld was writing to
are damaged. Either incomplete or lost or such.

However, there is only very little chance to lose data that was not
written to at the time of the crash.

Dynamic vs. fixed format: Dynamic row format is susceptible to the
following problem: imagine there is a hole between two records that
will be filled by INSERT. The new record contains information about
its used and unused length. While writing the record, mysqld crashes
and garbles the length information. Now this record could look longer
than the original hole and shadow one or more of the following
(otherwise untouched) records. This would be hard to spot. Similar
problems exist with merging holes.

Fixed length records don't have this problem and are therefore more
robust.

>> The MyISAM engine is quite simple. Data and index are held in separate
>> files. Data is structured in records. Whenever a record is modified,
>> it's written to disk immediately (however the operation system might
>> cache this). MyISAM never touches records without need. So if mysqld
>> goes down while in normal operation, only those records can be damaged
>> that were in use by active UPDATE, DELETE or INSERT operations.
>
> But the caching is all too important. It's not unusual to have hundreds
> of MB of disk cache in a busy system. That's a lot of data which can be
> lost.

Sure. But this problem was out of scope. We didn't talk about what
happens if the whole machine goes down, only what happens if mysqld
crashes.

Having the whole system crashing is also hard for "real" database
engines. I remember several passages in the InnoDB manual about
certain operating systems ignoring O_DIRECT for the tx log. Also
there may be "hidden" caches in disk controllers and in the disks.

XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 10.11.2006 16:59:18 von Toby

Axel Schwenke wrote:
> Jerry Stuckle wrote:
> > Axel Schwenke wrote:
> >> Jerry Stuckle wrote:
> >> ...
> >> The MyISAM engine is quite simple. Data and index are held in separate
> >> files. Data is structured in records. Whenever a record is modified,
> >> it's written to disk immediately (however the operation system might
> >> cache this). MyISAM never touches records without need. So if mysqld
> >> goes down while in normal operation, only those records can be damaged
> >> that were in use by active UPDATE, DELETE or INSERT operations.
> >
> > But the caching is all too important. It's not unusual to have hundreds
> > of MB of disk cache in a busy system. That's a lot of data which can be
> > lost.
>
> Sure. But this problem was out of scope. We didn't talk about what
> happens if the whole machine goes down, only what happens if mysqld
> crashes.
>
> Having the whole system crashing is also hard for "real" database
> engines. I remember several passages in the InnoDB manual about
> certain operating systems ignoring O_DIRECT for the tx log. Also
> there may be "hidden" caches in disk controllers and in the disks.

Indeed. Some references here:
http://groups.google.com/group/comp.unix.solaris/msg/4817a85 b71816f98

>
>
> XL
> --
> Axel Schwenke, Senior Software Developer, MySQL AB
>
> Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
> MySQL User Forums: http://forums.mysql.com/

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 10.11.2006 18:52:43 von Jerry Stuckle

Hi, Alex,

Comments below.

Axel Schwenke wrote:
> Jerry Stuckle wrote:
>
>>Axel Schwenke wrote:
>>
>>>Jerry Stuckle wrote:
>>>
>>>
>>>>So? If the file itself is corrupted, all it will do is recover a
>>>>corrupted file. What's the gain there?
>>>
>>>The gain is, that you have a chance to recover at all. With no files,
>>>there is *no* way to recover.
>>
>>What you don't get it that it's not the presence or absence of the files
>>- it's the CONTENTS of the files that matters.
>
>
> Agreed. But Alf worried he could lose whole tables aka files.
>
>
>>There is very little
>>chance you will lose the files completely in the case of a crash. There
>>is a much bigger (although admittedly still small) that the files will
>>be corrupted. And a huge chance if you have more than one table your
>>database will be inconsistent.
>>
>>
>>>However, thats not a real problem. MySQL never touches the datafile
>>>itself once it is created. Only exception: REPAIR TABLE. This will
>>>recreate the datafile (as new file with extension .TMD) and then
>>>rename files.
>>
>>Excuse me? MySQL ALWAYS touches the data file.
>
>
> Sorry, I didn't express myself clear here: MyISAM never touches the
> metadata for a data file. The file itself is created with CREATE TABLE.
> Later on there is data appended to the file or some block inside the
> file is modified. But the file itself stays there and there is
> virtually no chance to lose it. So indeed there is no gain from using
> a filesystem with metadata journaling (in fact most "journaling"
> filesystems use the journal only for metadata).
>
>
>>And it is constantly rewriting the files to disk.
>
> ...
>
>>Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
>>portion of the file to do all of this.
>
>
> What do you call "rewrite"?
>
> Of cource MySQL writes modified data. MySQL never reads an otherwise
> unmodified record and rewrites it somewhere else.
>

Just what you are calling it. It reads in a block of data and writes it
back out to disk.

Even in variable length rows where the new row is longer than the old
one and MySQL appends it to the end of the file, MySQL has to go back
and rewrite the original row to mark it as invalid.

>
>>>Most file operations on MyISAM tables are easier, faster and less
>>>risky, if the table uses fixed length records. Then there is no need to
>>>collapse adjacent unused records into one, UPDATE can be done in place,
>>>there will be no fragmentation and such.
>>
>>... what happens if the row spans a disk and the
>>system crashes between writes, for instance? Depending on exactly where
>>the block was split, you could completely screw up that row, bug be very
>>difficult to detect. Sure, it's only one row. But data corruption like
>>this can be much worse than just losing a row. The latter is easier to
>>determine.
>
>
> Agreed. But then again I don't know how *exactly* MyISAM does those
> nonatomic writes. One could imagine that the record is first written
> with a "this record is invalid" flag set. As soon as the complete
> record was written successfully, this flag is cleared in an atomic
> write. I know Monty is very fond of atomic operations.
>

Part of it is MyISAM. But part of it is the OS, also. For instance,
what happens if the row spans two physical blocks of data which are not
contiguous? In that case the OS has to write the first block, seek to
the next one and write that one.

There isn't anything Monty can do about that, unfortunately.

> But still there is no difference to what I said: If mysqld crashes,
> there is a good chance that all records that mysqld was writing to
> are damaged. Either incomplete or lost or such.
>

That is true.

> However, there is only very little chance to lose data that was not
> written to at the time of the crash.
>

Actually, you would lose all data which wasn't written to the disk.

> Dynamic vs. fixed format: Dynamic row format is susceptible to the
> following problem: imagine there is a hole between two records that
> will be filled by INSERT. The new record contains information about
> its used and unused length. While writing the record, mysqld crashes
> and garbles the length information. Now this record could look longer
> than the original hole and shadow one or more of the following
> (otherwise untouched) records. This would be hard to spot. Similar
> problems exist with merging holes.
>

Yep, a serious problem.

> Fixed length records don't have this problem and are therefore more
> robust.
>

I agree there. But there can be other problems as I noted before. And
a single corrupted row may be worse than a completely crashed dataset
because it's so difficult to find that row. For instance - let's say we
have a bank account number which is a string and spans two blocks.
Someone makes a $10M deposit to your account. In the middle MySQL
crashes. The account number is now incorrect - the first 1/2 has been
written to one block but the 2nd 1/2 never made it out. So it credited
the deposit to my account.

Wait a sec - I LIKE that idea! :-)

>
>>>The MyISAM engine is quite simple. Data and index are held in separate
>>>files. Data is structured in records. Whenever a record is modified,
>>>it's written to disk immediately (however the operation system might
>>>cache this). MyISAM never touches records without need. So if mysqld
>>>goes down while in normal operation, only those records can be damaged
>>>that were in use by active UPDATE, DELETE or INSERT operations.
>>
>>But the caching is all too important. It's not unusual to have hundreds
>>of MB of disk cache in a busy system. That's a lot of data which can be
>>lost.
>
>
> Sure. But this problem was out of scope. We didn't talk about what
> happens if the whole machine goes down, only what happens if mysqld
> crashes.
>
> Having the whole system crashing is also hard for "real" database
> engines. I remember several passages in the InnoDB manual about
> certain operating systems ignoring O_DIRECT for the tx log. Also
> there may be "hidden" caches in disk controllers and in the disks.
>
Agreed it's a problem. Most databases handle this with a log/journal
which writes directly to the file system and doesn't return until the
record is written. Once that is done, the real data is written
asynchronously to the tables.

In that way a crash loses at most the last record written (in the case
of an incomplete journal entry). But it still needs a consistent point
(i.e. a backup) to roll forward the log from.

But, as you pointed out, not all OS's support this. They should,
however, for critical data.

And BTW - some even have an option to have their own file system which
is not dependent on the OS at all. They are just provided with a space
on the disk (i.e. a partition) and handle their own I/O completely.
This, obviously, is the most secure because the RDB can handle corrupted
files - they know both the external and internal format for the data.
It's also the most efficient. But it's the hardest to implement.

>
> XL
> --
> Axel Schwenke, Senior Software Developer, MySQL AB
>
> Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
> MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 10.11.2006 20:04:22 von Toby

Jerry Stuckle wrote:
> Hi, Alex,
>
> Comments below.
>
> Axel Schwenke wrote:
> > Jerry Stuckle wrote:
> >
> >>Axel Schwenke wrote:
> >>
> >>>Jerry Stuckle wrote:
> >>>
> >>>
> >>>>So? If the file itself is corrupted, all it will do is recover a
> >>>>corrupted file. What's the gain there?
> >>>
> >>>The gain is, that you have a chance to recover at all. With no files,
> >>>there is *no* way to recover.
> >>
> >>What you don't get it that it's not the presence or absence of the files
> >>- it's the CONTENTS of the files that matters.
> >
> >
> > Agreed. But Alf worried he could lose whole tables aka files.
> >
> >
> >>There is very little
> >>chance you will lose the files completely in the case of a crash. There
> >>is a much bigger (although admittedly still small) that the files will
> >>be corrupted. And a huge chance if you have more than one table your
> >>database will be inconsistent.
> >>
> >>
> >>>However, thats not a real problem. MySQL never touches the datafile
> >>>itself once it is created. Only exception: REPAIR TABLE. This will
> >>>recreate the datafile (as new file with extension .TMD) and then
> >>>rename files.
> >>
> >>Excuse me? MySQL ALWAYS touches the data file.
> >
> >
> > Sorry, I didn't express myself clear here: MyISAM never touches the
> > metadata for a data file. The file itself is created with CREATE TABLE.
> > Later on there is data appended to the file or some block inside the
> > file is modified. But the file itself stays there and there is
> > virtually no chance to lose it. So indeed there is no gain from using
> > a filesystem with metadata journaling (in fact most "journaling"
> > filesystems use the journal only for metadata).
> >
> >
> >>And it is constantly rewriting the files to disk.
> >
> > ...
> >
> >>Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
> >>portion of the file to do all of this.
> >
> >
> > What do you call "rewrite"?
> >
> > Of cource MySQL writes modified data. MySQL never reads an otherwise
> > unmodified record and rewrites it somewhere else.
> >
>
> Just what you are calling it. It reads in a block of data and writes it
> back out to disk.

Note the words "otherwise unmodified" - i.e. not affected by current
operation.

>
> Even in variable length rows where the new row is longer than the old
> one and MySQL appends it to the end of the file, MySQL has to go back
> and rewrite the original row to mark it as invalid.
>
> >
> >>>Most file operations on MyISAM tables are easier, faster and less
> >>>risky, if the table uses fixed length records. Then there is no need to
> >>>collapse adjacent unused records into one, UPDATE can be done in place,
> >>>there will be no fragmentation and such.
> >>
> >>... what happens if the row spans a disk and the
> >>system crashes between writes, for instance? ...
> >
> >
> > Agreed. But then again I don't know how *exactly* MyISAM does those
> > nonatomic writes. ...
> >
>
> Part of it is MyISAM. But part of it is the OS, also. For instance,
> what happens if the row spans two physical blocks of data which are not
> contiguous? In that case the OS has to write the first block, seek to
> the next one and write that one.
>
> There isn't anything Monty can do about that, unfortunately.
>

MyISAM doesn't claim to be transactional.

> > However, there is only very little chance to lose data that was not
> > written to at the time of the crash.
> >
>
> Actually, you would lose all data which wasn't written to the disk.

Axel means, data *already* written which is not being changed, i.e.
other records.

>
> > Dynamic vs. fixed format: Dynamic row format is susceptible to the
> > following problem: ...
> > Having the whole system crashing is also hard for "real" database
> > engines. I remember several passages in the InnoDB manual about
> > certain operating systems ignoring O_DIRECT for the tx log. Also
> > there may be "hidden" caches in disk controllers and in the disks.
> >
> Agreed it's a problem. Most databases handle this with a log/journal
> which writes directly to the file system and doesn't return until the
> record is written. Once that is done, the real data is written
> asynchronously to the tables.

Yes, but how is this relevant to MyISAM?

> ...
>
>
> >
> > XL
> > --
> > Axel Schwenke, Senior Software Developer, MySQL AB
> >
> > Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
> > MySQL User Forums: http://forums.mysql.com/
>
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 10.11.2006 21:28:54 von Jerry Stuckle

toby wrote:
> Jerry Stuckle wrote:
>
>>Hi, Alex,
>>
>>Comments below.
>>
>>Axel Schwenke wrote:
>>
>>>Jerry Stuckle wrote:
>>>
>>>
>>>>Axel Schwenke wrote:
>>>>
>>>>
>>>>>Jerry Stuckle wrote:
>>>>>
>>>>>
>>>>>
>>>>>>So? If the file itself is corrupted, all it will do is recover a
>>>>>>corrupted file. What's the gain there?
>>>>>
>>>>>The gain is, that you have a chance to recover at all. With no files,
>>>>>there is *no* way to recover.
>>>>
>>>>What you don't get it that it's not the presence or absence of the files
>>>>- it's the CONTENTS of the files that matters.
>>>
>>>
>>>Agreed. But Alf worried he could lose whole tables aka files.
>>>
>>>
>>>
>>>>There is very little
>>>>chance you will lose the files completely in the case of a crash. There
>>>>is a much bigger (although admittedly still small) that the files will
>>>>be corrupted. And a huge chance if you have more than one table your
>>>>database will be inconsistent.
>>>>
>>>>
>>>>
>>>>>However, thats not a real problem. MySQL never touches the datafile
>>>>>itself once it is created. Only exception: REPAIR TABLE. This will
>>>>>recreate the datafile (as new file with extension .TMD) and then
>>>>>rename files.
>>>>
>>>>Excuse me? MySQL ALWAYS touches the data file.
>>>
>>>
>>>Sorry, I didn't express myself clear here: MyISAM never touches the
>>>metadata for a data file. The file itself is created with CREATE TABLE.
>>>Later on there is data appended to the file or some block inside the
>>>file is modified. But the file itself stays there and there is
>>>virtually no chance to lose it. So indeed there is no gain from using
>>>a filesystem with metadata journaling (in fact most "journaling"
>>>filesystems use the journal only for metadata).
>>>
>>>
>>>
>>>>And it is constantly rewriting the files to disk.
>>>
>>>...
>>>
>>>
>>>>Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
>>>>portion of the file to do all of this.
>>>
>>>
>>>What do you call "rewrite"?
>>>
>>>Of cource MySQL writes modified data. MySQL never reads an otherwise
>>>unmodified record and rewrites it somewhere else.
>>>
>>
>>Just what you are calling it. It reads in a block of data and writes it
>>back out to disk.
>
>
> Note the words "otherwise unmodified" - i.e. not affected by current
> operation.
>

Depends on your definition of "otherwise unmodified". That sounds like
something different than "unmodified", doesn't it? "Otherwise
unmodified" indicates *something* has changed.

Now - if you just say "MySQL never reads an unmodified record and
rewrites it somewhere else", I will agree.

>
>>Even in variable length rows where the new row is longer than the old
>>one and MySQL appends it to the end of the file, MySQL has to go back
>>and rewrite the original row to mark it as invalid.
>>
>>
>>>>>Most file operations on MyISAM tables are easier, faster and less
>>>>>risky, if the table uses fixed length records. Then there is no need to
>>>>>collapse adjacent unused records into one, UPDATE can be done in place,
>>>>>there will be no fragmentation and such.
>>>>
>>>>... what happens if the row spans a disk and the
>>>>system crashes between writes, for instance? ...
>>>
>>>
>>>Agreed. But then again I don't know how *exactly* MyISAM does those
>>>nonatomic writes. ...
>>>
>>
>>Part of it is MyISAM. But part of it is the OS, also. For instance,
>>what happens if the row spans two physical blocks of data which are not
>>contiguous? In that case the OS has to write the first block, seek to
>>the next one and write that one.
>>
>>There isn't anything Monty can do about that, unfortunately.
>>
>
>
> MyISAM doesn't claim to be transactional.
>

Nope, and I never said it did. But this has nothing to do with
transactions. It has to do with a single row - or even a single column
in one row - being corrupted.

Transactional has to do with multiple operations (generally including
modification of the data) in which all or none must complete. That's
not the case here.

>
>>>However, there is only very little chance to lose data that was not
>>>written to at the time of the crash.
>>>
>>
>>Actually, you would lose all data which wasn't written to the disk.
>
>
> Axel means, data *already* written which is not being changed, i.e.
> other records.
>

Could be. But that's not what he said. He said "not written to...".

Now - if he means data which was not overwritten (or in the progress of
being overwritten), then I will agree.

>
>>>Dynamic vs. fixed format: Dynamic row format is susceptible to the
>>>following problem: ...
>>>Having the whole system crashing is also hard for "real" database
>>>engines. I remember several passages in the InnoDB manual about
>>>certain operating systems ignoring O_DIRECT for the tx log. Also
>>>there may be "hidden" caches in disk controllers and in the disks.
>>>
>>
>>Agreed it's a problem. Most databases handle this with a log/journal
>>which writes directly to the file system and doesn't return until the
>>record is written. Once that is done, the real data is written
>>asynchronously to the tables.
>
>
> Yes, but how is this relevant to MyISAM?
>

It goes back to the crux of the original poster's problem. He wants to
use an access method which is not crash-safe and is trying to ensure the
integrity of his data - or at least a major portion of it.

>
>>...
>>
>>
>>
>>>XL
>>>--
>>>Axel Schwenke, Senior Software Developer, MySQL AB
>>>
>>>Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
>>>MySQL User Forums: http://forums.mysql.com/
>>
>>

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 10.11.2006 21:33:33 von gordon

>> > Agreed. But Alf worried he could lose whole tables aka files.
>> >
>> >
>> >>There is very little
>> >>chance you will lose the files completely in the case of a crash. There

Ok, I assume here we are talking about a mysqld crash, NOT an OS crash,
a power failure, or a hardware crash, or a hardware malfunction such
as a disk controller that writes on the wrong sectors or writes random
crap to the correct sectors.

WHY did mysqld crash? One plausible scenario is that it has gone
completely bonkers, e.g. because of a buffer-overflow virus attack
or coding error. Scribbled-on code can do anything. It's even more
likely to do something bad if the buffer-overflow was intentional.

So, you have to assume that mysqld can do anything a rogue user-level
process running with the same privileges will do: such as deleting
all the tables, or interpreting SELECT * FROM ... as DELETE FROM
.... Bye, bye, data. Any time you write data, there is a chance
of writing crap instead (buggy daemon code, buggy OS, buggy hardware,
etc.). Any time you write data, there is a chance of its being
written in the wrong place.

The worst case is considerably less ugly if you assume that mysqld
crashes because someone did a kill -9 on the daemon (it suddenly
stops with correct behavior up to the stopping point) and it is
otherwise bug-free.

The worst case is still very bad but the average case is a lot less
ugly if you assume a "clean" interruption of power: writes to the
hard disk just stop at an arbitrary point. (I have one system where
a particular disk partition usually acquires an unreadable sector
if the system crashes due to power interruption, even though 99% of
the time it's sitting there not accessing the disk, read or write).

>> >>is a much bigger (although admittedly still small) that the files will
>> >>be corrupted. And a huge chance if you have more than one table your
>> >>database will be inconsistent.
>> >>
>> >>
>> >>>However, thats not a real problem. MySQL never touches the datafile
>> >>>itself once it is created. Only exception: REPAIR TABLE. This will
>> >>>recreate the datafile (as new file with extension .TMD) and then
>> >>>rename files.

I believe this is incorrect. OPTIMIZE TABLE and ALTER TABLE (under
some circumstances, such as actually changing the schema) will also
do this. But these aren't used very often.

Now consider what happens when you attempt doing this WITH INSUFFICIENT
DISK SPACE for temporarily having two copies. I believe I have
managed to lose a table this way, although it was a scratch table
and not particularly important anyway. And this scenario has usually
"failed cleanly", although it usually leaves the partition out of
disk space so nothing much else works.

As far as I know there are very few places where MySQL chops a file and
then attempts to re-write it, and these are places where it's re-creating
the file from scratch, with the data already stored in another file
(REPAIR TABLE, OPTIMIZE TABLE, ALTER TABLE, DROP TABLE/CREATE TABLE).
It won't do that for things like mass UPDATE. It may leave some more
unused space in the data file which may be usable later when data is
INSERTed.

>> >>Excuse me? MySQL ALWAYS touches the data file.
>> >
>> >
>> > Sorry, I didn't express myself clear here: MyISAM never touches the
>> > metadata for a data file. The file itself is created with CREATE TABLE.

Writing on a file changes the change-time metadata for the file.
Writing on a file to extend it likely changes the list of blocks
used by a file (if it is extended by enough to add more blocks).

>> > Later on there is data appended to the file or some block inside the
>> > file is modified. But the file itself stays there and there is
>> > virtually no chance to lose it. So indeed there is no gain from using
>> > a filesystem with metadata journaling (in fact most "journaling"
>> > filesystems use the journal only for metadata).
>> >
>> >
>> >>And it is constantly rewriting the files to disk.
>> >
>> > ...
>> >
>> >>Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
>> >>portion of the file to do all of this.
>> >
>> >
>> > What do you call "rewrite"?
>> >
>> > Of cource MySQL writes modified data. MySQL never reads an otherwise
>> > unmodified record and rewrites it somewhere else.

I don't think this is true for operations that copy rows of tables.
But that won't corrupt the source table.

>> >
>>
>> Just what you are calling it. It reads in a block of data and writes it
>> back out to disk.
>
>Note the words "otherwise unmodified" - i.e. not affected by current
>operation.
>
>>
>> Even in variable length rows where the new row is longer than the old
>> one and MySQL appends it to the end of the file, MySQL has to go back
>> and rewrite the original row to mark it as invalid.
>>
>> >
>> >>>Most file operations on MyISAM tables are easier, faster and less
>> >>>risky, if the table uses fixed length records. Then there is no need to
>> >>>collapse adjacent unused records into one, UPDATE can be done in place,
>> >>>there will be no fragmentation and such.
>> >>
>> >>... what happens if the row spans a disk and the
>> >>system crashes between writes, for instance? ...
>> >
>> >
>> > Agreed. But then again I don't know how *exactly* MyISAM does those
>> > nonatomic writes. ...
>> >
>>
>> Part of it is MyISAM. But part of it is the OS, also. For instance,
>> what happens if the row spans two physical blocks of data which are not
>> contiguous? In that case the OS has to write the first block, seek to
>> the next one and write that one.
>>
>> There isn't anything Monty can do about that, unfortunately.
>>
>
>MyISAM doesn't claim to be transactional.
>
>> > However, there is only very little chance to lose data that was not
>> > written to at the time of the crash.
>> >
>>
>> Actually, you would lose all data which wasn't written to the disk.
>
>Axel means, data *already* written which is not being changed, i.e.
>other records.
>
>>
>> > Dynamic vs. fixed format: Dynamic row format is susceptible to the
>> > following problem: ...
>> > Having the whole system crashing is also hard for "real" database
>> > engines. I remember several passages in the InnoDB manual about
>> > certain operating systems ignoring O_DIRECT for the tx log. Also
>> > there may be "hidden" caches in disk controllers and in the disks.
>> >
>> Agreed it's a problem. Most databases handle this with a log/journal
>> which writes directly to the file system and doesn't return until the
>> record is written. Once that is done, the real data is written
>> asynchronously to the tables.
>
>Yes, but how is this relevant to MyISAM?

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 10.11.2006 21:44:31 von Toby

Jerry Stuckle wrote:
> toby wrote:
> > Jerry Stuckle wrote:
> >
> >>Hi, Alex,
> >>
> >>Comments below.
> >>
> >>Axel Schwenke wrote:
> >>
> >>>Jerry Stuckle wrote:
> >>>
> >>>
> >>>>Axel Schwenke wrote:
> >>>>... MyISAM never touches the
> >>>metadata for a data file. The file itself is created with CREATE TABLE.
> >>>Later on there is data appended to the file or some block inside the
> >>>file is modified. But the file itself stays there and there is
> >>>virtually no chance to lose it. So indeed there is no gain from using
> >>>a filesystem with metadata journaling (in fact most "journaling"
> >>>filesystems use the journal only for metadata).
> >>>
> >>>
> >>>
> >>>>And it is constantly rewriting the files to disk.
> >>>
> >>>...
> >>>
> >>>
> >>>>Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
> >>>>portion of the file to do all of this.
> >>>
> >>>
> >>>What do you call "rewrite"?
> >>>
> >>>Of cource MySQL writes modified data. MySQL never reads an otherwise
> >>>unmodified record and rewrites it somewhere else.
> >>>
> >>
> >>Just what you are calling it. It reads in a block of data and writes it
> >>back out to disk.
> >
> >
> > Note the words "otherwise unmodified" - i.e. not affected by current
> > operation.
> >
>
> Depends on your definition of "otherwise unmodified". That sounds like
> something different than "unmodified", doesn't it? "Otherwise
> unmodified" indicates *something* has changed.
>
> Now - if you just say "MySQL never reads an unmodified record and
> rewrites it somewhere else", I will agree.

I think that's exactly what Axel meant, yes.

>
> >
> >>Even in variable length rows where the new row is longer than the old
> >>one and MySQL appends it to the end of the file, MySQL has to go back
> >>and rewrite the original row to mark it as invalid.
> >>
> >>
> >>>>>Most file operations on MyISAM tables are easier, faster and less
> >>>>>risky, if the table uses fixed length records. Then there is no need to
> >>>>>collapse adjacent unused records into one, UPDATE can be done in place,
> >>>>>there will be no fragmentation and such.
> >>>>
> >>>>... what happens if the row spans a disk and the
> >>>>system crashes between writes, for instance? ...
> >>>
> >>>
> >>>Agreed. But then again I don't know how *exactly* MyISAM does those
> >>>nonatomic writes. ...
> >>>
> >>
> >>Part of it is MyISAM. But part of it is the OS, also. For instance,
> >>what happens if the row spans two physical blocks of data which are not
> >>contiguous? In that case the OS has to write the first block, seek to
> >>the next one and write that one.
> >>
> >>There isn't anything Monty can do about that, unfortunately.
> >>
> >
> >
> > MyISAM doesn't claim to be transactional.
> >
>
> Nope, and I never said it did. But this has nothing to do with
> transactions. It has to do with a single row - or even a single column
> in one row - being corrupted.
>
> Transactional has to do with multiple operations (generally including
> modification of the data) in which all or none must complete. That's
> not the case here.

The problem you describe is solved by transactional engines.

>
> >
> >>>However, there is only very little chance to lose data that was not
> >>>written to at the time of the crash.
> >>>
> >>
> >>Actually, you would lose all data which wasn't written to the disk.
> >
> >
> > Axel means, data *already* written which is not being changed, i.e.
> > other records.
> >
>
> Could be. But that's not what he said. He said "not written to...".
>
> Now - if he means data which was not overwritten (or in the progress of
> being overwritten), then I will agree.

Again, I think that's what he meant.

>
> >
> >>>Dynamic vs. fixed format: Dynamic row format is susceptible to the
> >>>following problem: ...
> >>>Having the whole system crashing is also hard for "real" database
> >>>engines. I remember several passages in the InnoDB manual about
> >>>certain operating systems ignoring O_DIRECT for the tx log. Also
> >>>there may be "hidden" caches in disk controllers and in the disks.
> >>>
> >>
> >>Agreed it's a problem. Most databases handle this with a log/journal
> >>which writes directly to the file system and doesn't return until the
> >>record is written. Once that is done, the real data is written
> >>asynchronously to the tables.
> >
> >
> > Yes, but how is this relevant to MyISAM?
> >
>
> It goes back to the crux of the original poster's problem. He wants to
> use an access method which is not crash-safe and is trying to ensure the
> integrity of his data - or at least a major portion of it.

I guess you/Axel have covered some of the points where this just isn't
possible. OP really ought to consider a different engine, no?

>
> >
> >>...
> >>
> >>
> >>
> >>>XL
> >>>--
> >>>Axel Schwenke, Senior Software Developer, MySQL AB
> >>>
> >>>Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
> >>>MySQL User Forums: http://forums.mysql.com/
> >>
> >>
>
>
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 10.11.2006 22:08:49 von Jerry Stuckle

toby wrote:
> Jerry Stuckle wrote:
>
>>toby wrote:
>>
>>>Jerry Stuckle wrote:
>>>
>>>
>>>>Hi, Alex,
>>>>
>>>>Comments below.
>>>>
>>>>Axel Schwenke wrote:
>>>>
>>>>
>>>>>Jerry Stuckle wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Axel Schwenke wrote:
>>>>>>... MyISAM never touches the
>>>>>
>>>>>metadata for a data file. The file itself is created with CREATE TABLE.
>>>>>Later on there is data appended to the file or some block inside the
>>>>>file is modified. But the file itself stays there and there is
>>>>>virtually no chance to lose it. So indeed there is no gain from using
>>>>>a filesystem with metadata journaling (in fact most "journaling"
>>>>>filesystems use the journal only for metadata).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>And it is constantly rewriting the files to disk.
>>>>>
>>>>>...
>>>>>
>>>>>
>>>>>
>>>>>>Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
>>>>>>portion of the file to do all of this.
>>>>>
>>>>>
>>>>>What do you call "rewrite"?
>>>>>
>>>>>Of cource MySQL writes modified data. MySQL never reads an otherwise
>>>>>unmodified record and rewrites it somewhere else.
>>>>>
>>>>
>>>>Just what you are calling it. It reads in a block of data and writes it
>>>>back out to disk.
>>>
>>>
>>>Note the words "otherwise unmodified" - i.e. not affected by current
>>>operation.
>>>
>>
>>Depends on your definition of "otherwise unmodified". That sounds like
>>something different than "unmodified", doesn't it? "Otherwise
>>unmodified" indicates *something* has changed.
>>
>>Now - if you just say "MySQL never reads an unmodified record and
>>rewrites it somewhere else", I will agree.
>
>
> I think that's exactly what Axel meant, yes.
>
>
>>>>Even in variable length rows where the new row is longer than the old
>>>>one and MySQL appends it to the end of the file, MySQL has to go back
>>>>and rewrite the original row to mark it as invalid.
>>>>
>>>>
>>>>
>>>>>>>Most file operations on MyISAM tables are easier, faster and less
>>>>>>>risky, if the table uses fixed length records. Then there is no need to
>>>>>>>collapse adjacent unused records into one, UPDATE can be done in place,
>>>>>>>there will be no fragmentation and such.
>>>>>>
>>>>>>... what happens if the row spans a disk and the
>>>>>>system crashes between writes, for instance? ...
>>>>>
>>>>>
>>>>>Agreed. But then again I don't know how *exactly* MyISAM does those
>>>>>nonatomic writes. ...
>>>>>
>>>>
>>>>Part of it is MyISAM. But part of it is the OS, also. For instance,
>>>>what happens if the row spans two physical blocks of data which are not
>>>>contiguous? In that case the OS has to write the first block, seek to
>>>>the next one and write that one.
>>>>
>>>>There isn't anything Monty can do about that, unfortunately.
>>>>
>>>
>>>
>>>MyISAM doesn't claim to be transactional.
>>>
>>
>>Nope, and I never said it did. But this has nothing to do with
>>transactions. It has to do with a single row - or even a single column
>>in one row - being corrupted.
>>
>>Transactional has to do with multiple operations (generally including
>>modification of the data) in which all or none must complete. That's
>>not the case here.
>
>
> The problem you describe is solved by transactional engines.
>

Yes, it is solved by by "transactional engines". But you don't
necessarily need to explicitly use transactions for it. For instance,
INNODB can protect against that, even if you are using autocommit
(effectively otherwise negating transactional operations).

>
>>>>>However, there is only very little chance to lose data that was not
>>>>>written to at the time of the crash.
>>>>>
>>>>
>>>>Actually, you would lose all data which wasn't written to the disk.
>>>
>>>
>>>Axel means, data *already* written which is not being changed, i.e.
>>>other records.
>>>
>>
>>Could be. But that's not what he said. He said "not written to...".
>>
>>Now - if he means data which was not overwritten (or in the progress of
>>being overwritten), then I will agree.
>
>
> Again, I think that's what he meant.
>

It could be. I can only go by what he said. And sometimes English is
not the best language, especially when discussing technical topics.

>
>>>>>Dynamic vs. fixed format: Dynamic row format is susceptible to the
>>>>>following problem: ...
>>>>>Having the whole system crashing is also hard for "real" database
>>>>>engines. I remember several passages in the InnoDB manual about
>>>>>certain operating systems ignoring O_DIRECT for the tx log. Also
>>>>>there may be "hidden" caches in disk controllers and in the disks.
>>>>>
>>>>
>>>>Agreed it's a problem. Most databases handle this with a log/journal
>>>>which writes directly to the file system and doesn't return until the
>>>>record is written. Once that is done, the real data is written
>>>>asynchronously to the tables.
>>>
>>>
>>>Yes, but how is this relevant to MyISAM?
>>>
>>
>>It goes back to the crux of the original poster's problem. He wants to
>>use an access method which is not crash-safe and is trying to ensure the
>>integrity of his data - or at least a major portion of it.
>
>
> I guess you/Axel have covered some of the points where this just isn't
> possible. OP really ought to consider a different engine, no?
>

I agree completely.

Of course, with the additional integrity comes additional overhead.
TANSTAAFL.

>
>>>>...
>>>>
>>>>
>>>>
>>>>
>>>>>XL
>>>>>--
>>>>>Axel Schwenke, Senior Software Developer, MySQL AB
>>>>>
>>>>>Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
>>>>>MySQL User Forums: http://forums.mysql.com/
>>>>
>>>>
>>
>>
>>--
>>==================
>>Remove the "x" from my email address
>>Jerry Stuckle
>>JDS Computer Training Corp.
>>jstucklex@attglobal.net
>>==================
>
>

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 10.11.2006 22:14:34 von Toby

Jerry Stuckle wrote:
> toby wrote:
> > Jerry Stuckle wrote:
> >
> >>toby wrote:
> >>
> >>>Jerry Stuckle wrote:
> >>>
> >>>
> >>>>Hi, Alex,
> >>>>
> >>>>Comments below.
> >>>>
> >>>>Axel Schwenke wrote:
> >>>>
> >>>>
> >>>>>Jerry Stuckle wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Axel Schwenke wrote:
> >>>>>>... MyISAM never touches the
> >>>>>
> >>>>>metadata for a data file. The file itself is created with CREATE TABLE.
> >>>>>Later on there is data appended to the file or some block inside the
> >>>>>file is modified. But the file itself stays there and there is
> >>>>>virtually no chance to lose it. So indeed there is no gain from using
> >>>>>a filesystem with metadata journaling (in fact most "journaling"
> >>>>>filesystems use the journal only for metadata).
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>And it is constantly rewriting the files to disk.
> >>>>>
> >>>>>...
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
> >>>>>>portion of the file to do all of this.
> >>>>>
> >>>>>
> >>>>>What do you call "rewrite"?
> >>>>>
> >>>>>Of cource MySQL writes modified data. MySQL never reads an otherwise
> >>>>>unmodified record and rewrites it somewhere else.
> >>>>>
> >>>>
> >>>>Just what you are calling it. It reads in a block of data and writes it
> >>>>back out to disk.
> >>>
> >>>
> >>>Note the words "otherwise unmodified" - i.e. not affected by current
> >>>operation.
> >>>
> >>
> >>Depends on your definition of "otherwise unmodified". That sounds like
> >>something different than "unmodified", doesn't it? "Otherwise
> >>unmodified" indicates *something* has changed.
> >>
> >>Now - if you just say "MySQL never reads an unmodified record and
> >>rewrites it somewhere else", I will agree.
> >
> >
> > I think that's exactly what Axel meant, yes.
> >
> >
> >>>>Even in variable length rows where the new row is longer than the old
> >>>>one and MySQL appends it to the end of the file, MySQL has to go back
> >>>>and rewrite the original row to mark it as invalid.
> >>>>
> >>>>
> >>>>
> >>>>>>>Most file operations on MyISAM tables are easier, faster and less
> >>>>>>>risky, if the table uses fixed length records. Then there is no need to
> >>>>>>>collapse adjacent unused records into one, UPDATE can be done in place,
> >>>>>>>there will be no fragmentation and such.
> >>>>>>
> >>>>>>... what happens if the row spans a disk and the
> >>>>>>system crashes between writes, for instance? ...
> >>>>>
> >>>>>
> >>>>>Agreed. But then again I don't know how *exactly* MyISAM does those
> >>>>>nonatomic writes. ...
> >>>>>
> >>>>
> >>>>Part of it is MyISAM. But part of it is the OS, also. For instance,
> >>>>what happens if the row spans two physical blocks of data which are not
> >>>>contiguous? In that case the OS has to write the first block, seek to
> >>>>the next one and write that one.
> >>>>
> >>>>There isn't anything Monty can do about that, unfortunately.
> >>>>
> >>>
> >>>
> >>>MyISAM doesn't claim to be transactional.
> >>>
> >>
> >>Nope, and I never said it did. But this has nothing to do with
> >>transactions. It has to do with a single row - or even a single column
> >>in one row - being corrupted.
> >>
> >>Transactional has to do with multiple operations (generally including
> >>modification of the data) in which all or none must complete. That's
> >>not the case here.
> >
> >
> > The problem you describe is solved by transactional engines.
> >
>
> Yes, it is solved by by "transactional engines". But you don't
> necessarily need to explicitly use transactions for it. For instance,
> INNODB can protect against that, even if you are using autocommit
> (effectively otherwise negating transactional operations).

An Autocommited statement is no different from any other transaction,
so it benefits from the same machinery, yes.

>
> >
> >>>>>However, there is only very little chance to lose data that was not
> >>>>>written to at the time of the crash.
> >>>>>
> >>>>
> >>>>Actually, you would lose all data which wasn't written to the disk.
> >>>
> >>>
> >>>Axel means, data *already* written which is not being changed, i.e.
> >>>other records.
> >>>
> >>
> >>Could be. But that's not what he said. He said "not written to...".
> >>
> >>Now - if he means data which was not overwritten (or in the progress of
> >>being overwritten), then I will agree.
> >
> >
> > Again, I think that's what he meant.
> >
>
> It could be. I can only go by what he said. And sometimes English is
> not the best language, especially when discussing technical topics.

You apparently had more trouble deciphering his intended meaning than I
did.

>
> >
> >>>>>Dynamic vs. fixed format: Dynamic row format is susceptible to the
> >>>>>following problem: ...
> >>>>>Having the whole system crashing is also hard for "real" database
> >>>>>engines. I remember several passages in the InnoDB manual about
> >>>>>certain operating systems ignoring O_DIRECT for the tx log. Also
> >>>>>there may be "hidden" caches in disk controllers and in the disks.
> >>>>>
> >>>>
> >>>>Agreed it's a problem. Most databases handle this with a log/journal
> >>>>which writes directly to the file system and doesn't return until the
> >>>>record is written. Once that is done, the real data is written
> >>>>asynchronously to the tables.
> >>>
> >>>
> >>>Yes, but how is this relevant to MyISAM?
> >>>
> >>
> >>It goes back to the crux of the original poster's problem. He wants to
> >>use an access method which is not crash-safe and is trying to ensure the
> >>integrity of his data - or at least a major portion of it.
> >
> >
> > I guess you/Axel have covered some of the points where this just isn't
> > possible. OP really ought to consider a different engine, no?
> >
>
> I agree completely.
>
> Of course, with the additional integrity comes additional overhead.
> TANSTAAFL.

Well, each of the engines has a different sweet spot (BDB, Solid, PBXT,
Falcon) and we don't even know if the OP has a performance problem. I
think he only mentioned an integrity problem?

>
> >
> >>>>...
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>XL
> >>>>>--
> >>>>>Axel Schwenke, Senior Software Developer, MySQL AB
> >>>>>
> >>>>>Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
> >>>>>MySQL User Forums: http://forums.mysql.com/
> >>>>
> >>>>
> >>
> >>
> >>--
> >>==================
> >>Remove the "x" from my email address
> >>Jerry Stuckle
> >>JDS Computer Training Corp.
> >>jstucklex@attglobal.net
> >>==================
> >
> >
>
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 11.11.2006 00:20:08 von Axel Schwenke

Guys, could you please try to cut your quotes to a minimum?
Thanks!

Jerry Stuckle wrote:
> toby wrote:
>> Jerry Stuckle wrote:
>>>>>>
>>>>>>Of cource MySQL writes modified data. MySQL never reads an otherwise
>>>>>>unmodified record and rewrites it somewhere else.
>>>>>
>>>>>Just what you are calling it. It reads in a block of data and writes it
>>>>>back out to disk.
>>>>
>>>>Note the words "otherwise unmodified" - i.e. not affected by current
>>>>operation.
>>>
>>>Depends on your definition of "otherwise unmodified". That sounds like
>>>something different than "unmodified", doesn't it? "Otherwise
>>>unmodified" indicates *something* has changed.
>>>
>>>Now - if you just say "MySQL never reads an unmodified record and
>>>rewrites it somewhere else", I will agree.
>>
>> I think that's exactly what Axel meant, yes.

I can confirm that's just what I meant.

>>>>MyISAM doesn't claim to be transactional.
>>>
>>>Nope, and I never said it did. But this has nothing to do with
>>>transactions. It has to do with a single row - or even a single column
>>>in one row - being corrupted.
>>>
>>>Transactional has to do with multiple operations (generally including
>>>modification of the data) in which all or none must complete. That's
>>>not the case here.
>>
>> The problem you describe is solved by transactional engines.
>
> Yes, it is solved by by "transactional engines". But you don't
> necessarily need to explicitly use transactions for it. For instance,
> INNODB can protect against that, even if you are using autocommit
> (effectively otherwise negating transactional operations).

The is no "non-transactional" operation mode if you use InnoDB. If
AUTO_COMMIT=yes, each DML statement is one implicit transaction.
And of course each modification of the InnoDB table space is tracked
by the InnoDB TX log.

>>>>>>However, there is only very little chance to lose data that was not
>>>>>>written to at the time of the crash.
>>>>>
>>>>>Actually, you would lose all data which wasn't written to the disk.
>>>>
>>>>Axel means, data *already* written which is not being changed, i.e.
>>>>other records.
>>>
>>>Could be. But that's not what he said. He said "not written to...".
>>>
>>>Now - if he means data which was not overwritten (or in the progress of
>>>being overwritten), then I will agree.
>>
>> Again, I think that's what he meant.

Confirmed again.

/me starts considering that Jerry does not understand what /me means

> It could be. I can only go by what he said. And sometimes English is
> not the best language, especially when discussing technical topics.

Lets switch to German then :-)

>>>>Yes, but how is this relevant to MyISAM?
>>>
>>>It goes back to the crux of the original poster's problem. He wants to
>>>use an access method which is not crash-safe and is trying to ensure the
>>>integrity of his data - or at least a major portion of it.
>>
>> I guess you/Axel have covered some of the points where this just isn't
>> possible. OP really ought to consider a different engine, no?

No.

Alf said he could afford to lose some data. Not 100% of course, but up
to 5% (he said so in <2IednRVjiZF3PM7YnZ2dnUVZ_umdnZ2d@comcast.com>).

So - maybe - MyISAM could be "good enough" for his needs.

XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 11.11.2006 02:57:43 von Jerry Stuckle

Axel Schwenke wrote:
> Guys, could you please try to cut your quotes to a minimum?
> Thanks!
>
> Jerry Stuckle wrote:
>
>
>
> No.
>
> Alf said he could afford to lose some data. Not 100% of course, but up
> to 5% (he said so in <2IednRVjiZF3PM7YnZ2dnUVZ_umdnZ2d@comcast.com>).
>
> So - maybe - MyISAM could be "good enough" for his needs.
>
>

And he also said he could NOT afford to lose all the data. Not in the
case of a MySQL crash, an OS crash, a hardware problem, whatever.

You can't guarantee that with MYISAM.

So *maybe* he can get by. But I wouldn't bet my job on it. And I
wouldn't recommend it to one of my customers in the same situation. If
I did, I would be negligent in my duties as a consultant.

> XL
> --
> Axel Schwenke, Senior Software Developer, MySQL AB
>
> Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
> MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 11.11.2006 04:19:51 von Toby

Jerry Stuckle wrote:
> Axel Schwenke wrote:
> > Guys, could you please try to cut your quotes to a minimum?
> > Thanks!
> >
> > Jerry Stuckle wrote:
> >
> >
> >
> > No.
> >
> > Alf said he could afford to lose some data. Not 100% of course, but up
> > to 5% (he said so in <2IednRVjiZF3PM7YnZ2dnUVZ_umdnZ2d@comcast.com>).
> >
> > So - maybe - MyISAM could be "good enough" for his needs.
> >
> >
>
> And he also said he could NOT afford to lose all the data. Not in the
> case of a MySQL crash, an OS crash, a hardware problem, whatever.
>
> You can't guarantee that with MYISAM.

Faced with OS or hardware problem, you can't guarantee it with any
engine.

>
> So *maybe* he can get by. But I wouldn't bet my job on it. And I
> wouldn't recommend it to one of my customers in the same situation. If
> I did, I would be negligent in my duties as a consultant.
>
> > XL
> > --
> > Axel Schwenke, Senior Software Developer, MySQL AB
> >
> > Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
> > MySQL User Forums: http://forums.mysql.com/
>
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 11.11.2006 04:35:08 von Jerry Stuckle

toby wrote:
> Jerry Stuckle wrote:
>
>>Axel Schwenke wrote:
>>
>>>Guys, could you please try to cut your quotes to a minimum?
>>>Thanks!
>>>
>>>Jerry Stuckle wrote:
>>>
>>>
>>>
>>>No.
>>>
>>>Alf said he could afford to lose some data. Not 100% of course, but up
>>>to 5% (he said so in <2IednRVjiZF3PM7YnZ2dnUVZ_umdnZ2d@comcast.com>).
>>>
>>>So - maybe - MyISAM could be "good enough" for his needs.
>>>
>>>
>>
>>And he also said he could NOT afford to lose all the data. Not in the
>>case of a MySQL crash, an OS crash, a hardware problem, whatever.
>>
>>You can't guarantee that with MYISAM.
>
>
> Faced with OS or hardware problem, you can't guarantee it with any
> engine.
>

Actually, for an OS problem, you can. Use an RDB which journals and
take regular backups. Rolling forward from the last valid backup will
restore all committed transactions.

Hardware failure is a little more difficult. At the least you need to
have your database and journal on two different disks with two different
adapters. Better is to also mirror the database and journal with
something like RAID-1 or RAID-10.

It can be guaranteed. Critical databases all use these techniques.

>
>>So *maybe* he can get by. But I wouldn't bet my job on it. And I
>>wouldn't recommend it to one of my customers in the same situation. If
>>I did, I would be negligent in my duties as a consultant.
>>
>>
>>>XL
>>>--
>>>Axel Schwenke, Senior Software Developer, MySQL AB
>>>
>>>Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
>>>MySQL User Forums: http://forums.mysql.com/
>>
>>
>>--
>>==================
>>Remove the "x" from my email address
>>Jerry Stuckle
>>JDS Computer Training Corp.
>>jstucklex@attglobal.net
>>==================
>
>

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 11.11.2006 04:49:57 von Toby

Jerry Stuckle wrote:
> toby wrote:
> > Jerry Stuckle wrote:
> >>And he also said he could NOT afford to lose all the data. Not in the
> >>case of a MySQL crash, an OS crash, a hardware problem, whatever.
> >>
> >>You can't guarantee that with MYISAM.
> >
> >
> > Faced with OS or hardware problem, you can't guarantee it with any
> > engine.
> >
>
> Actually, for an OS problem, you can. Use an RDB which journals and
> take regular backups. Rolling forward from the last valid backup will
> restore all committed transactions.

Nope, OS/hardware issue could mean "no journal".

>
> Hardware failure is a little more difficult. At the least you need to
> have your database and journal on two different disks with two different
> adapters. Better is to also mirror the database and journal with
> something like RAID-1 or RAID-10.

Neither of which can protect against certain hardware failures -
everyone has a story about the bulletproof RAID setup which was
scribbled over by a bad controller, or bad cable, or bad power. ZFS
buys a lot more safety (end to end verification).

>
> It can be guaranteed. Critical databases all use these techniques.

I don't trust that word "guaranteed". You need backups in any case. :)

>
> >
> >>So *maybe* he can get by. But I wouldn't bet my job on it. And I
> >>wouldn't recommend it to one of my customers in the same situation. If
> >>I did, I would be negligent in my duties as a consultant.
> >>
> >>
> >>>XL
> >>>--
> >>>Axel Schwenke, Senior Software Developer, MySQL AB
> >>>
> >>>Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
> >>>MySQL User Forums: http://forums.mysql.com/
> >>
> >>
> >>--
> >>==================
> >>Remove the "x" from my email address
> >>Jerry Stuckle
> >>JDS Computer Training Corp.
> >>jstucklex@attglobal.net
> >>==================
> >
> >
>
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 11.11.2006 05:10:48 von Jerry Stuckle

toby wrote:
> Jerry Stuckle wrote:
>
>>toby wrote:
>>
>>>Jerry Stuckle wrote:
>>>
>>>>And he also said he could NOT afford to lose all the data. Not in the
>>>>case of a MySQL crash, an OS crash, a hardware problem, whatever.
>>>>
>>>>You can't guarantee that with MYISAM.
>>>
>>>
>>>Faced with OS or hardware problem, you can't guarantee it with any
>>>engine.
>>>
>>
>>Actually, for an OS problem, you can. Use an RDB which journals and
>>take regular backups. Rolling forward from the last valid backup will
>>restore all committed transactions.
>
>
> Nope, OS/hardware issue could mean "no journal".
>

Journals are written synchronously, before data is written to the
database. Also, they are preallocated - so there is no change in the
allocation units on the disk. Even an OS problem can't break that. The
worst which can happen is the last transaction isn't completely written
to disk (i.e. the database crashed in the middle of the write).

And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
have to lose multiple disks and adapters at exactly the same time to
loose the journal.

>
>>Hardware failure is a little more difficult. At the least you need to
>>have your database and journal on two different disks with two different
>>adapters. Better is to also mirror the database and journal with
>>something like RAID-1 or RAID-10.
>
>
> Neither of which can protect against certain hardware failures -
> everyone has a story about the bulletproof RAID setup which was
> scribbled over by a bad controller, or bad cable, or bad power. ZFS
> buys a lot more safety (end to end verification).
>

I don't know of anyone who has "a story" about these systems where data
was lost on RAID-1 or RAID-10.

These systems duplicate everything. They have multiple controllers.
Separate cables. Even separate power supplies in the most critical
cases. Even a power failure just powers down the device (and take the
system down).

Also, ZFS doesn't protect against a bad disk, for instance. All it does
is guarantee the data was written properly. A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?

I have in the past been involved in some very critical databases. They
all use various RAID devices. And the most critical use RAID-1 or RAID-10.
>
>>It can be guaranteed. Critical databases all use these techniques.
>
>
> I don't trust that word "guaranteed". You need backups in any case. :)
>
>
>>>>So *maybe* he can get by. But I wouldn't bet my job on it. And I
>>>>wouldn't recommend it to one of my customers in the same situation. If
>>>>I did, I would be negligent in my duties as a consultant.
>>>>
>>>>
>>>>
>>>>>XL
>>>>>--
>>>>>Axel Schwenke, Senior Software Developer, MySQL AB
>>>>>
>>>>>Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
>>>>>MySQL User Forums: http://forums.mysql.com/
>>>>

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 11.11.2006 06:28:39 von Toby

Jerry Stuckle wrote:
> toby wrote:
> > Jerry Stuckle wrote:
> >
> >>toby wrote:
> >>
> >>>Jerry Stuckle wrote:
> >>>
> >>>>And he also said he could NOT afford to lose all the data. Not in the
> >>>>case of a MySQL crash, an OS crash, a hardware problem, whatever.
> >>>>
> >>>>You can't guarantee that with MYISAM.
> >>>
> >>>
> >>>Faced with OS or hardware problem, you can't guarantee it with any
> >>>engine.
> >>>
> >>
> >>Actually, for an OS problem, you can. Use an RDB which journals and
> >>take regular backups. Rolling forward from the last valid backup will
> >>restore all committed transactions.
> >
> >
> > Nope, OS/hardware issue could mean "no journal".
> >
>
> Journals are written synchronously, before data is written to the
> database. Also, they are preallocated - so there is no change in the
> allocation units on the disk. Even an OS problem can't break that. The
> worst which can happen is the last transaction isn't completely written
> to disk (i.e. the database crashed in the middle of the write).
>
> And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
> have to lose multiple disks and adapters at exactly the same time to
> loose the journal.

As long as there is a single point of failure (software or firmware bug
for instance)...

>
> >
> >>Hardware failure is a little more difficult. At the least you need to
> >>have your database and journal on two different disks with two different
> >>adapters. Better is to also mirror the database and journal with
> >>something like RAID-1 or RAID-10.
> >
> >
> > Neither of which can protect against certain hardware failures -
> > everyone has a story about the bulletproof RAID setup which was
> > scribbled over by a bad controller, or bad cable, or bad power. ZFS
> > buys a lot more safety (end to end verification).
> >
>
> I don't know of anyone who has "a story" about these systems where data
> was lost on RAID-1 or RAID-10.

It hasn't happened to me either, but it has happened to many others.

>
> These systems duplicate everything. They have multiple controllers.
> Separate cables. Even separate power supplies in the most critical
> cases. Even a power failure just powers down the device (and take the
> system down).
>
> Also, ZFS doesn't protect against a bad disk, for instance. All it does
> is guarantee the data was written properly.

It does considerably better than RAID-1 here, in several ways - by
verifying writes; verifying reads; by healing immediately a data error
is found; and by (optionally) making scrubbing passes to reduce the
possibility of undetected loss (this also works for conventional RAID
of course, subject to error detection limitations).

> A failing controller can
> easily overwrite the data at some later time. RAID-1 and RAID-10 could
> still have that happen, but what are the chances of two separate
> controllers having exactly the same failure at the same time?

The difference is that ZFS will see the problem (checksum) and
automatically salvage the data from the good side, while RAID-1 will
not discover the damage (only reads from one side of the mirror).
Obviously checksumming is the critical difference; RAID-1 is entirely
dependent on the drive correctly signalling errors (correctable or
not); it cannot independently verify data integrity and remains
vulnerable to latent data loss.

>
> I have in the past been involved in some very critical databases. They
> all use various RAID devices. And the most critical use RAID-1 or RAID-10.

We can do even better these days.

Related links of interest:
http://blogs.sun.com/bonwick/
http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
https://www.gelato.unsw.edu.au/archives/comp-arch/2006-Septe mber/003008.html
http://www.lockss.org/locksswiki/files/3/30/Eurosys2006.pdf [A Fresh
Look at the Reliability of Long-term Digital Storage, 2006]
http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
Digital Archiving: A Survey, 2006]
http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
2006]
http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
Faults and Reliability of Disk Arrays, 1997]

> >
> >>It can be guaranteed. Critical databases all use these techniques.
> >
> >
> > I don't trust that word "guaranteed". You need backups in any case. :)
> >
> >
> >>>>So *maybe* he can get by. But I wouldn't bet my job on it. And I
> >>>>wouldn't recommend it to one of my customers in the same situation. If
> >>>>I did, I would be negligent in my duties as a consultant.
> >>>>
> >>>>
> >>>>
> >>>>>XL
> >>>>>--
> >>>>>Axel Schwenke, Senior Software Developer, MySQL AB
> >>>>>
> >>>>>Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
> >>>>>MySQL User Forums: http://forums.mysql.com/
> >>>>
>
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 11.11.2006 06:33:33 von Toby

toby wrote:
> Jerry Stuckle wrote:
> > ...
> > A failing controller can
> > easily overwrite the data at some later time. RAID-1 and RAID-10 could
> > still have that happen, but what are the chances of two separate
> > controllers having exactly the same failure at the same time?
>
> The difference is that ZFS will see the problem (checksum) and
> automatically salvage the data from the good side, while RAID-1 will
> not discover the damage

I should have added - you don't need *two* failures. You only need *one
silent error* to cause data loss with RAID-1. ZFS is proof against
silent errors, although of course it's still susceptible to multiple
failures (such as both mirrors suffering a whole disk failure without
repair).

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 11.11.2006 15:58:10 von Jerry Stuckle

toby wrote:
> Jerry Stuckle wrote:
>
>>
>>Journals are written synchronously, before data is written to the
>>database. Also, they are preallocated - so there is no change in the
>>allocation units on the disk. Even an OS problem can't break that. The
>>worst which can happen is the last transaction isn't completely written
>>to disk (i.e. the database crashed in the middle of the write).
>>
>>And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
>>have to lose multiple disks and adapters at exactly the same time to
>>loose the journal.
>
>
> As long as there is a single point of failure (software or firmware bug
> for instance)...
>

They will also handle hardware failures. I have never heard of any loss
of data due to hardware failures on RAID-1 or RAID-10. Can you point to
even one instance?

>
>>>>Hardware failure is a little more difficult. At the least you need to
>>>>have your database and journal on two different disks with two different
>>>>adapters. Better is to also mirror the database and journal with
>>>>something like RAID-1 or RAID-10.
>>>
>>>
>>>Neither of which can protect against certain hardware failures -
>>>everyone has a story about the bulletproof RAID setup which was
>>>scribbled over by a bad controller, or bad cable, or bad power. ZFS
>>>buys a lot more safety (end to end verification).
>>>
>>
>>I don't know of anyone who has "a story" about these systems where data
>>was lost on RAID-1 or RAID-10.
>
>
> It hasn't happened to me either, but it has happened to many others.
>

Specifics? Using RAID-1 or RAID-10?

>
>>These systems duplicate everything. They have multiple controllers.
>>Separate cables. Even separate power supplies in the most critical
>>cases. Even a power failure just powers down the device (and take the
>>system down).
>>
>>Also, ZFS doesn't protect against a bad disk, for instance. All it does
>>is guarantee the data was written properly.
>
>
> It does considerably better than RAID-1 here, in several ways - by
> verifying writes; verifying reads; by healing immediately a data error
> is found; and by (optionally) making scrubbing passes to reduce the
> possibility of undetected loss (this also works for conventional RAID
> of course, subject to error detection limitations).
>

And how does it recover from a disk crash? Or what happens if the data
goes bad after being written and read back?

Additionally, it depends on the software correctly detecting and
signaling a data error.

>
>>A failing controller can
>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
>>still have that happen, but what are the chances of two separate
>>controllers having exactly the same failure at the same time?
>
>
> The difference is that ZFS will see the problem (checksum) and
> automatically salvage the data from the good side, while RAID-1 will
> not discover the damage (only reads from one side of the mirror).
> Obviously checksumming is the critical difference; RAID-1 is entirely
> dependent on the drive correctly signalling errors (correctable or
> not); it cannot independently verify data integrity and remains
> vulnerable to latent data loss.
>

If it's a single sector. But if the entire disk crashes - i.e. an
electronics failure?

But all data is mirrored. And part of the drive's job is to signal
errors. One which doesn't do that correctly isn't much good, is it/

>
>>I have in the past been involved in some very critical databases. They
>>all use various RAID devices. And the most critical use RAID-1 or RAID-10.
>
>
> We can do even better these days.
>
> Related links of interest:
> http://blogs.sun.com/bonwick/
> http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
> https://www.gelato.unsw.edu.au/archives/comp-arch/2006-Septe mber/003008.html
> http://www.lockss.org/locksswiki/files/3/30/Eurosys2006.pdf [A Fresh
> Look at the Reliability of Long-term Digital Storage, 2006]
> http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
> Digital Archiving: A Survey, 2006]
> http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
> 2006]
> http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
> Faults and Reliability of Disk Arrays, 1997]
>

So? I don't see anything in any of these articles which affects this
discussion. We're not talking about long term digital storage, for
instance.

I'm just curious. How many critical database systems have you actually
been involved with? I've lost count. When I worked for IBM, we had
banks, insurance companies, etc., all with critical databases as
customers. Probably the largest I ever worked with was a major U.S.
airline reservation system.

These systems are critical to their business. The airline database
averaged tens of thousands of operations per second. This is a critical
system. Can you imagine what would happen if they lost even 2 minutes
of reservations? Especially with today's electronic ticketing systems?
And *never* have I seen (or heard of) a loss of data other than what
was being currently processed.

BTW - NONE of them use zfs - because these are mainframe systems, not
Linux. But they all use the mainframe versions of RAID-1 or RAID-10.

In any case - this is way off topic for this newsgroup. The original
question was "Can I prevent the loss of a significant portion of my data
in the case of a MySQL, OS or hardware failure, when using MyISAM?".

The answer is no.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 11.11.2006 16:01:20 von Jerry Stuckle

toby wrote:
> toby wrote:
>
>>Jerry Stuckle wrote:
>>
>>>...
>>>A failing controller can
>>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
>>>still have that happen, but what are the chances of two separate
>>>controllers having exactly the same failure at the same time?
>>
>>The difference is that ZFS will see the problem (checksum) and
>>automatically salvage the data from the good side, while RAID-1 will
>>not discover the damage
>
>
> I should have added - you don't need *two* failures. You only need *one
> silent error* to cause data loss with RAID-1. ZFS is proof against
> silent errors, although of course it's still susceptible to multiple
> failures (such as both mirrors suffering a whole disk failure without
> repair).
>

ZFS is not proof against silent errors - they can still occur. It is
possible for it to miss an error, also. Plus it is not proof against
data decaying after it is written to disk. And, as you note, it doesn't
handle a disk crash.

But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 11.11.2006 17:31:42 von Toby

Jerry Stuckle wrote:
> toby wrote:
> > Jerry Stuckle wrote:
> >
> >>
> >>Journals are written synchronously, ...
> >>
> >>And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
> >>have to lose multiple disks and adapters at exactly the same time to
> >>loose the journal.
> >
> >
> > As long as there is a single point of failure (software or firmware bug
> > for instance)...
> >
>
> They will also handle hardware failures. I have never heard of any loss
> of data due to hardware failures on RAID-1 or RAID-10. Can you point to
> even one instance?

There are several examples of such hardware failures in the links
cited, but I'll crosspost this to comp.arch.storage - I'll eat my hat
if no-one there has seen a RAID data loss.

>
> >
> >>...
> >>I don't know of anyone who has "a story" about these systems where data
> >>was lost on RAID-1 or RAID-10.
> >
> >
> > It hasn't happened to me either, but it has happened to many others.
> >
>
> Specifics? Using RAID-1 or RAID-10?
>
> >
> >>These systems duplicate everything. They have multiple controllers.
> >>Separate cables. Even separate power supplies in the most critical
> >>cases. Even a power failure just powers down the device (and take the
> >>system down).
> >>
> >>Also, ZFS doesn't protect against a bad disk, for instance. All it does
> >>is guarantee the data was written properly.
> >
> >
> > It does considerably better than RAID-1 here, in several ways - by
> > verifying writes; verifying reads; by healing immediately a data error
> > is found; and by (optionally) making scrubbing passes to reduce the
> > possibility of undetected loss (this also works for conventional RAID
> > of course, subject to error detection limitations).
> >
>
> And how does it recover from a disk crash? Or what happens if the data
> goes bad after being written and read back?

You use the redundancy to repair it. RAID-1 does not do this.

>
> Additionally, it depends on the software correctly detecting and
> signaling a data error.

Which RAID-1 cannot do at all.

>
> >
> >>A failing controller can
> >>easily overwrite the data at some later time. RAID-1 and RAID-10 could
> >>still have that happen, but what are the chances of two separate
> >>controllers having exactly the same failure at the same time?
> >
> >
> > The difference is that ZFS will see the problem (checksum) and
> > automatically salvage the data from the good side, while RAID-1 will
> > not discover the damage (only reads from one side of the mirror).
> > Obviously checksumming is the critical difference; RAID-1 is entirely
> > dependent on the drive correctly signalling errors (correctable or
> > not); it cannot independently verify data integrity and remains
> > vulnerable to latent data loss.
> >
>
> If it's a single sector. But if the entire disk crashes - i.e. an
> electronics failure?

That's right, it cannot bring a dead disk back to life...

>
> But all data is mirrored. And part of the drive's job is to signal
> errors. One which doesn't do that correctly isn't much good, is it/

You're right that RAID-1 is built on the assumption that drives
perfectly report errors. ZFS isn't.

As Richard Elling writes, "We don't have to rely on a parity protected
SCSI bus, or a bug-free disk firmware (I've got the scars) to ensure
that what is on persistent storage is what we get in memory. ... by
distrusting everything in the storage data path we will build in the
reliability and redundancy into the file system."

>
> >
> >>I have in the past been involved in some very critical databases. They
> >>all use various RAID devices. And the most critical use RAID-1 or RAID-10.
> >
> >
> > We can do even better these days.
> >
> > Related links of interest:
> > http://blogs.sun.com/bonwick/
> > http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
> > https://www.gelato.unsw.edu.au/archives/comp-arch/2006-Septe mber/003008.html
> > http://www.lockss.org/locksswiki/files/3/30/Eurosys2006.pdf [A Fresh
> > Look at the Reliability of Long-term Digital Storage, 2006]
> > http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
> > Digital Archiving: A Survey, 2006]
> > http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
> > 2006]
> > http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
> > Faults and Reliability of Disk Arrays, 1997]
> >
>
> So? I don't see anything in any of these articles which affects this
> discussion. We're not talking about long term digital storage, for
> instance.

I think that's quite relevant to many "business critical" database
systems. Databases are even evolving in response to changing
*regulatory* requirements: MySQL's ARCHIVE engine, for instance...

> I'm just curious. How many critical database systems have you actually
> been involved with? I've lost count. ...
> These systems are critical to their business. ...

None of this is relevant to what I'm trying to convey, which is simply:
What ZFS does beyond RAID.

Why are you taking the position that they are equivalent? There are
innumerable failure modes that RAID(-1) cannot handle, which ZFS does.

>
> BTW - NONE of them use zfs - because these are mainframe systems, not
> Linux. But they all use the mainframe versions of RAID-1 or RAID-10.

I still claim - along with Sun - that you can, using more modern
software, improve on the integrity and availability guarantees of
RAID-1. This applies equally to the small systems I specify (say, a
small mirrored disk server storing POS account data) as to their
humongous storage arrays.

>
> In any case - this is way off topic for this newsgroup. The original
> question was "Can I prevent the loss of a significant portion of my data
> in the case of a MySQL, OS or hardware failure, when using MyISAM?".
>
> The answer is no.
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 11.11.2006 19:55:25 von Bill Todd

Jerry Stuckle wrote:

....

> ZFS is not proof against silent errors - they can still occur.

Of course they can, but they will be caught by the background
verification scrubbing before much time passes (i.e., within a time
window that radically reduces the likelihood that another disk will fail
before the error is caught and corrected), unlike the case with
conventional RAID (where they aren't caught at all, and rise up to bite
you - with non-negligible probability these days - if the good copy then
dies).

And ZFS *is* proof against silent errors in the sense that data thus
mangled will not be returned to an application (i.e., it will be caught
when read if the background integrity validation has not yet reached it)
- again, unlike the case with conventional mirroring, where there's a
good chance that it will be returned to the application as good.

It is
> possible for it to miss an error, also.

It is also possible for all the air molecules in your room to decide -
randomly - to congregate in the corner, and for you to be asphyxiated.
Most people needn't worry about probabilities of these magnitudes.

Plus it is not proof against
> data decaying after it is written to disk.

No - but, again, it will catch it before long, even in cases where
conventional disk scrubbing would not.

And, as you note, it doesn't
> handle a disk crash.

It handles it with resilience comparable to RAID-1, but is more flexible
in that it can then use distributed free space to restore the previous
level of redundancy (whereas RAID-1/RAID-10 cannot unless the number of
configured hot spare disks equals the number of failed disks).

>
> But when properly implemented, RAID-1 and RAID-10 will detect and
> correct even more errors than ZFS will.

Please name even one.

- bill

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 11.11.2006 19:57:18 von gordonb.kap3m

>>>Actually, for an OS problem, you can. Use an RDB which journals and
>>>take regular backups. Rolling forward from the last valid backup will
>>>restore all committed transactions.
>>
>>
>> Nope, OS/hardware issue could mean "no journal".

OS issue could mean "no disk writes", PERIOD.

>Journals are written synchronously, before data is written to the
>database. Also, they are preallocated - so there is no change in the
>allocation units on the disk. Even an OS problem can't break that.

Yes, it can. An OS can decide not to write data at all. (consider
an anti-virus program that monitors disk writes hooked into the
OS). Or, at any time, it can erase all the data. (Consider
accidentally zeroing out a sector containing inodes, including a
file someone else was using and the journal and some table .MYD
files. Oh, yes, remember READING a file modifies the inode (accessed
time)). Or it could write any data over the last sector read (which
might be part of the mysqld executable).

>The
>worst which can happen is the last transaction isn't completely written
>to disk (i.e. the database crashed in the middle of the write).

When you think worst-case OS failure, think VIRUS. When you think
worst-case hardware failure, think EXPLOSION. Or FIRE. Or disk
head crash. When you think worst-case power-failure situation,
think burned-out circuits. Or erased track.

>And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
>have to lose multiple disks and adapters at exactly the same time to
>loose the journal.

Only if the OS writes it in the first place, and the RAID controller
isn't broken, and the OS doesn't scribble over it later.

>>>Hardware failure is a little more difficult. At the least you need to
>>>have your database and journal on two different disks with two different
>>>adapters.

On different planets.

>>>Better is to also mirror the database and journal with
>>>something like RAID-1 or RAID-10.
>>
>>
>> Neither of which can protect against certain hardware failures -
>> everyone has a story about the bulletproof RAID setup which was
>> scribbled over by a bad controller, or bad cable, or bad power. ZFS
>> buys a lot more safety (end to end verification).
>>
>
>I don't know of anyone who has "a story" about these systems where data
>was lost on RAID-1 or RAID-10.

It's certainly possible to rapidly lose data when you type in a
UPDATE or DELETE query and accidentally type a semicolon instead
of ENTER just before you were going to type WHERE. RAID (or MySQL's
replication setup) does just what it's supposed to do and updates
all the copies with the bad data.

I'm not trying to discourage use of RAID. It can save your butt in
lots of situations. But it doesn't work miracles.

>These systems duplicate everything. They have multiple controllers.
>Separate cables. Even separate power supplies in the most critical
>cases. Even a power failure just powers down the device (and take the
>system down).

>Also, ZFS doesn't protect against a bad disk, for instance. All it does
>is guarantee the data was written properly. A failing controller can
>easily overwrite the data at some later time. RAID-1 and RAID-10 could
>still have that happen, but what are the chances of two separate
>controllers having exactly the same failure at the same time?
>
>I have in the past been involved in some very critical databases. They
>all use various RAID devices. And the most critical use RAID-1 or RAID-10.
>>
>>>It can be guaranteed. Critical databases all use these techniques.
>>
>>
>> I don't trust that word "guaranteed". You need backups in any case. :)

ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, what

am 11.11.2006 20:09:04 von Toby

Jerry Stuckle wrote:
> toby wrote:
> > toby wrote:
> >
> >>Jerry Stuckle wrote:
> >>
> >>>...
> >>>A failing controller can
> >>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
> >>>still have that happen, but what are the chances of two separate
> >>>controllers having exactly the same failure at the same time?
> >>
> >>The difference is that ZFS will see the problem (checksum) and
> >>automatically salvage the data from the good side, while RAID-1 will
> >>not discover the damage
> >
> >
> > I should have added - you don't need *two* failures. You only need *one
> > silent error* to cause data loss with RAID-1. ZFS is proof against
> > silent errors, although of course it's still susceptible to multiple
> > failures (such as both mirrors suffering a whole disk failure without
> > repair).
> >
>
> ZFS is not proof against silent errors - they can still occur. It is
> possible for it to miss an error, also. Plus it is not proof against
> data decaying after it is written to disk.

Actually both capabilities are among its strongest features.

Clearly you haven't read or understood any of the publicly available
information about it, so I'm not going to pursue this any further
beyond relating an analogy:
You will likely live longer if you look both ways before crossing
the road, rather than walking straight across without looking because
"cars will stop".

> ...
> But when properly implemented, RAID-1 and RAID-10 will detect and
> correct even more errors than ZFS will.

I'll let those with more patience refute this.

>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 11.11.2006 20:22:25 von Rich Teer

On Sat, 11 Nov 2006, toby wrote:

> Jerry Stuckle wrote:
> > ...
> > But when properly implemented, RAID-1 and RAID-10 will detect and
> > correct even more errors than ZFS will.
>
> I'll let those with more patience refute this.

Jerry, what are you smoking? Do you actually know what ZFS is, and
if so what if, in the context of your assertion I quoted above, ZFS
is used to implement RAID 1 and RAID 10 (which, incidentally, it is
VERY frequently used to do)?

I agree with Toby: you need to read a bit more about ZFS. If you're
a storage nut (meant in a non-disparaging way!), I think you'll like
what you read.

--
Rich Teer, SCSA, SCNA, SCSECA, OpenSolaris CAB member

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 12.11.2006 03:21:36 von Jerry Stuckle

Bill Todd wrote:
> Jerry Stuckle wrote:
>
> ...
>
>> ZFS is not proof against silent errors - they can still occur.
>
>
> Of course they can, but they will be caught by the background
> verification scrubbing before much time passes (i.e., within a time
> window that radically reduces the likelihood that another disk will fail
> before the error is caught and corrected), unlike the case with
> conventional RAID (where they aren't caught at all, and rise up to bite
> you - with non-negligible probability these days - if the good copy then
> dies).
>
> And ZFS *is* proof against silent errors in the sense that data thus
> mangled will not be returned to an application (i.e., it will be caught
> when read if the background integrity validation has not yet reached it)
> - again, unlike the case with conventional mirroring, where there's a
> good chance that it will be returned to the application as good.
>

The same is true with RAID-1 and RAID-10. An error on the disk will be
detected and returned by the hardware to the OS.

> It is
>
>> possible for it to miss an error, also.
>
>
> It is also possible for all the air molecules in your room to decide -
> randomly - to congregate in the corner, and for you to be asphyxiated.
> Most people needn't worry about probabilities of these magnitudes.
>

About the same chances of ZFS missing an error as RAID-1 or RAID-10.
The big difference being ZFS if done in software, which requires CPU
cycles and other resources. It's also open to corruption. RAID-1 and
RAID-10 are implemented in hardware/firmware which cannot be corrupted
(Read only memory) and require no CPU cycles.

> Plus it is not proof against
>
>> data decaying after it is written to disk.
>
>
> No - but, again, it will catch it before long, even in cases where
> conventional disk scrubbing would not.
>

So do RAID-1 and RAID-10.

> And, as you note, it doesn't
>
>> handle a disk crash.
>
>
> It handles it with resilience comparable to RAID-1, but is more flexible
> in that it can then use distributed free space to restore the previous
> level of redundancy (whereas RAID-1/RAID-10 cannot unless the number of
> configured hot spare disks equals the number of failed disks).
>

And for a critical system you have that redundancy and more.

>>
>> But when properly implemented, RAID-1 and RAID-10 will detect and
>> correct even more errors than ZFS will.
>

A complete disk crash, for instance. Even Toby admitted ZFS cannot
recover from a disk crash.

ZFS is good. But it's a cheap software implementation of an expensive
hardware recovery system. And there is no way software can do it as
well as hardware does.

That's why all critical systems use hardware systems such as RAID-1 and
RAID-10.

>
> Please name even one.
>
> - bill

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 12.11.2006 03:30:46 von Jerry Stuckle

toby wrote:
> Jerry Stuckle wrote:
>
>>toby wrote:
>>
>>>toby wrote:
>>>
>>>
>>>>Jerry Stuckle wrote:
>>>>
>>>>
>>>>>...
>>>>>A failing controller can
>>>>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
>>>>>still have that happen, but what are the chances of two separate
>>>>>controllers having exactly the same failure at the same time?
>>>>
>>>>The difference is that ZFS will see the problem (checksum) and
>>>>automatically salvage the data from the good side, while RAID-1 will
>>>>not discover the damage
>>>
>>>
>>>I should have added - you don't need *two* failures. You only need *one
>>>silent error* to cause data loss with RAID-1. ZFS is proof against
>>>silent errors, although of course it's still susceptible to multiple
>>>failures (such as both mirrors suffering a whole disk failure without
>>>repair).
>>>
>>
>>ZFS is not proof against silent errors - they can still occur. It is
>>possible for it to miss an error, also. Plus it is not proof against
>>data decaying after it is written to disk.
>
>
> Actually both capabilities are among its strongest features.
>
> Clearly you haven't read or understood any of the publicly available
> information about it, so I'm not going to pursue this any further
> beyond relating an analogy:
> You will likely live longer if you look both ways before crossing
> the road, rather than walking straight across without looking because
> "cars will stop".
>

Actually, I understand quite a bit about ZFS. However, unlike you, I
also understand its shortcomings. That's because I started working on
fault-tolerant drive systems starting in 1977 as a hardware CE for IBM,
working on large mainframes. I've watched it grow over the years. And
as a EE major, I also understand the hardware and it's strengths and
weaknesses - in detail.

And as a CS major (dual majors) and programmer since 1867, including
working on system software for IBM in the 1980's I have a thorough
understanding of the software end.

And it's obvious from your statements you have no real understanding or
either, other than sales literature.

>
>>...
>>But when properly implemented, RAID-1 and RAID-10 will detect and
>>correct even more errors than ZFS will.
>
>
> I'll let those with more patience refute this.
>

And more knowledge of the real facts?

BTW - I took out all those extra newsgroups you added. If I wanted to
discuss things there I would have added them myself.

But I'm also not going to discuss this any more with you. I'd really
rather have discussions with someone who really knows the internals - of
both systems.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 12.11.2006 03:33:54 von Jerry Stuckle

Rich Teer wrote:
> On Sat, 11 Nov 2006, toby wrote:
>
>
>>Jerry Stuckle wrote:
>>
>>>...
>>>But when properly implemented, RAID-1 and RAID-10 will detect and
>>>correct even more errors than ZFS will.
>>
>>I'll let those with more patience refute this.
>
>
> Jerry, what are you smoking? Do you actually know what ZFS is, and
> if so what if, in the context of your assertion I quoted above, ZFS
> is used to implement RAID 1 and RAID 10 (which, incidentally, it is
> VERY frequently used to do)?
>
> I agree with Toby: you need to read a bit more about ZFS. If you're
> a storage nut (meant in a non-disparaging way!), I think you'll like
> what you read.
>

I'm not smoking anything.

REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
software system such as ZFS.

Of course, there are some systems out there which CLAIM to be RAID-1 or
RAID-10, but implement them in software such as ZFS. What they are are
really RAID-1/RAID-10 compliant.

And BTW - I've taken out the extra newsgroups. They have nothing to do
with this discussion.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 12.11.2006 03:34:10 von Toby

Jerry Stuckle wrote:
> toby wrote:
> > Jerry Stuckle wrote:
> >
> >>toby wrote:
> >>
> >>>toby wrote:
> >>>
> >>>
> >>>>Jerry Stuckle wrote:
> >>>>
> >>>>
> >>>>>...
> >>>>>A failing controller can
> >>>>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
> >>>>>still have that happen, but what are the chances of two separate
> >>>>>controllers having exactly the same failure at the same time?
> >>>>
> >>>>The difference is that ZFS will see the problem (checksum) and
> >>>>automatically salvage the data from the good side, while RAID-1 will
> >>>>not discover the damage
> >>>
> >>>
> >>>I should have added - you don't need *two* failures. You only need *one
> >>>silent error* to cause data loss with RAID-1. ZFS is proof against
> >>>silent errors, although of course it's still susceptible to multiple
> >>>failures (such as both mirrors suffering a whole disk failure without
> >>>repair).
> >>>
> >>
> >>ZFS is not proof against silent errors - they can still occur. It is
> >>possible for it to miss an error, also. Plus it is not proof against
> >>data decaying after it is written to disk.
> >
> >
> > Actually both capabilities are among its strongest features.
> >
> > Clearly you haven't read or understood any of the publicly available
> > information about it, so I'm not going to pursue this any further
> > beyond relating an analogy:
> > You will likely live longer if you look both ways before crossing
> > the road, rather than walking straight across without looking because
> > "cars will stop".
> >
>
> Actually, I understand quite a bit about ZFS. However, unlike you, I
> also understand its shortcomings. That's because I started working on
> fault-tolerant drive systems starting in 1977 as a hardware CE for IBM,
> working on large mainframes. I've watched it grow over the years. And
> as a EE major, I also understand the hardware and it's strengths and
> weaknesses - in detail.
>
> And as a CS major (dual majors) and programmer since 1867, including
> working on system software for IBM in the 1980's I have a thorough
> understanding of the software end.
>
> And it's obvious from your statements you have no real understanding or
> either, other than sales literature.

This isn't about a battle of the egos. I was challenging what seemed to
be factual misunderstandings of ZFS relative to RAID. Perhaps we're
talking at cross purposes; you had trouble getting Axel's point also...

>
> >
> >>...
> >>But when properly implemented, RAID-1 and RAID-10 will detect and
> >>correct even more errors than ZFS will.
> >
> >
> > I'll let those with more patience refute this.
> >
>
> And more knowledge of the real facts?
>
> BTW - I took out all those extra newsgroups you added. If I wanted to
> discuss things there I would have added them myself.
>
>
> But I'm also not going to discuss this any more with you. I'd really
> rather have discussions with someone who really knows the internals - of
> both systems.

You'll find them in the newsgroups you snipped, not here. I'm sorry
things degenerated to this point, but I stand by my corrections of your
strange views on ZFS' capabilities.

>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 12.11.2006 03:51:41 von Jerry Stuckle

toby wrote:
> Jerry Stuckle wrote:
>
>>toby wrote:
>>
>>>Jerry Stuckle wrote:
>>>
>>>
>>>>Journals are written synchronously, ...
>>>>
>>>>And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
>>>>have to lose multiple disks and adapters at exactly the same time to
>>>>loose the journal.
>>>
>>>
>>>As long as there is a single point of failure (software or firmware bug
>>>for instance)...
>>>
>>
>>They will also handle hardware failures. I have never heard of any loss
>>of data due to hardware failures on RAID-1 or RAID-10. Can you point to
>>even one instance?
>
>
> There are several examples of such hardware failures in the links
> cited, but I'll crosspost this to comp.arch.storage - I'll eat my hat
> if no-one there has seen a RAID data loss.
>

I've seen those links. I have yet to see where there was any loss of
data proven. Some conjectures in blogs, for instance. But I want to
see documented facts.

And I've removed the cross-post. If I want a discussion in
comp.arch.storage, I will post in it.

>
>>>>...
>>>>I don't know of anyone who has "a story" about these systems where data
>>>>was lost on RAID-1 or RAID-10.
>>>
>>>
>>>It hasn't happened to me either, but it has happened to many others.
>>>
>>
>>Specifics? Using RAID-1 or RAID-10?
>>
>>
>>>>These systems duplicate everything. They have multiple controllers.
>>>>Separate cables. Even separate power supplies in the most critical
>>>>cases. Even a power failure just powers down the device (and take the
>>>>system down).
>>>>
>>>>Also, ZFS doesn't protect against a bad disk, for instance. All it does
>>>>is guarantee the data was written properly.
>>>
>>>
>>>It does considerably better than RAID-1 here, in several ways - by
>>>verifying writes; verifying reads; by healing immediately a data error
>>>is found; and by (optionally) making scrubbing passes to reduce the
>>>possibility of undetected loss (this also works for conventional RAID
>>>of course, subject to error detection limitations).
>>>
>>
>>And how does it recover from a disk crash? Or what happens if the data
>>goes bad after being written and read back?
>
>
> You use the redundancy to repair it. RAID-1 does not do this.
>

No, RAID-1 has complete mirrors of the data. And if it detects an error
on the primary disk it can correct the error from the mirror, automatically.

>
>>Additionally, it depends on the software correctly detecting and
>>signaling a data error.
>
>
> Which RAID-1 cannot do at all.
>

Actually, RAID-1 does do it. In case you aren't aware, all sectors on
the disks are checksummed. If there is a failure, the hardware will
detect it, long before it even gets to the software. The hardware can
even retry the operation, or it can go straight to the mirror.

>
>>>>A failing controller can
>>>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
>>>>still have that happen, but what are the chances of two separate
>>>>controllers having exactly the same failure at the same time?
>>>
>>>
>>>The difference is that ZFS will see the problem (checksum) and
>>>automatically salvage the data from the good side, while RAID-1 will
>>>not discover the damage (only reads from one side of the mirror).
>>>Obviously checksumming is the critical difference; RAID-1 is entirely
>>>dependent on the drive correctly signalling errors (correctable or
>>>not); it cannot independently verify data integrity and remains
>>>vulnerable to latent data loss.
>>>
>>
>>If it's a single sector. But if the entire disk crashes - i.e. an
>>electronics failure?
>
>
> That's right, it cannot bring a dead disk back to life...
>

Nope, but the mirror still contains the data.

>
>>But all data is mirrored. And part of the drive's job is to signal
>>errors. One which doesn't do that correctly isn't much good, is it/
>
>
> You're right that RAID-1 is built on the assumption that drives
> perfectly report errors. ZFS isn't.
>

Do you really understand how drives work? I mean the actual electronics
of it? Could you read a schematic, scope a failing drive down to the
bad component? Do you have that level of knowledge?

If not, please don't make statements you have no real understanding of.
I can do that, and more. And I have done it.

> As Richard Elling writes, "We don't have to rely on a parity protected
> SCSI bus, or a bug-free disk firmware (I've got the scars) to ensure
> that what is on persistent storage is what we get in memory. ... by
> distrusting everything in the storage data path we will build in the
> reliability and redundancy into the file system."
>

So, you read a few statements and argue your point without any real
technical knowledge of what goes on behind the scenes?

Can you tell me the chances of having an undetected problem on a parity
protected SCSI bus? Or even a non-parity protected one? And can you
give me the details of the most common causes of those? I thought not.

And bug-free disk firmware? Disk firmware is a LOT more bug free than
any OS software I've ever seen, including Linux. That's because it has
to do a limited amount of operations with a limited interface.

Unlike a file system which has to handle many additional operations on
different disk types and configurations.

And BTW - how many disk firmware bugs have you heard about recently? I
don't say they can't occur. But the reliable disk manufacturers check,
double-check and triple-check their code before it goes out. Then they
test it again.

>
>>>>I have in the past been involved in some very critical databases. They
>>>>all use various RAID devices. And the most critical use RAID-1 or RAID-10.
>>>
>>>
>>>We can do even better these days.
>>>
>>>Related links of interest:
>>>http://blogs.sun.com/bonwick/
>>>http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
>>>https://www.gelato.unsw.edu.au/archives/comp-arch/2006-Se ptember/003008.html
>>>http://www.lockss.org/locksswiki/files/3/30/Eurosys2006.p df [A Fresh
>>>Look at the Reliability of Long-term Digital Storage, 2006]
>>>http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
>>>Digital Archiving: A Survey, 2006]
>>>http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
>>>2006]
>>>http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
>>>Faults and Reliability of Disk Arrays, 1997]
>>>
>>
>>So? I don't see anything in any of these articles which affects this
>>discussion. We're not talking about long term digital storage, for
>>instance.
>
>
> I think that's quite relevant to many "business critical" database
> systems. Databases are even evolving in response to changing
> *regulatory* requirements: MySQL's ARCHIVE engine, for instance...
>

What does MySQL's ARCHIVE engine have to do with "regulatory
requirements"? In case you haven't noticed, MySQL is NOT a US company
(although they do have a U.S. subsidiary).

>
>>I'm just curious. How many critical database systems have you actually
>>been involved with? I've lost count. ...
>>These systems are critical to their business. ...
>
>
> None of this is relevant to what I'm trying to convey, which is simply:
> What ZFS does beyond RAID.
>
> Why are you taking the position that they are equivalent? There are
> innumerable failure modes that RAID(-1) cannot handle, which ZFS does.
>

I'm not taking the position they are equivalent. I'm taking the
position that ZFS is an inferior substitute for a true RAID-1 or RAID-10
implementation.

>
>>BTW - NONE of them use zfs - because these are mainframe systems, not
>>Linux. But they all use the mainframe versions of RAID-1 or RAID-10.
>
>
> I still claim - along with Sun - that you can, using more modern
> software, improve on the integrity and availability guarantees of
> RAID-1. This applies equally to the small systems I specify (say, a
> small mirrored disk server storing POS account data) as to their
> humongous storage arrays.
>

OK, you can maintain it. But a properly configured and operating RAID-1
or RAID-10 array needs no such assistance.

>
>>In any case - this is way off topic for this newsgroup. The original
>>question was "Can I prevent the loss of a significant portion of my data
>>in the case of a MySQL, OS or hardware failure, when using MyISAM?".
>>
>>The answer is no.
>>
>>--
>>==================
>>Remove the "x" from my email address
>>Jerry Stuckle
>>JDS Computer Training Corp.
>>jstucklex@attglobal.net
>>==================
>
>

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 12.11.2006 04:12:10 von Jerry Stuckle

Gordon Burditt wrote:
>>>>Actually, for an OS problem, you can. Use an RDB which journals and
>>>>take regular backups. Rolling forward from the last valid backup will
>>>>restore all committed transactions.
>>>
>>>
>>>Nope, OS/hardware issue could mean "no journal".
>
>
> OS issue could mean "no disk writes", PERIOD.
>

True. But synchronous writes to the journal will have errors, if the OS
is doing its job.

>
>>Journals are written synchronously, before data is written to the
>>database. Also, they are preallocated - so there is no change in the
>>allocation units on the disk. Even an OS problem can't break that.
>
>
> Yes, it can. An OS can decide not to write data at all. (consider
> an anti-virus program that monitors disk writes hooked into the
> OS). Or, at any time, it can erase all the data. (Consider
> accidentally zeroing out a sector containing inodes, including a
> file someone else was using and the journal and some table .MYD
> files. Oh, yes, remember READING a file modifies the inode (accessed
> time)). Or it could write any data over the last sector read (which
> might be part of the mysqld executable).
>

True. The OS has to perform the operations demanded of it. But in that
case nothing will help - not ZFS, not RAID, nothing.

However, at the same time, an OS which does that won't be running for
long, so it's really a moot point.

And BTW - when you're talking inodes, etc., you're discussing
Unix-specific implementations of one file system (actually a couple more
than that - but they are all basically related). There are other file
systems out there.

>
>>The
>>worst which can happen is the last transaction isn't completely written
>>to disk (i.e. the database crashed in the middle of the write).
>
>
> When you think worst-case OS failure, think VIRUS. When you think
> worst-case hardware failure, think EXPLOSION. Or FIRE. Or disk
> head crash. When you think worst-case power-failure situation,
> think burned-out circuits. Or erased track.
>

A critical system will have virus protection. If you aren't taking even
minimal steps to protect your system, you deserve everything you get.

>
>
>>And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
>>have to lose multiple disks and adapters at exactly the same time to
>>loose the journal.
>
>
> Only if the OS writes it in the first place, and the RAID controller
> isn't broken, and the OS doesn't scribble over it later.
>
Yep. Take precautions to protect your system.

>
>>>>Hardware failure is a little more difficult. At the least you need to
>>>>have your database and journal on two different disks with two different
>>>>adapters.
>
>
> On different planets.
>

Right. Get real here.

>
>>>>Better is to also mirror the database and journal with
>>>>something like RAID-1 or RAID-10.
>>>
>>>
>>>Neither of which can protect against certain hardware failures -
>>>everyone has a story about the bulletproof RAID setup which was
>>>scribbled over by a bad controller, or bad cable, or bad power. ZFS
>>>buys a lot more safety (end to end verification).
>>>
>>
>>I don't know of anyone who has "a story" about these systems where data
>>was lost on RAID-1 or RAID-10.
>
>
> It's certainly possible to rapidly lose data when you type in a
> UPDATE or DELETE query and accidentally type a semicolon instead
> of ENTER just before you were going to type WHERE. RAID (or MySQL's
> replication setup) does just what it's supposed to do and updates
> all the copies with the bad data.
>
> I'm not trying to discourage use of RAID. It can save your butt in
> lots of situations. But it doesn't work miracles.
>

This has nothing to do with the integrity of the database. Of course
it's possible to do something stupid on any system.

Why not just:

rm -r /

>
>>These systems duplicate everything. They have multiple controllers.
>>Separate cables. Even separate power supplies in the most critical
>>cases. Even a power failure just powers down the device (and take the
>>system down).
>
>
>>Also, ZFS doesn't protect against a bad disk, for instance. All it does
>>is guarantee the data was written properly. A failing controller can
>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
>>still have that happen, but what are the chances of two separate
>>controllers having exactly the same failure at the same time?
>>
>>I have in the past been involved in some very critical databases. They
>>all use various RAID devices. And the most critical use RAID-1 or RAID-10.
>>
>>>>It can be guaranteed. Critical databases all use these techniques.
>>>
>>>
>>>I don't trust that word "guaranteed". You need backups in any case. :)

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 12.11.2006 04:13:23 von Toby

Jerry Stuckle wrote:
> toby wrote:
> > Jerry Stuckle wrote:
> >
> >>toby wrote:
> >>
> >>>Jerry Stuckle wrote:
> >>>
> >>>
> >>>>Journals are written synchronously, ...
> >>>>
> >>>>And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
> >>>>have to lose multiple disks and adapters at exactly the same time to
> >>>>loose the journal.
> >>>
> >>>
> >>>As long as there is a single point of failure (software or firmware bug
> >>>for instance)...
> >>>
> >>
> >>They will also handle hardware failures. I have never heard of any loss
> >>of data due to hardware failures on RAID-1 or RAID-10. Can you point to
> >>even one instance?
> >
> >
> > There are several examples of such hardware failures in the links
> > cited, but I'll crosspost this to comp.arch.storage - I'll eat my hat
> > if no-one there has seen a RAID data loss.
> >
>
> I've seen those links. I have yet to see where there was any loss of
> data proven. Some conjectures in blogs, for instance. But I want to
> see documented facts.

Jerry, I'm having trouble believing that you can't come up with a data
loss scenario for conventional RAID-1.

>
> And I've removed the cross-post. If I want a discussion in
> comp.arch.storage, I will post in it.
>
> >
> >>>>...
> >>>>I don't know of anyone who has "a story" about these systems where data
> >>>>was lost on RAID-1 or RAID-10.
> >>>
> >>>
> >>>It hasn't happened to me either, but it has happened to many others.
> >>>
> >>
> >>Specifics? Using RAID-1 or RAID-10?
> >>
> >>
> >>>>These systems duplicate everything. They have multiple controllers.
> >>>>Separate cables. Even separate power supplies in the most critical
> >>>>cases. Even a power failure just powers down the device (and take the
> >>>>system down).
> >>>>
> >>>>Also, ZFS doesn't protect against a bad disk, for instance. All it does
> >>>>is guarantee the data was written properly.
> >>>
> >>>
> >>>It does considerably better than RAID-1 here, in several ways - by
> >>>verifying writes; verifying reads; by healing immediately a data error
> >>>is found; and by (optionally) making scrubbing passes to reduce the
> >>>possibility of undetected loss (this also works for conventional RAID
> >>>of course, subject to error detection limitations).
> >>>
> >>
> >>And how does it recover from a disk crash? Or what happens if the data
> >>goes bad after being written and read back?
> >
> >
> > You use the redundancy to repair it. RAID-1 does not do this.
> >
>
> No, RAID-1 has complete mirrors of the data. And if it detects an error
> on the primary disk it can correct the error from the mirror, automatically.

In fact, it does not. It reads from only one side of the mirror. Yes,
*if the drive reports an error* it can fix from the other side. ZFS
does not depend on the drive (or any subsystem) reliably reporting
errors. (I'm not inventing this, I'm only describing.)

>
> >
> >>Additionally, it depends on the software correctly detecting and
> >>signaling a data error.
> >
> >
> > Which RAID-1 cannot do at all.
> >
>
> Actually, RAID-1 does do it. In case you aren't aware, all sectors on
> the disks are checksummed.

Are you referring to disk internals? If so, it's not relevant to a
comparison between RAID-1 and ZFS, since the mechanism applies in both
cases. ZFS applies a further level of checksumming as you know.

> If there is a failure, the hardware will
> detect it, long before it even gets to the software. The hardware can
> even retry the operation, or it can go straight to the mirror.
>
> >
> >>>>A failing controller can
> >>>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
> >>>>still have that happen, but what are the chances of two separate
> >>>>controllers having exactly the same failure at the same time?
> >>>
> >>>
> >>>The difference is that ZFS will see the problem (checksum) and
> >>>automatically salvage the data from the good side, while RAID-1 will
> >>>not discover the damage (only reads from one side of the mirror).
> >>>Obviously checksumming is the critical difference; RAID-1 is entirely
> >>>dependent on the drive correctly signalling errors (correctable or
> >>>not); it cannot independently verify data integrity and remains
> >>>vulnerable to latent data loss.
> >>>
> >>
> >>If it's a single sector. But if the entire disk crashes - i.e. an
> >>electronics failure?
> >
> >
> > That's right, it cannot bring a dead disk back to life...
> >
>
> Nope, but the mirror still contains the data.
>
> >
> >>But all data is mirrored. And part of the drive's job is to signal
> >>errors. One which doesn't do that correctly isn't much good, is it/
> >
> >
> > You're right that RAID-1 is built on the assumption that drives
> > perfectly report errors. ZFS isn't.
> >
>
> Do you really understand how drives work? I mean the actual electronics
> of it? Could you read a schematic, scope a failing drive down to the
> bad component? Do you have that level of knowledge?
>
> If not, please don't make statements you have no real understanding of.
> I can do that, and more. And I have done it.

Is that actually relevant here?

My statement was, ZFS does not assume drives, controllers, drivers or
any level of the stack faithfully reports errors. I'm not inventing
that. Its design principle is, as Richard writes, distrust of the
entire I/O stack (a.k.a. Bonwick's "end-to-end"). You may not like to
hear the words from me (since you've decided I'm not worth listening
to), but there it is.

>
> > As Richard Elling writes, "We don't have to rely on a parity protected
> > SCSI bus, or a bug-free disk firmware (I've got the scars) to ensure
> > that what is on persistent storage is what we get in memory. ... by
> > distrusting everything in the storage data path we will build in the
> > reliability and redundancy into the file system."
> >
>
> So, you read a few statements and argue your point without any real
> technical knowledge of what goes on behind the scenes?
>
> Can you tell me the chances of having an undetected problem on a parity
> protected SCSI bus? Or even a non-parity protected one? And can you
> give me the details of the most common causes of those? I thought not.

OK. Seems you're pretty angry about something...

>
> And bug-free disk firmware? Disk firmware is a LOT more bug free than
> any OS software I've ever seen, including Linux. That's because it has
> to do a limited amount of operations with a limited interface.
>
> Unlike a file system which has to handle many additional operations on
> different disk types and configurations.
>
> And BTW - how many disk firmware bugs have you heard about recently? I
> don't say they can't occur. But the reliable disk manufacturers check,
> double-check and triple-check their code before it goes out. Then they
> test it again.
>
> >
> >>>>I have in the past been involved in some very critical databases. They
> >>>>all use various RAID devices. And the most critical use RAID-1 or RAID-10.
> >>>
> >>>
> >>>We can do even better these days.
> >>>
> >>>Related links of interest:
> >>>http://blogs.sun.com/bonwick/
> >>>http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
> >>>https://www.gelato.unsw.edu.au/archives/comp-arch/2006-Se ptember/003008.html
> >>>http://www.lockss.org/locksswiki/files/3/30/Eurosys2006.p df [A Fresh
> >>>Look at the Reliability of Long-term Digital Storage, 2006]
> >>>http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
> >>>Digital Archiving: A Survey, 2006]
> >>>http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
> >>>2006]
> >>>http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
> >>>Faults and Reliability of Disk Arrays, 1997]
> >>>
> >>
> >>So? I don't see anything in any of these articles which affects this
> >>discussion. We're not talking about long term digital storage, for
> >>instance.
> >
> >
> > I think that's quite relevant to many "business critical" database
> > systems. Databases are even evolving in response to changing
> > *regulatory* requirements: MySQL's ARCHIVE engine, for instance...
> >
>
> What does MySQL's ARCHIVE engine have to do with "regulatory
> requirements"? In case you haven't noticed, MySQL is NOT a US company
> (although they do have a U.S. subsidiary).

It was a subtle point. Don't sweat it.

>
> >
> >>I'm just curious. How many critical database systems have you actually
> >>been involved with? I've lost count. ...
> >>These systems are critical to their business. ...
> >
> >
> > None of this is relevant to what I'm trying to convey, which is simply:
> > What ZFS does beyond RAID.
> >
> > Why are you taking the position that they are equivalent? There are
> > innumerable failure modes that RAID(-1) cannot handle, which ZFS does.
> >
>
> I'm not taking the position they are equivalent. I'm taking the
> position that ZFS is an inferior substitute for a true RAID-1 or RAID-10
> implementation.

I don't believe that is the case. We'll have to agree to disagree.

>
> >
> >>BTW - NONE of them use zfs - because these are mainframe systems, not
> >>Linux. But they all use the mainframe versions of RAID-1 or RAID-10.
> >
> >
> > I still claim - along with Sun - that you can, using more modern
> > software, improve on the integrity and availability guarantees of
> > RAID-1. This applies equally to the small systems I specify (say, a
> > small mirrored disk server storing POS account data) as to their
> > humongous storage arrays.
> >
>
> OK, you can maintain it. But a properly configured and operating RAID-1
> or RAID-10 array needs no such assistance.

But there are numerous failure modes they can't handle. Any unreported
data error on disk, for instance.

Btw, if you want information from "more qualified sources" than myself
on ZFS, you should continue to post in comp.unix.solaris. My resume
isn't as long as yours, as we have established several times, and you
clearly have decided I have nothing useful to contribute. Oh well.

>
> >
> >>In any case - this is way off topic for this newsgroup. The original
> >>question was "Can I prevent the loss of a significant portion of my data
> >>in the case of a MySQL, OS or hardware failure, when using MyISAM?".
> >>
> >>The answer is no.
> >>
> >>--
> >>==================
> >>Remove the "x" from my email address
> >>Jerry Stuckle
> >>JDS Computer Training Corp.
> >>jstucklex@attglobal.net
> >>==================
> >
> >
>
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 12.11.2006 04:30:25 von Toby

Jerry Stuckle wrote:
> Rich Teer wrote:
> > On Sat, 11 Nov 2006, toby wrote:
> >
> >
> >>Jerry Stuckle wrote:
> >>
> >>>...
> >>>But when properly implemented, RAID-1 and RAID-10 will detect and
> >>>correct even more errors than ZFS will.
> >>
> >>I'll let those with more patience refute this.
> >
> >
> > Jerry, what are you smoking? Do you actually know what ZFS is, and
> > if so what if, in the context of your assertion I quoted above, ZFS
> > is used to implement RAID 1 and RAID 10 (which, incidentally, it is
> > VERY frequently used to do)?
> >
> > I agree with Toby: you need to read a bit more about ZFS. If you're
> > a storage nut (meant in a non-disparaging way!), I think you'll like
> > what you read.
> >
>
> I'm not smoking anything.
>
> REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
> software system such as ZFS.

Which is actually a weak point, because you then have to trust the
controller, cables, and so on that interface the "reliable" storage.
Sure, you can have two controllers, and so on, but your application
still has no assurance that the data is good. ZFS is designed to
provide that assurance. The fact that it is part of the operating
system and not a hardware-isolated module makes this possible. Don't
take my word for it, read Bonwick, he's much smarter than I am (which
is why I use his system):
http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data

Btw, I have restored the crosspost to comp.unix.solaris, because ZFS is
a Solaris 10 filesystem.

>
> Of course, there are some systems out there which CLAIM to be RAID-1 or
> RAID-10, but implement them in software such as ZFS. What they are are
> really RAID-1/RAID-10 compliant.
>
> And BTW - I've taken out the extra newsgroups. They have nothing to do
> with this discussion.
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 12.11.2006 05:37:19 von Jerry Stuckle

toby wrote:
> Jerry Stuckle wrote:
>
>>Rich Teer wrote:
>>
>>>On Sat, 11 Nov 2006, toby wrote:
>>>
>>>
>>>
>>>>Jerry Stuckle wrote:
>>>>
>>>>
>>>>>...
>>>>>But when properly implemented, RAID-1 and RAID-10 will detect and
>>>>>correct even more errors than ZFS will.
>>>>
>>>>I'll let those with more patience refute this.
>>>
>>>
>>>Jerry, what are you smoking? Do you actually know what ZFS is, and
>>>if so what if, in the context of your assertion I quoted above, ZFS
>>>is used to implement RAID 1 and RAID 10 (which, incidentally, it is
>>>VERY frequently used to do)?
>>>
>>>I agree with Toby: you need to read a bit more about ZFS. If you're
>>>a storage nut (meant in a non-disparaging way!), I think you'll like
>>>what you read.
>>>
>>
>>I'm not smoking anything.
>>
>>REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
>>software system such as ZFS.
>
>
> Which is actually a weak point, because you then have to trust the
> controller, cables, and so on that interface the "reliable" storage.
> Sure, you can have two controllers, and so on, but your application
> still has no assurance that the data is good. ZFS is designed to
> provide that assurance. The fact that it is part of the operating
> system and not a hardware-isolated module makes this possible. Don't
> take my word for it, read Bonwick, he's much smarter than I am (which
> is why I use his system):
> http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data
>

Believe me - I trust the hardware a LOT farther than the software!

And yes, I've read bonwick. He's a great proponent of zfs. However, I
don't think he has any idea how the hardware works. At least I haven't
seen any indication of it.

> Btw, I have restored the crosspost to comp.unix.solaris, because ZFS is
> a Solaris 10 filesystem.
>

And I have removed it again.

But that's OK. I'm not going to respond to you any further. It's
obvious you've bought a bill of goods hook, line and sinker. And you
aren't willing to listen to anything else.

Bye.

>
>>Of course, there are some systems out there which CLAIM to be RAID-1 or
>>RAID-10, but implement them in software such as ZFS. What they are are
>>really RAID-1/RAID-10 compliant.
>>
>>And BTW - I've taken out the extra newsgroups. They have nothing to do
>>with this discussion.
>>
>>--
>>==================
>>Remove the "x" from my email address
>>Jerry Stuckle
>>JDS Computer Training Corp.
>>jstucklex@attglobal.net
>>==================
>
>

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 12.11.2006 05:42:20 von Jerry Stuckle

toby wrote:
> Jerry Stuckle wrote:
>
>>toby wrote:
>>
>>>Jerry Stuckle wrote:
>>>
>>>
>>>>toby wrote:
>>>>
>>>>
>>>>>toby wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Jerry Stuckle wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>...
>>>>>>>A failing controller can
>>>>>>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
>>>>>>>still have that happen, but what are the chances of two separate
>>>>>>>controllers having exactly the same failure at the same time?
>>>>>>
>>>>>>The difference is that ZFS will see the problem (checksum) and
>>>>>>automatically salvage the data from the good side, while RAID-1 will
>>>>>>not discover the damage
>>>>>
>>>>>
>>>>>I should have added - you don't need *two* failures. You only need *one
>>>>>silent error* to cause data loss with RAID-1. ZFS is proof against
>>>>>silent errors, although of course it's still susceptible to multiple
>>>>>failures (such as both mirrors suffering a whole disk failure without
>>>>>repair).
>>>>>
>>>>
>>>>ZFS is not proof against silent errors - they can still occur. It is
>>>>possible for it to miss an error, also. Plus it is not proof against
>>>>data decaying after it is written to disk.
>>>
>>>
>>>Actually both capabilities are among its strongest features.
>>>
>>>Clearly you haven't read or understood any of the publicly available
>>>information about it, so I'm not going to pursue this any further
>>>beyond relating an analogy:
>>> You will likely live longer if you look both ways before crossing
>>>the road, rather than walking straight across without looking because
>>>"cars will stop".
>>>
>>
>>Actually, I understand quite a bit about ZFS. However, unlike you, I
>>also understand its shortcomings. That's because I started working on
>>fault-tolerant drive systems starting in 1977 as a hardware CE for IBM,
>>working on large mainframes. I've watched it grow over the years. And
>>as a EE major, I also understand the hardware and it's strengths and
>>weaknesses - in detail.
>>
>>And as a CS major (dual majors) and programmer since 1867, including
>>working on system software for IBM in the 1980's I have a thorough
>>understanding of the software end.
>>
>>And it's obvious from your statements you have no real understanding or
>>either, other than sales literature.
>
>
> This isn't about a battle of the egos. I was challenging what seemed to
> be factual misunderstandings of ZFS relative to RAID. Perhaps we're
> talking at cross purposes; you had trouble getting Axel's point also...
>

It's not about a battle of egos with me, either. It's about correcting
some misconceptions of a close-minded individual who has no real idea of
the technical issues involved.

I suspect I understand both ZFS and RAID-1 and RAID-10 a whole lot more
than you do - because I have a thorough understanding of the underlying
hardware and its operation, as well as the programming involved.

>
>>>>...
>>>>But when properly implemented, RAID-1 and RAID-10 will detect and
>>>>correct even more errors than ZFS will.
>>>
>>>
>>>I'll let those with more patience refute this.
>>>
>>
>>And more knowledge of the real facts?
>>
>>BTW - I took out all those extra newsgroups you added. If I wanted to
>>discuss things there I would have added them myself.
>>
>>
>>But I'm also not going to discuss this any more with you. I'd really
>>rather have discussions with someone who really knows the internals - of
>>both systems.
>
>
> You'll find them in the newsgroups you snipped, not here. I'm sorry
> things degenerated to this point, but I stand by my corrections of your
> strange views on ZFS' capabilities.
>

And it's snipped again because I really don't give a damn what bill of
goods you've bought. And I'm finished with this conversation.

Bye.

>
>>--
>>==================
>>Remove the "x" from my email address
>>Jerry Stuckle
>>JDS Computer Training Corp.
>>jstucklex@attglobal.net
>>==================
>
>

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 12.11.2006 05:54:01 von Toby

toby wrote:
> Jerry Stuckle wrote:
> > ...
> > Actually, I understand quite a bit about ZFS. However, unlike you, I
> > also understand its shortcomings.

This group and I would very much like to hear about those shortcomings,
if you would elucidate.

> > That's because I started working on
> > fault-tolerant drive systems starting in 1977 as a hardware CE for IBM,
> > working on large mainframes. I've watched it grow over the years. And
> > as a EE major, I also understand the hardware and it's strengths and
> > weaknesses - in detail.
> >
> > And as a CS major (dual majors) and programmer since 1867, including
> > working on system software for IBM in the 1980's I have a thorough
> > understanding of the software end.
> >
> > And it's obvious from your statements you have no real understanding or
> > either, other than sales literature.

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 12.11.2006 05:55:40 von Toby

Jerry Stuckle wrote:
> toby wrote:
> > Jerry Stuckle wrote:
> >
> >>Rich Teer wrote:
> >>
> >>>On Sat, 11 Nov 2006, toby wrote:
> >>>
> >>>
> >>>
> >>>>Jerry Stuckle wrote:
> >>>>
> >>>>
> >>>>>...
> >>>>>But when properly implemented, RAID-1 and RAID-10 will detect and
> >>>>>correct even more errors than ZFS will.
> >>>>
> >>>>I'll let those with more patience refute this.
> >>>
> >>>
> >>>Jerry, what are you smoking? Do you actually know what ZFS is, and
> >>>if so what if, in the context of your assertion I quoted above, ZFS
> >>>is used to implement RAID 1 and RAID 10 (which, incidentally, it is
> >>>VERY frequently used to do)?
> >>>
> >>>I agree with Toby: you need to read a bit more about ZFS. If you're
> >>>a storage nut (meant in a non-disparaging way!), I think you'll like
> >>>what you read.
> >>>
> >>
> >>I'm not smoking anything.
> >>
> >>REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
> >>software system such as ZFS.
> >
> >
> > Which is actually a weak point, because you then have to trust the
> > controller, cables, and so on that interface the "reliable" storage.
> > Sure, you can have two controllers, and so on, but your application
> > still has no assurance that the data is good. ZFS is designed to
> > provide that assurance. The fact that it is part of the operating
> > system and not a hardware-isolated module makes this possible. Don't
> > take my word for it, read Bonwick, he's much smarter than I am (which
> > is why I use his system):
> > http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data
> >
>
> Believe me - I trust the hardware a LOT farther than the software!
>
> And yes, I've read bonwick. He's a great proponent of zfs. However, I
> don't think he has any idea how the hardware works. At least I haven't
> seen any indication of it.
>
> > Btw, I have restored the crosspost to comp.unix.solaris, because ZFS is
> > a Solaris 10 filesystem.
> >
>
> And I have removed it again.
>
> But that's OK. I'm not going to respond to you any further. It's
> obvious you've bought a bill of goods hook, line and sinker. And you
> aren't willing to listen to anything else.

Au contraire. I have asked a question in the relevant group, about what
you have identified as ZFS' shortcomings, and I would genuinely like to
hear the answer.

--Toby

>
> Bye.
>
> >
> >>Of course, there are some systems out there which CLAIM to be RAID-1 or
> >>RAID-10, but implement them in software such as ZFS. What they are are
> >>really RAID-1/RAID-10 compliant.
> >>
> >>And BTW - I've taken out the extra newsgroups. They have nothing to do
> >>with this discussion.
> >>
> >>--
> >>==================
> >>Remove the "x" from my email address
> >>Jerry Stuckle
> >>JDS Computer Training Corp.
> >>jstucklex@attglobal.net
> >>==================
> >
> >
>
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 12.11.2006 06:02:34 von Jerry Stuckle

toby wrote:
> Jerry Stuckle wrote:
>
>>toby wrote:
>>
>>>Jerry Stuckle wrote:
>>>
>>>
>>>>toby wrote:
>>>>
>>>>
>>>>>Jerry Stuckle wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Journals are written synchronously, ...
>>>>>>
>>>>>>And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
>>>>>>have to lose multiple disks and adapters at exactly the same time to
>>>>>>loose the journal.
>>>>>
>>>>>
>>>>>As long as there is a single point of failure (software or firmware bug
>>>>>for instance)...
>>>>>
>>>>
>>>>They will also handle hardware failures. I have never heard of any loss
>>>>of data due to hardware failures on RAID-1 or RAID-10. Can you point to
>>>>even one instance?
>>>
>>>
>>>There are several examples of such hardware failures in the links
>>>cited, but I'll crosspost this to comp.arch.storage - I'll eat my hat
>>>if no-one there has seen a RAID data loss.
>>>
>>
>>I've seen those links. I have yet to see where there was any loss of
>>data proven. Some conjectures in blogs, for instance. But I want to
>>see documented facts.
>
>
> Jerry, I'm having trouble believing that you can't come up with a data
> loss scenario for conventional RAID-1.
>

What do you mean ME coming up with a data loss scenario? YOU come up
with a single point of failure which loses data on a real RAID-1. A
point of failure which isn't reflected back to the OS, of course.

>
>>And I've removed the cross-post. If I want a discussion in
>>comp.arch.storage, I will post in it.
>>
>>
>>>>>>...
>>>>>>I don't know of anyone who has "a story" about these systems where data
>>>>>>was lost on RAID-1 or RAID-10.
>>>>>
>>>>>
>>>>>It hasn't happened to me either, but it has happened to many others.
>>>>>
>>>>
>>>>Specifics? Using RAID-1 or RAID-10?
>>>>
>>>>
>>>>
>>>>>>These systems duplicate everything. They have multiple controllers.
>>>>>>Separate cables. Even separate power supplies in the most critical
>>>>>>cases. Even a power failure just powers down the device (and take the
>>>>>>system down).
>>>>>>
>>>>>>Also, ZFS doesn't protect against a bad disk, for instance. All it does
>>>>>>is guarantee the data was written properly.
>>>>>
>>>>>
>>>>>It does considerably better than RAID-1 here, in several ways - by
>>>>>verifying writes; verifying reads; by healing immediately a data error
>>>>>is found; and by (optionally) making scrubbing passes to reduce the
>>>>>possibility of undetected loss (this also works for conventional RAID
>>>>>of course, subject to error detection limitations).
>>>>>
>>>>
>>>>And how does it recover from a disk crash? Or what happens if the data
>>>>goes bad after being written and read back?
>>>
>>>
>>>You use the redundancy to repair it. RAID-1 does not do this.
>>>
>>
>>No, RAID-1 has complete mirrors of the data. And if it detects an error
>>on the primary disk it can correct the error from the mirror, automatically.
>
>
> In fact, it does not. It reads from only one side of the mirror. Yes,
> *if the drive reports an error* it can fix from the other side. ZFS
> does not depend on the drive (or any subsystem) reliably reporting
> errors. (I'm not inventing this, I'm only describing.)
>

Maybe not the implementations you're familiar with. True fault tolerant
ones will detect failure on one side and automatically corrects by
fetching them from the other side.

And tell me exactly under what conditions the drive will report an error
but ZFS would not have. Specifics details, please.

>
>>>>Additionally, it depends on the software correctly detecting and
>>>>signaling a data error.
>>>
>>>
>>>Which RAID-1 cannot do at all.
>>>
>>
>>Actually, RAID-1 does do it. In case you aren't aware, all sectors on
>>the disks are checksummed.
>
>
> Are you referring to disk internals? If so, it's not relevant to a
> comparison between RAID-1 and ZFS, since the mechanism applies in both
> cases. ZFS applies a further level of checksumming as you know.
>

Which makes ZFS's checksum unnecessary and irrelevant - unless you're
using cheap drives, that is.

>
>>If there is a failure, the hardware will
>>detect it, long before it even gets to the software. The hardware can
>>even retry the operation, or it can go straight to the mirror.
>>
>>
>>>>>>A failing controller can
>>>>>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
>>>>>>still have that happen, but what are the chances of two separate
>>>>>>controllers having exactly the same failure at the same time?
>>>>>
>>>>>
>>>>>The difference is that ZFS will see the problem (checksum) and
>>>>>automatically salvage the data from the good side, while RAID-1 will
>>>>>not discover the damage (only reads from one side of the mirror).
>>>>>Obviously checksumming is the critical difference; RAID-1 is entirely
>>>>>dependent on the drive correctly signalling errors (correctable or
>>>>>not); it cannot independently verify data integrity and remains
>>>>>vulnerable to latent data loss.
>>>>>
>>>>
>>>>If it's a single sector. But if the entire disk crashes - i.e. an
>>>>electronics failure?
>>>
>>>
>>>That's right, it cannot bring a dead disk back to life...
>>>
>>
>>Nope, but the mirror still contains the data.
>>
>>
>>>>But all data is mirrored. And part of the drive's job is to signal
>>>>errors. One which doesn't do that correctly isn't much good, is it/
>>>
>>>
>>>You're right that RAID-1 is built on the assumption that drives
>>>perfectly report errors. ZFS isn't.
>>>
>>
>>Do you really understand how drives work? I mean the actual electronics
>>of it? Could you read a schematic, scope a failing drive down to the
>>bad component? Do you have that level of knowledge?
>>
>>If not, please don't make statements you have no real understanding of.
>> I can do that, and more. And I have done it.
>
>
> Is that actually relevant here?
>

You're making technical claims. Provide the technical support to back
up your claims.

> My statement was, ZFS does not assume drives, controllers, drivers or
> any level of the stack faithfully reports errors. I'm not inventing
> that. Its design principle is, as Richard writes, distrust of the
> entire I/O stack (a.k.a. Bonwick's "end-to-end"). You may not like to
> hear the words from me (since you've decided I'm not worth listening
> to), but there it is.
>

Sure you should distrust the I/O stack. It can be overwritten so many
ways by the software.

Unlike the controller - where it can't be overwritten.

I'll talk to you. But please don't insult me by repeating technical
claims when you don't understand the background behind them. As I said
before - I've read the links you provided, and quite frankly don't agree
with a number of their claims.

>
>>>As Richard Elling writes, "We don't have to rely on a parity protected
>>>SCSI bus, or a bug-free disk firmware (I've got the scars) to ensure
>>>that what is on persistent storage is what we get in memory. ... by
>>>distrusting everything in the storage data path we will build in the
>>>reliability and redundancy into the file system."
>>>
>>
>>So, you read a few statements and argue your point without any real
>>technical knowledge of what goes on behind the scenes?
>>
>>Can you tell me the chances of having an undetected problem on a parity
>>protected SCSI bus? Or even a non-parity protected one? And can you
>>give me the details of the most common causes of those? I thought not.
>
>
> OK. Seems you're pretty angry about something...
>

Not angry at all. Just trying to find out if you understand what you're
talking about.

>
>>And bug-free disk firmware? Disk firmware is a LOT more bug free than
>>any OS software I've ever seen, including Linux. That's because it has
>>to do a limited amount of operations with a limited interface.
>>
>>Unlike a file system which has to handle many additional operations on
>>different disk types and configurations.
>>
>>And BTW - how many disk firmware bugs have you heard about recently? I
>>don't say they can't occur. But the reliable disk manufacturers check,
>>double-check and triple-check their code before it goes out. Then they
>>test it again.
>>
>>
>>>>>>I have in the past been involved in some very critical databases. They
>>>>>>all use various RAID devices. And the most critical use RAID-1 or RAID-10.
>>>>>
>>>>>
>>>>>We can do even better these days.
>>>>>
>>>>>Related links of interest:
>>>>>http://blogs.sun.com/bonwick/
>>>>>http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
>>>>>https://www.gelato.unsw.edu.au/archives/comp-arch/2006- September/003008.html
>>>>>http://www.lockss.org/locksswiki/files/3/30/Eurosys2006 .pdf [A Fresh
>>>>>Look at the Reliability of Long-term Digital Storage, 2006]
>>>>>http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
>>>>>Digital Archiving: A Survey, 2006]
>>>>>http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
>>>>>2006]
>>>>>http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
>>>>>Faults and Reliability of Disk Arrays, 1997]
>>>>>
>>>>
>>>>So? I don't see anything in any of these articles which affects this
>>>>discussion. We're not talking about long term digital storage, for
>>>>instance.
>>>
>>>
>>>I think that's quite relevant to many "business critical" database
>>>systems. Databases are even evolving in response to changing
>>>*regulatory* requirements: MySQL's ARCHIVE engine, for instance...
>>>
>>
>>What does MySQL's ARCHIVE engine have to do with "regulatory
>>requirements"? In case you haven't noticed, MySQL is NOT a US company
>>(although they do have a U.S. subsidiary).
>
>
> It was a subtle point. Don't sweat it.
>

Then why even bring it up? Because it's irrelevant?

>
>>>>I'm just curious. How many critical database systems have you actually
>>>>been involved with? I've lost count. ...
>>>>These systems are critical to their business. ...
>>>
>>>
>>>None of this is relevant to what I'm trying to convey, which is simply:
>>>What ZFS does beyond RAID.
>>>
>>>Why are you taking the position that they are equivalent? There are
>>>innumerable failure modes that RAID(-1) cannot handle, which ZFS does.
>>>
>>
>>I'm not taking the position they are equivalent. I'm taking the
>>position that ZFS is an inferior substitute for a true RAID-1 or RAID-10
>>implementation.
>
>
> I don't believe that is the case. We'll have to agree to disagree.
>

The difference is I don't just accept what someone claims. Rather, I
analyze and determine just how accurate the statements are.

>
>>>>BTW - NONE of them use zfs - because these are mainframe systems, not
>>>>Linux. But they all use the mainframe versions of RAID-1 or RAID-10.
>>>
>>>
>>>I still claim - along with Sun - that you can, using more modern
>>>software, improve on the integrity and availability guarantees of
>>>RAID-1. This applies equally to the small systems I specify (say, a
>>>small mirrored disk server storing POS account data) as to their
>>>humongous storage arrays.
>>>
>>
>>OK, you can maintain it. But a properly configured and operating RAID-1
>>or RAID-10 array needs no such assistance.
>
>
> But there are numerous failure modes they can't handle. Any unreported
> data error on disk, for instance.
>

And exactly how can you get an unreported data error from a disk?

> Btw, if you want information from "more qualified sources" than myself
> on ZFS, you should continue to post in comp.unix.solaris. My resume
> isn't as long as yours, as we have established several times, and you
> clearly have decided I have nothing useful to contribute. Oh well.
>

Not really. You butted into this conversation and discussed zfs -
which, BTW, is a UNIX-only file system. And in case you haven't figured
out, UNIX is NOT the only OS out there. Even MySQL recognizes that.

I'm just refuting your wild claims. But you're not interested in
discussing hard facts - you make claims about "unreported data errors",
for instance, but have no idea how they can happen, how often they
happen or the odds of them happening.

All you have is a sales pitch you've bought.

Thanks, I have better things to do with my time. Bye.

>
>>>>In any case - this is way off topic for this newsgroup. The original
>>>>question was "Can I prevent the loss of a significant portion of my data
>>>>in the case of a MySQL, OS or hardware failure, when using MyISAM?".
>>>>
>>>>The answer is no.
>>>>
>>>>--
>>>>==================
>>>>Remove the "x" from my email address
>>>>Jerry Stuckle
>>>>JDS Computer Training Corp.
>>>>jstucklex@attglobal.net
>>>>==================
>>>
>>>
>>
>>--
>>==================
>>Remove the "x" from my email address
>>Jerry Stuckle
>>JDS Computer Training Corp.
>>jstucklex@attglobal.net
>>==================
>
>

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 12.11.2006 06:16:33 von Toby

Jerry Stuckle wrote:
> toby wrote:
> > Jerry Stuckle wrote:
> >
> >>toby wrote:
> >>
> >>>Jerry Stuckle wrote:
> >>>
> >>>
> >>>>toby wrote:
> >>>>
> >>>>
> >>>>>Jerry Stuckle wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Journals are written synchronously, ...
> >>>>>>
> >>>>>>And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
> >>>>>>have to lose multiple disks and adapters at exactly the same time to
> >>>>>>loose the journal.
> >>>>>
> >>>>>
> >>>>>As long as there is a single point of failure (software or firmware bug
> >>>>>for instance)...
> >>>>>
> >>>>
> >>>>They will also handle hardware failures. I have never heard of any loss
> >>>>of data due to hardware failures on RAID-1 or RAID-10. Can you point to
> >>>>even one instance?
> >>>
> >>>
> >>>There are several examples of such hardware failures in the links
> >>>cited, but I'll crosspost this to comp.arch.storage - I'll eat my hat
> >>>if no-one there has seen a RAID data loss.
> >>>
> >>
> >>I've seen those links. I have yet to see where there was any loss of
> >>data proven. Some conjectures in blogs, for instance. But I want to
> >>see documented facts.
> >
> >
> > Jerry, I'm having trouble believing that you can't come up with a data
> > loss scenario for conventional RAID-1.
> >
>
> What do you mean ME coming up with a data loss scenario? YOU come up
> with a single point of failure which loses data on a real RAID-1. A
> point of failure which isn't reflected back to the OS, of course.
>
> >
> >>And I've removed the cross-post. If I want a discussion in
> >>comp.arch.storage, I will post in it.
> >>
> >>
> >>>>>>...
> >>>>>>I don't know of anyone who has "a story" about these systems where data
> >>>>>>was lost on RAID-1 or RAID-10.
> >>>>>
> >>>>>
> >>>>>It hasn't happened to me either, but it has happened to many others.
> >>>>>
> >>>>
> >>>>Specifics? Using RAID-1 or RAID-10?
> >>>>
> >>>>
> >>>>
> >>>>>>These systems duplicate everything. They have multiple controllers.
> >>>>>>Separate cables. Even separate power supplies in the most critical
> >>>>>>cases. Even a power failure just powers down the device (and take the
> >>>>>>system down).
> >>>>>>
> >>>>>>Also, ZFS doesn't protect against a bad disk, for instance. All it does
> >>>>>>is guarantee the data was written properly.
> >>>>>
> >>>>>
> >>>>>It does considerably better than RAID-1 here, in several ways - by
> >>>>>verifying writes; verifying reads; by healing immediately a data error
> >>>>>is found; and by (optionally) making scrubbing passes to reduce the
> >>>>>possibility of undetected loss (this also works for conventional RAID
> >>>>>of course, subject to error detection limitations).
> >>>>>
> >>>>
> >>>>And how does it recover from a disk crash? Or what happens if the data
> >>>>goes bad after being written and read back?
> >>>
> >>>
> >>>You use the redundancy to repair it. RAID-1 does not do this.
> >>>
> >>
> >>No, RAID-1 has complete mirrors of the data. And if it detects an error
> >>on the primary disk it can correct the error from the mirror, automatically.
> >
> >
> > In fact, it does not. It reads from only one side of the mirror. Yes,
> > *if the drive reports an error* it can fix from the other side. ZFS
> > does not depend on the drive (or any subsystem) reliably reporting
> > errors. (I'm not inventing this, I'm only describing.)
> >
>
> Maybe not the implementations you're familiar with. True fault tolerant
> ones will detect failure on one side and automatically corrects by
> fetching them from the other side.
>
> And tell me exactly under what conditions the drive will report an error
> but ZFS would not have. Specifics details, please.
>
> >
> >>>>Additionally, it depends on the software correctly detecting and
> >>>>signaling a data error.
> >>>
> >>>
> >>>Which RAID-1 cannot do at all.
> >>>
> >>
> >>Actually, RAID-1 does do it. In case you aren't aware, all sectors on
> >>the disks are checksummed.
> >
> >
> > Are you referring to disk internals? If so, it's not relevant to a
> > comparison between RAID-1 and ZFS, since the mechanism applies in both
> > cases. ZFS applies a further level of checksumming as you know.
> >
>
> Which makes ZFS's checksum unnecessary and irrelevant - unless you're
> using cheap drives, that is.
>
> >
> >>If there is a failure, the hardware will
> >>detect it, long before it even gets to the software. The hardware can
> >>even retry the operation, or it can go straight to the mirror.
> >>
> >>
> >>>>>>A failing controller can
> >>>>>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
> >>>>>>still have that happen, but what are the chances of two separate
> >>>>>>controllers having exactly the same failure at the same time?
> >>>>>
> >>>>>
> >>>>>The difference is that ZFS will see the problem (checksum) and
> >>>>>automatically salvage the data from the good side, while RAID-1 will
> >>>>>not discover the damage (only reads from one side of the mirror).
> >>>>>Obviously checksumming is the critical difference; RAID-1 is entirely
> >>>>>dependent on the drive correctly signalling errors (correctable or
> >>>>>not); it cannot independently verify data integrity and remains
> >>>>>vulnerable to latent data loss.
> >>>>>
> >>>>
> >>>>If it's a single sector. But if the entire disk crashes - i.e. an
> >>>>electronics failure?
> >>>
> >>>
> >>>That's right, it cannot bring a dead disk back to life...
> >>>
> >>
> >>Nope, but the mirror still contains the data.
> >>
> >>
> >>>>But all data is mirrored. And part of the drive's job is to signal
> >>>>errors. One which doesn't do that correctly isn't much good, is it/
> >>>
> >>>
> >>>You're right that RAID-1 is built on the assumption that drives
> >>>perfectly report errors. ZFS isn't.
> >>>
> >>
> >>Do you really understand how drives work? I mean the actual electronics
> >>of it? Could you read a schematic, scope a failing drive down to the
> >>bad component? Do you have that level of knowledge?
> >>
> >>If not, please don't make statements you have no real understanding of.
> >> I can do that, and more. And I have done it.
> >
> >
> > Is that actually relevant here?
> >
>
> You're making technical claims. Provide the technical support to back
> up your claims.
>
> > My statement was, ZFS does not assume drives, controllers, drivers or
> > any level of the stack faithfully reports errors. I'm not inventing
> > that. Its design principle is, as Richard writes, distrust of the
> > entire I/O stack (a.k.a. Bonwick's "end-to-end"). You may not like to
> > hear the words from me (since you've decided I'm not worth listening
> > to), but there it is.
> >
>
> Sure you should distrust the I/O stack. It can be overwritten so many
> ways by the software.
>
> Unlike the controller - where it can't be overwritten.
>
> I'll talk to you. But please don't insult me by repeating technical
> claims when you don't understand the background behind them. As I said
> before - I've read the links you provided, and quite frankly don't agree
> with a number of their claims.

Sure, let's talk. Would you humour me with a reply in comp.unix.solaris
with details on the ZFS shortcomings you were talking about - a genuine
request, because some of us *have* invested in that technology.
Tomorrow I'll think over the points you question above.

>
> >
> >>>As Richard Elling writes, "We don't have to rely on a parity protected
> >>>SCSI bus, or a bug-free disk firmware (I've got the scars) to ensure
> >>>that what is on persistent storage is what we get in memory. ... by
> >>>distrusting everything in the storage data path we will build in the
> >>>reliability and redundancy into the file system."
> >>>
> >>
> >>So, you read a few statements and argue your point without any real
> >>technical knowledge of what goes on behind the scenes?
> >>
> >>Can you tell me the chances of having an undetected problem on a parity
> >>protected SCSI bus? Or even a non-parity protected one? And can you
> >>give me the details of the most common causes of those? I thought not.
> >
> >
> > OK. Seems you're pretty angry about something...
> >
>
> Not angry at all. Just trying to find out if you understand what you're
> talking about.

You've decided I don't. But let's press on while it remains civil.

>
> >
> >>And bug-free disk firmware? Disk firmware is a LOT more bug free than
> >>any OS software I've ever seen, including Linux. That's because it has
> >>to do a limited amount of operations with a limited interface.
> >>
> >>Unlike a file system which has to handle many additional operations on
> >>different disk types and configurations.
> >>
> >>And BTW - how many disk firmware bugs have you heard about recently? I
> >>don't say they can't occur. But the reliable disk manufacturers check,
> >>double-check and triple-check their code before it goes out. Then they
> >>test it again.
> >>
> >>
> >>>>>>I have in the past been involved in some very critical databases. They
> >>>>>>all use various RAID devices. And the most critical use RAID-1 or RAID-10.
> >>>>>
> >>>>>
> >>>>>We can do even better these days.
> >>>>>
> >>>>>Related links of interest:
> >>>>>http://blogs.sun.com/bonwick/
> >>>>>http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
> >>>>>https://www.gelato.unsw.edu.au/archives/comp-arch/2006- September/003008.html
> >>>>>http://www.lockss.org/locksswiki/files/3/30/Eurosys2006 .pdf [A Fresh
> >>>>>Look at the Reliability of Long-term Digital Storage, 2006]
> >>>>>http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
> >>>>>Digital Archiving: A Survey, 2006]
> >>>>>http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
> >>>>>2006]
> >>>>>http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
> >>>>>Faults and Reliability of Disk Arrays, 1997]
> >>>>>
> >>>>
> >>>>So? I don't see anything in any of these articles which affects this
> >>>>discussion. We're not talking about long term digital storage, for
> >>>>instance.
> >>>
> >>>
> >>>I think that's quite relevant to many "business critical" database
> >>>systems. Databases are even evolving in response to changing
> >>>*regulatory* requirements: MySQL's ARCHIVE engine, for instance...
> >>>
> >>
> >>What does MySQL's ARCHIVE engine have to do with "regulatory
> >>requirements"? In case you haven't noticed, MySQL is NOT a US company
> >>(although they do have a U.S. subsidiary).
> >
> >
> > It was a subtle point. Don't sweat it.
> >
>
> Then why even bring it up? Because it's irrelevant?

I don't think "long term data storage" is irrelevant to databases and
data integrity. It *is* irrelevant to the OP's question, of course :)

>
> >
> >>>>I'm just curious. How many critical database systems have you actually
> >>>>been involved with? I've lost count. ...
> >>>>These systems are critical to their business. ...
> >>>
> >>>
> >>>None of this is relevant to what I'm trying to convey, which is simply:
> >>>What ZFS does beyond RAID.
> >>>
> >>>Why are you taking the position that they are equivalent? There are
> >>>innumerable failure modes that RAID(-1) cannot handle, which ZFS does.
> >>>
> >>
> >>I'm not taking the position they are equivalent. I'm taking the
> >>position that ZFS is an inferior substitute for a true RAID-1 or RAID-10
> >>implementation.
> >
> >
> > I don't believe that is the case. We'll have to agree to disagree.
> >
>
> The difference is I don't just accept what someone claims. Rather, I
> analyze and determine just how accurate the statements are.

Please don't assume I have done none of my own thinking.

>
> >
> >>>>BTW - NONE of them use zfs - because these are mainframe systems, not
> >>>>Linux. But they all use the mainframe versions of RAID-1 or RAID-10.
> >>>
> >>>
> >>>I still claim - along with Sun - that you can, using more modern
> >>>software, improve on the integrity and availability guarantees of
> >>>RAID-1. This applies equally to the small systems I specify (say, a
> >>>small mirrored disk server storing POS account data) as to their
> >>>humongous storage arrays.
> >>>
> >>
> >>OK, you can maintain it. But a properly configured and operating RAID-1
> >>or RAID-10 array needs no such assistance.
> >
> >
> > But there are numerous failure modes they can't handle. Any unreported
> > data error on disk, for instance.
> >
>
> And exactly how can you get an unreported data error from a disk?

If the error occurs in cable, controller, RAM, and so on. I have seen
this myself.

>
> > Btw, if you want information from "more qualified sources" than myself
> > on ZFS, you should continue to post in comp.unix.solaris. My resume
> > isn't as long as yours, as we have established several times, and you
> > clearly have decided I have nothing useful to contribute. Oh well.
> >
>
> Not really. You butted into this conversation and discussed zfs -
> which, BTW, is a UNIX-only file system. And in case you haven't figured
> out, UNIX is NOT the only OS out there. Even MySQL recognizes that.

Yes, I was talking about specific capabilities of ZFS. The fact it's
UNIX-specific isn't really important to those principles.

>
> I'm just refuting your wild claims. But you're not interested in
> discussing hard facts - you make claims about "unreported data errors",
> for instance, but have no idea how they can happen, how often they
> happen or the odds of them happening.

I'm not sure I've made any wild claims other than "I think ZFS can
guarantee more than conventional RAID", due to concepts which underpin
ZFS' design. That's not very "wild". There is some inductive thinking
involved, not sheer speculation. If you calmed down, we could talk it
over. You've said you distrust Bonwick -- so I'd like to hear why he's
wrong (in the appropriate forum).

>
> All you have is a sales pitch you've bought.
>
> Thanks, I have better things to do with my time. Bye.
>
> >
> >>>>In any case - this is way off topic for this newsgroup. The original
> >>>>question was "Can I prevent the loss of a significant portion of my data
> >>>>in the case of a MySQL, OS or hardware failure, when using MyISAM?".
> >>>>
> >>>>The answer is no.
> >>>>
> >>>>--
> >>>>==================
> >>>>Remove the "x" from my email address
> >>>>Jerry Stuckle
> >>>>JDS Computer Training Corp.
> >>>>jstucklex@attglobal.net
> >>>>==================
> >>>
> >>>
> >>
> >>--
> >>==================
> >>Remove the "x" from my email address
> >>Jerry Stuckle
> >>JDS Computer Training Corp.
> >>jstucklex@attglobal.net
> >>==================
> >
> >
>
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstucklex@attglobal.net
> ==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 12.11.2006 06:36:06 von Bill Todd

Jerry Stuckle wrote:
> Bill Todd wrote:
>> Jerry Stuckle wrote:
>>
>> ...
>>
>>> ZFS is not proof against silent errors - they can still occur.
>>
>>
>> Of course they can, but they will be caught by the background
>> verification scrubbing before much time passes (i.e., within a time
>> window that radically reduces the likelihood that another disk will
>> fail before the error is caught and corrected), unlike the case with
>> conventional RAID (where they aren't caught at all, and rise up to
>> bite you - with non-negligible probability these days - if the good
>> copy then dies).
>>
>> And ZFS *is* proof against silent errors in the sense that data thus
>> mangled will not be returned to an application (i.e., it will be
>> caught when read if the background integrity validation has not yet
>> reached it) - again, unlike the case with conventional mirroring,
>> where there's a good chance that it will be returned to the
>> application as good.
>>
>
>
> The same is true with RAID-1 and RAID-10. An error on the disk will be
> detected and returned by the hardware to the OS.

I'd think that someone as uninformed as you are would have thought twice
about appending an ad for his services to his Usenet babble. But formal
studies have shown that the least competent individuals seem to be the
most confident of their opinions (because they just don't know enough to
understand how clueless they really are).

Do you even know what a silent error is? It's an error that the disk
does not notice, and hence cannot report.

Duh.

In some of your other recent drivel you've seemed to suggest that this
simply does not happen. Well, perhaps not in your own extremely limited
experience, but you really shouldn't generalize from that.

A friend of mine at DEC investigated this about a decade ago and found
that the (high-end) disk subsystems of some (high-end) Alpha platforms
were encountering undetected errors on average every few TB (i.e., what
they read back was, very rarely, not quite what they had written in,
with no indication of error). That may be better today (that's more
like the uncorrectable error rate now), but it still happens. The
causes are well known to people reasonably familiar with the technology:
the biggies are writes that report successful completion but in fact
do nothing, writes that go to the wrong target sector(s) (whether or not
they report success), and errors that the sector checksums just don't
catch (those used to be about three orders of magnitude rarer than
uncorrectable errors, but that was before the rush toward higher density
and longer checksums to catch the significantly-increased raw error
rates - disk manufacturers no longer report the undetected error rate,
but I suspect that it's considerably closer to the uncorrectable error
rate now). There are also a few special cases - e.g., the disk that
completes a sector update while power is failing, not knowing that the
transfer from memory got clamped part-way through and returned zeros
rather than whatever it was supposed to (so as far as the disk knows
they're valid).

IBM, Unisys, NetApp, and EMC (probably not an exhaustive list, but
they're the ones that spring immediately to mind) all use non-standard
disk sector sizes in some of their systems to hold additional validation
information (maintained by software or firmware well above the disk
level) aimed at catching some (but in most cases not all) of these
unreported errors.

Silent errors are certainly rare, but they happen. ZFS catches them.
RAID does not. End of story.

....

The
> big difference being ZFS if done in software, which requires CPU cycles
> and other resources.

Since when was this discussion about use of resources rather than
integrity (not that ZFS's use of resources for implementing its own
RAID-1/RAID-10 facilities is significant anyway)?

> It's also open to corruption.

No more than the data that some other file system gives to a hardware
RAID implementation would be: it all comes from the same place (main
memory).

However, because ZFS subsequently checks what it wrote against a
*separate* checksum, if it *was* corrupted below the request-submission
level ZFS is very likely to find out, whereas a conventional RAID
implementation (and the higher layers built on top of it) won't: they
just write what (they think) they're told to, with no additional check.

RAID-1 and RAID-10
> are implemented in hardware/firmware which cannot be corrupted (Read
> only memory) and require no CPU cycles.

If your operating system and file system have been corrupted, you've got
problems regardless of how faithfully your disk hardware transfers this
corruption to its platters: this alleged deficiency compared with a
hardware implementation is just not an issue.

You've also suggested elsewhere that a hardware implementation is less
likely to contain bugs, which at least in this particular instance is
nonsense: ZFS's RAID-1/10 implementation benefits from the rest of its
design such that it's likely *far* simpler than any high-performance
hardware implementation (with its controller-level cache management and
deferred write-back behavior) is, and hence if anything likely *less* buggy.

>
>> Plus it is not proof against
>>
>>> data decaying after it is written to disk.
>>
>>
>> No - but, again, it will catch it before long, even in cases where
>> conventional disk scrubbing would not.
>>
>
> So do RAID-1 and RAID-10.

No, they typically do not: they may scrub to ensure that sectors can be
read successfully (and without checksum errors), but they do not compare
one copy with the other (and even if they did, if they found that the
copies differed they'd have no idea which one was the right one - but
ZFS knows).

>
>> And, as you note, it doesn't
>>
>>> handle a disk crash.
>>
>>
>> It handles it with resilience comparable to RAID-1, but is more
>> flexible in that it can then use distributed free space to restore the
>> previous level of redundancy (whereas RAID-1/RAID-10 cannot unless the
>> number of configured hot spare disks equals the number of failed disks).
>>
>
> And for a critical system you have that redundancy and more.

So, at best, RAID-1/10 matches ZFS in this specific regard (though of
course it can't leverage the additional bandwidth and IOPS of its spare
space, unlike ZFS). Whoopee.

>
>>>
>>> But when properly implemented, RAID-1 and RAID-10 will detect and
>>> correct even more errors than ZFS will.
>>
>
> A complete disk crash, for instance. Even Toby admitted ZFS cannot
> recover from a disk crash.
>
> ZFS is good. But it's a cheap software implementation of an expensive
> hardware recovery system. And there is no way software can do it as
> well as hardware does.

You at least got that right: ZFS does it considerably better, not
merely 'as well'. And does so at significantly lower cost (so you got
that part right too).

The one advantage that a good hardware RAID-1/10 implementation has over
ZFS relates to performance, primarily small-synchronous-write latency:
while ZFS can group small writes to achieve competitive throughput (in
fact, superior throughput in some cases), it can't safely report
synchronous write completion until the data is on the disk platters,
whereas a good RAID controller will contain mirrored NVRAM that can
guarantee persistence in microseconds rather than milliseconds (and then
destage the writes to the platters lazily).

Now, ZFS does have an 'intent log' for small writes, and does have the
capability of placing this log on (mirrored) NVRAM to achieve equivalent
small-synchronous-write latency - but that's a hardware option, not part
and parcel of ZFS itself.

....

>> Please name even one.

Why am I not surprised that you dodged that challenge?

Now, as far as credentials go, some people (who aren't sufficiently
familiar with this subject to know just how incompetent you really are
to discuss it) might find yours impressive (at least you appear to think
they might, since you made some effort to trot them out). I must admit
that I can't match your claim to have been programming since "1867", but
I have been designing and writing system software since 1976 (starting
with 11 years at DEC), and had a stint at EMC designing high-end storage
firmware in the early '90s. I specialize in designing and implementing
high-performance, high-availability distributed file, object, and
database systems, and have personally created significant portions of
several such; in this pursuit, I've kept current on the state of the art
both in academia and in the commercial arena.

And I say you're full of shit. Christ, you've never even heard of
people losing mirrored data at all - not from latent errors only
discovered at rebuild time, not from correlated failures of mirror pairs
from the same batch (or even not from the same batch - with a large
enough RAID-10 array there's a modest probability that some pair won't
recover from a simple power outage, and - though this may be news to you
- even high-end UPSs are *not* infallible)...

Sheesh.

- bill

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 12.11.2006 15:30:14 von Jerry Stuckle

Bill Todd wrote:
> Jerry Stuckle wrote:
>
>> Bill Todd wrote:
>>
>>> Jerry Stuckle wrote:
>>>
>>> ...
>>>
>>>> ZFS is not proof against silent errors - they can still occur.
>>>
>>>
>>>
>>> Of course they can, but they will be caught by the background
>>> verification scrubbing before much time passes (i.e., within a time
>>> window that radically reduces the likelihood that another disk will
>>> fail before the error is caught and corrected), unlike the case with
>>> conventional RAID (where they aren't caught at all, and rise up to
>>> bite you - with non-negligible probability these days - if the good
>>> copy then dies).
>>>
>>> And ZFS *is* proof against silent errors in the sense that data thus
>>> mangled will not be returned to an application (i.e., it will be
>>> caught when read if the background integrity validation has not yet
>>> reached it) - again, unlike the case with conventional mirroring,
>>> where there's a good chance that it will be returned to the
>>> application as good.
>>>
>>
>>
>> The same is true with RAID-1 and RAID-10. An error on the disk will
>> be detected and returned by the hardware to the OS.
>
>
> I'd think that someone as uninformed as you are would have thought twice
> about appending an ad for his services to his Usenet babble. But formal
> studies have shown that the least competent individuals seem to be the
> most confident of their opinions (because they just don't know enough to
> understand how clueless they really are).
>

How you wish. I suspect I have many years more experience and knowledge
than you. And am more familiar with fault tolerant systems.

> Do you even know what a silent error is? It's an error that the disk
> does not notice, and hence cannot report.
>

Yep. And please tell me EXACTLY how this can occur.

> Duh.
>
> In some of your other recent drivel you've seemed to suggest that this
> simply does not happen. Well, perhaps not in your own extremely limited
> experience, but you really shouldn't generalize from that.
>

I didn't say it CAN'T happen. My statement was that is is SO UNLIKELY
to happen that it can be virtually ignored.

> A friend of mine at DEC investigated this about a decade ago and found
> that the (high-end) disk subsystems of some (high-end) Alpha platforms
> were encountering undetected errors on average every few TB (i.e., what
> they read back was, very rarely, not quite what they had written in,
> with no indication of error). That may be better today (that's more
> like the uncorrectable error rate now), but it still happens. The
> causes are well known to people reasonably familiar with the technology:
> the biggies are writes that report successful completion but in fact do
> nothing, writes that go to the wrong target sector(s) (whether or not
> they report success), and errors that the sector checksums just don't
> catch (those used to be about three orders of magnitude rarer than
> uncorrectable errors, but that was before the rush toward higher density
> and longer checksums to catch the significantly-increased raw error
> rates - disk manufacturers no longer report the undetected error rate,
> but I suspect that it's considerably closer to the uncorrectable error
> rate now). There are also a few special cases - e.g., the disk that
> completes a sector update while power is failing, not knowing that the
> transfer from memory got clamped part-way through and returned zeros
> rather than whatever it was supposed to (so as far as the disk knows
> they're valid).
>

That was a decade ago. What are the figures TODAY? Do you even know?
Do you even know why they happen?

> IBM, Unisys, NetApp, and EMC (probably not an exhaustive list, but
> they're the ones that spring immediately to mind) all use non-standard
> disk sector sizes in some of their systems to hold additional validation
> information (maintained by software or firmware well above the disk
> level) aimed at catching some (but in most cases not all) of these
> unreported errors.
>

That's one way things are done.

> Silent errors are certainly rare, but they happen. ZFS catches them.
> RAID does not. End of story.
>

And what about data which is corrupted once it's placed in the ZFS
buffer? ZFS buffers are in RAM, and can be overwritten at any time.

And ZFS itself can be corrupted - telling the disk to write to the wrong
sector, for instance. It is subject to viruses. It runs only on UNIX.

The list goes on. But these are things which can occur much more easily
than the hardware errors you mention. And they are conveniently ignored
by those who have bought the ZFS hype.

> ...
>
> The
>
>> big difference being ZFS if done in software, which requires CPU
>> cycles and other resources.
>
>
> Since when was this discussion about use of resources rather than
> integrity (not that ZFS's use of resources for implementing its own
> RAID-1/RAID-10 facilities is significant anyway)?
>

It's always about performance. 100% integrity is no good if you need
100% of the system resources to handle it.

>> It's also open to corruption.
>
>
> No more than the data that some other file system gives to a hardware
> RAID implementation would be: it all comes from the same place (main
> memory).
>

Ah, but this is much more likely than once it's been passed off to the
hardware, no?

> However, because ZFS subsequently checks what it wrote against a
> *separate* checksum, if it *was* corrupted below the request-submission
> level ZFS is very likely to find out, whereas a conventional RAID
> implementation (and the higher layers built on top of it) won't: they
> just write what (they think) they're told to, with no additional check.
>

So? If the buffer is corrupted, the checksum will be, also. And if the
data is written to the wrong sector, the checksum will still be correct.

> RAID-1 and RAID-10
>
>> are implemented in hardware/firmware which cannot be corrupted (Read
>> only memory) and require no CPU cycles.
>
>
> If your operating system and file system have been corrupted, you've got
> problems regardless of how faithfully your disk hardware transfers this
> corruption to its platters: this alleged deficiency compared with a
> hardware implementation is just not an issue.
>

The hardware is MUCH MORE RELIABLE than the software.

> You've also suggested elsewhere that a hardware implementation is less
> likely to contain bugs, which at least in this particular instance is
> nonsense: ZFS's RAID-1/10 implementation benefits from the rest of its
> design such that it's likely *far* simpler than any high-performance
> hardware implementation (with its controller-level cache management and
> deferred write-back behavior) is, and hence if anything likely *less*
> buggy.
>

Yea, right. Keep believing it. Because when talking about software
implementations, you have to also consider the OS and other software
running at the time.

>>
>>> Plus it is not proof against
>>>
>>>> data decaying after it is written to disk.
>>>
>>>
>>>
>>> No - but, again, it will catch it before long, even in cases where
>>> conventional disk scrubbing would not.
>>>
>>
>> So do RAID-1 and RAID-10.
>
>
> No, they typically do not: they may scrub to ensure that sectors can be
> read successfully (and without checksum errors), but they do not compare
> one copy with the other (and even if they did, if they found that the
> copies differed they'd have no idea which one was the right one - but
> ZFS knows).
>

No, they don't. But the odds of an incorrect read generating a valid
checksum with current algorithms (assuming high quality drives -
different manufacturers use different techniques) are now so high as to
be negligible. You're more likely to have something overwritten in
memory than a silent error.

>>
>>> And, as you note, it doesn't
>>>
>>>> handle a disk crash.
>>>
>>>
>>>
>>> It handles it with resilience comparable to RAID-1, but is more
>>> flexible in that it can then use distributed free space to restore
>>> the previous level of redundancy (whereas RAID-1/RAID-10 cannot
>>> unless the number of configured hot spare disks equals the number of
>>> failed disks).
>>>
>>
>> And for a critical system you have that redundancy and more.
>
>
> So, at best, RAID-1/10 matches ZFS in this specific regard (though of
> course it can't leverage the additional bandwidth and IOPS of its spare
> space, unlike ZFS). Whoopee.
>

And it does it with more integrity and better performance, as indicated
above.

>>
>>>>
>>>> But when properly implemented, RAID-1 and RAID-10 will detect and
>>>> correct even more errors than ZFS will.
>>>
>>>
>>
>> A complete disk crash, for instance. Even Toby admitted ZFS cannot
>> recover from a disk crash.
>>
>> ZFS is good. But it's a cheap software implementation of an expensive
>> hardware recovery system. And there is no way software can do it as
>> well as hardware does.
>
>
> You at least got that right: ZFS does it considerably better, not
> merely 'as well'. And does so at significantly lower cost (so you got
> that part right too).
>

The fact that you even try to claim that ZFS is better than a RAID-1 or
RAID-10 system shows just how little you understand critical systems,
and how much you've bought into the ZFS hype.

> The one advantage that a good hardware RAID-1/10 implementation has over
> ZFS relates to performance, primarily small-synchronous-write latency:
> while ZFS can group small writes to achieve competitive throughput (in
> fact, superior throughput in some cases), it can't safely report
> synchronous write completion until the data is on the disk platters,
> whereas a good RAID controller will contain mirrored NVRAM that can
> guarantee persistence in microseconds rather than milliseconds (and then
> destage the writes to the platters lazily).
>

That's one advantage, yes.

> Now, ZFS does have an 'intent log' for small writes, and does have the
> capability of placing this log on (mirrored) NVRAM to achieve equivalent
> small-synchronous-write latency - but that's a hardware option, not part
> and parcel of ZFS itself.
>

Oh, so you're now saying that synchronous writes may not be truly
synchronous with ZFS? That's something I didn't know. I thought ZFS
was smarter than that.

> ...
>
>>> Please name even one.
>
>
> Why am I not surprised that you dodged that challenge?
>

Because I'm not the one making the claim. You make a claim? Don't
expect me to do your work backing it up for you.

> Now, as far as credentials go, some people (who aren't sufficiently
> familiar with this subject to know just how incompetent you really are
> to discuss it) might find yours impressive (at least you appear to think
> they might, since you made some effort to trot them out). I must admit
> that I can't match your claim to have been programming since "1867", but
> I have been designing and writing system software since 1976 (starting
> with 11 years at DEC), and had a stint at EMC designing high-end storage
> firmware in the early '90s. I specialize in designing and implementing
> high-performance, high-availability distributed file, object, and
> database systems, and have personally created significant portions of
> several such; in this pursuit, I've kept current on the state of the art
> both in academia and in the commercial arena.
>

That was 1967, obviously a typo.

OK, well, when you get an electronics background, you can start talking
with intelligence about just how all of those hardware problems occur.

> And I say you're full of shit. Christ, you've never even heard of
> people losing mirrored data at all - not from latent errors only
> discovered at rebuild time, not from correlated failures of mirror pairs
> from the same batch (or even not from the same batch - with a large
> enough RAID-10 array there's a modest probability that some pair won't
> recover from a simple power outage, and - though this may be news to you
> - even high-end UPSs are *not* infallible)...
>
> Sheesh.
>
> - bill

ROFLMAO! Just like a troll. Jump into the middle of a discussion
uninvited. Doesn't have any real knowledge, but is an expert on
everything. Then makes personal attacks against the other person to
cover for this deficiency.

Hell, you'd probably have to look up Ohm's Law. And you're lecturing me
on the how much more reliable than hardware?

Go back into your little hole, troll.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 12.11.2006 15:31:33 von Jerry Stuckle

toby wrote:
> toby wrote:
>
>>Jerry Stuckle wrote:
>>
>>>...
>>>Actually, I understand quite a bit about ZFS. However, unlike you, I
>>>also understand its shortcomings.
>
>
> This group and I would very much like to hear about those shortcomings,
> if you would elucidate.
>
>
>>>That's because I started working on
>>>fault-tolerant drive systems starting in 1977 as a hardware CE for IBM,
>>>working on large mainframes. I've watched it grow over the years. And
>>>as a EE major, I also understand the hardware and it's strengths and
>>>weaknesses - in detail.
>>>
>>>And as a CS major (dual majors) and programmer since 1867, including
>>>working on system software for IBM in the 1980's I have a thorough
>>>understanding of the software end.
>>>
>>>And it's obvious from your statements you have no real understanding or
>>>either, other than sales literature.
>
>

You would. This group would not. You want to find out, you go to the
relevant groups. Don't bring your garbage here. It is not appropriate
for this group.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 12.11.2006 17:47:42 von Frank Cusack

On 11 Nov 2006 19:30:25 -0800 "toby" wrote:
> Jerry Stuckle wrote:
>> REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
>> software system such as ZFS.

This simple statement shows a fundamental misunderstanding of the basics,
let alone zfs.

-frank

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 12.11.2006 18:31:23 von Bill Todd

Jerry Stuckle wrote:
> Bill Todd wrote:
>> Jerry Stuckle wrote:
>>
>>> Bill Todd wrote:
>>>
>>>> Jerry Stuckle wrote:
>>>>
>>>> ...
>>>>
>>>>> ZFS is not proof against silent errors - they can still occur.
>>>>
>>>>
>>>>
>>>> Of course they can, but they will be caught by the background
>>>> verification scrubbing before much time passes (i.e., within a time
>>>> window that radically reduces the likelihood that another disk will
>>>> fail before the error is caught and corrected), unlike the case with
>>>> conventional RAID (where they aren't caught at all, and rise up to
>>>> bite you - with non-negligible probability these days - if the good
>>>> copy then dies).
>>>>
>>>> And ZFS *is* proof against silent errors in the sense that data thus
>>>> mangled will not be returned to an application (i.e., it will be
>>>> caught when read if the background integrity validation has not yet
>>>> reached it) - again, unlike the case with conventional mirroring,
>>>> where there's a good chance that it will be returned to the
>>>> application as good.
>>>>
>>>
>>>
>>> The same is true with RAID-1 and RAID-10. An error on the disk will
>>> be detected and returned by the hardware to the OS.
>>
>>
>> I'd think that someone as uninformed as you are would have thought
>> twice about appending an ad for his services to his Usenet babble.
>> But formal studies have shown that the least competent individuals
>> seem to be the most confident of their opinions (because they just
>> don't know enough to understand how clueless they really are).
>>
>
> How you wish. I suspect I have many years more experience and knowledge
> than you.

You obviously suspect a great deal. Too bad that you don't have any
real clue. Even more too bad that you insist on parading that fact so
persistently.

> And am more familiar with fault tolerant systems.

Exactly how many have you yourself actually architected and built,
rather than simply using the fault-tolerant hardware and software that
others have provided? I've been centrally involved in several.

>
>
>> Do you even know what a silent error is? It's an error that the disk
>> does not notice, and hence cannot report.
>>
>
> Yep. And please tell me EXACTLY how this can occur.

I already did, but it seems that you need things spelled out in simpler
words: mostly, due to bugs in firmware in seldom-used recovery paths
that the vagaries of handling electro-mechanical devices occasionally
require. The proof is in the observed failures: as I said, end of
story (if you're not aware of the observed failures, it's time you
educated yourself in that area rather than kept babbling on
incompetently about the matter).

>
>> Duh.
>>
>> In some of your other recent drivel you've seemed to suggest that this
>> simply does not happen. Well, perhaps not in your own extremely
>> limited experience, but you really shouldn't generalize from that.
>>
>
> I didn't say it CAN'T happen. My statement was that is is SO UNLIKELY
> to happen that it can be virtually ignored.

My word - you can't even competently discuss what you yourself have so
recently (and so incorrectly) stated.

For example, in response to my previous statement that ZFS detected
errors that RAID-1/10 did not, you said "The same is true with RAID-1
and RAID-10. An error on the disk will be detected and returned by the
hardware to the OS" - no probabilistic qualification there at all.

And on the subject of "data decaying after it is written to disk" (which
includes erroneous over-writes), when I asserted that ZFS "will catch it
before long, even in cases where conventional disk scrubbing would not"
you responded "So do RAID-1 and RAID-10" - again, no probabilistic
qualification whatsoever (leaving aside your incorrect assertion about
RAID's ability to catch those instances that disk-scrubbing does not
reveal).

You even offered up an example *yourself* of such a firmware failure
mode: "A failing controller can easily overwrite the data at some later
time." Easily, Jerry? That doesn't exactly sound like a failure mode
that 'can be virtually ignored' to me. (And, of course, you accompanied
that pearl of wisdom with another incompetent assertion to the effect
that ZFS would not catch such a failure, when of course that's
*precisely* the kind of failure that ZFS is *designed* to catch.)

Usenet is unfortunately rife with incompetent blowhards like you - so
full of themselves that they can't conceive of someone else knowing more
than they do about anything that they mistakenly think they understand,
and so insistent on preserving that self-image that they'll continue
spewing erroneous statements forever (despite their repeated promises to
stop: "I'm not going to respond to you any further", "I'm also not
going to discuss this any more with you", "I'm finished with this
conversation" all in separate responses to toby last night - yet here
you are this morning responding to him yet again).

I'm not fond of blowhards, nor of their ability to lead others astray
technically if they're not confronted. Besides, sticking pins in such
over-inflated balloons is kind of fun.

....

>> IBM, Unisys, NetApp, and EMC (probably not an exhaustive list, but
>> they're the ones that spring immediately to mind) all use non-standard
>> disk sector sizes in some of their systems to hold additional
>> validation information (maintained by software or firmware well above
>> the disk level) aimed at catching some (but in most cases not all) of
>> these unreported errors.
>>
>
> That's one way things are done.

You're awfully long on generalizations and short on specifics, Jerry -
typical Usenet loud-mouthed ignorance. But at least you seem to
recognize that some rather significant industry players consider this
kind of error sufficiently important to take steps to catch them - as
ZFS does but RAID per se does not (that being the nub of this
discussion, in case you had forgotten).

Now, by all means tell us some of the *other* ways that such 'things are
done'.

>
>> Silent errors are certainly rare, but they happen. ZFS catches them.
>> RAID does not. End of story.
>>
>
> And what about data which is corrupted once it's placed in the ZFS
> buffer? ZFS buffers are in RAM, and can be overwritten at any time.

All file data comes from some RAM buffer, Jerry - even that handed to a
firmware RAID. So if it can be corrupted in system RAM, firmware RAID
is no cure at all.

>
> And ZFS itself can be corrupted - telling the disk to write to the wrong
> sector, for instance. It is subject to viruses.

All file data comes from some software OS environment, Jerry - same
comment as above.

> It runs only on UNIX.

My, you do wander: what does this have to do with a discussion about
the value of different approaches to ensuring data integrity?

>
> The list goes on.

Perhaps in your own fevered imagination.

But these are things which can occur much more easily
> than the hardware errors you mention.

It really does depend on the environment: some system software
environments are wide-open to corruption, while others are so
well-protected from external attacks and so internally bullet-proof that
they often have up-times of a decade or more (and in those the
likelihood of disk firmware errors is sufficiently higher than the kind
of software problems that you're talking about that, by George, their
vendors find it worthwhile to take the steps I mentioned to guard
against them).

But, once again, in the cases where your OS integrity *is* a significant
problem, then firmware RAID isn't going to save you anyway.

And they are conveniently ignored
> by those who have bought the ZFS hype.

I really don't know what you've got against ZFS, Jerry, save for the
fact that discussing it has so clearly highlighted your own
incompetence. The only 'hype' that I've noticed around ZFS involves its
alleged 128-bitness (when its files only reach 64 - or is it 63? - bits
in size, and the need for more than 70 - 80 bits of total file system
size within the next few decades is rather difficult to justify).

But its ability to catch the same errors that far more expensive
products from the likes of IBM, NetApp, and EMC are designed to catch is
not hype: it's simple fact.

>
>> ...
>>
>> The
>>
>>> big difference being ZFS if done in software, which requires CPU
>>> cycles and other resources.
>>
>>
>> Since when was this discussion about use of resources rather than
>> integrity (not that ZFS's use of resources for implementing its own
>> RAID-1/RAID-10 facilities is significant anyway)?
>>
>
> It's always about performance. 100% integrity is no good if you need
> 100% of the system resources to handle it.

Horseshit. It's only 'about performance' when the performance impact is
significant. In the case of ZFS's mirroring implementation, it isn't of
any significance at all (let alone any *real* drag on the system).

>
>>> It's also open to corruption.
>>
>>
>> No more than the data that some other file system gives to a hardware
>> RAID implementation would be: it all comes from the same place (main
>> memory).
>>
>
> Ah, but this is much more likely than once it's been passed off to the
> hardware, no?

No: once there's any noticeable likelihood of corruption in system RAM,
then it really doesn't matter how reliable the rest of the system is.

>
>> However, because ZFS subsequently checks what it wrote against a
>> *separate* checksum, if it *was* corrupted below the
>> request-submission level ZFS is very likely to find out, whereas a
>> conventional RAID implementation (and the higher layers built on top
>> of it) won't: they just write what (they think) they're told to, with
>> no additional check.
>>
>
> So? If the buffer is corrupted, the checksum will be, also.

No: in many (possibly all - I'd have to check the code to make sure)
cases ZFS establishes the checksum when the data is moved *into* the
buffer (and IIRC performs any compression and/or encryption at that
point as well: it's a hell of a lot less expensive to do all these at
once as the data is passing through the CPU cache on the way to the
buffer than to fetch it back again later).

And if the
> data is written to the wrong sector, the checksum will still be correct.

No: if the data is written to the wrong sector, any subsequent read
targeting the correct sector will find a checksum mismatch (as will any
read to the sector which was incorrectly written).

>
>> RAID-1 and RAID-10
>>
>>> are implemented in hardware/firmware which cannot be corrupted (Read
>>> only memory) and require no CPU cycles.
>>
>>
>> If your operating system and file system have been corrupted, you've
>> got problems regardless of how faithfully your disk hardware transfers
>> this corruption to its platters: this alleged deficiency compared
>> with a hardware implementation is just not an issue.
>>
>
> The hardware is MUCH MORE RELIABLE than the software.

Once again, nitwit: if the OS-level software is not reliable, it
*doesn't matter* how reliable the hardware is.

>
>> You've also suggested elsewhere that a hardware implementation is less
>> likely to contain bugs, which at least in this particular instance is
>> nonsense: ZFS's RAID-1/10 implementation benefits from the rest of
>> its design such that it's likely *far* simpler than any
>> high-performance hardware implementation (with its controller-level
>> cache management and deferred write-back behavior) is, and hence if
>> anything likely *less* buggy.
>>
>
> Yea, right. Keep believing it.

As will anyone else remotely well-acquainted with system hardware and
software. 'Firmware' is just software that someone has committed to
silicon, after all: it is just as prone to bugs as the system-level
software that you keep disparaging - more so, when it's more complex
than the system-software implementation.

....

You're more likely to have something overwritten in
> memory than a silent error.

Then (since you seem determined to keep ignoring this point) why on
Earth do you suppose that entirely reputable companies like IBM, NetApp,
and EMC go to such lengths to catch them? If it makes sense for them
(and for their customers), then it's really difficult to see why ZFS's
abilities in that area wouldn't be significant.

....

>>>>> But when properly implemented, RAID-1 and RAID-10 will detect and
>>>>> correct even more errors than ZFS will.
>>>>
>>>>
>>>
>>> A complete disk crash, for instance. Even Toby admitted ZFS cannot
>>> recover from a disk crash.
>>>
>>> ZFS is good. But it's a cheap software implementation of an
>>> expensive hardware recovery system. And there is no way software can
>>> do it as well as hardware does.
>>
>>
>> You at least got that right: ZFS does it considerably better, not
>> merely 'as well'. And does so at significantly lower cost (so you got
>> that part right too).
>>
>
> The fact that you even try to claim that ZFS is better than a RAID-1 or
> RAID-10 system shows just how little you understand critical systems,
> and how much you've bought into the ZFS hype.

The fact that you so consistently misrepresent ZFS as being something
*different* from RAID-1/10 shows that you don't even understand the
definition of RAID: the ZFS back-end *is* RAID-1/10 - it just leverages
its implementation in software to improve its reliability (because
writing directly from system RAM to disk without an intermediate step
through a common controller buffer significantly improves the odds that
*one* of the copies will be correct - and the checksum enables ZFS to
determine which one that is).

>
>> The one advantage that a good hardware RAID-1/10 implementation has
>> over ZFS relates to performance, primarily small-synchronous-write
>> latency: while ZFS can group small writes to achieve competitive
>> throughput (in fact, superior throughput in some cases), it can't
>> safely report synchronous write completion until the data is on the
>> disk platters, whereas a good RAID controller will contain mirrored
>> NVRAM that can guarantee persistence in microseconds rather than
>> milliseconds (and then destage the writes to the platters lazily).
>>
>
> That's one advantage, yes.
>
>> Now, ZFS does have an 'intent log' for small writes, and does have the
>> capability of placing this log on (mirrored) NVRAM to achieve
>> equivalent small-synchronous-write latency - but that's a hardware
>> option, not part and parcel of ZFS itself.
>>
>
> Oh, so you're now saying that synchronous writes may not be truly
> synchronous with ZFS? That's something I didn't know. I thought ZFS
> was smarter than that.

ZFS is, of course, smarter than that - too bad that you aren't.

I said nothing whatsoever to suggest that ZFS did not honor requests to
write synchronously: reread what I wrote until you understand it (and
while you're at it, reread what, if anything, you have read about ZFS
until you understand that as well: your ignorant bluster is becoming
more tiresome than amusing by this point).

>
>> ...
>>
>>>> Please name even one.
>>
>>
>> Why am I not surprised that you dodged that challenge?
>>
>
> Because I'm not the one making the claim.

My - you're either an out-right liar or even more abysmally incompetent
than even I had thought.

Let me refresh your memory: the exchange went

[quote]

> But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

Please name even one.

[end quote]

At the risk of being repetitive (since reading comprehension does not
appear to be your strong suit), the specific claim (yours, quoted above)
was that "when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will."

I challenged you to name even one such - and I'm still waiting.

....

> ROFLMAO! Just like a troll. Jump into the middle of a discussion
> uninvited.

I can understand why you'd like to limit the discussion to people who
have even less of a clue than you do (and thus why you keep cropping out
newsgroups where more knowledgeable people might be found), but toby
invited those in comp.arch.storage to participate by cross-posting there.

Since I tend to feel that discussions should continue where they
started, and since it seemed appropriate to respond directly to your
drivel rather than through toby's quoting in his c.a.storage post, I
came over here - hardly uninvited. Rich Teer (who I suspect also
qualifies as significantly more knowledgeable than you) chose to
continue in c.a.storage; people like Jeff Bonwick probably just aren't
interested (as I said, deflating incompetent blowhards is kind of a
hobby of mine, plus something of a minor civic duty - otherwise, I
wouldn't bother with you either).

- bill

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 12.11.2006 20:52:18 von gordonb.wg5hl

>> Do you even know what a silent error is? It's an error that the disk
>> does not notice, and hence cannot report.
>>
>
>Yep. And please tell me EXACTLY how this can occur.

Some drives will accept a sector write into an on-drive buffer and
indicate completion of the write before even attempting it. This
speeds things up. A subsequent discovery of a problem with the
sector-header would not be reported *on that write*. (I don't know
how stuff like this does get reported, possibly on a later write
by a completely different program, but in any case, it's likely too
late to report it to the caller at the user-program level).

Such drives *might* still be able to write data in the buffer cache
(assuming no bad sectors) even if the power fails: something about
using the momentum of the spinning drive to generate power for a
few milliseconds needed. Or maybe just a big capacitor on the
drive.

Drives like this shouldn't be used in a RAID setup, or the option
to indicate completion should be turned off. In the case of SCSI,
the RAID controller probably knows how to do this. In the case of
IDE, it might be manufacturer-specific.

There's a reason that some RAID setups require drives with modified
firmware.

>> In some of your other recent drivel you've seemed to suggest that this
>> simply does not happen. Well, perhaps not in your own extremely limited
>> experience, but you really shouldn't generalize from that.
>>
>
>I didn't say it CAN'T happen. My statement was that is is SO UNLIKELY
>to happen that it can be virtually ignored.

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 12.11.2006 22:12:07 von Jerry Stuckle

Bill Todd wrote:
> Jerry Stuckle wrote:
>
>> Bill Todd wrote:
>>
>>> Jerry Stuckle wrote:
>>>
>>>> Bill Todd wrote:
>>>>
>>>>> Jerry Stuckle wrote:
>>>>>
>>>>> ...
>>>>>
>>>>>> ZFS is not proof against silent errors - they can still occur.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Of course they can, but they will be caught by the background
>>>>> verification scrubbing before much time passes (i.e., within a time
>>>>> window that radically reduces the likelihood that another disk will
>>>>> fail before the error is caught and corrected), unlike the case
>>>>> with conventional RAID (where they aren't caught at all, and rise
>>>>> up to bite you - with non-negligible probability these days - if
>>>>> the good copy then dies).
>>>>>
>>>>> And ZFS *is* proof against silent errors in the sense that data
>>>>> thus mangled will not be returned to an application (i.e., it will
>>>>> be caught when read if the background integrity validation has not
>>>>> yet reached it) - again, unlike the case with conventional
>>>>> mirroring, where there's a good chance that it will be returned to
>>>>> the application as good.
>>>>>
>>>>
>>>>
>>>> The same is true with RAID-1 and RAID-10. An error on the disk will
>>>> be detected and returned by the hardware to the OS.
>>>
>>>
>>>
>>> I'd think that someone as uninformed as you are would have thought
>>> twice about appending an ad for his services to his Usenet babble.
>>> But formal studies have shown that the least competent individuals
>>> seem to be the most confident of their opinions (because they just
>>> don't know enough to understand how clueless they really are).
>>>
>>
>> How you wish. I suspect I have many years more experience and
>> knowledge than you.
>
>
> You obviously suspect a great deal. Too bad that you don't have any
> real clue. Even more too bad that you insist on parading that fact so
> persistently.
>
>> And am more familiar with fault tolerant systems.
>
>
> Exactly how many have you yourself actually architected and built,
> rather than simply using the fault-tolerant hardware and software that
> others have provided? I've been centrally involved in several.
>

Disk drive systems? I admit, none. My design experience has been more
in the digital arena - although I have done some analog design -
balanced amplifiers, etc.

How many disk drive systems have you actually had to troubleshoot?
Locate and replace a failing head, for example? Or a bad op amp in a
read amplifier? Again, none, I suspect. I've done quite a few in my
years.

And from your comments you show absolutely know knowledge of the
underlying electronics, much less the firmware involved. Yet you claim
you've been "centrally involved. Doing what - assembling the pieces?
All you've done is assemble the pieces.

>>
>>
>>> Do you even know what a silent error is? It's an error that the disk
>>> does not notice, and hence cannot report.
>>>
>>
>> Yep. And please tell me EXACTLY how this can occur.
>
>
> I already did, but it seems that you need things spelled out in simpler
> words: mostly, due to bugs in firmware in seldom-used recovery paths
> that the vagaries of handling electro-mechanical devices occasionally
> require. The proof is in the observed failures: as I said, end of
> story (if you're not aware of the observed failures, it's time you
> educated yourself in that area rather than kept babbling on
> incompetently about the matter).
>

OK, and exactly how many of these bugs are there? Disk drive and
similar firmware is some of the most specialized and most heavily tested
firmware on the planet.

And show me hard facts on the failure. Otherwise you're just spewing
marketing bullshit like all trolls - overly stating the weaknesses of
the other methods, while maximizing your products strengths and ignoring
its weaknesses.

You have made claims about how bad disk drives are without ZFS. It's
amazing that computers work at all with all those errors you claim exist!

>>
>>> Duh.
>>>
>>> In some of your other recent drivel you've seemed to suggest that
>>> this simply does not happen. Well, perhaps not in your own extremely
>>> limited experience, but you really shouldn't generalize from that.
>>>
>>
>> I didn't say it CAN'T happen. My statement was that is is SO UNLIKELY
>> to happen that it can be virtually ignored.
>
>
> My word - you can't even competently discuss what you yourself have so
> recently (and so incorrectly) stated.
>
> For example, in response to my previous statement that ZFS detected
> errors that RAID-1/10 did not, you said "The same is true with RAID-1
> and RAID-10. An error on the disk will be detected and returned by the
> hardware to the OS" - no probabilistic qualification there at all.
>

The probabilities are much higher that you will be killed by a meteor in
the next 10 years.

Drive electronics detect errors all the time. They automatically mark
bad spots on the disk. They correct read wrrors, and if using
verification, they correct write errors.

> And on the subject of "data decaying after it is written to disk" (which
> includes erroneous over-writes), when I asserted that ZFS "will catch it
> before long, even in cases where conventional disk scrubbing would not"
> you responded "So do RAID-1 and RAID-10" - again, no probabilistic
> qualification whatsoever (leaving aside your incorrect assertion about
> RAID's ability to catch those instances that disk-scrubbing does not
> reveal).
>

You never made any probabilistic qualification, so neither did I. You
want probabilities, you need to supply them.

> You even offered up an example *yourself* of such a firmware failure
> mode: "A failing controller can easily overwrite the data at some later
> time." Easily, Jerry? That doesn't exactly sound like a failure mode
> that 'can be virtually ignored' to me. (And, of course, you accompanied
> that pearl of wisdom with another incompetent assertion to the effect
> that ZFS would not catch such a failure, when of course that's
> *precisely* the kind of failure that ZFS is *designed* to catch.)
>

I didn't say it could be ignored. I did say it could be handled by a
properly configured RAID-1 or RAID-10 array.

> Usenet is unfortunately rife with incompetent blowhards like you - so
> full of themselves that they can't conceive of someone else knowing more
> than they do about anything that they mistakenly think they understand,
> and so insistent on preserving that self-image that they'll continue
> spewing erroneous statements forever (despite their repeated promises to
> stop: "I'm not going to respond to you any further", "I'm also not
> going to discuss this any more with you", "I'm finished with this
> conversation" all in separate responses to toby last night - yet here
> you are this morning responding to him yet again).
>

And unfortunately, it's full of trolls like you who jump unwanted into
conversations where they have no experience, blow assertions out their
asses, and then attack the other person.

I've seen assholes like you before. You're a dime a dozen.

> I'm not fond of blowhards, nor of their ability to lead others astray
> technically if they're not confronted. Besides, sticking pins in such
> over-inflated balloons is kind of fun.
>

Then you should learn to keep your fat mouth shut about things you know
nothing.

> ...
>
>>> IBM, Unisys, NetApp, and EMC (probably not an exhaustive list, but
>>> they're the ones that spring immediately to mind) all use
>>> non-standard disk sector sizes in some of their systems to hold
>>> additional validation information (maintained by software or firmware
>>> well above the disk level) aimed at catching some (but in most cases
>>> not all) of these unreported errors.
>>>
>>
>> That's one way things are done.
>
>
> You're awfully long on generalizations and short on specifics, Jerry -
> typical Usenet loud-mouthed ignorance. But at least you seem to
> recognize that some rather significant industry players consider this
> kind of error sufficiently important to take steps to catch them - as
> ZFS does but RAID per se does not (that being the nub of this
> discussion, in case you had forgotten).
>
> Now, by all means tell us some of the *other* ways that such 'things are
> done'.
>

Get me someone competent in the hardware/firmware end and I can talk all
the specifics you want.

But you've made a hell of a bunch of claims and had no specifics on your
own - other than "ZFS is great and RAID sucks". ROFLMAO!

>>
>>> Silent errors are certainly rare, but they happen. ZFS catches them.
>>> RAID does not. End of story.
>>>
>>
>> And what about data which is corrupted once it's placed in the ZFS
>> buffer? ZFS buffers are in RAM, and can be overwritten at any time.
>
>
> All file data comes from some RAM buffer, Jerry - even that handed to a
> firmware RAID. So if it can be corrupted in system RAM, firmware RAID
> is no cure at all.
>

Ah, but let's see ZFS correct for that! Oh, sorry - I found a failure
mode your beloved file system doesn't handle, didn't I?

>>
>> And ZFS itself can be corrupted - telling the disk to write to the
>> wrong sector, for instance. It is subject to viruses.
>
>
> All file data comes from some software OS environment, Jerry - same
> comment as above.
>

So you admit it can be corrupted.

>> It runs only on UNIX.
>
>
> My, you do wander: what does this have to do with a discussion about
> the value of different approaches to ensuring data integrity?
>

Because your beloved ZFS isn't worth a damn on any other system, that's
why. Let's see it run on MVS/XE, for instance. It doesn't work.
RAID-1/RAID-10 does.

>>
>> The list goes on.
>
>
> Perhaps in your own fevered imagination.
>
> But these are things which can occur much more easily
>

Not at all. I'm just not going to waste my time listing all the
possibilities.

>> than the hardware errors you mention.
>
>
> It really does depend on the environment: some system software
> environments are wide-open to corruption, while others are so
> well-protected from external attacks and so internally bullet-proof that
> they often have up-times of a decade or more (and in those the
> likelihood of disk firmware errors is sufficiently higher than the kind
> of software problems that you're talking about that, by George, their
> vendors find it worthwhile to take the steps I mentioned to guard
> against them).
>
> But, once again, in the cases where your OS integrity *is* a significant
> problem, then firmware RAID isn't going to save you anyway.
>
> And they are conveniently ignored
>

And you conveniently ignore how ZFS can be corrupted. In fact, it is
much more easily corrupted than basic file systems using RAID-1/RAID-10
arrays - if for no other reason than it contains a lot more code and
needs to to more work.

>> by those who have bought the ZFS hype.
>
>
> I really don't know what you've got against ZFS, Jerry, save for the
> fact that discussing it has so clearly highlighted your own
> incompetence. The only 'hype' that I've noticed around ZFS involves its
> alleged 128-bitness (when its files only reach 64 - or is it 63? - bits
> in size, and the need for more than 70 - 80 bits of total file system
> size within the next few decades is rather difficult to justify).
>
> But its ability to catch the same errors that far more expensive
> products from the likes of IBM, NetApp, and EMC are designed to catch is
> not hype: it's simple fact.
>

I don't have anything against ZFS. What I don't like is blowhards like
you who pop in with a bunch of marketing hype but no real facts nor
knowledge of what you speak.

>>
>>> ...
>>>
>>> The
>>>
>>>> big difference being ZFS if done in software, which requires CPU
>>>> cycles and other resources.
>>>
>>>
>>>
>>> Since when was this discussion about use of resources rather than
>>> integrity (not that ZFS's use of resources for implementing its own
>>> RAID-1/RAID-10 facilities is significant anyway)?
>>>
>>
>> It's always about performance. 100% integrity is no good if you need
>> 100% of the system resources to handle it.
>
>
> Horseshit. It's only 'about performance' when the performance impact is
> significant. In the case of ZFS's mirroring implementation, it isn't of
> any significance at all (let alone any *real* drag on the system).
>

Keep believing that. It will help you to justify your statements in
your mind.

>>
>>>> It's also open to corruption.
>>>
>>>
>>>
>>> No more than the data that some other file system gives to a hardware
>>> RAID implementation would be: it all comes from the same place (main
>>> memory).
>>>
>>
>> Ah, but this is much more likely than once it's been passed off to the
>> hardware, no?
>
>
> No: once there's any noticeable likelihood of corruption in system RAM,
> then it really doesn't matter how reliable the rest of the system is.
>

And ZFS can be corrupted more easily than more basic file systems. A
point you conveniently ignore.

>>
>>> However, because ZFS subsequently checks what it wrote against a
>>> *separate* checksum, if it *was* corrupted below the
>>> request-submission level ZFS is very likely to find out, whereas a
>>> conventional RAID implementation (and the higher layers built on top
>>> of it) won't: they just write what (they think) they're told to,
>>> with no additional check.
>>>
>>
>> So? If the buffer is corrupted, the checksum will be, also.
>
>
> No: in many (possibly all - I'd have to check the code to make sure)
> cases ZFS establishes the checksum when the data is moved *into* the
> buffer (and IIRC performs any compression and/or encryption at that
> point as well: it's a hell of a lot less expensive to do all these at
> once as the data is passing through the CPU cache on the way to the
> buffer than to fetch it back again later).
>

Gee, someone who can actually read code? WOW!

> And if the
>
>> data is written to the wrong sector, the checksum will still be correct.
>
>
> No: if the data is written to the wrong sector, any subsequent read
> targeting the correct sector will find a checksum mismatch (as will any
> read to the sector which was incorrectly written).
>

So pray tell - how is it going to do that? The data was written just as
it was checksummed.

>>
>>> RAID-1 and RAID-10
>>>
>>>> are implemented in hardware/firmware which cannot be corrupted (Read
>>>> only memory) and require no CPU cycles.
>>>
>>>
>>>
>>> If your operating system and file system have been corrupted, you've
>>> got problems regardless of how faithfully your disk hardware
>>> transfers this corruption to its platters: this alleged deficiency
>>> compared with a hardware implementation is just not an issue.
>>>
>>
>> The hardware is MUCH MORE RELIABLE than the software.
>
>
> Once again, nitwit: if the OS-level software is not reliable, it
> *doesn't matter* how reliable the hardware is.
>

Ah, more personal attacks. Brilliant!

>>
>>> You've also suggested elsewhere that a hardware implementation is
>>> less likely to contain bugs, which at least in this particular
>>> instance is nonsense: ZFS's RAID-1/10 implementation benefits from
>>> the rest of its design such that it's likely *far* simpler than any
>>> high-performance hardware implementation (with its controller-level
>>> cache management and deferred write-back behavior) is, and hence if
>>> anything likely *less* buggy.
>>>
>>
>> Yea, right. Keep believing it.
>
>
> As will anyone else remotely well-acquainted with system hardware and
> software. 'Firmware' is just software that someone has committed to
> silicon, after all: it is just as prone to bugs as the system-level
> software that you keep disparaging - more so, when it's more complex
> than the system-software implementation.
>

That right there shows how little you understand disk technology today.
Firmware is less prone to bugs because it is analyzed and tested so
much more thoroughly than software, both by humans and machines.

After all - a recall on disks with a firmware bug would cost any disk
company at least tens of millions of dollars - if it didn't bankrupt the
company. It's very cheap in comparison to spend a few million
analyzing, testing, retesting, etc. all the firmware.

Additionally, being a hardware interface, it has limited actions
required of it. And those functions can easily be emulated by system
test sets, which can duplicate both ends of the controller. They have
an equivalent to the system bus for commands, and a replacement for the
disk electronics to test the other end. Many have even gone to
simulating the signals to/from the R/W heads themselves. With such test
sets they can automatically simulate virtually every possible failure
mode of the disk, validating all of the hardware and firmware.

But of course, if you were as smart as you claim, you would know this.
And you wouldn't be making the asinine claims about bugs in firmware
that you are.

> ...
>
> You're more likely to have something overwritten in
>
>> memory than a silent error.
>
>
> Then (since you seem determined to keep ignoring this point) why on
> Earth do you suppose that entirely reputable companies like IBM, NetApp,
> and EMC go to such lengths to catch them? If it makes sense for them
> (and for their customers), then it's really difficult to see why ZFS's
> abilities in that area wouldn't be significant.
>

Another claim without any proof - and you accuse me of making claim.
Another typical troll behavior.

I don't know about NetApp or EMC, but I still have contacts in IBM. And
they do not "go to such lengths" to catch silent errors.

> ...
>
>>>>>> But when properly implemented, RAID-1 and RAID-10 will detect and
>>>>>> correct even more errors than ZFS will.
>>>>>
>>>>>
>>>>>
>>>>
>>>> A complete disk crash, for instance. Even Toby admitted ZFS cannot
>>>> recover from a disk crash.
>>>>
>>>> ZFS is good. But it's a cheap software implementation of an
>>>> expensive hardware recovery system. And there is no way software
>>>> can do it as well as hardware does.
>>>
>>>
>>>
>>> You at least got that right: ZFS does it considerably better, not
>>> merely 'as well'. And does so at significantly lower cost (so you
>>> got that part right too).
>>>
>>
>> The fact that you even try to claim that ZFS is better than a RAID-1
>> or RAID-10 system shows just how little you understand critical
>> systems, and how much you've bought into the ZFS hype.
>
>
> The fact that you so consistently misrepresent ZFS as being something
> *different* from RAID-1/10 shows that you don't even understand the
> definition of RAID: the ZFS back-end *is* RAID-1/10 - it just leverages
> its implementation in software to improve its reliability (because
> writing directly from system RAM to disk without an intermediate step
> through a common controller buffer significantly improves the odds that
> *one* of the copies will be correct - and the checksum enables ZFS to
> determine which one that is).
>

And you're saying it's the same? You really don't understand what
RAID-1 or RAID-10 is.

No, ZFS is just a cheap software replacement for an expensive hardware
system.

>>
>>> The one advantage that a good hardware RAID-1/10 implementation has
>>> over ZFS relates to performance, primarily small-synchronous-write
>>> latency: while ZFS can group small writes to achieve competitive
>>> throughput (in fact, superior throughput in some cases), it can't
>>> safely report synchronous write completion until the data is on the
>>> disk platters, whereas a good RAID controller will contain mirrored
>>> NVRAM that can guarantee persistence in microseconds rather than
>>> milliseconds (and then destage the writes to the platters lazily).
>>>
>>
>> That's one advantage, yes.
>>
>>> Now, ZFS does have an 'intent log' for small writes, and does have
>>> the capability of placing this log on (mirrored) NVRAM to achieve
>>> equivalent small-synchronous-write latency - but that's a hardware
>>> option, not part and parcel of ZFS itself.
>>>
>>
>> Oh, so you're now saying that synchronous writes may not be truly
>> synchronous with ZFS? That's something I didn't know. I thought ZFS
>> was smarter than that.
>
>
> ZFS is, of course, smarter than that - too bad that you aren't.
>
> I said nothing whatsoever to suggest that ZFS did not honor requests to
> write synchronously: reread what I wrote until you understand it (and
> while you're at it, reread what, if anything, you have read about ZFS
> until you understand that as well: your ignorant bluster is becoming
> more tiresome than amusing by this point).
>

No, I'm just trying to understand your statement.

>>
>>> ...
>>>
>>>>> Please name even one.
>>>
>>>
>>>
>>> Why am I not surprised that you dodged that challenge?
>>>
>>
>> Because I'm not the one making the claim.
>
>
> My - you're either an out-right liar or even more abysmally incompetent
> than even I had thought.
>
> Let me refresh your memory: the exchange went
>
> [quote]
>
> > But when properly implemented, RAID-1 and RAID-10 will detect and
> correct even more errors than ZFS will.
>
> Please name even one.
>
> [end quote]
>
> At the risk of being repetitive (since reading comprehension does not
> appear to be your strong suit), the specific claim (yours, quoted above)
> was that "when properly implemented, RAID-1 and RAID-10 will detect and
> correct even more errors than ZFS will."
>

But trolling does seem to be your strong suit.

> I challenged you to name even one such - and I'm still waiting.
>

A corruption in the ZFS buffer between writes, where different data is
written to one disk than the other.

Errors where ZFS itself is corrupted.

> ...
>
>> ROFLMAO! Just like a troll. Jump into the middle of a discussion
>> uninvited.
>
>
> I can understand why you'd like to limit the discussion to people who
> have even less of a clue than you do (and thus why you keep cropping out
> newsgroups where more knowledgeable people might be found), but toby
> invited those in comp.arch.storage to participate by cross-posting there.
>
> Since I tend to feel that discussions should continue where they
> started, and since it seemed appropriate to respond directly to your
> drivel rather than through toby's quoting in his c.a.storage post, I
> came over here - hardly uninvited. Rich Teer (who I suspect also
> qualifies as significantly more knowledgeable than you) chose to
> continue in c.a.storage; people like Jeff Bonwick probably just aren't
> interested (as I said, deflating incompetent blowhards is kind of a
> hobby of mine, plus something of a minor civic duty - otherwise, I
> wouldn't bother with you either).
>
> - bill

No, actually, I'd much rather be discussing this with someone who has
some real knowledge, not blowhard trolls like you.

So I'm not even going to bother to respond to you any more. I prefer to
carry out intelligent conversations with intelligent people.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 12.11.2006 22:22:13 von Jerry Stuckle

Gordon Burditt wrote:
>>>Do you even know what a silent error is? It's an error that the disk
>>>does not notice, and hence cannot report.
>>>
>>
>>Yep. And please tell me EXACTLY how this can occur.
>
>
> Some drives will accept a sector write into an on-drive buffer and
> indicate completion of the write before even attempting it. This
> speeds things up. A subsequent discovery of a problem with the
> sector-header would not be reported *on that write*. (I don't know
> how stuff like this does get reported, possibly on a later write
> by a completely different program, but in any case, it's likely too
> late to report it to the caller at the user-program level).
>

Yes, it's very common for drives to buffer data like this. But also,
drives have a "write-through" command which forces synchronous writing.
The drive doesn't return from such a write until the data is
physically on the drive. And they usually even have a verify flag,
which rereads the data after it has been written and compares it to what
was written.

> Such drives *might* still be able to write data in the buffer cache
> (assuming no bad sectors) even if the power fails: something about
> using the momentum of the spinning drive to generate power for a
> few milliseconds needed. Or maybe just a big capacitor on the
> drive.
>

I don't know of any which are able to do anything more than complete the
current operation in case of a power failure. Using the drive as a
generator would brake it too quickly, and it would take a huge capacitor
to handle the current requirements for a seek (seeks require huge
current spikes - several amps - for a very short time).

> Drives like this shouldn't be used in a RAID setup, or the option
> to indicate completion should be turned off. In the case of SCSI,
> the RAID controller probably knows how to do this. In the case of
> IDE, it might be manufacturer-specific.
>

Actually, they can under certain conditions.

Although the drive itself couldn't have a big enough capacitor, the
power supply could keep it up for a few hundred milliseconds.

First of all, the drive typically writes any buffered data pretty
quickly, anyway (usually < 100 ms) when it is idle. Of course, a
heavily loaded disk will slow this down.

But in the case of a power failure, the power supply needs to
immediately raise a "power fail" condition to the drive. The drive
should then not accept any new operations and immediately complete any
which are in progress. Properly designed power supplies will take this
into consideration and have enough storage to keep the drives going for
a minimum time.

> There's a reason that some RAID setups require drives with modified
> firmware.
>

Yep, among them being drives which are not set up as above.
>
>
>>>In some of your other recent drivel you've seemed to suggest that this
>>>simply does not happen. Well, perhaps not in your own extremely limited
>>>experience, but you really shouldn't generalize from that.
>>>
>>
>>I didn't say it CAN'T happen. My statement was that is is SO UNLIKELY
>>to happen that it can be virtually ignored.
>
>

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 12.11.2006 22:26:50 von Jerry Stuckle

Frank Cusack wrote:
> On 11 Nov 2006 19:30:25 -0800 "toby" wrote:
>
>>Jerry Stuckle wrote:
>>
>>>REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
>>>software system such as ZFS.
>
>
> This simple statement shows a fundamental misunderstanding of the basics,
> let alone zfs.
>
> -frank

Not at all. RAID-1 and RAID-10 devices are file system neutral. Just
like disk systems are file-system neutral.

And anyone who things otherwise doesn't understand real RAID
implementations - only cheap ones which use software for all or part of
their implementation.

Real RAID arrays are not cheap. $100-500/GB is not out of the question.
And you won't find them at COMP-USA or other retailers.

But you don't see those very often on PC's. Most of the time you see
cheap implementations where some of the work is done in software.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 13.11.2006 03:00:02 von Bill Todd

Jerry Stuckle wrote:

....

I suspect I have many years more experience and
>>> knowledge than you.
>>
>>
>> You obviously suspect a great deal. Too bad that you don't have any
>> real clue. Even more too bad that you insist on parading that fact so
>> persistently.
>>
>>> And am more familiar with fault tolerant systems.
>>
>>
>> Exactly how many have you yourself actually architected and built,
>> rather than simply using the fault-tolerant hardware and software that
>> others have provided? I've been centrally involved in several.
>>
>
> Disk drive systems? I admit, none.

So no surprise there, but at least a little credit for honesty.

My design experience has been more
> in the digital arena - although I have done some analog design -
> balanced amplifiers, etc.

Gee, whiz - and somehow you think that qualifies you to make pompous and
inaccurate statements about other areas that you know so much less
about. There's a lesson there, but I doubt your ability to learn it.

>
> How many disk drive systems have you actually had to troubleshoot?
> Locate and replace a failing head, for example? Or a bad op amp in a
> read amplifier? Again, none, I suspect. I've done quite a few in my
> years.

As I suspected, a tech with inflated delusions of competence. You
really ought to learn the difference between being able to troubleshoot
a problem (that's what VCR repair guys do: it's not exactly rocket
science) and being able to design (or even really understand) the system
that exhibits it.

>
> And from your comments you show absolutely know knowledge of the
> underlying electronics, much less the firmware involved.

My job at EMC was designing exactly such firmware for a new high-end
disk array, Jerry. And I was working closely with people who had
already been through that exercise for their existing Symmetrix product.
You seem to be laboring under the illusion that 'firmware' is somehow
significantly different from software.

Yet you claim
> you've been "centrally involved. Doing what - assembling the pieces?
> All you've done is assemble the pieces.

The difference, Jerry, (since you seem to be ignorant of it, though
that's apparently only one small drop in the vast ocean of your
ignorance) is that the pieces you're talking about are *not*
fault-tolerant - they don't even reliably report the faults which they
encounter.

Building a fault-tolerant system involves understanding the limits of
such underlying pieces and designing ways to compensate for them -
exactly the kind of thing that ZFS does with its separate checksumming
(and IBM, NetApp, and EMC do, though not always as effectively, with
their additional in sector information that contains higher-level
sanity-checks than the disk checksums have).

....

> show me hard facts on the failure.

As you said recently to me, I'm not going to do your homework for you:
educating you (assuming that it possible at all) is not specifically
part of my agenda, just making sure that anyone who might otherwise take
your bombastic certainty as evidence of actual knowledge understands the
shallowness of your understanding.

The fact that the vendors whom I cited above take this kind of failure
seriously enough to guard against it should be evidence enough for
anyone who does not have both eyes firmly squeezed shut. If disk
manufacturers were more forthcoming (as they were a few years ago) about
providing information about undetected error rates the information
wouldn't be as elusive now - though even then I suspect that it only
related to checksum strength rather than to errors caused by firmware bugs.

....

> You have made claims about how bad disk drives are without ZFS.

Were you an even half-competent reader, you would know that I have made
no such claims: I've only observed that ZFS catches certain classes of
errors that RAID per se cannot - and that these classes are sufficiently
important that other major vendors take steps to catch them as well.

It's
> amazing that computers work at all with all those errors you claim exist!

Hey, they work without any redundancy at all - most of the time. The
usual question is, just how important is your data compared with the
cost of protecting it better? ZFS has just significantly changed the
balance in that area, which is why it's interesting.

....

> The probabilities are much higher that you will be killed by a meteor in
> the next 10 years.

That's rather difficult to evaluate. On the one hand, *no one* in
recorded history has been killed by a meteor, which would suggest that
the probability of such is rather low indeed. On the other, the
probability of a large impact that would kill a significant percentage
of the Earth's population (and thus with non-negligible probability
include me) could be high enough to worry about.

But of course that's not the real issue anyway. A single modest-sized
server (3+ raw TB, which only requires 5 disks these days to achieve)
contains more disk sectors than there are people on the Earth, and even
a single error in maintaining those 6 billion sectors leads to
corruption unless it's reliably caught. Large installations can be
1,000 times this size. So while the probability that *any given sector*
will be corrupted is very small, the probability that *some* sector will
be corrupted is sufficiently disturbing that reputable vendors protect
against it, and enjoy considerable economic success doing so.

>
> Drive electronics detect errors all the time.

No, Jerry: they just detect errors *almost* all of the time - and
that's the problem, in a nutshell. Try to wrap what passes for your
mind around that, and you might learn something.

....

>> You even offered up an example *yourself* of such a firmware failure
>> mode: "A failing controller can easily overwrite the data at some
>> later time." Easily, Jerry? That doesn't exactly sound like a
>> failure mode that 'can be virtually ignored' to me. (And, of course,
>> you accompanied that pearl of wisdom with another incompetent
>> assertion to the effect that ZFS would not catch such a failure, when
>> of course that's *precisely* the kind of failure that ZFS is
>> *designed* to catch.)
>>
>
> I didn't say it could be ignored. I did say it could be handled by a
> properly configured RAID-1 or RAID-10 array.

And yet have been conspicuously silent when challenged to explain
exactly how - because, of course, no RAID can do what you assert it can
above: the most it could do (if it actually performed background
comparisons between data copies rather than just scrubbed them to ensure
that they could be read without error) would be to determine that they
did not match - it would have no way to determine which one was correct,
because that information is higher level in nature.

Once again, that's exactly the kind of thing that features such as ZFS's
separate checksums and supplementary in-sector sanity-checks from the
likes of EMC are for. But of course they aren't part of RAID per se at
all: you really should read the original Berkeley papers if you're
still confused about that.

>
>> Usenet is unfortunately rife with incompetent blowhards like you - so
>> full of themselves that they can't conceive of someone else knowing
>> more than they do about anything that they mistakenly think they
>> understand, and so insistent on preserving that self-image that
>> they'll continue spewing erroneous statements forever (despite their
>> repeated promises to stop: "I'm not going to respond to you any
>> further", "I'm also not going to discuss this any more with you", "I'm
>> finished with this conversation" all in separate responses to toby
>> last night - yet here you are this morning responding to him yet again).
>>
>
> And unfortunately, it's full of trolls like you who jump unwanted into
> conversations

As I already observed, I completely understand why you wouldn't want
people around who could easily demonstrate just how incompetent you
really are. Unfortunately for you, this does not appear to be a forum
where you can exercise moderator control to make that happen - so tough
tooties.

....

>>>> IBM, Unisys, NetApp, and EMC (probably not an exhaustive list, but
>>>> they're the ones that spring immediately to mind) all use
>>>> non-standard disk sector sizes in some of their systems to hold
>>>> additional validation information (maintained by software or
>>>> firmware well above the disk level) aimed at catching some (but in
>>>> most cases not all) of these unreported errors.
>>>>
>>>
>>> That's one way things are done.
>>
>>
>> You're awfully long on generalizations and short on specifics, Jerry -
>> typical Usenet loud-mouthed ignorance. But at least you seem to
>> recognize that some rather significant industry players consider this
>> kind of error sufficiently important to take steps to catch them - as
>> ZFS does but RAID per se does not (that being the nub of this
>> discussion, in case you had forgotten).
>>
>> Now, by all means tell us some of the *other* ways that such 'things
>> are done'.
>>
>
> Get me someone competent in the hardware/firmware end and I can talk all
> the specifics you want.

Sure, Jerry: bluster away and duck the specifics again - wouldn't want
to spoil your perfect record there.

....

>>>> Silent errors are certainly rare, but they happen. ZFS catches
>>>> them. RAID does not. End of story.
>>>>
>>>
>>> And what about data which is corrupted once it's placed in the ZFS
>>> buffer? ZFS buffers are in RAM, and can be overwritten at any time.
>>
>>
>> All file data comes from some RAM buffer, Jerry - even that handed to
>> a firmware RAID. So if it can be corrupted in system RAM, firmware
>> RAID is no cure at all.
>>
>
> Ah, but let's see ZFS correct for that! Oh, sorry - I found a failure
> mode your beloved file system doesn't handle, didn't I?

The point, imbecile, is not that ZFS (or anything else) catches
*everything*: it's that ZFS catches the same kinds of errors that
conventional RAID-1/10 catches (because at the back end it *is*
conventional RAID-1/10), plus other kinds that conventional RAID-1/10
misses.

[massive quantity of drivel snipped - just not worthy of comment at all]

>> And if the
>>
>>> data is written to the wrong sector, the checksum will still be correct.
>>
>>
>> No: if the data is written to the wrong sector, any subsequent read
>> targeting the correct sector will find a checksum mismatch (as will
>> any read to the sector which was incorrectly written).
>>
>
> So pray tell - how is it going to do that?

Careful there, Jerry: when you ask a question, you risk getting a real
answer that will even further expose the depths of your ignorance.

The data was written just as
> it was checksummed.

Try reading what I said again: it really shouldn't be *that* difficult
to understand (unless you really don't know *anything* about how ZFS works).

When ZFS writes data, it doesn't over-write an existing copy if there is
one: it writes the new copy into free space and garbage-collects the
existing copy (unless it has to retain it temporarily for a snapshot)
after the new data is on disk. This means that it updates the metadata
to point to that new copy rather than to the old one, and when doing so
it includes a checksum so that later on, when it reads the data back in,
it can determine with a very high degree of confidence that this data is
indeed what it previously wrote (the feature that conventional RAID
completely lacks).

So if the disk misdirects the write, when the correct sector is later
read in the updated metadata checksum won't match its contents - and if
the incorrectly-overwritten sector is later read through its own
metadata path, that checksum won't match either: in both cases, the
correct information is then read from the other copy and used to update
the corrupted one, something which RAID-1/10 per se simply cannot do
(because it has no way to know which copy is the correct one even if it
did detect the difference by comparing them, though that's also not part
of the standard definition of RAID).

....

'Firmware' is just software that someone has committed to
>> silicon, after all: it is just as prone to bugs as the system-level
>> software that you keep disparaging - more so, when it's more complex
>> than the system-software implementation.
>>
>
> That right there shows how little you understand disk technology today.

You're confused and babbling yet again: the firmware in question here
is not disk firmware (though the fact that disk firmware can and does
have bugs is the reason why sanity checks beyond those inherent in RAID
are desirable): it's RAID firmware, since your contention was that it
somehow magically was less bug-prone than RAID software.

> Firmware is less prone to bugs because it is analyzed and tested so
> much more thoroughly than software, both by humans and machines.

Firmware *is* software. And RAID software can be (and often is) checked
just as thoroughly, because, of course, it's just as important (and for
that matter has the same interface, since you attempted to present that
later as a difference between them).

....

(since you seem determined to keep ignoring this point) why on
>> Earth do you suppose that entirely reputable companies like IBM,
>> NetApp, and EMC go to such lengths to catch them? If it makes sense
>> for them (and for their customers), then it's really difficult to see
>> why ZFS's abilities in that area wouldn't be significant.
>>
>
> Another claim without any proof - and you accuse me of making claim.
> Another typical troll behavior.
>
> I don't know about NetApp or EMC,

You obviously don't know about much at all, but that doesn't seem to
inhibit you from pontificating incompetently.

but I still have contacts in IBM. And
> they do not "go to such lengths" to catch silent errors.

Yes, they do - in particular, in their i-series boxes, where they use
non-standard sector sizes (520 or 528 bytes, I forget which) to include
exactly the kind of additional sanity-checks that I described. Either
your 'contacts in IBM' are as incompetent as you are, or (probably more
likely) you phrased the question incorrectly.

I strongly suspect that IBM uses similar mechanisms in their mainframe
storage, but haven't followed that as closely.

....

>> The fact that you so consistently misrepresent ZFS as being something
>> *different* from RAID-1/10 shows that you don't even understand the
>> definition of RAID: the ZFS back-end *is* RAID-1/10 - it just
>> leverages its implementation in software to improve its reliability
>> (because writing directly from system RAM to disk without an
>> intermediate step through a common controller buffer significantly
>> improves the odds that *one* of the copies will be correct - and the
>> checksum enables ZFS to determine which one that is).
>>
>
> And you're saying it's the same? You really don't understand what
> RAID-1 or RAID-10 is.

Nor, apparently, does anyone else who has bothered to respond to you
here: everyone's out of step but you.

Sure, Jerry. Do you actually make a living in this industry? If so, I
truly pity your customers.

....

>>>>>> Please name even one.
>>>>
>>>>
>>>>
>>>> Why am I not surprised that you dodged that challenge?
>>>>
>>>
>>> Because I'm not the one making the claim.
>>
>>
>> My - you're either an out-right liar or even more abysmally
>> incompetent than even I had thought.
>>
>> Let me refresh your memory: the exchange went
>>
>> [quote]
>>
>> > But when properly implemented, RAID-1 and RAID-10 will detect and
>> correct even more errors than ZFS will.
>>
>> Please name even one.
>>
>> [end quote]
>>
>> At the risk of being repetitive (since reading comprehension does not
>> appear to be your strong suit), the specific claim (yours, quoted
>> above) was that "when properly implemented, RAID-1 and RAID-10 will
>> detect and correct even more errors than ZFS will."
>>
>
> But trolling does seem to be your strong suit.

Ah - still ducking and weaving frantically, I see.

>
>> I challenged you to name even one such - and I'm still waiting.
>>
>
> A corruption in the ZFS buffer between writes, where different data is
> written to one disk than the other.
>
> Errors where ZFS itself is corrupted.

Tsk, tsk. These alleged issues have nothing to do with RAID's ability
to 'detect and correct even more errors than ZFS' - in fact, they have
nothing whatsoever to do with RAID detecting or correcting *anything*.
They're just hypothetical exposures (rather than established problems)
that you've propped up to try to suggest deficiencies in ZFS compared
with moving some of its facilities into firmware.

Come on, Jerry: surely you can come up with *one* kind of error that a
firmware-based RAID can 'detect and correct' that ZFS would miss - or
were you just blowing smoke out of your ass on that one, as in so many
others?

....

> I'm not even going to bother to respond to you any more.

O frabjous day! But wait: can we believe this any more than your
similar statements to toby last night?

Inquiring minds want to know...

- bill

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 13.11.2006 03:35:58 von Jerry Stuckle

Bill Todd wrote:
> Jerry Stuckle wrote:
>
> ...
>
> I suspect I have many years more experience and
>
>>>> knowledge than you.
>>>
>>>
>>>
>>> You obviously suspect a great deal. Too bad that you don't have any
>>> real clue. Even more too bad that you insist on parading that fact
>>> so persistently.
>>>
>>>> And am more familiar with fault tolerant systems.
>>>
>>>
>>>
>>> Exactly how many have you yourself actually architected and built,
>>> rather than simply using the fault-tolerant hardware and software
>>> that others have provided? I've been centrally involved in several.
>>>
>>
>> Disk drive systems? I admit, none.
>
>
> So no surprise there, but at least a little credit for honesty.
>
> My design experience has been more
>
>> in the digital arena - although I have done some analog design -
>> balanced amplifiers, etc.
>
>
> Gee, whiz - and somehow you think that qualifies you to make pompous and
> inaccurate statements about other areas that you know so much less
> about. There's a lesson there, but I doubt your ability to learn it.
>
>>
>> How many disk drive systems have you actually had to troubleshoot?
>> Locate and replace a failing head, for example? Or a bad op amp in a
>> read amplifier? Again, none, I suspect. I've done quite a few in my
>> years.
>
>
> As I suspected, a tech with inflated delusions of competence. You
> really ought to learn the difference between being able to troubleshoot
> a problem (that's what VCR repair guys do: it's not exactly rocket
> science) and being able to design (or even really understand) the system
> that exhibits it.
>

No, a EE graduate with years of design experience before I got into
programming. Sorry, sucker.

And I've snipped the rest of your post. It's obvious you're only an
average programmer (if that) with no real knowledge of the electronics.
All you do is take a set of specs and write code to meet them. Anyone
with six months of experience can do that.

Sorry, troll. The rest of your post isn't even worth reading. Go crawl
back in your hole.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 13.11.2006 05:45:25 von Bill Todd

Jerry Stuckle wrote:

....

>> As I suspected, a tech with inflated delusions of competence. You
>> really ought to learn the difference between being able to
>> troubleshoot a problem (that's what VCR repair guys do: it's not
>> exactly rocket science) and being able to design (or even really
>> understand) the system that exhibits it.
>>
>
> No, a EE graduate

Well, I guess some schools will graduate just about anybody.

....

It's obvious you're only an
> average programmer (if that)

Wow - now you're such an expert on programming that you can infer such
conclusions from a discussion which barely touches on the subject.
That's pretty indicative of your level of understanding in general,
though - so once more, no surprises here.

> All you do is take a set of specs and write code to meet them.

Well, I guess you could say that I take imperfect, real-world hardware
and surrounding environments and (after doing the necessary research,
high-level architecting, and intermediate-level designing) write the
code that creates considerably-less-imperfect systems from them. And
since you probably aren't capable of even beginning to understand the
difference between that and your own statement, I guess we can leave it
there.

....

> The rest of your post isn't even worth reading.

No doubt especially the part where I wondered whether you'd stick by
your promise not to respond again. You're so predictable that you'd be
boring just for that - if you weren't already boring for so many other
reasons.

But I think that my job here is done: I doubt that there's anyone left
wondering whether you might be someone worth listening to on this subject.

- bill

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 13.11.2006 08:45:02 von Robert Milkowski

Jerry Stuckle wrote:
> Frank Cusack wrote:
> > On 11 Nov 2006 19:30:25 -0800 "toby" wrote:
> >
> >>Jerry Stuckle wrote:
> >>
> >>>REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
> >>>software system such as ZFS.
> >
> >
> > This simple statement shows a fundamental misunderstanding of the basics,
> > let alone zfs.
> >
> > -frank
>
> Not at all. RAID-1 and RAID-10 devices are file system neutral. Just
> like disk systems are file-system neutral.
>
> And anyone who things otherwise doesn't understand real RAID
> implementations - only cheap ones which use software for all or part of
> their implementation.

Well, I would say that it's actually you who do not understand ZFS at all.
You claim you read Bonwick blog entry - I belive you just do not want to understand
it.

> Real RAID arrays are not cheap. $100-500/GB is not out of the question.
> And you won't find them at COMP-USA or other retailers.
>
> But you don't see those very often on PC's. Most of the time you see
> cheap implementations where some of the work is done in software.

So? I use ZFS with cheap drives and also with storage like EMC Symmetrix and
several vendors midrange arrays. In some workloads I get for example better
performance when RAID-10 is done completely by ZFS and not by hardware itself.

Also recently one such hardware RAID actually did generate data corruption
without reporting it and ZFS did manage it properly. And we happen to have to
fsck UFS file systems from time to time on those arrays for no apparent reason.

ps. IBM's "hardware" RAID arrays can also loose data, you'll be even informed
by that "hardware" that it did so, how convinient

btw: when you talk about hardware RAID - there is actually software running
on a array's hardware, in case you didn't know

--
Robert Milkowski
rmilkowskiDDDD@wp-sa.pl
http://milek.blogspot.com

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 13.11.2006 09:28:25 von Robert Milkowski

Jerry Stuckle wrote:
> Bill Todd wrote:
> >
>
> That was a decade ago. What are the figures TODAY? Do you even know?
> Do you even know why they happen?

I don't care that much for figures - what matters is I can observe
it in my environment with lots of data and storage arrays. Not daily of course
but still. And ZFS has already detected many data corruption generated
by arrays with HW RAID.

> > Silent errors are certainly rare, but they happen. ZFS catches them.
> > RAID does not. End of story.
> >
>
> And what about data which is corrupted once it's placed in the ZFS
> buffer? ZFS buffers are in RAM, and can be overwritten at any time.

That kind of problem doesn't disappear with RAID done in HW.
If buffer in OS is corrupted before data are sent to an array
then HW array also won't help.

Now if you have uncorrectable memory problems on your server and your
server and OS can't cope with that then you've got much bigger problem
anyway and RAID won't help you.

> And ZFS itself can be corrupted - telling the disk to write to the wrong
> sector, for instance. It is subject to viruses. It runs only on UNIX.

The beauty of ZFS is that even if ZFS itself write data to wrong sector
then in redundand config ZFS can still detect it, recover and provide
application correct data.

I really encourage you to read about ZFS internals as it's realy great
technology with features you can't find anywhere else.

http://opensolaris.org/os/community/zfs/

ps. viruses.... :))))) ok, if you have an VIRUS in your OS which
is capable of corrupting data then HW RAID also won't help

> >> big difference being ZFS if done in software, which requires CPU
> >> cycles and other resources.

That's of course true. There're definitely environments when due to CPU
doing RAID in ZFS will be slower than in HW, you're right.
However in most environments disk performance is actually the limiting
factor not CPU. Also in many cases it's much easier and cheaper to add
CPU power to the system than to increase disk performance.

> It's always about performance. 100% integrity is no good if you need
> 100% of the system resources to handle it.

You are wrong. What's good from rock performance if your data is corrupted?
Actually you need an balance between two, otherwise people would use
only stripe and forget about other RAIDs, right?

And while people are worrying that ZFS can consume much CPU due to checksum
calculations in real life it seems that this is offseted by other features
(like RAID and FS integration, etc.) so at the end in many cases you
actually get better performance that doing RAID in HW.

I did actual tests. Also I have "tested" it in production.
Have you?

ps. see my blog and ZFS list at opensolaris.org for more info.

> > However, because ZFS subsequently checks what it wrote against a
> > *separate* checksum, if it *was* corrupted below the request-submission
> > level ZFS is very likely to find out, whereas a conventional RAID
> > implementation (and the higher layers built on top of it) won't: they
> > just write what (they think) they're told to, with no additional check.
> >
>
> So? If the buffer is corrupted, the checksum will be, also. And if the
> data is written to the wrong sector, the checksum will still be correct.

If buffer is corrupted before OS sends data to the array then you've got problem
regardles of using software or hardware RAID.

Now even if ZFS writes data to wrong sector it can still detect it and correct.
This is due to fact that ZFS does NOT store checksum with data block itself.
Checksum is stored in metadata block pointing to data block. Also meta data
block is checksumed and its checksum is stored in its parent meta block, and so
on. So if ZFS due to bug would write data to wrong location, overwritten blocks
have checksums stored in different location and ZFS would detect it, correct and
still return good data.

Really, read something about ZFS before you express your opinions on it.

> The hardware is MUCH MORE RELIABLE than the software.

1. you still have to use Application/OS to make any use of that hardware.

2. your hardware runs sotware anyway

3. your hardware returns corrupted data (sometimes)

> The fact that you even try to claim that ZFS is better than a RAID-1 or
> RAID-10 system shows just how little you understand critical systems,
> and how much you've bought into the ZFS hype.
>

I would rather say that you are complete ignorant and never have actually
read with understanding about ZFS. Also it appears you've never got data
corruption from HW arrays - how lucky you are, or maybe you didn't realize
it was an array which corrupted your data.

Also it seems you don't understand that ZFS does also RAID-1 and/or RAID-10.

> > The one advantage that a good hardware RAID-1/10 implementation has over
> > ZFS relates to performance, primarily small-synchronous-write latency:
> > while ZFS can group small writes to achieve competitive throughput (in
> > fact, superior throughput in some cases), it can't safely report
> > synchronous write completion until the data is on the disk platters,
> > whereas a good RAID controller will contain mirrored NVRAM that can
> > guarantee persistence in microseconds rather than milliseconds (and then
> > destage the writes to the platters lazily).
> >
>
> That's one advantage, yes.

That's why the combination of ZFS+RAID with large caches is so compeling
in many cases. And yes, I do have such configs.

> Oh, so you're now saying that synchronous writes may not be truly
> synchronous with ZFS? That's something I didn't know. I thought ZFS
> was smarter than that.

Please, stop trolling. Of course they are synchronous.

--
Robert Milkowski
rmilkowski@wp-sa.pl
http://milek.blogspot.com

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 13.11.2006 10:00:49 von Robert Milkowski

Jerry Stuckle wrote:
> Bill Todd wrote:
>
> OK, and exactly how many of these bugs are there? Disk drive and
> similar firmware is some of the most specialized and most heavily tested
> firmware on the planet.

What? How many arrays do you manage?
How many times did you have to upgrade disk firmware or
RAID controllers firmware on them? I did many times.

However I must say that arrays are reliable. But from time to time it just happens.
We did fsck or other magic to get our data working, not that often but still.

Recently I did use two SCSI JBODs (ok, it's not array) connected
via two SCSI adapters to a host. RAID-10 done in ZFS between JBODS.
Well, during data copy one of the controllers reported some warnings,
but keep operational. Well, it actually did corrupt data - fortunately
ZFS did handle it properly, and we replaced the adapter. With traditional
file systems we would be in trouble.

> > All file data comes from some RAM buffer, Jerry - even that handed to a
> > firmware RAID. So if it can be corrupted in system RAM, firmware RAID
> > is no cure at all.
> >
>
> Ah, but let's see ZFS correct for that! Oh, sorry - I found a failure
> mode your beloved file system doesn't handle, didn't I?

So? Nobody claims ZFS protects you from ALL possible data corruption.
Only that it protects you from much more data corruptions than when RAID
is done only on the array. It's also not theoretical but actually it's an
experience of many sys admins.

> Because your beloved ZFS isn't worth a damn on any other system, that's
> why. Let's see it run on MVS/XE, for instance. It doesn't work.
> RAID-1/RAID-10 does.

If you have to use MVS then you're right - you can't use ZFS and you
have to live with it.

> And you conveniently ignore how ZFS can be corrupted. In fact, it is
> much more easily corrupted than basic file systems using RAID-1/RAID-10
> arrays - if for no other reason than it contains a lot more code and
> needs to to more work.

Well, actually ZFS has less code than UFS, for example.
See http://blogs.sun.com/eschrock/entry/ufs_svm_vs_zfs_code

First check your assumptions before posting them.
But I don't blame you - when I first heard about ZFS my first
reaction was: it's too good to be true. Well, later I started using
it and after over two years of using it (also in production) it still
amazes me how wonderful it is. It has also its weak points, it had/has
some bugs but after using it for more than two years I've never loose data.

> > Horseshit. It's only 'about performance' when the performance impact is
> > significant. In the case of ZFS's mirroring implementation, it isn't of
> > any significance at all (let alone any *real* drag on the system).
> >
>
> Keep believing that. It will help you to justify your statements in
> your mind.

Have you checked it? I DID. And in MY environment ZFS delivered
better performance than HW RAID.

> A corruption in the ZFS buffer between writes, where different data is
> written to one disk than the other.

Actually as soon as you will read those data ZFS will detect it and correct,
also will return correct data to an application.
Also you can run SCRUB process in a background from time to time,
so even if you do not read those data back again ZFS will check
all data and correct problems if it finds any.

So in above case you described ZFS will actually detect corruption
and repair.

--
Robert Milkowski
rmilkowskiXXX@wp-sa.pl
http://milek.blogspot.com

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 13.11.2006 12:40:46 von Jerry Stuckle

Bill Todd wrote:
> Jerry Stuckle wrote:
>
> ...
>
>>> As I suspected, a tech with inflated delusions of competence. You
>>> really ought to learn the difference between being able to
>>> troubleshoot a problem (that's what VCR repair guys do: it's not
>>> exactly rocket science) and being able to design (or even really
>>> understand) the system that exhibits it.
>>>
>>
>> No, a EE graduate
>
>
> Well, I guess some schools will graduate just about anybody.
>

Right, troll.

> ...
>
> It's obvious you're only an
>
>> average programmer (if that)
>
>
> Wow - now you're such an expert on programming that you can infer such
> conclusions from a discussion which barely touches on the subject.
> That's pretty indicative of your level of understanding in general,
> though - so once more, no surprises here.
>

Well, with almost 40 years of programming, I can spot a large-mouthed
asshole when I see one.

Of course, you're such an expert on me EE experience.

>> All you do is take a set of specs and write code to meet them.
>
>
> Well, I guess you could say that I take imperfect, real-world hardware
> and surrounding environments and (after doing the necessary research,
> high-level architecting, and intermediate-level designing) write the
> code that creates considerably-less-imperfect systems from them. And
> since you probably aren't capable of even beginning to understand the
> difference between that and your own statement, I guess we can leave it
> there.
>

ROFLMAO! "Imperfect, real-world hardware" is a hell of a lot more
reliable than your programming! Then you "improve them" by making them
even less perfect! That is just too great.

You're a troll - and the worst one I've ever seen on Usenet. Hell, you
can't even succeed as a troll. You don't understand the hardware you're
supposedly writing to. It's pretty obvious your claims are out your
ass. You have no idea what you're talking about, and no idea how the
disk drive manufacturers write the firmware.

> ...
>
>> The rest of your post isn't even worth reading.
>
>
> No doubt especially the part where I wondered whether you'd stick by
> your promise not to respond again. You're so predictable that you'd be
> boring just for that - if you weren't already boring for so many other
> reasons.
>
> But I think that my job here is done: I doubt that there's anyone left
> wondering whether you might be someone worth listening to on this subject.
>
> - bill

I respond to assholes when they make even bigger assholes of themselves,
as you just did.

Go back to your Tinker Toys, little boy. You're a troll, nothing more,
nothing less.

You may claim you're a programmer. And maybe you've even written a few
lines of code in your lifetime. And there's even a slight chance you
got it to work, with some help.

But your claim that you write disk drive controller firmware is full of
shit. Your completely inane claims have proven that.

Troll.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 13.11.2006 12:53:17 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>Bill Todd wrote:
>>
>>OK, and exactly how many of these bugs are there? Disk drive and
>>similar firmware is some of the most specialized and most heavily tested
>>firmware on the planet.
>
>
> What? How many arrays do you manage?
> How many times did you have to upgrade disk firmware or
> RAID controllers firmware on them? I did many times.
>

Robert,

I've lost count over the years of how many I've managed.

As for upgrading disk firmware? Never. RAID firmware? Once, but that
was on recommendation from the manufacturer, not because we had a problem.

But the RAID devices I'm talking about are at COMP-USA. They are high
end arrays attached to minis and mainframes. Starting cost probably
$500K or more. And they are reliable.

> However I must say that arrays are reliable. But from time to time it just happens.
> We did fsck or other magic to get our data working, not that often but still.
>

Never had to do it.

> Recently I did use two SCSI JBODs (ok, it's not array) connected
> via two SCSI adapters to a host. RAID-10 done in ZFS between JBODS.
> Well, during data copy one of the controllers reported some warnings,
> but keep operational. Well, it actually did corrupt data - fortunately
> ZFS did handle it properly, and we replaced the adapter. With traditional
> file systems we would be in trouble.
>
>

Gee, with good hardware that wouldn't have happened. And with a real
RAID array it wouldn't have happened, either.

>
>>>All file data comes from some RAM buffer, Jerry - even that handed to a
>>>firmware RAID. So if it can be corrupted in system RAM, firmware RAID
>>>is no cure at all.
>>>
>>
>>Ah, but let's see ZFS correct for that! Oh, sorry - I found a failure
>>mode your beloved file system doesn't handle, didn't I?
>
>
> So? Nobody claims ZFS protects you from ALL possible data corruption.
> Only that it protects you from much more data corruptions than when RAID
> is done only on the array. It's also not theoretical but actually it's an
> experience of many sys admins.
>

Ah, but that's what some of the people in this thread have claimed,
Robert. Check back.

>
>>Because your beloved ZFS isn't worth a damn on any other system, that's
>>why. Let's see it run on MVS/XE, for instance. It doesn't work.
>>RAID-1/RAID-10 does.
>
>
> If you have to use MVS then you're right - you can't use ZFS and you
> have to live with it.
>

Or even Windows. Or Mac. Or any of several other OS's.

>
>>And you conveniently ignore how ZFS can be corrupted. In fact, it is
>>much more easily corrupted than basic file systems using RAID-1/RAID-10
>>arrays - if for no other reason than it contains a lot more code and
>>needs to to more work.
>
>
> Well, actually ZFS has less code than UFS, for example.
> See http://blogs.sun.com/eschrock/entry/ufs_svm_vs_zfs_code
>
> First check your assumptions before posting them.
> But I don't blame you - when I first heard about ZFS my first
> reaction was: it's too good to be true. Well, later I started using
> it and after over two years of using it (also in production) it still
> amazes me how wonderful it is. It has also its weak points, it had/has
> some bugs but after using it for more than two years I've never loose data.
>

I have checked my assumptions. Note that I never said ZFS is bad. Just
that it isn't the magic cure-all that others in this thread are
claiming. And it's just a cheap replacement for proper RAID devices.
>
>
>>>Horseshit. It's only 'about performance' when the performance impact is
>>>significant. In the case of ZFS's mirroring implementation, it isn't of
>>>any significance at all (let alone any *real* drag on the system).
>>>
>>
>>Keep believing that. It will help you to justify your statements in
>>your mind.
>
>
> Have you checked it? I DID. And in MY environment ZFS delivered
> better performance than HW RAID.
>
>

What RAID did you get? Did it have it's own drivers, or did it use the
system drivers? If the former, a lot of the work was done in software,
which is common in less expensive systems. Did it have dual
controllers, or did it use one controller for both drives? I could go on.

With cheap controllers you get cheap performance.

>
>>A corruption in the ZFS buffer between writes, where different data is
>>written to one disk than the other.
>
>
> Actually as soon as you will read those data ZFS will detect it and correct,
> also will return correct data to an application.
> Also you can run SCRUB process in a background from time to time,
> so even if you do not read those data back again ZFS will check
> all data and correct problems if it finds any.
>
> So in above case you described ZFS will actually detect corruption
> and repair.
>
>

I never said ZFS couldn't correct and repair some problems. But it does
NOT do everything, like some people here have indicated.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 13.11.2006 13:05:08 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>Bill Todd wrote:
>>
>>That was a decade ago. What are the figures TODAY? Do you even know?
>>Do you even know why they happen?
>
>
> I don't care that much for figures - what matters is I can observe
> it in my environment with lots of data and storage arrays. Not daily of course
> but still. And ZFS has already detected many data corruption generated
> by arrays with HW RAID.
>

Or, at least ZFS claims to have detected those corruptions. What proof
do you have that there really were errors? What kinds of errors were these?

If they were "silent errors", I would be very suspicious of the
reporting, unless you have a cheap array.

>
>>>Silent errors are certainly rare, but they happen. ZFS catches them.
>>>RAID does not. End of story.
>>>
>>
>>And what about data which is corrupted once it's placed in the ZFS
>>buffer? ZFS buffers are in RAM, and can be overwritten at any time.
>
>
> That kind of problem doesn't disappear with RAID done in HW.
> If buffer in OS is corrupted before data are sent to an array
> then HW array also won't help.
>

Actually, it does. For instance, it will sit in the ZFS buffer for a
lot longer, leaving it open to corruption longer. It has to be in the
buffer at least as long as it takes for the first write to complete and
the command sent off to the second drive. With RAID, the data are
protected as soon as they are sent to the array with the first write.

With ZFS the data will be in the buffer for several ms or longer, even
with no other load on the system or disks. With RAID devices, data will
be there typically less than a ms. And as the system load gets heavier,
this difference increases.

There is much more opportunity for data to be corrupted in ZFS than RAID.

> Now if you have uncorrectable memory problems on your server and your
> server and OS can't cope with that then you've got much bigger problem
> anyway and RAID won't help you.
>

I didn't say anything about an uncorrectable memory problem.
>
>
>>And ZFS itself can be corrupted - telling the disk to write to the wrong
>>sector, for instance. It is subject to viruses. It runs only on UNIX.
>
>
> The beauty of ZFS is that even if ZFS itself write data to wrong sector
> then in redundand config ZFS can still detect it, recover and provide
> application correct data.
>

Yes, it *can* detect it. But there is no guarantee it *will* detect it.
And how is it going to provide application correct data if that data
was overwritten?

> I really encourage you to read about ZFS internals as it's realy great
> technology with features you can't find anywhere else.
>
> http://opensolaris.org/os/community/zfs/
>
> ps. viruses.... :))))) ok, if you have an VIRUS in your OS which
> is capable of corrupting data then HW RAID also won't help
>
I've read a lot about ZFS. I'm not saying it's a bad system. I'm just
saying there is a hell of a lot of marketing hype people have succumbed
to. And I have yet to get anyone with any technical background who can
support those

>
>
>>>>big difference being ZFS if done in software, which requires CPU
>>>>cycles and other resources.
>
>
> That's of course true. There're definitely environments when due to CPU
> doing RAID in ZFS will be slower than in HW, you're right.
> However in most environments disk performance is actually the limiting
> factor not CPU. Also in many cases it's much easier and cheaper to add
> CPU power to the system than to increase disk performance.
>
>
>
>>It's always about performance. 100% integrity is no good if you need
>>100% of the system resources to handle it.
>
>
> You are wrong. What's good from rock performance if your data is corrupted?
> Actually you need an balance between two, otherwise people would use
> only stripe and forget about other RAIDs, right?
>
> And while people are worrying that ZFS can consume much CPU due to checksum
> calculations in real life it seems that this is offseted by other features
> (like RAID and FS integration, etc.) so at the end in many cases you
> actually get better performance that doing RAID in HW.
>
> I did actual tests. Also I have "tested" it in production.
> Have you?
>
> ps. see my blog and ZFS list at opensolaris.org for more info.
>
>
>
>>>However, because ZFS subsequently checks what it wrote against a
>>>*separate* checksum, if it *was* corrupted below the request-submission
>>>level ZFS is very likely to find out, whereas a conventional RAID
>>>implementation (and the higher layers built on top of it) won't: they
>>>just write what (they think) they're told to, with no additional check.
>>>
>>
>>So? If the buffer is corrupted, the checksum will be, also. And if the
>>data is written to the wrong sector, the checksum will still be correct.
>
>
> If buffer is corrupted before OS sends data to the array then you've got problem
> regardles of using software or hardware RAID.
>
> Now even if ZFS writes data to wrong sector it can still detect it and correct.
> This is due to fact that ZFS does NOT store checksum with data block itself.
> Checksum is stored in metadata block pointing to data block. Also meta data
> block is checksumed and its checksum is stored in its parent meta block, and so
> on. So if ZFS due to bug would write data to wrong location, overwritten blocks
> have checksums stored in different location and ZFS would detect it, correct and
> still return good data.
>
> Really, read something about ZFS before you express your opinions on it.
>
>
>
>>The hardware is MUCH MORE RELIABLE than the software.
>
>
> 1. you still have to use Application/OS to make any use of that hardware.
>
> 2. your hardware runs sotware anyway
>
> 3. your hardware returns corrupted data (sometimes)
>
>
>
>>The fact that you even try to claim that ZFS is better than a RAID-1 or
>>RAID-10 system shows just how little you understand critical systems,
>>and how much you've bought into the ZFS hype.
>>
>
>
> I would rather say that you are complete ignorant and never have actually
> read with understanding about ZFS. Also it appears you've never got data
> corruption from HW arrays - how lucky you are, or maybe you didn't realize
> it was an array which corrupted your data.
>
> Also it seems you don't understand that ZFS does also RAID-1 and/or RAID-10.
>
>
>
>>>The one advantage that a good hardware RAID-1/10 implementation has over
>>>ZFS relates to performance, primarily small-synchronous-write latency:
>>>while ZFS can group small writes to achieve competitive throughput (in
>>>fact, superior throughput in some cases), it can't safely report
>>>synchronous write completion until the data is on the disk platters,
>>>whereas a good RAID controller will contain mirrored NVRAM that can
>>>guarantee persistence in microseconds rather than milliseconds (and then
>>>destage the writes to the platters lazily).
>>>
>>
>>That's one advantage, yes.
>
>
> That's why the combination of ZFS+RAID with large caches is so compeling
> in many cases. And yes, I do have such configs.
>
>
>>Oh, so you're now saying that synchronous writes may not be truly
>>synchronous with ZFS? That's something I didn't know. I thought ZFS
>>was smarter than that.
>
>
> Please, stop trolling. Of course they are synchronous.
>
>

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 13.11.2006 13:23:56 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>Frank Cusack wrote:
>>
>>>On 11 Nov 2006 19:30:25 -0800 "toby" wrote:
>>>
>>>
>>>>Jerry Stuckle wrote:
>>>>
>>>>
>>>>>REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
>>>>>software system such as ZFS.
>>>
>>>
>>>This simple statement shows a fundamental misunderstanding of the basics,
>>>let alone zfs.
>>>
>>>-frank
>>
>>Not at all. RAID-1 and RAID-10 devices are file system neutral. Just
>>like disk systems are file-system neutral.
>>
>>And anyone who things otherwise doesn't understand real RAID
>>implementations - only cheap ones which use software for all or part of
>>their implementation.
>
>
> Well, I would say that it's actually you who do not understand ZFS at all.
> You claim you read Bonwick blog entry - I belive you just do not want to understand
> it.
>

No, I read it. The difference is I have enough technical background to
separate the facts from the hype.

>
>>Real RAID arrays are not cheap. $100-500/GB is not out of the question.
>> And you won't find them at COMP-USA or other retailers.
>>
>>But you don't see those very often on PC's. Most of the time you see
>>cheap implementations where some of the work is done in software.
>
>
> So? I use ZFS with cheap drives and also with storage like EMC Symmetrix and
> several vendors midrange arrays. In some workloads I get for example better
> performance when RAID-10 is done completely by ZFS and not by hardware itself.
>

If you use cheap drives, you need something like ZFS. But if you depend
on cheap drives, your data isn't very critical.

Let's take a real-life example of a critical system - a major airline
where losing one minute of reservations will cost millions of dollars.
And if the system is down for 12 hours the entire company can go under.

Or losing a single hour's worth of flight information could bankrupt the
company. Even losing a single flight could cost millions of dollars,
not to mention the bad PR.

BTW, this airline not only has RAID devices, they have duplicate data
centers databases on those RAID devices are synchronized constantly.

Or a bank, where lost transactions can cause account balances to be
incorrect, bad data being sent to the Federal Reserve System can cost
millions of dollars. Even if they recover all the data, the time it
takes can cost huge losses - banks are on a schedule to send tapes to
the Federal Reserve every night, and missing a deadline can easily cost
$100K per hour in fines.

> Also recently one such hardware RAID actually did generate data corruption
> without reporting it and ZFS did manage it properly. And we happen to have to
> fsck UFS file systems from time to time on those arrays for no apparent reason.
>
> ps. IBM's "hardware" RAID arrays can also loose data, you'll be even informed
> by that "hardware" that it did so, how convinient
>

I never said that hardware RAID systems can't lose data. My comments were:

1) With good drives, unreported ("silent") errors occur so seldom that
they can be ignored,
2) Virtually all other errors can be corrected by the hardware, and
3) ZFS cannot correct for all those errors.

ZFS makes some great claims. But Bonwick makes a great marketing piece
in the way he magnifies the possibilities of hardware problems and
minimizes potential problems in ZFS. He also magnifies the good things
about ZFS, but minimizes the positives of RAID devices.

The whole thing is a great exercise in marketing hype.

> btw: when you talk about hardware RAID - there is actually software running
> on a array's hardware, in case you didn't know
>

Of course I understand that. But I also understand it's isolated from
the system software, and not subject to viruses and other nasty things.
It's also not subject to corruption by other programs.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 13.11.2006 15:12:45 von Bill Todd

Jerry Stuckle wrote:

....

> If they were "silent errors", I would be very suspicious of the
> reporting, unless you have a cheap array.

Of course you would, Jerry: you've apparently never been one to let the
facts get in the way of a good preconception.

....

For instance, it will sit in the ZFS buffer for a
> lot longer, leaving it open to corruption longer. It has to be in the
> buffer at least as long as it takes for the first write to complete and
> the command sent off to the second drive.

'Fraid not, moron: haven't you ever programmed an asynchronous system
before? The two writes are sent in parallel (if you don't understand
why, or how, well - that's pretty much on a par with the rest of your
ignorance).

....

> Yes, it *can* detect it. But there is no guarantee it *will* detect it.
> And how is it going to provide application correct data if that data
> was overwritten?

From the other, uncorrupted copy, idiot.

....

> I've read a lot about ZFS.

Then the problem clearly resides in your inability to understand what
you've (allegedly) read: again, no surprise here.

- bill

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 13.11.2006 15:21:44 von Bill Todd

Jerry Stuckle wrote:

....

But Bonwick makes a great marketing piece
> in the way he magnifies the possibilities of hardware problems

Sure, Jerry: it's all just hype - ZFS's separate checksums, IBM's,
NetApp's, and EMC's similar (though not always as effective) ancillary
in-line sanity-checks, Oracle's 'Hardware Assisted Resilient Data'
initiative (again, not as fully end-to-end as ZFS's mechanism, but at
least it verifies that what it wrote is what gets down to the individual
disk - and all the major hardware vendors have supported it)...

And only you understand this, of course, due to your extensive
technician-level experience in component repair: everyone else here,
despite their actual experiences with ZFS and hardware, doesn't have a clue.

Are you also one of those people who hears voices in your head when no
one else is around?

- bill

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 13.11.2006 15:37:40 von Jerry Stuckle

Bill Todd wrote:
> Jerry Stuckle wrote:
>
> ...
>
>> If they were "silent errors", I would be very suspicious of the
>> reporting, unless you have a cheap array.
>
>
> Of course you would, Jerry: you've apparently never been one to let the
> facts get in the way of a good preconception.
>

Yep, you sure do, Bill. You are so convinced that ZFS is the best thing
since sliced bread you can't see the obvious.

> ...
>
> For instance, it will sit in the ZFS buffer for a
>
>> lot longer, leaving it open to corruption longer. It has to be in the
>> buffer at least as long as it takes for the first write to complete
>> and the command sent off to the second drive.
>
>
> 'Fraid not, moron: haven't you ever programmed an asynchronous system
> before? The two writes are sent in parallel (if you don't understand
> why, or how, well - that's pretty much on a par with the rest of your
> ignorance).
>
> ...
>
>> Yes, it *can* detect it. But there is no guarantee it *will* detect
>> it. And how is it going to provide application correct data if that
>> data was overwritten?
>
>
> From the other, uncorrupted copy, idiot.
>
> ...
>
>> I've read a lot about ZFS.
>
>
> Then the problem clearly resides in your inability to understand what
> you've (allegedly) read: again, no surprise here.
>
> - bill

Or your inability to understand basic facts.

Troll.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 13.11.2006 15:39:07 von Jerry Stuckle

Bill Todd wrote:
> Jerry Stuckle wrote:
>
> ...
>
> But Bonwick makes a great marketing piece
>
>> in the way he magnifies the possibilities of hardware problems
>
>
> Sure, Jerry: it's all just hype - ZFS's separate checksums, IBM's,
> NetApp's, and EMC's similar (though not always as effective) ancillary
> in-line sanity-checks, Oracle's 'Hardware Assisted Resilient Data'
> initiative (again, not as fully end-to-end as ZFS's mechanism, but at
> least it verifies that what it wrote is what gets down to the individual
> disk - and all the major hardware vendors have supported it)...
>
> And only you understand this, of course, due to your extensive
> technician-level experience in component repair: everyone else here,
> despite their actual experiences with ZFS and hardware, doesn't have a
> clue.
>
> Are you also one of those people who hears voices in your head when no
> one else is around?
>
> - bill

And digital design - something you've never attempted, nor are you
capable of attempting.

You're just pissed off because you found someone with more knowledge
than you who is challenging your bullshit. And you can't stand it, so
you try the personal attacks.

Troll.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 13.11.2006 16:06:20 von Robert Milkowski

Jerry Stuckle wrote:
> Robert Milkowski wrote:
> > Jerry Stuckle wrote:
> >
> >>Bill Todd wrote:
> >>
> >>OK, and exactly how many of these bugs are there? Disk drive and
> >>similar firmware is some of the most specialized and most heavily tested
> >>firmware on the planet.
> >
> >
> > What? How many arrays do you manage?
> > How many times did you have to upgrade disk firmware or
> > RAID controllers firmware on them? I did many times.
> >
>
> Robert,
>
> I've lost count over the years of how many I've managed.
>
> As for upgrading disk firmware? Never. RAID firmware? Once, but that
> was on recommendation from the manufacturer, not because we had a problem.
>
> But the RAID devices I'm talking about are at COMP-USA. They are high
> end arrays attached to minis and mainframes. Starting cost probably
> $500K or more. And they are reliable.

And if you have managed them for some time then you definitely upgraded their
firmware more than once, including disk firmware. Well, maybe not you
but EMC engeneer did it for you :)

I have upgraded for example (ok, EMC engeener) Symmetrix firmware
more than once. And this is the array you are talkin about I guess.

> > Recently I did use two SCSI JBODs (ok, it's not array) connected
> > via two SCSI adapters to a host. RAID-10 done in ZFS between JBODS.
> > Well, during data copy one of the controllers reported some warnings,
> > but keep operational. Well, it actually did corrupt data - fortunately
> > ZFS did handle it properly, and we replaced the adapter. With traditional
> > file systems we would be in trouble.
> >
> >
>
> Gee, with good hardware that wouldn't have happened. And with a real
> RAID array it wouldn't have happened, either.

Really? Well it did more than once (well known vendors).

> >>>Horseshit. It's only 'about performance' when the performance impact is
> >>>significant. In the case of ZFS's mirroring implementation, it isn't of
> >>>any significance at all (let alone any *real* drag on the system).
> >>>
> >>
> >>Keep believing that. It will help you to justify your statements in
> >>your mind.
> >
> >
> > Have you checked it? I DID. And in MY environment ZFS delivered
> > better performance than HW RAID.
> >
> >
>
> What RAID did you get? Did it have it's own drivers, or did it use the
> system drivers? If the former, a lot of the work was done in software,
> which is common in less expensive systems. Did it have dual
> controllers, or did it use one controller for both drives? I could go on.

Dual FC links, RAID-10, etc.....

> I never said ZFS couldn't correct and repair some problems. But it does
> NOT do everything, like some people here have indicated.

Of course. But it does protect you for more data corruption scenarios than ANY
HW RAID can.

--
Robert Milkowski
rmilkowskiZZZ@wp-sa.pl
http://milek.blogspot.com

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 13.11.2006 16:46:17 von Bill Todd

Jerry Stuckle wrote:

....

> You're just pissed off because you found someone with more knowledge
> than you

And, obviously, more knowledge than anyone else, whether here (those
with actual experience of the errors you claim don't exist in noticeable
quantities) or in the rest of the industry (such as those who actually
implemented the mechanisms that you claim are just hype without even
beginning to understand them).

You also appear to have a rather loose grasp on reality, at least when
it comes to presenting utter drivel as fact.

Are you familiar with the concept of 'delusional megalomania' , Jerry?
If not, perhaps you ought to become acquainted with it.

- bill

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 13.11.2006 17:48:13 von Good Man

alf wrote in news:q_ydnRdMUfzIMM_YnZ2dnUVZ_rmdnZ2d@comcast.com:

> Hi,
>
> is it possible that due to OS crash or mysql itself crash or some e.g.
> SCSI failure to lose all the data stored in the table (let's say million
> of 1KB rows). In other words what is the worst case scenario for MyISAM
> backend?

Hi everyone

Thanks for contributing... for the most part, it was great to see very
knowledgable people discuss the intriciacies of data safety and management.

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 13.11.2006 18:57:30 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>Robert Milkowski wrote:
>>
>>>Jerry Stuckle wrote:
>>>
>>>
>>>>Bill Todd wrote:
>>>>
>>>>OK, and exactly how many of these bugs are there? Disk drive and
>>>>similar firmware is some of the most specialized and most heavily tested
>>>>firmware on the planet.
>>>
>>>
>>>What? How many arrays do you manage?
>>>How many times did you have to upgrade disk firmware or
>>>RAID controllers firmware on them? I did many times.
>>>
>>
>>Robert,
>>
>>I've lost count over the years of how many I've managed.
>>
>>As for upgrading disk firmware? Never. RAID firmware? Once, but that
>>was on recommendation from the manufacturer, not because we had a problem.
>>
>>But the RAID devices I'm talking about are at COMP-USA. They are high
>>end arrays attached to minis and mainframes. Starting cost probably
>>$500K or more. And they are reliable.
>
>
> And if you have managed them for some time then you definitely upgraded their
> firmware more than once, including disk firmware. Well, maybe not you
> but EMC engeneer did it for you :)
>
> I have upgraded for example (ok, EMC engeener) Symmetrix firmware
> more than once. And this is the array you are talkin about I guess.
>

No, I'm not talking Symmetrix. If we're talking the same company, they
are a software supplier (quite good software, I must add), not a RAID
array manufacturer. They may, however, have software to run RAID
arrays; if so I'm not familiar with that particular product.

>
>
>>>Recently I did use two SCSI JBODs (ok, it's not array) connected
>>>via two SCSI adapters to a host. RAID-10 done in ZFS between JBODS.
>>>Well, during data copy one of the controllers reported some warnings,
>>>but keep operational. Well, it actually did corrupt data - fortunately
>>>ZFS did handle it properly, and we replaced the adapter. With traditional
>>>file systems we would be in trouble.
>>>
>>>
>>
>>Gee, with good hardware that wouldn't have happened. And with a real
>>RAID array it wouldn't have happened, either.
>
>
> Really? Well it did more than once (well known vendors).
>

McDonalds is also well known. But I wouldn't equate that to quality food.

>
>>>>>Horseshit. It's only 'about performance' when the performance impact is
>>>>>significant. In the case of ZFS's mirroring implementation, it isn't of
>>>>>any significance at all (let alone any *real* drag on the system).
>>>>>
>>>>
>>>>Keep believing that. It will help you to justify your statements in
>>>>your mind.
>>>
>>>
>>>Have you checked it? I DID. And in MY environment ZFS delivered
>>>better performance than HW RAID.
>>>
>>>
>>
>>What RAID did you get? Did it have it's own drivers, or did it use the
>>system drivers? If the former, a lot of the work was done in software,
>>which is common in less expensive systems. Did it have dual
>>controllers, or did it use one controller for both drives? I could go on.
>
>
> Dual FC links, RAID-10, etc.....
>

But you didn't answer my questions. Did it have it's own drivers? Did
it have dual controllers?

>
>>I never said ZFS couldn't correct and repair some problems. But it does
>>NOT do everything, like some people here have indicated.
>
>
> Of course. But it does protect you for more data corruption scenarios than ANY
> HW RAID can.
>
>

But good HW RAID will detect and, if at all possible, correct data
corruption. And if it's not possible, it's because the data is lost -
i.e. completely scrambled and/or overwritten on both drives. Even ZFS
can't handle that.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 13.11.2006 19:02:24 von Jerry Stuckle

Bill Todd wrote:
> Jerry Stuckle wrote:
>
> ...
>
>> You're just pissed off because you found someone with more knowledge
>> than you
>
>
> And, obviously, more knowledge than anyone else, whether here (those
> with actual experience of the errors you claim don't exist in noticeable
> quantities) or in the rest of the industry (such as those who actually
> implemented the mechanisms that you claim are just hype without even
> beginning to understand them).
>

Again, just another troll response. You can't dispute the facts, so you
make personal attacks on the messenger.

For the record, I have more knowledge of the hardware and internals than
anyone here has shown. And I have yet to see anything in the ZFS
references provided to indicate that ANY of the people there have more
than a cursory knowledge of the hardware and firmware behind disk drives
themselves, much less an in depth knowledge. Yet they spew "facts" like
they are experts.

I have no argument with their programming skills. Merely their lack of
knowledge of disk hardware.

> You also appear to have a rather loose grasp on reality, at least when
> it comes to presenting utter drivel as fact.
>

ROFLMAO! Because I give you the true reality, and not some hype?

> Are you familiar with the concept of 'delusional megalomania' , Jerry?
> If not, perhaps you ought to become acquainted with it.
>
> - bill

You seem to be quite familiar with it, Bill. How many times have you
been diagnosed with it?

Just another troll.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware, whatever)

am 13.11.2006 19:24:17 von Robert Milkowski

Jerry Stuckle wrote:
>
> But good HW RAID will detect and, if at all possible, correct data
> corruption. And if it's not possible, it's because the data is lost -
> i.e. completely scrambled and/or overwritten on both drives. Even ZFS
> can't handle that.

Ok, it doesn't make sense to reason with you.
You live in a world of your own - fine, keep dreaming.

--
Robert Milkowski
rmilkowskiXXXX@wp-sa.pl
http://milek.blogspot.com

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 13.11.2006 19:27:22 von Bill Todd

Jerry Stuckle wrote:

....

>> Are you familiar with the concept of 'delusional megalomania' , Jerry?
>> If not, perhaps you ought to become acquainted with it.
>>
>> - bill
>
> You seem to be quite familiar with it, Bill. How many times have you
> been diagnosed with it?

None, but thirty-odd years ago I did work full-time for three years in a
mental hospital, treating disturbed adolescents.

Now, unlike you, I'm not prone to making sweeping assertions far outside
my area of professional expertise and with little or no solid
foundation, but it doesn't take a degree in psychology to know how
clearly your behavior here reminds me of them. And the more I've
noticed that, the more I've begun to feel that perhaps you were more
deserving of pity than of scorn.

But of course, since my experience in this area is purely practical
(though obtained working closely with people who *were* professionals in
this area), I could be wrong: as Freud might have said, sometimes an
asshole is simply an asshole, rather than mentally ill.

- bill

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 13.11.2006 19:34:00 von Robert Milkowski

Jerry Stuckle wrote:
> Robert Milkowski wrote:
> > Well, I would say that it's actually you who do not understand ZFS at all.
> > You claim you read Bonwick blog entry - I belive you just do not want to understand
> > it.
> >
>
> No, I read it. The difference is I have enough technical background to
> separate the facts from the hype.

So you keep saying...
But your posts indicate you are actually ignorant when it comes
to technical details. All you presented so far is whishful thinking
and belive that if something costs lots of monay then it will automagicaly
solve all problems. Well, in reality it's not a case. No matter how much money
you put in a HW RAID it won't detect some data corruptions which would otherwise
be easily detected and corrected by zfs.

I think you should stay in your wonderland and we should not waste our time anymore.

--
Robert Milkowski
rmilkowskiZZZ@wp-sa.pl
http://milek.blogspot.com

Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

am 13.11.2006 22:53:23 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>But good HW RAID will detect and, if at all possible, correct data
>>corruption. And if it's not possible, it's because the data is lost -
>>i.e. completely scrambled and/or overwritten on both drives. Even ZFS
>>can't handle that.
>
>
> Ok, it doesn't make sense to reason with you.
> You live in a world of your own - fine, keep dreaming.
>
>

And you ignore the facts. Good luck.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 13.11.2006 22:55:12 von Jerry Stuckle

Bill Todd wrote:
> Jerry Stuckle wrote:
>
> ...
>
>>> Are you familiar with the concept of 'delusional megalomania' ,
>>> Jerry? If not, perhaps you ought to become acquainted with it.
>>>
>>> - bill
>>
>>
>> You seem to be quite familiar with it, Bill. How many times have you
>> been diagnosed with it?
>
>
> None, but thirty-odd years ago I did work full-time for three years in a
> mental hospital, treating disturbed adolescents.
>

I more suspect you were a patient.

> Now, unlike you, I'm not prone to making sweeping assertions far outside
> my area of professional expertise and with little or no solid
> foundation, but it doesn't take a degree in psychology to know how
> clearly your behavior here reminds me of them. And the more I've
> noticed that, the more I've begun to feel that perhaps you were more
> deserving of pity than of scorn.
>

Hmmm, it seems you've made some sweeping statements in this thread. And
unlike me, you don't have the hardware background to support your
statements. And even your software background is questionable.

> But of course, since my experience in this area is purely practical
> (though obtained working closely with people who *were* professionals in
> this area), I could be wrong: as Freud might have said, sometimes an
> asshole is simply an asshole, rather than mentally ill.
>
> - bill

Right. What did you do - empty their trash cans for them?

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 13.11.2006 23:18:27 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>Robert Milkowski wrote:
>>
>>>Well, I would say that it's actually you who do not understand ZFS at all.
>>>You claim you read Bonwick blog entry - I belive you just do not want to understand
>>>it.
>>>
>>
>>No, I read it. The difference is I have enough technical background to
>>separate the facts from the hype.
>
>
> So you keep saying...
> But your posts indicate you are actually ignorant when it comes
> to technical details. All you presented so far is whishful thinking
> and belive that if something costs lots of monay then it will automagicaly
> solve all problems. Well, in reality it's not a case. No matter how much money
> you put in a HW RAID it won't detect some data corruptions which would otherwise
> be easily detected and corrected by zfs.
>
> I think you should stay in your wonderland and we should not waste our time anymore.
>
>

No, a truly fault-tolerant hardware RAID is VERY expensive to develop
and manufacture. You don't take $89 100GB disk drives off the shelf,
tack them onto an EIDE controller and add some software to the system.

You first have to start with high quality disk drives. The electronic
components are also higher quality, with complicated circuits to detect
marginal signal strength off of the platter, determine when a signal is
marginal, change the sensing parameters in an attempt to reread the data
correctly, and so on.

The firmware must be able to work with this hardware to handle read
errors and change those parameters, automatically mark marginal sectors
bad before the become totally wiped out, and if the data cannot be read,
automatically retry from the mirror. And if the retry occurs, the
firmware must mark the original track bad and rewrite it with the good data.

Also, with two or more controllers, the controllers talk to each other
directly, generally over a dedicated bus. They keep each other informed
of their status and constantly run diagnostics on themselves and each
other when the system is idle. These tests include reading and writing
test cylinders on the disks to verify proper operation.

Additionally, in the more expensive RAID devices, checksums are
typically at least 32 bits long (your off-the-shelf drive typically uses
a 16 bit checksum), and the checksum is built in hardware - much more
expensive, but much faster than doing it in firmware. Checksum
comparisons are done in hardware, also.

Plus, with verified writes, the firmware has to go back and reread the
data the next time the sector comes around and compare it with the
contents of the buffer. Again, this is often done in hardware on the
high end RAID systems.

And, most of these RAID devices use custom chip sets - not something off
the shelf. Designing the chipsets themselves is in itself quite
expensive, and due to the relatively limited run and high density of the
chipsets, they are quite expensive to produce.

There's a lot more to it. But the final result is these devices have a
lot more hardware and software, a lot more internal communications, and
a lot more firmware. And it costs a lot of money to design and
manufacture these devices. That's why you won't find them at your
local computer store.

Some of this can be emulated in software. But the software cannot
detect when a signal is getting marginal (it's either "good" or "bad",
adjust the r/w head parameters, and similar things. Yes, it can
checksum the data coming back and read from the mirror drive if
necessary. It might even be able to tell the controller to run a
self-check (most controllers do have that capability) during idle times.
But it can't do a lot more than that. The controller interface isn't
smart enough to do a lot more.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 13.11.2006 23:24:13 von Bill Todd

Jerry Stuckle wrote:

....

>> None, but thirty-odd years ago I did work full-time for three years in
>> a mental hospital, treating disturbed adolescents.

....

> Right. What did you do - empty their trash cans for them?

I fully understand how limited your reading skills are, but exactly what
part of the end of the above sentence surpassed even your meager ability
to read?

One of the many things I learned there was that some people are simply
beyond help - professional or otherwise. You appear to be one of them:
the fact that not a single person here has defended you, but rather
uniformly told you how deluded you are, doesn't faze you in the slightest.

Fortunately, that's in no way my problem, nor of any concern to me. So
have a nice life in your own private little fantasy world - just don't
be surprised that no one else subscribes to it.

- bill

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 14.11.2006 00:09:51 von Robert Milkowski

Jerry Stuckle wrote:
> Robert Milkowski wrote:
>
> No, a truly fault-tolerant hardware RAID is VERY expensive to develop
> and manufacture. You don't take $89 100GB disk drives off the shelf,
> tack them onto an EIDE controller and add some software to the system.
>
> You first have to start with high quality disk drives. The electronic
> components are also higher quality, with complicated circuits to detect
> marginal signal strength off of the platter, determine when a signal is
> marginal, change the sensing parameters in an attempt to reread the data
> correctly, and so on.
>
> The firmware must be able to work with this hardware to handle read
> errors and change those parameters, automatically mark marginal sectors
> bad before the become totally wiped out, and if the data cannot be read,
> automatically retry from the mirror. And if the retry occurs, the
> firmware must mark the original track bad and rewrite it with the good data.
>
> Also, with two or more controllers, the controllers talk to each other
> directly, generally over a dedicated bus. They keep each other informed
> of their status and constantly run diagnostics on themselves and each
> other when the system is idle. These tests include reading and writing
> test cylinders on the disks to verify proper operation.
>
> Additionally, in the more expensive RAID devices, checksums are
> typically at least 32 bits long (your off-the-shelf drive typically uses
> a 16 bit checksum), and the checksum is built in hardware - much more
> expensive, but much faster than doing it in firmware. Checksum
> comparisons are done in hardware, also.
>
> Plus, with verified writes, the firmware has to go back and reread the
> data the next time the sector comes around and compare it with the
> contents of the buffer. Again, this is often done in hardware on the
> high end RAID systems.
>
> And, most of these RAID devices use custom chip sets - not something off
> the shelf. Designing the chipsets themselves is in itself quite
> expensive, and due to the relatively limited run and high density of the
> chipsets, they are quite expensive to produce.
>
> There's a lot more to it. But the final result is these devices have a
> lot more hardware and software, a lot more internal communications, and
> a lot more firmware. And it costs a lot of money to design and
> manufacture these devices. That's why you won't find them at your
> local computer store.
>
> Some of this can be emulated in software. But the software cannot
> detect when a signal is getting marginal (it's either "good" or "bad",
> adjust the r/w head parameters, and similar things. Yes, it can
> checksum the data coming back and read from the mirror drive if
> necessary. It might even be able to tell the controller to run a
> self-check (most controllers do have that capability) during idle times.
> But it can't do a lot more than that. The controller interface isn't
> smart enough to do a lot more.
>

The point is that you can still use such array and put on top of it ZFS
for many reasons - easier management is one of reasons, another is better
data protection than if you use classic file system.

--
Robert Milkowski
rmilkowskiZZZZ@wp-sa.pl
http://milek.blogspot.com

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 14.11.2006 01:57:17 von Bill Todd

Dear me - I just bade you a fond farewell, and here you've at last come
up with something at least vaguely technical (still somewhat mistaken,
but at least technical). So I'll respond to it in kind:

Jerry Stuckle wrote:

....

a truly fault-tolerant hardware RAID is VERY expensive to develop
> and manufacture.

That's true, and one of the reasons why it makes a lot more sense to do
the work in software instead (as long as small-update latency is not
critical). One of the reasons for the rise of high-end, high-cost
hardware RAID systems was the lag in development of system software in
that area. Another was the ability of the hardware approach to bundle
in NVRAM write acceleration that was considerably more difficult to add
(and then use) as a special system device (the venerable Prestoserve
product comes to mind), plus large amounts of additional cache that
systems may not even have been able to support at all due to
address-space limitations: the ability to look like a plain old disk
(no system or application software changes required at all) but offer
far higher reliability (through redundancy) and far better small-update
and/or read-caching performance helped make the sale.

But time marches on. Most serious operating systems now support (either
natively or via extremely reputable decade-old, thoroughly-tested
third-party system software products from people like Veritas) software
RAID, and as much cache memory as you can afford (no more address-space
limitations there) - plus (with products like ZFS) are at least starting
to address synchronous small-update throughput (though when synchronous
small-update *latency* is critical there's still no match for NVRAM).

You don't take $89 100GB disk drives off the shelf,
> tack them onto an EIDE controller and add some software to the system.

Actually, you can do almost *precisely* that, as long as the software is
handles the situation appropriately - and that's part of what ZFS is
offering (and what you so obviously completely fail to be able to grasp).

No disk or firmware is completely foolproof. Not one. No matter how
expensive and well-designed. So the question isn't whether the disks
and firmware are unreliable, but just the degree and manner in which
they are.

There is, to be sure, no way that you can make a pair of inexpensive
SATA drives just as reliable as a pair of Cheetahs, all other things
being equal. But it is *eminently* possible, using appropriate software
(or firmware), to make *three or four* inexpensive SATA drives *more*
reliable than a pair of Cheetahs that cost far more - and to obtain
better performance in many areas in the bargain.

Do you buy Brand X drives off the back of a truck in an alley? Of
course not: you buy from Seagate (or someone you think has similar
credibility - and perhaps not their newly-acquired Maxtor drives for a
while yet), and for 24/7 use you buy their 'near line' drives (which
aren't much more expensive than their desktop versions). For *really*
hard 24/7 seek-intensive pounding, your only real SATA choice is Western
Digital's Raptor series - unless you just throw so many lesser drives at
the workload and distribute it across them sufficiently evenly that
there's no long any real pounding on any given drive (which in fact is
not an unrealistic possibility, though one which must be approached with
due care).

Such reputable SATA drives aren't the equal of their high-end FC
cousins, but neither are they crap: in both cases, as long as your
expectations are realistic, you compensate for their limitations, and
you don't abuse them, they won't let you down.

And you don't attach them through Brand X SATA controllers, either:
ideally, you attach them directly (since you no longer need any
intermediate RAID hardware), using the same quality electronics you have
on the rest of your system board (so the SATA connection won't
constitute a weak link). And by virtue of being considerably simpler
hardware/firmware than a RAID implementation, that controller may well
be *more* reliable.

If you've got a lot of disks to attach, quality SATA port multipliers,
SAS connections, and fibre-channel-to-SATA links are available.

>
> You first have to start with high quality disk drives. The electronic
> components are also higher quality, with complicated circuits to detect
> marginal signal strength off of the platter, determine when a signal is
> marginal, change the sensing parameters in an attempt to reread the data
> correctly, and so on.

That's all very nice, but that actually (while as explained above being
an eminently debatable question in its own right) hasn't been the main
subject under discussion here: it's been whether hardware *RAID* is any
more reliable than software RAID (not what kind of disks one should use
after having made that RAID choice).

>
> The firmware must be able to work with this hardware to handle read
> errors and change those parameters, automatically mark marginal sectors
> bad before the become totally wiped out,

Whether you're aware of it or not, modern SATA drives (and even
not-too-old ATA drives) do *all* the things that you just described in
your last one-and-a-half paragraphs.

and if the data cannot be read,
> automatically retry from the mirror.

It really doesn't matter whether that's done in hardware or in software.

And if the retry occurs, the
> firmware must mark the original track bad and rewrite it with the good
> data.

Modern disks (both FC/SCSI and ATA/SATA) do that themselves, without
waiting for instructions from a higher level. They report any failure
up so that the higher level (again, doesn't matter whether it's firmware
or software) can correct the data if a good copy can be found elsewhere.
If its internal retry succeeds, the disk doesn't report an error, but
does log it internally such that any interested higher-level firmware or
software can see whether such successful retries are starting to become
alarmingly frequent and act accordingly.

>
> Also, with two or more controllers, the controllers talk to each other
> directly, generally over a dedicated bus. They keep each other informed
> of their status and constantly run diagnostics on themselves and each
> other when the system is idle.

Which is only necessary because they're doing things like capturing
updates in NVRAM (updates that must survive controller failure and thus
need to be mirrored in NVRAM at the other controller): if you eliminate
that level of function, you lose any need for that level of complexity
(not to mention eliminating a complete layer of complex hardware with
its own potential to fail).

As I said at the outset, hardware RAID *does* have some *performance*
advantages (though new software approaches to handling data continue to
erode them). But there's no intrinsic *reliability* advantage: if you
don't need that NVRAM mirrored between controllers for performance
reasons, it adds nothing to (and may actually subtract from) your
system's reliability compared with a software approach.

Having multiple paths to each disk isn't all that critical in RAID-1/10
configurations, since you can split the copies across two controllers to
ensure that one copy remains available if a controller dies (not that
frequent an occurrence - arguably, no more likely than that your system
board will experience some single point of *complete* failure). SATA
port selectors allow system fail-over, as does use of SAS or FC
connectivity to the disks (and the latter two support multiple paths to
each disk as well, should you want them).

These tests include reading and writing
> test cylinders on the disks to verify proper operation.

The background disk scrubbing which both hardware and software RAID
approaches should be doing covers that (and if there's really *no*
writing going on in the system for long periods of time, the software
can exercise that as well once in a while).

>
> Additionally, in the more expensive RAID devices, checksums are
> typically at least 32 bits long (your off-the-shelf drive typically uses
> a 16 bit checksum), and the checksum is built in hardware - much more
> expensive, but much faster than doing it in firmware. Checksum
> comparisons are done in hardware, also.

Your hand-waving just got a bit fast to follow there.

1. Disks certainly use internal per-sector error-correction codes when
transferring data to and from their platters. They are hundreds
(perhaps by now *many* hundreds) of bits long.

2. Disks use cyclic redundancy checks on the data that they accept from
and distribute to the outside world (old IDE disks did not, but ATA
disks do and SATA disks do as well - IIRC the width is 32 bits).

3. I'd certainly expect any RAID hardware to use those CRCs to
communicate with both disks and host systems: that hardly qualifies as
anything unusual. If you were talking about some *other* kind of
checksum, it would have to have been internal to the RAID, since the
disks wouldn't know anything about it (a host using special driver
software potentially could, but it would add nothing of obvious value to
the CRC mechanisms that the host already uses to communicate directly
with disks, so I'd just expect the RAID box to emulate a disk for such
communication).

4. Thus data going from system memory to disk platter and back goes (in
each direction) through several interfaces and physical connectors and
multiple per-hop checks, and the probability of some undetected failure,
while very small for any given interface, connector, or hop, is not
quite as small for the sum of all of them (as well as there being some
errors, such as misdirected or lost writes, that none of those checks
can catch). What ZFS provides (that by definition hardware RAID cannot,
since it must emulate a standard block-level interface to the host) is
an end-to-end checksum that verifies data from the time it is created in
main memory to the time it has been fetched back into main memory from
disk. IBM, NetApp, and EMC use somewhat analogous supplementary
checksums to protect data: in the i-series case I believe that they are
created and checked in main memory at the driver level and are thus
comparably strong, while in NetApp's and EMC's cases they are created
and checked in the main memory of the file server or hardware box but
then must get to and from client main memory across additional
interfaces, connectors, and hops which have their own individual checks
and are thus not comparably end-to-end in nature - though if the NetApp
data is accessed through a file-level protocol that includes an
end-to-end checksum that is created and checked in client and server
main memory rather than, e.g., in some NIC hardware accelerator it could
be *almost* comparable in strength.

>
> Plus, with verified writes, the firmware has to go back and reread the
> data the next time the sector comes around and compare it with the
> contents of the buffer. Again, this is often done in hardware on the
> high end RAID systems.

And can just as well be done in system software (indeed, this is often a
software option in high-end systems).

>
> And, most of these RAID devices use custom chip sets - not something off
> the shelf.

That in itself is a red flag: they are far more complex and also get
far less thoroughly exercised out in the field than more standard
components - regardless of how diligently they're tested.

As others have pointed out, high-end RAID firmware updates are *not*
infrequent. And they don't just do them for fun.

Designing the chipsets themselves is in itself quite
> expensive, and due to the relatively limited run and high density of the
> chipsets, they are quite expensive to produce.

As I observed at the outset, another reason to do the work in system
software.

>
> There's a lot more to it. But the final result is these devices have a
> lot more hardware and software, a lot more internal communications, and
> a lot more firmware. And it costs a lot of money to design and
> manufacture these devices.

And all those things are *disadvantages*, not recommendations.

They can also significantly limit their utility. For example, VMS
clusters support synchronous operation at separations up to 500 miles
(actually, more, but beyond that it starts to get into needs for special
tweaking) - but using host-based software mirroring rather than hardware
mirroring (because most hardware won't mirror synchronously at anything
like that distance - not to mention requiring a complete second
connection at *any* distance, whereas the normal cluster LAN or WAN can
handle software mirroring activity).

That's why you won't find them at your
> local computer store.

I seriously doubt that anyone who's been talking with you (or at least
trying to) about hardware RAID solutions has been talking about any that
you'd find at CompUSA. EMC's Symmetrix, for example, was the gold
standard of enterprise-level hardware RAID for most of the '90s - only
relatively recently did IBM claw back substantial market share in that
area (along with HDS).

>
> Some of this can be emulated in software.

*All* of the RAID part can be.

But the software cannot
> detect when a signal is getting marginal (it's either "good" or "bad",
> adjust the r/w head parameters, and similar things.

And neither can hardware RAID: those things happen strictly internally
at the disk (for that matter, by definition *anything* that the disk
externalizes can be handled by software as well as by RAID hardware).

Yes, it can
> checksum the data coming back and read from the mirror drive if
> necessary.

Yup.

Now, that *used* to be at least something of a performance issue - being
able to offload that into firmware was measurably useful. But today's
processor and memory bandwidth makes it eminently feasible - even in
cases where it's not effectively free (if you have to move the data, or
have to compress/decompress or encrypt/decrypt it, you can generate the
checksum as it's passing through and pay virtually no additional cost at
all).

That's still only a wash when conventional checksum mechanisms are used.
But when you instead use an end-to-end checksum like ZFS's (which you
can do *only* when the data is in main memory, hence can't offload) you
get a significant benefit from it.

It might even be able to tell the controller to run a
> self-check (most controllers do have that capability) during idle times.

If there were any reason to - but without the complexity of RAID
firmware to worry about, any need for checks beyond what the simpler
controller should probably be doing on its own becomes questionable.

> But it can't do a lot more than that. The controller interface isn't
> smart enough to do a lot more.

And without having to handle RAID management, it doesn't have to be.

- bill

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 14.11.2006 02:35:46 von Jerry Stuckle

Bill Todd wrote:
> Jerry Stuckle wrote:
>
> ...
>
>>> None, but thirty-odd years ago I did work full-time for three years
>>> in a mental hospital, treating disturbed adolescents.
>
>
> ...
>
>> Right. What did you do - empty their trash cans for them?
>
>
> I fully understand how limited your reading skills are, but exactly what
> part of the end of the above sentence surpassed even your meager ability
> to read?
>
> One of the many things I learned there was that some people are simply
> beyond help - professional or otherwise. You appear to be one of them:
> the fact that not a single person here has defended you, but rather
> uniformly told you how deluded you are, doesn't faze you in the slightest.
>
> Fortunately, that's in no way my problem, nor of any concern to me. So
> have a nice life in your own private little fantasy world - just don't
> be surprised that no one else subscribes to it.
>
> - bill

Yes, I agree. I would suggest you go back through this thread and see
who needs the help, but it's obviously beyond your comprehension.

A quick refresher. You came up with some "facts" but had nothing other
than a couple of blogs to back them up. You have no technical
background, and are incapable of understanding even the basic
electronics about which you espouse "facts". Yet you regard them as the
ultimate truths.

And when I came back and shot down your arguments one by one, you
started the personal attacks. You have yet to refute any of the facts I
gave you, other than to repeat your hype and drivel (as if that makes
them even more factual) and more personal attacks.

Go away, little troll You're mommy is calling you.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 14.11.2006 02:42:58 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>Robert Milkowski wrote:
>>
>>No, a truly fault-tolerant hardware RAID is VERY expensive to develop
>>and manufacture. You don't take $89 100GB disk drives off the shelf,
>>tack them onto an EIDE controller and add some software to the system.
>>
>>You first have to start with high quality disk drives. The electronic
>>components are also higher quality, with complicated circuits to detect
>>marginal signal strength off of the platter, determine when a signal is
>>marginal, change the sensing parameters in an attempt to reread the data
>>correctly, and so on.
>>
>>The firmware must be able to work with this hardware to handle read
>>errors and change those parameters, automatically mark marginal sectors
>>bad before the become totally wiped out, and if the data cannot be read,
>>automatically retry from the mirror. And if the retry occurs, the
>>firmware must mark the original track bad and rewrite it with the good data.
>>
>>Also, with two or more controllers, the controllers talk to each other
>>directly, generally over a dedicated bus. They keep each other informed
>>of their status and constantly run diagnostics on themselves and each
>>other when the system is idle. These tests include reading and writing
>>test cylinders on the disks to verify proper operation.
>>
>>Additionally, in the more expensive RAID devices, checksums are
>>typically at least 32 bits long (your off-the-shelf drive typically uses
>>a 16 bit checksum), and the checksum is built in hardware - much more
>>expensive, but much faster than doing it in firmware. Checksum
>>comparisons are done in hardware, also.
>>
>>Plus, with verified writes, the firmware has to go back and reread the
>>data the next time the sector comes around and compare it with the
>>contents of the buffer. Again, this is often done in hardware on the
>>high end RAID systems.
>>
>>And, most of these RAID devices use custom chip sets - not something off
>>the shelf. Designing the chipsets themselves is in itself quite
>>expensive, and due to the relatively limited run and high density of the
>>chipsets, they are quite expensive to produce.
>>
>>There's a lot more to it. But the final result is these devices have a
>>lot more hardware and software, a lot more internal communications, and
>>a lot more firmware. And it costs a lot of money to design and
>>manufacture these devices. That's why you won't find them at your
>>local computer store.
>>
>>Some of this can be emulated in software. But the software cannot
>>detect when a signal is getting marginal (it's either "good" or "bad",
>>adjust the r/w head parameters, and similar things. Yes, it can
>>checksum the data coming back and read from the mirror drive if
>>necessary. It might even be able to tell the controller to run a
>>self-check (most controllers do have that capability) during idle times.
>> But it can't do a lot more than that. The controller interface isn't
>>smart enough to do a lot more.
>>
>
>
> The point is that you can still use such array and put on top of it ZFS
> for many reasons - easier management is one of reasons, another is better
> data protection than if you use classic file system.
>

The point is that such an array makes ZFS unnecessary. Sure, you *can*
use it (if you're using Linux - most of these systems do not). There is
nothing for ZFS to "manage" - configuration is done through utilities
(and sometimes an Ethernet port or similar). There is no management
interface for the file system - it all looks like a single disk (or
several disks, depending on the configuration).

As for data protection - if the RAID array can't read the data, it's
lost far beyond what ZFS or any other file system can do - unless you
have another complete RAID being run by ZFS. And if that's the case,
it's cheaper to have multiple mirrors.

There are a few very high end who use 3 drives and compare everything (2
out of 3 win). But these are very, very rare, and only used for the
absolutely most critical data (i.e. space missions, where the can't be
repaired/replaced easily).

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 14.11.2006 03:48:07 von Jerry Stuckle

Bill Todd wrote:
> Dear me - I just bade you a fond farewell, and here you've at last come
> up with something at least vaguely technical (still somewhat mistaken,
> but at least technical). So I'll respond to it in kind:
>
> Jerry Stuckle wrote:
>
> ...
>
> a truly fault-tolerant hardware RAID is VERY expensive to develop
>
>> and manufacture.
>
>
> That's true, and one of the reasons why it makes a lot more sense to do
> the work in software instead (as long as small-update latency is not
> critical). One of the reasons for the rise of high-end, high-cost
> hardware RAID systems was the lag in development of system software in
> that area. Another was the ability of the hardware approach to bundle
> in NVRAM write acceleration that was considerably more difficult to add
> (and then use) as a special system device (the venerable Prestoserve
> product comes to mind), plus large amounts of additional cache that
> systems may not even have been able to support at all due to
> address-space limitations: the ability to look like a plain old disk
> (no system or application software changes required at all) but offer
> far higher reliability (through redundancy) and far better small-update
> and/or read-caching performance helped make the sale.
>

Yes, and software implementations are poor replacements for a truly
fault-tolerant system. And the high end RAID devices do not require
special software - they look like any other disk device attached to the
system.

As for bundling write acceleration in NVRAM - again, meaningless because
good RAID devices aren't loaded as a "special system device".

Prestoserve was one of the first lower-end RAID products made. However,
there were a huge number of them before that. But you wouldn't find
them on a PC. They were primarily medium and large system devices.
Prestoserve took some of the ideas and moved much of the hardware
handling into software. Unfortunately, when they did it, they lost the
ability to handle problems at a low-level (i.e. read head biasing,
etc.). It did make the arrays a lot cheaper, but at a price.

And in the RAID devices, system address space was never a problem -
because the data was transferred to RAID cache immediately. This did
not come out of the system pool; the controllers have their own cache.

I remember 64MB caches in the controllers way back in the mid 80's.
It's in the GB, now. No address space limitations on the system because
it didn't use system memory.

> But time marches on. Most serious operating systems now support (either
> natively or via extremely reputable decade-old, thoroughly-tested
> third-party system software products from people like Veritas) software
> RAID, and as much cache memory as you can afford (no more address-space
> limitations there) - plus (with products like ZFS) are at least starting
> to address synchronous small-update throughput (though when synchronous
> small-update *latency* is critical there's still no match for NVRAM).
>

Sure, you can get software RAID. But it's not as reliable as a good
hardware RAID.

> You don't take $89 100GB disk drives off the shelf,
>
>> tack them onto an EIDE controller and add some software to the system.
>
>
> Actually, you can do almost *precisely* that, as long as the software is
> handles the situation appropriately - and that's part of what ZFS is
> offering (and what you so obviously completely fail to be able to grasp).
>

In the cheap RAID devices, sure. But not in the good ones. You're
talking cheap. I'm talking quality.

> No disk or firmware is completely foolproof. Not one. No matter how
> expensive and well-designed. So the question isn't whether the disks
> and firmware are unreliable, but just the degree and manner in which
> they are.
>

I never said they were 100% foolproof. Rather, I said they are amongst
the most tested software made. Probably the only software tested more
thoroughly is the microcode on CPU's. And they are as reliable as
humanly possible.

Of course, the same thing goes for ZFS and any file system. They're not
completely foolproof, either, are they?

> There is, to be sure, no way that you can make a pair of inexpensive
> SATA drives just as reliable as a pair of Cheetahs, all other things
> being equal. But it is *eminently* possible, using appropriate software
> (or firmware), to make *three or four* inexpensive SATA drives *more*
> reliable than a pair of Cheetahs that cost far more - and to obtain
> better performance in many areas in the bargain.
>

And there is no way to make a pair of Cheetahs as reliable as drives
made strictly for high end RAID devices. Some of these drives still
sell for $30-60/GB (or more).

> Do you buy Brand X drives off the back of a truck in an alley? Of
> course not: you buy from Seagate (or someone you think has similar
> credibility - and perhaps not their newly-acquired Maxtor drives for a
> while yet), and for 24/7 use you buy their 'near line' drives (which
> aren't much more expensive than their desktop versions). For *really*
> hard 24/7 seek-intensive pounding, your only real SATA choice is Western
> Digital's Raptor series - unless you just throw so many lesser drives at
> the workload and distribute it across them sufficiently evenly that
> there's no long any real pounding on any given drive (which in fact is
> not an unrealistic possibility, though one which must be approached with
> due care).
>

Or RAID drives not available as single units - other than as replacement
parts for their specific RAID arrays.

> Such reputable SATA drives aren't the equal of their high-end FC
> cousins, but neither are they crap: in both cases, as long as your
> expectations are realistic, you compensate for their limitations, and
> you don't abuse them, they won't let you down.
>

No, I didn't say ANY drive was "crap". They're good drives, when used
for what they are designed. But drives made for RAID arrays are in a
class by themselves. And they can do things that standard drives can't
(like dynamically adjust amplifiers and slew rates when reading and
writing data).

> And you don't attach them through Brand X SATA controllers, either:
> ideally, you attach them directly (since you no longer need any
> intermediate RAID hardware), using the same quality electronics you have
> on the rest of your system board (so the SATA connection won't
> constitute a weak link). And by virtue of being considerably simpler
> hardware/firmware than a RAID implementation, that controller may well
> be *more* reliable.
>

There is no way this is more reliable than a good RAID system. If you
had ever used one, you wouldn't even try to make that claim.

> If you've got a lot of disks to attach, quality SATA port multipliers,
> SAS connections, and fibre-channel-to-SATA links are available.
>

Sure. But they still don't do the things RAID drives can do.

>>
>> You first have to start with high quality disk drives. The electronic
>> components are also higher quality, with complicated circuits to
>> detect marginal signal strength off of the platter, determine when a
>> signal is marginal, change the sensing parameters in an attempt to
>> reread the data correctly, and so on.
>
>
> That's all very nice, but that actually (while as explained above being
> an eminently debatable question in its own right) hasn't been the main
> subject under discussion here: it's been whether hardware *RAID* is any
> more reliable than software RAID (not what kind of disks one should use
> after having made that RAID choice).
>

And the disk drive is a part of hardware RAID. Only a total idiot would
ignore the disk drive quality when discussing RAID reliability.

>>
>> The firmware must be able to work with this hardware to handle read
>> errors and change those parameters, automatically mark marginal
>> sectors bad before the become totally wiped out,
>
>
> Whether you're aware of it or not, modern SATA drives (and even
> not-too-old ATA drives) do *all* the things that you just described in
> your last one-and-a-half paragraphs.
>

And let's see those drives do things like dynamically adjust the
electronics - such as amp gain, bias, slew rate... They can't do it.

> and if the data cannot be read,
>
>> automatically retry from the mirror.
>
>
> It really doesn't matter whether that's done in hardware or in software.
>

Spoken by someone who truly has no idea what he's talking about. Anyone
who has worked with high performance, critical systems knows there is a
*huge* difference between doing it in hardware and software.

> And if the retry occurs, the
>
>> firmware must mark the original track bad and rewrite it with the good
>> data.
>
>
> Modern disks (both FC/SCSI and ATA/SATA) do that themselves, without
> waiting for instructions from a higher level. They report any failure
> up so that the higher level (again, doesn't matter whether it's firmware
> or software) can correct the data if a good copy can be found elsewhere.
> If its internal retry succeeds, the disk doesn't report an error, but
> does log it internally such that any interested higher-level firmware or
> software can see whether such successful retries are starting to become
> alarmingly frequent and act accordingly.
>

Yes, they report total failure on a read. But they can't go back and
try to reread the sector with different parms to the read amps, for
instance. And a good RAID controller will make decisions based in part
on what parameters it takes to read the data.

>>
>> Also, with two or more controllers, the controllers talk to each other
>> directly, generally over a dedicated bus. They keep each other
>> informed of their status and constantly run diagnostics on themselves
>> and each other when the system is idle.
>
>
> Which is only necessary because they're doing things like capturing
> updates in NVRAM (updates that must survive controller failure and thus
> need to be mirrored in NVRAM at the other controller): if you eliminate
> that level of function, you lose any need for that level of complexity
> (not to mention eliminating a complete layer of complex hardware with
> its own potential to fail).
>

This has nothing to do with updates in NVRAM. This has everything to do
with processing the data, constant self-checks, etc. This is critical
in high-reliabilty systems.

> As I said at the outset, hardware RAID *does* have some *performance*
> advantages (though new software approaches to handling data continue to
> erode them). But there's no intrinsic *reliability* advantage: if you
> don't need that NVRAM mirrored between controllers for performance
> reasons, it adds nothing to (and may actually subtract from) your
> system's reliability compared with a software approach.
>

Again, you make a generalization about which you know nothing. How many
$500K+ RAID arrays have you actually worked on? For that matter, how
many $50K arrays? $5K?

> Having multiple paths to each disk isn't all that critical in RAID-1/10
> configurations, since you can split the copies across two controllers to
> ensure that one copy remains available if a controller dies (not that
> frequent an occurrence - arguably, no more likely than that your system
> board will experience some single point of *complete* failure). SATA
> port selectors allow system fail-over, as does use of SAS or FC
> connectivity to the disks (and the latter two support multiple paths to
> each disk as well, should you want them).
>

I don't believe I ever said anything about multiple paths to each disk.
But you're correct, some RAID arrays have them.

> These tests include reading and writing
>
>> test cylinders on the disks to verify proper operation.
>
>
> The background disk scrubbing which both hardware and software RAID
> approaches should be doing covers that (and if there's really *no*
> writing going on in the system for long periods of time, the software
> can exercise that as well once in a while).
>

No, it doesn't. For instance, these tests include things like writing
with a lower-level signal than normal and trying to read it back. It
helps catch potential problems in the heads and electronics. The same
is true for writing with stronger than normal currents - and trying to
read them back. Also checking adjacent tracks for "bit bleed". And a
lot of other things.

These are things again no software implementation can do.

>>
>> Additionally, in the more expensive RAID devices, checksums are
>> typically at least 32 bits long (your off-the-shelf drive typically
>> uses a 16 bit checksum), and the checksum is built in hardware - much
>> more expensive, but much faster than doing it in firmware. Checksum
>> comparisons are done in hardware, also.
>
>
> Your hand-waving just got a bit fast to follow there.
>
> 1. Disks certainly use internal per-sector error-correction codes when
> transferring data to and from their platters. They are hundreds
> (perhaps by now *many* hundreds) of bits long.

Actually, not. Sectors are still 512 bytes. And the checksums (or ECC,
if they use them) are still only 16 or 32 bits. And even if they use
ECC, 32 bits can only can only correct up to 3 bad bits out of the 512
bytes. None use "many hundreds of bits". It would waste too much disk
space.

>
> 2. Disks use cyclic redundancy checks on the data that they accept from
> and distribute to the outside world (old IDE disks did not, but ATA
> disks do and SATA disks do as well - IIRC the width is 32 bits).
>

See above. And even the orignal IDE drives used a 16 bit checksum.

> 3. I'd certainly expect any RAID hardware to use those CRCs to
> communicate with both disks and host systems: that hardly qualifies as
> anything unusual. If you were talking about some *other* kind of
> checksum, it would have to have been internal to the RAID, since the
> disks wouldn't know anything about it (a host using special driver
> software potentially could, but it would add nothing of obvious value to
> the CRC mechanisms that the host already uses to communicate directly
> with disks, so I'd just expect the RAID box to emulate a disk for such
> communication).
>

CRC's are not transferred to the host system, either in RAID or non-RAID
drives. Yes, some drives have that capability for diagnostic purposes.
But as a standard practice, transferring 512 bytes is 512 bytes of
data - no more, no less.

> 4. Thus data going from system memory to disk platter and back goes (in
> each direction) through several interfaces and physical connectors and
> multiple per-hop checks, and the probability of some undetected failure,
> while very small for any given interface, connector, or hop, is not
> quite as small for the sum of all of them (as well as there being some
> errors, such as misdirected or lost writes, that none of those checks
> can catch). What ZFS provides (that by definition hardware RAID cannot,
> since it must emulate a standard block-level interface to the host) is
> an end-to-end checksum that verifies data from the time it is created in
> main memory to the time it has been fetched back into main memory from
> disk. IBM, NetApp, and EMC use somewhat analogous supplementary
> checksums to protect data: in the i-series case I believe that they are
> created and checked in main memory at the driver level and are thus
> comparably strong, while in NetApp's and EMC's cases they are created
> and checked in the main memory of the file server or hardware box but
> then must get to and from client main memory across additional
> interfaces, connectors, and hops which have their own individual checks
> and are thus not comparably end-to-end in nature - though if the NetApp
> data is accessed through a file-level protocol that includes an
> end-to-end checksum that is created and checked in client and server
> main memory rather than, e.g., in some NIC hardware accelerator it could
> be *almost* comparable in strength.
>

Yes, ZFS can correct for errors like bad connectors and cables. And I
guess you need it if you use cheap connectors or cables. But even if
they do fail - it's not going to be a one-time occurrance. Chances are
your system will crash within a few hundred ms.

I dont' know about NetApp, but IBM doesn't work this way at all. The
channel itself is parity checked by hardware on both ends. Any parity
check brings the system to an immediate halt.

>>
>> Plus, with verified writes, the firmware has to go back and reread the
>> data the next time the sector comes around and compare it with the
>> contents of the buffer. Again, this is often done in hardware on the
>> high end RAID systems.
>
>
> And can just as well be done in system software (indeed, this is often a
> software option in high-end systems).
>

Sure, it *can* be done with software, at a price.

>>
>> And, most of these RAID devices use custom chip sets - not something
>> off the shelf.
>
>
> That in itself is a red flag: they are far more complex and also get
> far less thoroughly exercised out in the field than more standard
> components - regardless of how diligently they're tested.
>
Gotten a cell phone lately? Chances are the chips in your phone are
custom-made. Each manufacturer creates its own. Or an X-BOX, Nintendo,
PlayStation, etc.? Most of those have customer chips. And the same is
true for microwaves, TV sets and more.

The big difference is that Nokia can make 10M custom chilps for its
phones; for a high-end RAID device, 100K is a big run.

> As others have pointed out, high-end RAID firmware updates are *not*
> infrequent. And they don't just do them for fun.
>
> Designing the chipsets themselves is in itself quite
>
>> expensive, and due to the relatively limited run and high density of
>> the chipsets, they are quite expensive to produce.
>
>
> As I observed at the outset, another reason to do the work in system
> software.
>

And data in the system and system software can be corrupted. Once the
data is in the RAID device, it cannot.
>>
>> There's a lot more to it. But the final result is these devices have
>> a lot more hardware and software, a lot more internal communications,
>> and a lot more firmware. And it costs a lot of money to design and
>> manufacture these devices.
>
>
> And all those things are *disadvantages*, not recommendations.
>

And all of these are advantages. They increase reliability and integrity.

You seem to think software is the way to go. Just tell me one thing.
When was the last time you had to have your computer fixed because of a
hardware problem? And how many times have you had to reboot due to a
software problem?

And you say software is as reliable?

> They can also significantly limit their utility. For example, VMS
> clusters support synchronous operation at separations up to 500 miles
> (actually, more, but beyond that it starts to get into needs for special
> tweaking) - but using host-based software mirroring rather than hardware
> mirroring (because most hardware won't mirror synchronously at anything
> like that distance - not to mention requiring a complete second
> connection at *any* distance, whereas the normal cluster LAN or WAN can
> handle software mirroring activity).
>

Who's talking about mirroring for 500 miles? Not me. And none of the
systems I know about do this for data integrity reasons.

Some do it for off-site backup, but that has nothing to do with RAID.

> That's why you won't find them at your
>

>> local computer store.
>
>
> I seriously doubt that anyone who's been talking with you (or at least
> trying to) about hardware RAID solutions has been talking about any that
> you'd find at CompUSA. EMC's Symmetrix, for example, was the gold
> standard of enterprise-level hardware RAID for most of the '90s - only
> relatively recently did IBM claw back substantial market share in that
> area (along with HDS).
>

Actually, Symmetrix grew big in the small and medium systems, but IBM
never lost the lead in the top end RAID solutions. But they also were
(and still are) quite a bit more expensive than EMC's.

>>
>> Some of this can be emulated in software.
>
>
> *All* of the RAID part can be.
>

Let's see you do things like adjust drive electronics in software. And
what you can do - let's see you do it without any impact to the system.

> But the software cannot
>
>> detect when a signal is getting marginal (it's either "good" or "bad",
>> adjust the r/w head parameters, and similar things.
>
>
> And neither can hardware RAID: those things happen strictly internally
> at the disk (for that matter, by definition *anything* that the disk
> externalizes can be handled by software as well as by RAID hardware).
>

And here you show you know nothing about what you talk. RAID drives are
specially built to work with their controllers. And RAID controllers
are made to be able to do these things. This is very low level stuff -
not things which are avialable outside the drive/controller.

Effectively, the RAID controller and the disk controller become one
unit. Separate, but one.

> Yes, it can
>
>> checksum the data coming back and read from the mirror drive if
>> necessary.
>
>
> Yup.
>
> Now, that *used* to be at least something of a performance issue - being
> able to offload that into firmware was measurably useful. But today's
> processor and memory bandwidth makes it eminently feasible - even in
> cases where it's not effectively free (if you have to move the data, or
> have to compress/decompress or encrypt/decrypt it, you can generate the
> checksum as it's passing through and pay virtually no additional cost at
> all).
>

Sorry, Bill, this statement is really off the wall.

Then why do all the high end disk controllers use DMA to transfer data?
Because it's faster and takes fewer CPU cycles than doing it software,
that's why. And computing checkums for 512 bytes takes a significantly
longer time that actually transferring the data to/from memory via software.

Also, instead of allocating 512 byte buffers, the OS would have to
allocate 514 or 516 byte buffers. This removes a lot of the optimizaton
possible when the system is using buffers during operations.

Additionally, differerent disk drives internally use differrent checksums.

Plus there is no way to tell the disk what to write for a checksum.
This is hard-coded into the disk controller.

> That's still only a wash when conventional checksum mechanisms are used.
> But when you instead use an end-to-end checksum like ZFS's (which you
> can do *only* when the data is in main memory, hence can't offload) you
> get a significant benefit from it.
>

Sure, if there's a hardware failure. But I repeat - how often do you
get hardware errors? How often do you get software errors? Which is
more reliable?

> It might even be able to tell the controller to run a
>
>> self-check (most controllers do have that capability) during idle times.
>
>
> If there were any reason to - but without the complexity of RAID
> firmware to worry about, any need for checks beyond what the simpler
> controller should probably be doing on its own becomes questionable.
>
>> But it can't do a lot more than that. The controller interface isn't
>> smart enough to do a lot more.
>
>
> And without having to handle RAID management, it doesn't have to be.
>
> - bill
>

Nope, and it's too bad, also.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 14.11.2006 05:24:52 von Robert Milkowski

Jerry Stuckle wrote:
> Robert Milkowski wrote:
> > The point is that you can still use such array and put on top of it ZFS
> > for many reasons - easier management is one of reasons, another is better
> > data protection than if you use classic file system.
> >
>
> The point is that such an array makes ZFS unnecessary. Sure, you *can*
> use it (if you're using Linux - most of these systems do not). There is
> nothing for ZFS to "manage" - configuration is done through utilities
> (and sometimes an Ethernet port or similar). There is no management
> interface for the file system - it all looks like a single disk (or
> several disks, depending on the configuration).
>
> As for data protection - if the RAID array can't read the data, it's
> lost far beyond what ZFS or any other file system can do - unless you
> have another complete RAID being run by ZFS. And if that's the case,
> it's cheaper to have multiple mirrors.

It's here you don't understand. ZFS protect me from bad driver, bad FC switch,
bad FC/SCSI/ESCON/... adapter corrupting data. As one SCSI adapter unfortunately
did few weeks ago. So it's not only the array.
And while I admit that I haven't seen (yet) ZFS detecting data corruption
on Symmetrix boxes, but I did on another arrays, it could be due to fact
I put ZFS on Symmetrix boxes not that long ago and comparing to other
arrays it's not that much storage under ZFS on Symmetrix here. So statisticaly
it can be that I'm just more lucky. And of course I expect Symmetrix to be
more reliable than JBOD or medium array.

Now when it comes to manageability - well, it's actually manageability
of ZFS that drove my attention at first. It's because of features like
pooled storage, many file systems, shrinking/growin on the fly, etc.
which make ZFS especially in a fast changing environments just rock.
When you've got manage lots of fast changing data for MANY clients, and all of
these is changing, with ZFS it's not problem at all - you create another
filesystem in a second, you've got all available storage to it, it doesn't really
matter which file system is being consumed faster, etc.

Then you got other feature which make ZFS quite compeling. In our environment
with lot of small random writes which with ZFS are mainly sequential writes,
write speed-up is considerable. It helps even with Symmetrix boxes with 16B or more
cache, not to mention smaller arrays. Well in some tests ZFS was actually
quickier even with write-thru on the array than with traditional file systems
with write-back cache. But most important "test" is production - and ZFS
is faster here.

Then you got basicaly free snapshots with no impact on performance, no need
for extra-sliced storage, etc. So you get used to make them automaticaly on
daily basis. Then if you have file systems with tens of milions of small files
then doing backup using zfs tools instead standard tools (Legato, Tivoli) gives
you even 10-15x shorter time not to mention much less IO needed to complete work.
Well it's like doing backup for several days and do it in hours difference here
sometimes.

Want to create several virtual environment each with its own file system?
But you don't know exactly how many of them you'll end up and you
don't know how much disk space each of them will consume. With ZFS such
problems just doesn't exists.

Then you've got dynamic block size which also helps, especialy when during
years your mean file size changes consoderably and file size distribution
is you've got lots of small and lots of large files.

Then ZFS keeps all file system information within pool - so I don't have
to put entries in any system config files, even nfs shares can be managed
by zfs. It means I can take freshly installed Solaris box, connect it
into SAN and just import ZFS pool with all config - no backup needed, no
manual config - I get all the same parameters for all file systems in a
pool within seconds.

Then if my old SPARC box becoming slow I can just import pool on x64 box,
or vice versa and everything just works (tested it myself) without any
conversion, data migration, etc. It just works.

Then in our devel environments I need some times to make a writable
copy of a file system, test some changes, etc. With zfs regardles of file
system size (several TB's, some time more) I get WRITABLE copy in one
second, without copying data, without any need for more space. When I'm
done I just delete clone. Well, need to clone entire virtual machine
in one second, with no additional disk space needed and make some tests?
I did, works great. Need entire data base copy on a devel machine in 1s?
Writable copy? Regardles of database size? With no performance impact
on original database? No problem.

You've got new quipment and want to test different RAID config with
your application to see which config performs best. So you setup
50TB RAID-10 config and make tests. Then you setup 50TB RAID-5 config
and make tests. Then RAID-6. Then some combination (dynamic striping
of RAID-6?). How much time it would take to just make RAID-5 on the array?
Well, sometimes even 2 days to just make some test, then wait another
dayy or two for another test. ZFS creates RAIDs within seconds with no
background synchronization, etc. so disks are immediately ready to use.
Again, you saved lot of time.

You need RAID-5 or RAID-6 and you're doing lots of small writes with
lots of concurrent streams? Your performance generally sucks regardles
of cache size in your Symmetrix or other array. Then you create
RAID-5 (or RAID-6) using zfs and suddenly you get N times the performance
of the array on the same hardware. Well, sometimes you just can't walk-by.

You've got application which is disk IO constrained and there's plenty
of CPU power left. And you're running out of disk space. Well, just
turn on compression in ZFS on-the-fly and all new writes are compressed.
The end effect is disk free is rising, and performance is not worse but
even better. Well, I did exactly that some time ago.

I could go on and all above is from my own experience on production environments
with ZFS for over last two years.

Like it or not but ZFS makes admin's life MUCH easier, solves many problems,
and in many cases saves your data when your array screw up.

--
Robert Milkowski
rmilkowskiSSSS@wp-sa.pl
http://milek.blogspot.com

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 14.11.2006 06:49:09 von Robert Milkowski

Jerry Stuckle wrote:
> Bill Todd wrote:
>
> > But time marches on. Most serious operating systems now support (either
> > natively or via extremely reputable decade-old, thoroughly-tested
> > third-party system software products from people like Veritas) software
> > RAID, and as much cache memory as you can afford (no more address-space
> > limitations there) - plus (with products like ZFS) are at least starting
> > to address synchronous small-update throughput (though when synchronous
> > small-update *latency* is critical there's still no match for NVRAM).
> >
>
> Sure, you can get software RAID. But it's not as reliable as a good
> hardware RAID.

Not true. Actually in some environments what you do is software
mirroring between two enterprise arrays and put your Oracle on top
of it. That way you get more reliable config.

Also when you're using your high-end array without ZFS basicaly
you get less reliability when you use the same array with ZFS.

> And let's see those drives do things like dynamically adjust the
> electronics - such as amp gain, bias, slew rate... They can't do it.

Again, you missing the point. You still get all of this as with ZFS you do
not throw array your array - you use it. ZFS is great you know, but it
doesn't make storage out of air molecules. So with ZFS among other
features you get additional protection which HW RAID itself cannot offer.

> who has worked with high performance, critical systems knows there is a
> *huge* difference between doing it in hardware and software.

Really? Actually depending on workload specifics and hardware specifics
I can see HW being faster than software, and the opposite.

In some cases clever combination of both gives best results.

> Yes, ZFS can correct for errors like bad connectors and cables. And I
> guess you need it if you use cheap connectors or cables. But even if
> they do fail - it's not going to be a one-time occurrance. Chances are
> your system will crash within a few hundred ms.

Geezz... I don't know how you configure your systems but just 'coz
of bad cable or connector my systems won't crash. They will use another
link. These are basics in HA storage management and I'm suprised you
don't know how to do it. And now thanks to ZFS if a FC switch, hba
or something else will corrupt data ZFS will detect and correct.

> I dont' know about NetApp, but IBM doesn't work this way at all. The
> channel itself is parity checked by hardware on both ends. Any parity
> check brings the system to an immediate halt.

What???? Just becaouse you get some errors on a link you halt entire system?
Well, just switch to good link.
I don't belive they are doing it actually.

> And data in the system and system software can be corrupted. Once the
> data is in the RAID device, it cannot.

Really? Unfortunately for your claims it happens.
And you know, even your beloved IBM's array lost some data here.
The array even warned us about it :) It wasn't Shark, but also
not low-end in IBMs arrays. And it did more than once.

>
> You seem to think software is the way to go. Just tell me one thing.
> When was the last time you had to have your computer fixed because of a
> hardware problem? And how many times have you had to reboot due to a
> software problem?

And how many times you had to reboot entire array for some upgrade or
corrections? Even high-end arrays? Including IBM's arrays? I had to do
it many times because I work with them. What about you? Maybe your envoronment
isn't as demanding?

> Actually, Symmetrix grew big in the small and medium systems, but IBM
> never lost the lead in the top end RAID solutions. But they also were
> (and still are) quite a bit more expensive than EMC's.
>

What IBM array are you talking about? Shark? Or maybe they got
for years something top secret only you know about?

In 1 minute I found some links for you.
As it seems you're fond of IBM lets start with them.

http://www-03.ibm.com/systems/storage/network/software/snapv alidator/
"
The challenge: the risk of data corruption is inherent in data transfers

Organizations of any size that rely heavily on the integrity of Oracle data need to safeguard against data corruption. Because database servers and storage devices reside at opposite ends of the I/O path, corruption can occur as each data block transfer passes through a series of logical layers involving hardware and software from multiple vendors. Other factors, such as application anomalies and human error, present additional risk. As a result, data corruption can occur at any stage of the process, even with the protection inherent in the most robust storage systems. The impact of these corruptions can cause considerable disruption to business continuity, which can be time consuming and costly to resolve.
The solution: end-to-end data validation

IBM System Storage N series with SnapValidator. software is designed to provide a high level of protection for Oracle data, helping you to detect potential data corruption before it occurs. By adding intelligence and database awareness to modular storage systems-across iSCSI SAN, FC SAN and NAS protocols-the software can help extend the advantages of checksum functionality to a greater variety of organizations."

Of course it's not trully end-to-end and it's only for writes, but at least IBM
recognizes that data integrity is a problem despite using enterprise RAID arrays.

Then something similar from EMC
http://www.emc.com/products/software/checksum.jsp

or Oracle itself
http://www.oracle.com/technology/deploy/availability/htdocs/ hardf.html

Other main vendors also recognizes data corruption as a problem and all
know RAID isn't complete answer. So they develop half-baked solutions as above.
Of course it's better than nothing.

Then comes ZFS and completely changes the game. They (Sun) did something which
is really ahead of competition and is innovative. And whether you like it or
not, and whether in your mind enterprise arrays are reliable or not, data corrpution
happens and ZFS greatly protects from it. Even more - ZFS does excellent its job
both on enterprise storage and on cheap industry disks. Which is great as for many
environments you can actually build reliable solution wiht orders of magnitude lower
costs.

Now I understand why IBM doesn't like it :)

--
Robert Milkowski
rmilkowskiXXXX@wp-sa.pl
http://milek.blogspot.com

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 14.11.2006 12:44:20 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>Robert Milkowski wrote:
>>
>>>The point is that you can still use such array and put on top of it ZFS
>>>for many reasons - easier management is one of reasons, another is better
>>>data protection than if you use classic file system.
>>>
>>
>>The point is that such an array makes ZFS unnecessary. Sure, you *can*
>>use it (if you're using Linux - most of these systems do not). There is
>>nothing for ZFS to "manage" - configuration is done through utilities
>>(and sometimes an Ethernet port or similar). There is no management
>>interface for the file system - it all looks like a single disk (or
>>several disks, depending on the configuration).
>>
>>As for data protection - if the RAID array can't read the data, it's
>>lost far beyond what ZFS or any other file system can do - unless you
>>have another complete RAID being run by ZFS. And if that's the case,
>>it's cheaper to have multiple mirrors.
>
>
> It's here you don't understand. ZFS protect me from bad driver, bad FC switch,
> bad FC/SCSI/ESCON/... adapter corrupting data. As one SCSI adapter unfortunately
> did few weeks ago. So it's not only the array.
> And while I admit that I haven't seen (yet) ZFS detecting data corruption
> on Symmetrix boxes, but I did on another arrays, it could be due to fact
> I put ZFS on Symmetrix boxes not that long ago and comparing to other
> arrays it's not that much storage under ZFS on Symmetrix here. So statisticaly
> it can be that I'm just more lucky. And of course I expect Symmetrix to be
> more reliable than JBOD or medium array.
>

Immaterial.

A bad driver won't let the system run - at least not for long. Same
with a bad FC switch, etc. And how long did your system run with a bad
SCSI adapter?

And yes, as I've stated before - like anything else, you get what you
paid for. Get a good quality RAID and you won't get data corruption issues.

> Now when it comes to manageability - well, it's actually manageability
> of ZFS that drove my attention at first. It's because of features like
> pooled storage, many file systems, shrinking/growin on the fly, etc.
> which make ZFS especially in a fast changing environments just rock.
> When you've got manage lots of fast changing data for MANY clients, and all of
> these is changing, with ZFS it's not problem at all - you create another
> filesystem in a second, you've got all available storage to it, it doesn't really
> matter which file system is being consumed faster, etc.
>

This has nothing to do with the reliability issues being discussed.

> Then you got other feature which make ZFS quite compeling. In our environment
> with lot of small random writes which with ZFS are mainly sequential writes,
> write speed-up is considerable. It helps even with Symmetrix boxes with 16B or more
> cache, not to mention smaller arrays. Well in some tests ZFS was actually
> quickier even with write-thru on the array than with traditional file systems
> with write-back cache. But most important "test" is production - and ZFS
> is faster here.
>

Again, nothing to do with the reliability issues.

> Then you got basicaly free snapshots with no impact on performance, no need
> for extra-sliced storage, etc. So you get used to make them automaticaly on
> daily basis. Then if you have file systems with tens of milions of small files
> then doing backup using zfs tools instead standard tools (Legato, Tivoli) gives
> you even 10-15x shorter time not to mention much less IO needed to complete work.
> Well it's like doing backup for several days and do it in hours difference here
> sometimes.
>

Ditto.

>
> Want to create several virtual environment each with its own file system?
> But you don't know exactly how many of them you'll end up and you
> don't know how much disk space each of them will consume. With ZFS such
> problems just doesn't exists.
>

When you going to get back to reliability - which is the issue here?

> Then you've got dynamic block size which also helps, especialy when during
> years your mean file size changes consoderably and file size distribution
> is you've got lots of small and lots of large files.
>
> Then ZFS keeps all file system information within pool - so I don't have
> to put entries in any system config files, even nfs shares can be managed
> by zfs. It means I can take freshly installed Solaris box, connect it
> into SAN and just import ZFS pool with all config - no backup needed, no
> manual config - I get all the same parameters for all file systems in a
> pool within seconds.
>
> Then if my old SPARC box becoming slow I can just import pool on x64 box,
> or vice versa and everything just works (tested it myself) without any
> conversion, data migration, etc. It just works.
>

Ho Hum... I'm falling asleep.

> Then in our devel environments I need some times to make a writable
> copy of a file system, test some changes, etc. With zfs regardles of file
> system size (several TB's, some time more) I get WRITABLE copy in one
> second, without copying data, without any need for more space. When I'm
> done I just delete clone. Well, need to clone entire virtual machine
> in one second, with no additional disk space needed and make some tests?
> I did, works great. Need entire data base copy on a devel machine in 1s?
> Writable copy? Regardles of database size? With no performance impact
> on original database? No problem.
>
> You've got new quipment and want to test different RAID config with
> your application to see which config performs best. So you setup
> 50TB RAID-10 config and make tests. Then you setup 50TB RAID-5 config
> and make tests. Then RAID-6. Then some combination (dynamic striping
> of RAID-6?). How much time it would take to just make RAID-5 on the array?
> Well, sometimes even 2 days to just make some test, then wait another
> dayy or two for another test. ZFS creates RAIDs within seconds with no
> background synchronization, etc. so disks are immediately ready to use.
> Again, you saved lot of time.
>

One correction. ZFS does not "create RAIDS". It EMULATES RAIDS. A big
difference. But still no discussion about reliability.

> You need RAID-5 or RAID-6 and you're doing lots of small writes with
> lots of concurrent streams? Your performance generally sucks regardles
> of cache size in your Symmetrix or other array. Then you create
> RAID-5 (or RAID-6) using zfs and suddenly you get N times the performance
> of the array on the same hardware. Well, sometimes you just can't walk-by.
>
> You've got application which is disk IO constrained and there's plenty
> of CPU power left. And you're running out of disk space. Well, just
> turn on compression in ZFS on-the-fly and all new writes are compressed.
> The end effect is disk free is rising, and performance is not worse but
> even better. Well, I did exactly that some time ago.
>
> I could go on and all above is from my own experience on production environments
> with ZFS for over last two years.
>
> Like it or not but ZFS makes admin's life MUCH easier, solves many problems,
> and in many cases saves your data when your array screw up.
>

And what does any of this have to do with the discussion at hand - which
is data reliability? You seen to have a penchance for changing the
subject when you can't refute the facts.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 14.11.2006 13:10:39 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>Bill Todd wrote:
>>
>>
>>>But time marches on. Most serious operating systems now support (either
>>>natively or via extremely reputable decade-old, thoroughly-tested
>>>third-party system software products from people like Veritas) software
>>>RAID, and as much cache memory as you can afford (no more address-space
>>>limitations there) - plus (with products like ZFS) are at least starting
>>>to address synchronous small-update throughput (though when synchronous
>>>small-update *latency* is critical there's still no match for NVRAM).
>>>
>>
>>Sure, you can get software RAID. But it's not as reliable as a good
>>hardware RAID.
>
>
> Not true. Actually in some environments what you do is software
> mirroring between two enterprise arrays and put your Oracle on top
> of it. That way you get more reliable config.
>

That just means it's more reliable than putting something on one array.
No surprises there. And putting it on 100 arrays is even more reliable.

And prove to me how doing it in software is more reliable than doing it
in hardware.

> Also when you're using your high-end array without ZFS basicaly
> you get less reliability when you use the same array with ZFS.
>

Proof? Statistics?

>
>
>>And let's see those drives do things like dynamically adjust the
>>electronics - such as amp gain, bias, slew rate... They can't do it.
>
>
> Again, you missing the point. You still get all of this as with ZFS you do
> not throw array your array - you use it. ZFS is great you know, but it
> doesn't make storage out of air molecules. So with ZFS among other
> features you get additional protection which HW RAID itself cannot offer.
>

This is where you've gone off the deep end, Robert, and prove you have
no idea what you're talking about.

ZFS cannot adjust amp gains. It cannot change the bias. It cannot
tweak the slew rates. And a lot more. These are all very low level
operations available only to the disk controller. And they are much of
what makes a difference between a high-end drive and a throw-away drive
(platter coating being another major cost difference).

None of this is available at any level to the OS.

>
>>who has worked with high performance, critical systems knows there is a
>>*huge* difference between doing it in hardware and software.
>
>
> Really? Actually depending on workload specifics and hardware specifics
> I can see HW being faster than software, and the opposite.
>
> In some cases clever combination of both gives best results.
>

Wrong! Data transfer in hardware is ALWAYS faster than in software.
Hardware can transfer 4 bytes every clock cycle. Software requires a
loop and about 7 clock cycles to transfer the same 4 bytes.

And you can't have a "combination of both". Either the hardware does
the transfer, or it waits for the software to do it.

>
>
>>Yes, ZFS can correct for errors like bad connectors and cables. And I
>>guess you need it if you use cheap connectors or cables. But even if
>>they do fail - it's not going to be a one-time occurrance. Chances are
>>your system will crash within a few hundred ms.
>
>
> Geezz... I don't know how you configure your systems but just 'coz
> of bad cable or connector my systems won't crash. They will use another
> link. These are basics in HA storage management and I'm suprised you
> don't know how to do it. And now thanks to ZFS if a FC switch, hba
> or something else will corrupt data ZFS will detect and correct.
>

Yep, and you need it if you use cheap cables and connectors.
Personally, I haven't seen a disk bad cable or connector in quite a
number of years. In fact, the only one I can remember seeing in the
past 10+ years was a display cable on a laptop - but that was because of
the flexing from opening and closing the lid.

So how often do YOU get bad cables or connectors?

Also, you're telling me you can go into your system while it's running
and just unplug any cable you want and it will keep running? Gee,
you've accomplished something computer manufacturers have dreamed about
for decades!

>
>>I dont' know about NetApp, but IBM doesn't work this way at all. The
>>channel itself is parity checked by hardware on both ends. Any parity
>>check brings the system to an immediate halt.
>
>
> What???? Just becaouse you get some errors on a link you halt entire system?
> Well, just switch to good link.
> I don't belive they are doing it actually.
>

Yep. It sure does. And it happens to a system about once every 10
years. And in all of my years in hardware, I had exactly ONE time it
was a cable or connector. And that was because someone tried to force
it into the socket. Any other failure was caused by electronics.

>
>>And data in the system and system software can be corrupted. Once the
>>data is in the RAID device, it cannot.
>
>
> Really? Unfortunately for your claims it happens.
> And you know, even your beloved IBM's array lost some data here.
> The array even warned us about it :) It wasn't Shark, but also
> not low-end in IBMs arrays. And it did more than once.
>

Not with good quality RAIDS. And obviously your claim of having IBM's
array is as full of manure as your earlier claims in this thread.

>
>
>>You seem to think software is the way to go. Just tell me one thing.
>>When was the last time you had to have your computer fixed because of a
>>hardware problem? And how many times have you had to reboot due to a
>>software problem?
>
>
> And how many times you had to reboot entire array for some upgrade or
> corrections? Even high-end arrays? Including IBM's arrays? I had to do
> it many times because I work with them. What about you? Maybe your envoronment
> isn't as demanding?
>

What does this have to do with the question?

But since you asked. How many times have you had to reboot because of
some upgrade or correction to your software? A hell of a lot more than
with any array.

>
>
>>Actually, Symmetrix grew big in the small and medium systems, but IBM
>>never lost the lead in the top end RAID solutions. But they also were
>>(and still are) quite a bit more expensive than EMC's.
>>
>
>
> What IBM array are you talking about? Shark? Or maybe they got
> for years something top secret only you know about?
>
>
> In 1 minute I found some links for you.
> As it seems you're fond of IBM lets start with them.
>
> http://www-03.ibm.com/systems/storage/network/software/snapv alidator/
> "
> The challenge: the risk of data corruption is inherent in data transfers
>
> Organizations of any size that rely heavily on the integrity of Oracle data need to safeguard against data corruption. Because database servers and storage devices reside at opposite ends of the I/O path, corruption can occur as each data block transfer passes through a series of logical layers involving hardware and software from multiple vendors. Other factors, such as application anomalies and human error, present additional risk. As a result, data corruption can occur at any stage of the process, even with the protection inherent in the most robust storage systems. The impact of these corruptions can cause considerable disruption to business continuity, which can be time consuming and costly to resolve.
> The solution: end-to-end data validation
>
> IBM System Storage N series with SnapValidator. software is designed to provide a high level of protection for Oracle data, helping you to detect potential data corruption before it occurs. By adding intelligence and database awareness to modular storage systems-across iSCSI SAN, FC SAN and NAS protocols-the software can help extend the advantages of checksum functionality to a greater variety of organizations."
>
> Of course it's not trully end-to-end and it's only for writes, but at least IBM
> recognizes that data integrity is a problem despite using enterprise RAID arrays.
>
>
> Then something similar from EMC
> http://www.emc.com/products/software/checksum.jsp
>
> or Oracle itself
> http://www.oracle.com/technology/deploy/availability/htdocs/ hardf.html
>
>
> Other main vendors also recognizes data corruption as a problem and all
> know RAID isn't complete answer. So they develop half-baked solutions as above.
> Of course it's better than nothing.
>
> Then comes ZFS and completely changes the game. They (Sun) did something which
> is really ahead of competition and is innovative. And whether you like it or
> not, and whether in your mind enterprise arrays are reliable or not, data corrpution
> happens and ZFS greatly protects from it. Even more - ZFS does excellent its job
> both on enterprise storage and on cheap industry disks. Which is great as for many
> environments you can actually build reliable solution wiht orders of magnitude lower
> costs.
>
> Now I understand why IBM doesn't like it :)
>
>

You've really gone off the wall here, Robert. You have proven you are
blowing your "facts" out your ass. You have absolutely no idea about
which you speak - your outlandish claims about what ZFS can do (adjust
amp gain, etc.) is proof of that. And your claims about software data
transfer being more efficient than hardware is hilarious.

And your previous post which talked about all the neat things ZFS could
do which have absolutely nothing to do with reliability.

Robert, your credibility here is now zero. You've made way too many
claims that anyone with even a modicum of hardware knowledge could
refute - hell, even a tech school student could do it.

So nothing else you say has any credibility, either. Go back into your
hole. You're adding absolutely nothing to this conversation.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 14.11.2006 13:43:40 von Robert Milkowski

Jerry Stuckle wrote:
> Robert Milkowski wrote:
> >
> > It's here you don't understand. ZFS protect me from bad driver, bad FC switch,
> > bad FC/SCSI/ESCON/... adapter corrupting data. As one SCSI adapter unfortunately
> > did few weeks ago. So it's not only the array.
> > And while I admit that I haven't seen (yet) ZFS detecting data corruption
> > on Symmetrix boxes, but I did on another arrays, it could be due to fact
> > I put ZFS on Symmetrix boxes not that long ago and comparing to other
> > arrays it's not that much storage under ZFS on Symmetrix here. So statisticaly
> > it can be that I'm just more lucky. And of course I expect Symmetrix to be
> > more reliable than JBOD or medium array.
> >
>
> Immaterial.
>
> A bad driver won't let the system run - at least not for long. Same
> with a bad FC switch, etc. And how long did your system run with a bad
> SCSI adapter?

Actualy it was writing data with some errors for hours, then system panicked.
Then again it was writing data for another 7-9h.

Now of course system won't reboot just because bad switch - it's not the first
time I had problem with fc switch (long time ago, granted). With at least dual links
from each host to differen switch (differen fabric) it's not a big problem.

> And yes, as I've stated before - like anything else, you get what you
> paid for. Get a good quality RAID and you won't get data corruption issues.

Is IBM high-end array a good quality one for example?

> And what does any of this have to do with the discussion at hand - which
> is data reliability? You seen to have a penchance for changing the
> subject when you can't refute the facts.

Reliability comes from end-to-end data integrity, which HW RAID itself can't
provide so your data are less protected.

Your RAID doesn't protect you from anything between itself and your host.
Now vendors recognize it for years that's why IBM, EMC, Oracle, Sun, Hitachi, etc.
all of them provide some hacks for specific applications like Oracle. Of course
none of those solution are nearly as complete as ZFS.
I know, you know better than all those vendors. You know better than people
who actually lost their data both on cheap and 500k+ (you seem to like this number)
arrays. It's just that you for some reasons can't accept simple facts.

Reliability comes from keeping meta data blocks on different LUNs + configured
protection. While it's not strictly RAID issue the point is ZFS has file system
integrated with Volume Manager, so it can do it. So even if you do just
striping on ZFS on RAID LUNS, and you overwrite entire LUN your file system still
will be consistent - only lost data (not metadata) are lost.

Reliability comes from never overwriting actual data on a medium, so you don't
have to deal with incomplete writes, etc.

Reliability comes from integrating file system with volume manager, so regardless
of RAID type you used, you are always consistent on disk (both data and meta data).
Something which traditional file systems, log structured, can't guarantee. And that's
why even with journaling sometimes you end up with fsck after all.

Reliability comes from knowing where exactly on each disk your data is, so if you
do not have RAID full of data ZFS will resilver disk in case of emergency MUCH
faster by resilvering only actual data and not all blocks on disk. Also as it understand
data on disk it starts resilver from / so from the beginning even if resilver isn't
completed yet you get some protection. On classic array you can't do it as they
can't understand data on their disks.

Now there are other thing in ZFS which greatly increase reliability, manegability, and
other things. But as you can't agree with basic and simple facts it doesn't really
make sense to go any further. It probably doesn't even make sense to talk to you anymore.

--
Robert Milkowski
rmilkowskiXXXXXX@wp-sa.pl
http://milek.blogspot.com

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 14.11.2006 13:59:25 von Robert Milkowski

Jerry Stuckle wrote:
> Robert Milkowski wrote:
> > Also when you're using your high-end array without ZFS basicaly
> > you get less reliability when you use the same array with ZFS.
> >
>
> Proof? Statistics?

Proof - I lost data on arrays, both big and large. ZFS already helped me
few times. That's all proof I need. Now if you need some basics why this is
possible please try again to read what all people here where writing, this
time with open mind and understanding, then go to Open Solaris site for
more details, then if you are still more curious and capable of reading code,
then do it. Or you can just try it.

However I suspect you will just keep trolling and stay in a dream of yours.

> ZFS cannot adjust amp gains. It cannot change the bias. It cannot
> tweak the slew rates. And a lot more. These are all very low level
> operations available only to the disk controller. And they are much of
> what makes a difference between a high-end drive and a throw-away drive
> (platter coating being another major cost difference).

To be honest I have no idea if high end arrays do these things.
But all I know is that if I use such an array and put ZFS on top of it,
then additional to protection you describe I get much more. At the and
I get better data protection. And that's exactly what people are already
doing.

Now, please try to ajust your bias, as evidently you've got undetected
(at least by yourself) malfunction :))) no offence.

> >>who has worked with high performance, critical systems knows there is a
> >>*huge* difference between doing it in hardware and software.
> >
> >
> > Really? Actually depending on workload specifics and hardware specifics
> > I can see HW being faster than software, and the opposite.
> >
> > In some cases clever combination of both gives best results.
> >
>
> Wrong! Data transfer in hardware is ALWAYS faster than in software.

Holly.....!#>!>!
Now you want persuade me that even if my application works faster with
software RAID it's actually slower, just because you think so.
You really have a problem with grasping reality around you.

> So how often do YOU get bad cables or connectors?
>
> Also, you're telling me you can go into your system while it's running
> and just unplug any cable you want and it will keep running? Gee,
> you've accomplished something computer manufacturers have dreamed about
> for decades!

Yep, in our HA solutions you can go and unplug one external cable you want and
the system will keep going - doesn't metter if it's network cable, power cable,
FC cable, ...

You know maybe when you're studying, and it was long time ago I guess, people
were dreaming about this, but it's really kind of standard in enterprise for years
if even for much more time. You really don't know anything about HA (but that was obvious
earlier).

> >>And data in the system and system software can be corrupted. Once the
> >>data is in the RAID device, it cannot.
> >
> >
> > Really? Unfortunately for your claims it happens.
> > And you know, even your beloved IBM's array lost some data here.
> > The array even warned us about it :) It wasn't Shark, but also
> > not low-end in IBMs arrays. And it did more than once.
> >
>
> Not with good quality RAIDS. And obviously your claim of having IBM's
> array is as full of manure as your earlier claims in this thread.

Ok, enough.
I guess from time to time I need to play a little bit with trolls, but this is
enough.

As someone else pointed - sometimes you really just can't help some people.

EOT

--
Robert Milkowski
rmilkowskiSSSS@wp-sa.pl
http://milek.blogspot.com

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 14.11.2006 15:32:23 von Bill Todd

Jerry Stuckle wrote:

....

software implementations are poor replacements for a truly
> fault-tolerant system.

You really ought to stop saying things like that: it just makes you
look ignorant.

While special-purpose embedded systems may be implementable wholly in
firmware, general-purpose systems are not - and hence *cannot* be more
reliable than their system software (and hardware) is. Either that
software and the hardware it runs on are sufficiently reliable, or
they're not - and if they're not, then the most reliable storage system
in the world can't help the situation, because the data *processing*
cannot be relied upon.

So stop babbling about software reliability: if it's adequate to
process the data, it's adequate to perform the lower levels of data
management as well. In many cases that combination may be *more*
reliable than adding totally separate hardware and firmware designed by
a completely different organization: using the existing system hardware
that *already* must be reliable (rather than requiring that a *second*
piece of hardware be equally reliable too) by definition reduces total
exposure to hardware faults or flaws (and writing reliable system
software is in no way more bug-prone than writing equivalent but
separate RAID firmware).

And the high end RAID devices do not require
> special software - they look like any other disk device attached to the
> system.

Which is why they cannot include the kinds of end-to-end checks that a
software implementation inside the OS can: the standard interface
doesn't support it.

>
> As for bundling write acceleration in NVRAM - again, meaningless because
> good RAID devices aren't loaded as a "special system device".

Perhaps you misunderstood: I listed transparent write-acceleration by
using NVRAM in hardware RAID as a hardware RAID *advantage* (it just has
nothing to do with *reliability*, which has been the issue under
discussion here).

>
> Prestoserve was one of the first lower-end RAID products made.

I don't believe that Prestoserve had much to do with RAID: it was
simply battery-backed (i.e., non-volatile) RAM that could be used as a
disk (or in front of disks) to make writes persistent at RAM speeds.

However,
> there were a huge number of them before that. But you wouldn't find
> them on a PC. They were primarily medium and large system devices.

I never heard of Prestoserve being available on a PC (though don't know
for certain that it wasn't): I encountered it on mid-range DEC and Sun
systems.

> Prestoserve took some of the ideas and moved much of the hardware
> handling into software. Unfortunately, when they did it, they lost the
> ability to handle problems at a low-level (i.e. read head biasing,
> etc.). It did make the arrays a lot cheaper, but at a price.

I think you're confused: Prestoserve had nothing to do with any kind of
'hardware handling', just with write acceleration via use of NVRAM.

>
> And in the RAID devices, system address space was never a problem -
> because the data was transferred to RAID cache immediately. This did
> not come out of the system pool; the controllers have their own cache.

Which was what I said: back when amounts of system memory were limited
(by addressability if nothing else) this was a real asset that hardware
RAID could offer, whereas today it's much less significant (since a
64-bit system can now address and effectively use as much RAM as can be
connected to it).

>
> I remember 64MB caches in the controllers way back in the mid 80's. It's
> in the GB, now.

Indeed - a quick look at some current IBM arrays show support up to at
least 1 GB (and large EMC and HDS arrays offer even more). On the other
hand, system RAM in large IBM p-series systems can reach 2 TB these
days, so (as I noted) the amount that a RAID controller can add to that
is far less (relatively) significant than it once was.

>> But time marches on. Most serious operating systems now support
>> (either natively or via extremely reputable decade-old,
>> thoroughly-tested third-party system software products from people
>> like Veritas) software RAID, and as much cache memory as you can
>> afford (no more address-space limitations there) - plus (with products
>> like ZFS) are at least starting to address synchronous small-update
>> throughput (though when synchronous small-update *latency* is critical
>> there's still no match for NVRAM).
>>
>
> Sure, you can get software RAID. But it's not as reliable as a good
> hardware RAID.

That is simply incorrect: stop talking garbage.

>
>> You don't take $89 100GB disk drives off the shelf,
>>
>>> tack them onto an EIDE controller and add some software to the system.
>>
>>
>> Actually, you can do almost *precisely* that, as long as the software
>> is handles the situation appropriately - and that's part of what ZFS
>> is offering (and what you so obviously completely fail to be able to
>> grasp).
>>
>
> In the cheap RAID devices, sure. But not in the good ones. You're
> talking cheap. I'm talking quality.

No: I'm talking relatively inexpensive, but with *no* sacrifice in
quality. You just clearly don't understand how that can be done, but
that's your own limitation, not any actual limitation on the technology.

>
>> No disk or firmware is completely foolproof. Not one. No matter how
>> expensive and well-designed. So the question isn't whether the disks
>> and firmware are unreliable, but just the degree and manner in which
>> they are.
>>
>
> I never said they were 100% foolproof. Rather, I said they are amongst
> the most tested software made. Probably the only software tested more
> thoroughly is the microcode on CPU's. And they are as reliable as
> humanly possible.

So is the system software in several OSs - zOS and VMS, for example (I
suspect IBM's i-series as well, but I'm not as familiar with that).
Those systems can literally run for on the order of a decade without a
reboot, as long as the hardware doesn't die underneath them.

And people trust their data to the system software already, so (as I
already noted) there's no significant additional exposure if that
software handles some of the RAID duties as well (and *less* total
*hardware* exposure, since no additional hardware has been introduced
into the equation).

>
> Of course, the same thing goes for ZFS and any file system. They're not
> completely foolproof, either, are they?

No, but neither need they be any *less* foolproof: it just doesn't
matter *where* you execute the RAID operations, just *how well* they're
implemented.

>
>> There is, to be sure, no way that you can make a pair of inexpensive
>> SATA drives just as reliable as a pair of Cheetahs, all other things
>> being equal. But it is *eminently* possible, using appropriate
>> software (or firmware), to make *three or four* inexpensive SATA
>> drives *more* reliable than a pair of Cheetahs that cost far more -
>> and to obtain better performance in many areas in the bargain.
>>
>
> And there is no way to make a pair of Cheetahs as reliable as drives
> made strictly for high end RAID devices. Some of these drives still
> sell for $30-60/GB (or more).

It is possible that I'm just not familiar with the 'high-end RAID
devices' that you're talking about - so let's focus on that.

EMC took a major chunk of the mainframe storage market away from IBM in
the '90s using commodity SCSI disks from (mostly) Seagate, not any kind
of 'special' drives (well, EMC got Seagate to make a few firmware
revisions, but nothing of major significance - I've always suspected
mostly to keep people from by-passing EMC and getting the drives at
standard retail prices). At that point, IBM was still making its own
proprietary drives at *far* higher prices per GB, and EMC cleaned up (by
emulating the traditional IBM drive technology using the commodity SCSI
Seagate drives and building in reliability through redundancy and
intelligent firmware to compensate for the lower per-drive reliability -
exactly the same kind of thing that I've been describing to let today's
SATA drives substitute effectively for the currently-popular higher-end
drives in enterprise use).

Since then, every major (and what *I* would consider 'high-end') array
manufacturer has followed that path: IBM and Hitachi use commodity
FC/SCSI drives in their high-end arrays too (and even offer SATA drives
as an option for less-demanding environments). These are the kinds of
arrays used in the highest-end systems that, e.g., IBM and HP run their
largest-system TPC-C benchmark submissions on: I won't assert that even
higher-end arrays using non-standard disks don't exist at all, but I
sure don't *see* them being used *anywhere*.

So exactly what drives are you talking about that cost far more than the
best that Seagate has to offer, and offer far more features in terms of
external control over internal drive operations (beyond the standard
'SCSI mode page' tweaks)? Where can we see descriptions of the
super-high-end arrays (costing "$100-500/GB" in your words) that use
such drives (and preferably descriptions of *how* they use them)?

Demonstrating that you have at least that much of a clue what you're
talking about would not only help convince people that you might
actually be worth listening to, but would also actually teach those of
us whose idea of 'high-end arrays' stops with HDS and Symmetrix
something we don't know. Otherwise, we'll just continue to assume that
you're at best talking about '80s proprietary technology that's
completely irrelevant today (if it ever existed as you describe it even
back then).

That would not, however, change the fact that equal reliability can be
achieved at significantly lower cost by using higher numbers of low-cost
(though reputable) drives with intelligent software to hook them
together into a reliable whole. In fact, the more expensive those
alleged non-standard super-high-end drives (and arrays) are, the easier
that is to do.

....

>> And you don't attach them through Brand X SATA controllers, either:
>> ideally, you attach them directly (since you no longer need any
>> intermediate RAID hardware), using the same quality electronics you
>> have on the rest of your system board (so the SATA connection won't
>> constitute a weak link). And by virtue of being considerably simpler
>> hardware/firmware than a RAID implementation, that controller may well
>> be *more* reliable.
>>
>
> There is no way this is more reliable than a good RAID system. If you
> had ever used one, you wouldn't even try to make that claim.

One could equally observe that if you had ever used a good operating
system, you wouldn't even try to make *your* claim. You clearly don't
understand software capabilities at all.

....

>> Whether you're aware of it or not, modern SATA drives (and even
>> not-too-old ATA drives) do *all* the things that you just described in
>> your last one-and-a-half paragraphs.
>>
>
> And let's see those drives do things like dynamically adjust the
> electronics - such as amp gain, bias, slew rate... They can't do it.

Your continuing blind spot is in not being able to understand that they
don't have to: whatever marginal improvement in per-drive reliability
such special-purpose advantages may achieve (again, assuming that they
achieve any at all: as I noted, commodity SATA drives *do* support most
of what you described, and I have no reason to believe that you have any
real clue what actual differences exist), those advantages can be
outweighed simply by using more lower-cost drives (such that two or even
three can fail for every high-cost drive failure without jeopardizing data).

....

>> Modern disks (both FC/SCSI and ATA/SATA) do that themselves, without
>> waiting for instructions from a higher level. They report any failure
>> up so that the higher level (again, doesn't matter whether it's
>> firmware or software) can correct the data if a good copy can be found
>> elsewhere. If its internal retry succeeds, the disk doesn't report an
>> error, but does log it internally such that any interested
>> higher-level firmware or software can see whether such successful
>> retries are starting to become alarmingly frequent and act accordingly.
>>
>
> Yes, they report total failure on a read. But they can't go back and
> try to reread the sector with different parms to the read amps, for
> instance.

As I already said, I have no reason to believe that you know what you're
talking about there. Both FC/SCSI and ATA/SATA drives make *exhaustive*
attempts to read data before giving up: they make multiple passes,
jigger the heads first off to one side of the track and then off to the
other to try to improve access, and God knows what else - to the point
where they can keep working for on the order of a minute trying to read
a bad sector before finally giving up (and I suspect that part of what
keeps them working that long includes at least some of the kinds of
electrical tweaks that you describe).

And a good RAID controller will make decisions based in part
> on what parameters it takes to read the data.
>
>>>
>>> Also, with two or more controllers, the controllers talk to each
>>> other directly, generally over a dedicated bus. They keep each other
>>> informed of their status and constantly run diagnostics on themselves
>>> and each other when the system is idle.
>>
>>
>> Which is only necessary because they're doing things like capturing
>> updates in NVRAM (updates that must survive controller failure and
>> thus need to be mirrored in NVRAM at the other controller): if you
>> eliminate that level of function, you lose any need for that level of
>> complexity (not to mention eliminating a complete layer of complex
>> hardware with its own potential to fail).
>>
>
> This has nothing to do with updates in NVRAM.

Yes, it does. In fact, that's about the *only* reason they really
*need* to talk with each other (and they don't even need to do that
unless they're configured as a fail-over pair, which itself is not an
actual 'need' when data is mirrored such that a single controller can
suffice).

This has everything to do
> with processing the data, constant self-checks, etc. This is critical
> in high-reliabilty systems.

No, it's not. You clearly just don't understand the various ways in
which high reliability can be achieved.

....

>> These tests include reading and writing
>>
>>> test cylinders on the disks to verify proper operation.
>>
>>
>> The background disk scrubbing which both hardware and software RAID
>> approaches should be doing covers that (and if there's really *no*
>> writing going on in the system for long periods of time, the software
>> can exercise that as well once in a while).
>>
>
> No, it doesn't. For instance, these tests include things like writing
> with a lower-level signal than normal and trying to read it back. It
> helps catch potential problems in the heads and electronics. The same
> is true for writing with stronger than normal currents - and trying to
> read them back. Also checking adjacent tracks for "bit bleed". And a
> lot of other things.
>
> These are things again no software implementation can do.

Anything firmware can do, software can do - but anything beyond standard
'mode page' controls would require use of the same special-purpose disk
interface that you allege the RAID firmware uses.

Again, though, there are more ways to skin the reliability cat than
continually torturing the disk through such a special interface - the
opposite extreme being just to do none of those special checks, let the
disk die (in whole or in part) if it decides to, and use sufficient
redundancy that that doesn't matter.

>
>>>
>>> Additionally, in the more expensive RAID devices, checksums are
>>> typically at least 32 bits long (your off-the-shelf drive typically
>>> uses a 16 bit checksum), and the checksum is built in hardware - much
>>> more expensive, but much faster than doing it in firmware. Checksum
>>> comparisons are done in hardware, also.
>>
>>
>> Your hand-waving just got a bit fast to follow there.
>>
>> 1. Disks certainly use internal per-sector error-correction codes
>> when transferring data to and from their platters. They are hundreds
>> (perhaps by now *many* hundreds) of bits long.
>
> Actually, not.

Actually, yes: you really should lose your habit of making assertions
about things that you don't know a damn thing about.

Sectors are still 512 bytes. And the checksums (or ECC,
> if they use them) are still only 16 or 32 bits.

No, they are not.

And even if they use
> ECC, 32 bits can only can only correct up to 3 bad bits out of the 512
> bytes.

Which is why they use hundreds of bits.

> None use "many hundreds of bits".

Yes, they do.

It would waste too much disk
> space.

No, it doesn't - though it and other overhead do use enough space that
the industry is slowly moving toward adopting 4 KB disk sectors to
reduce the relative impact.

Seagate's largest SATA drive generates a maximum internal bit rate of
1030 Mb/sec but a maximum net data transfer rate of only 78 MB/sec,
suggesting that less than 2/3 of each track is occupied by user data -
the rest being split between inter-record gaps and overhead (in part
ECC). IBM's largest SATA drive states that it uses a 52-byte (416-bit)
per-sector ECC internally (i.e., about 10% of the data payload size); it
claims to be able to recover 5 random burst errors and a single 330-bit
continuous burst error.

Possibly you were confused by the use of 4-byte ECC values in the 'read
long' and 'write long' commands: those values are emulated from the
information in the longer physical ECC.

>
>>
>> 2. Disks use cyclic redundancy checks on the data that they accept
>> from and distribute to the outside world (old IDE disks did not, but
>> ATA disks do and SATA disks do as well - IIRC the width is 32 bits).
>>
>
> See above. And even the orignal IDE drives used a 16 bit checksum.

I'm not sure what you mean by 'see above', unless you confused about the
difference between the (long) ECC used to correct data coming off the
platter and the (32-bit - I just checked the SATA 2.5 spec) CRC used to
guard data sent between the disk and the host.

>
>> 3. I'd certainly expect any RAID hardware to use those CRCs to
>> communicate with both disks and host systems: that hardly qualifies
>> as anything unusual. If you were talking about some *other* kind of
>> checksum, it would have to have been internal to the RAID, since the
>> disks wouldn't know anything about it (a host using special driver
>> software potentially could, but it would add nothing of obvious value
>> to the CRC mechanisms that the host already uses to communicate
>> directly with disks, so I'd just expect the RAID box to emulate a disk
>> for such communication).
>>
>
> CRC's are not transferred to the host system, either in RAID or non-RAID
> drives.

Yes, they are: read the SATA spec (see the section about 'frames').

Yes, some drives have that capability for diagnostic purposes.
> But as a standard practice, transferring 512 bytes is 512 bytes of data
> - no more, no less.
>
>> 4. Thus data going from system memory to disk platter and back goes
>> (in each direction) through several interfaces and physical connectors
>> and multiple per-hop checks, and the probability of some undetected
>> failure, while very small for any given interface, connector, or hop,
>> is not quite as small for the sum of all of them (as well as there
>> being some errors, such as misdirected or lost writes, that none of
>> those checks can catch). What ZFS provides (that by definition
>> hardware RAID cannot, since it must emulate a standard block-level
>> interface to the host) is an end-to-end checksum that verifies data
>> from the time it is created in main memory to the time it has been
>> fetched back into main memory from disk. IBM, NetApp, and EMC use
>> somewhat analogous supplementary checksums to protect data: in the
>> i-series case I believe that they are created and checked in main
>> memory at the driver level and are thus comparably strong, while in
>> NetApp's and EMC's cases they are created and checked in the main
>> memory of the file server or hardware box but then must get to and
>> from client main memory across additional interfaces, connectors, and
>> hops which have their own individual checks and are thus not
>> comparably end-to-end in nature - though if the NetApp data is
>> accessed through a file-level protocol that includes an end-to-end
>> checksum that is created and checked in client and server main memory
>> rather than, e.g., in some NIC hardware accelerator it could be
>> *almost* comparable in strength.
>>
>
> Yes, ZFS can correct for errors like bad connectors and cables. And I
> guess you need it if you use cheap connectors or cables.

You need it, period: the only question is how often.

But even if
> they do fail - it's not going to be a one-time occurrance. Chances are
> your system will crash within a few hundred ms.
>
> I dont' know about NetApp, but IBM doesn't work this way at all. The
> channel itself is parity checked by hardware on both ends. Any parity
> check brings the system to an immediate halt.

Exactly what part of the fact that the end-to-end ZFS mechanism is meant
to catch errors that are *not* caught elsewhere is still managing to
escape you? And that IBM uses similar mechanisms itself in its i-series
systems (as to other major vendors like EMC and NetApp) for the same reason?

>
>>>
>>> Plus, with verified writes, the firmware has to go back and reread
>>> the data the next time the sector comes around and compare it with
>>> the contents of the buffer. Again, this is often done in hardware on
>>> the high end RAID systems.
>>
>>
>> And can just as well be done in system software (indeed, this is often
>> a software option in high-end systems).
>>
>
> Sure, it *can* be done with software, at a price.

A much lower price than is required to write the same code as firmware
and then enshrine it in additional physical hardware.

>
>>>
>>> And, most of these RAID devices use custom chip sets - not something
>>> off the shelf.
>>
>>
>> That in itself is a red flag: they are far more complex and also get
>> far less thoroughly exercised out in the field than more standard
>> components - regardless of how diligently they're tested.
>>
> Gotten a cell phone lately? Chances are the chips in your phone are
> custom-made. Each manufacturer creates its own. Or an X-BOX, Nintendo,
> PlayStation, etc.? Most of those have customer chips. And the same is
> true for microwaves, TV sets and more.
>
> The big difference is that Nokia can make 10M custom chilps for its
> phones; for a high-end RAID device, 100K is a big run.

Exactly the point I was making: they get far less exercise out in the
field to flush out the last remaining bugs. I'd be more confident in a
carefully-crafted new software RAID implementation than in an equally
carefully-crafted new hardware-plus-firmware implementation, because the
former has considerably less 'new' in it (not to mention being easier to
trouble-shoot and fix in place if something *does* go wrong).

....

> You seem to think software is the way to go. Just tell me one thing.
> When was the last time you had to have your computer fixed because of a
> hardware problem? And how many times have you had to reboot due to a
> software problem?

Are you seriously suggesting that Intel and Microsoft have comparable
implementation discipline? Not to mention the relative complexity of an
operating system plus full third-party driver and application spectrum
vs. the far more standardized relationships that typical PC hardware
pieces enjoy.

We're talking about reliability *performing the same function*, not
something more like comparing the reliability of an automobile engine
with that of the vehicle as a whole.

>
> And you say software is as reliable?

For a given degree of complexity, and an equally-carefully-crafted
implementation, software that can leverage the reliability of existing
hardware that already has to be depended upon for other processing is
inherently more reliable - because code is code whether in software or
firmware, but the software-only approach has less hardware to go wrong.

....

>> I seriously doubt that anyone who's been talking with you (or at least
>> trying to) about hardware RAID solutions has been talking about any
>> that you'd find at CompUSA. EMC's Symmetrix, for example, was the
>> gold standard of enterprise-level hardware RAID for most of the '90s -
>> only relatively recently did IBM claw back substantial market share in
>> that area (along with HDS).
>>
>
> Actually, Symmetrix grew big in the small and medium systems, but IBM
> never lost the lead in the top end RAID solutions. But they also were
> (and still are) quite a bit more expensive than EMC's.

You really need to point to specific examples of the kinds of 'higher
end' RAIDs that you keep talking about (something we can look at and
evaluate on line, rather than asking us to take your unsupported word
for it). *Then* we'll actually have concrete competing approaches to
discuss.

....

the software cannot
>>
>>> detect when a signal is getting marginal (it's either "good" or
>>> "bad", adjust the r/w head parameters, and similar things.
>>
>>
>> And neither can hardware RAID: those things happen strictly
>> internally at the disk (for that matter, by definition *anything* that
>> the disk externalizes can be handled by software as well as by RAID
>> hardware).
>>
>
> And here you show you know nothing about what you talk. RAID drives are
> specially built to work with their controllers. And RAID controllers
> are made to be able to do these things. This is very low level stuff -
> not things which are avialable outside the drive/controller.
>
> Effectively, the RAID controller and the disk controller become one
> unit. Separate, but one.

Provide examples we can look at if you want anyone to believe you.

>
>> Yes, it can
>>
>>> checksum the data coming back and read from the mirror drive if
>>> necessary.
>>
>>
>> Yup.
>>
>> Now, that *used* to be at least something of a performance issue -
>> being able to offload that into firmware was measurably useful. But
>> today's processor and memory bandwidth makes it eminently feasible -
>> even in cases where it's not effectively free (if you have to move the
>> data, or have to compress/decompress or encrypt/decrypt it, you can
>> generate the checksum as it's passing through and pay virtually no
>> additional cost at all).
>>
>
> Sorry, Bill, this statement is really off the wall.

Not at all: in fact, you just had someone tell you about very
specifically comparing ZFS performance with more conventional approaches
and finding it *better*.

>
> Then why do all the high end disk controllers use DMA to transfer data?

Because a) it's only been in the past few years that CPU and memory
bandwidth has become 'too cheap to meter' (so controllers are still
using the same approaches they used to and b) there's no point in
*wasting* CPU bandwidth for no reason (DMA isn't magic, though: it
doesn't save any *memory* bandwidth).

When you're already moving the data, computing the checksum is free. If
you're not, it's still cheap enough to be worth the cost for the benefit
it confers (and there's often some way to achieve at least a bit of
synergy - e.g., then deciding to move the data after all because it
makes things easier and you've already got it in the processor cache to
checksum it).

> Because it's faster and takes fewer CPU cycles than doing it software,
> that's why. And computing checkums for 512 bytes takes a significantly
> longer time that actually transferring the data to/from memory via
> software.

It doesn't take *any* longer: even with pipelined and prefetched
caching, today's processors can compute checksums faster than the data
can be moved.

>
> Also, instead of allocating 512 byte buffers, the OS would have to
> allocate 514 or 516 byte buffers. This removes a lot of the optimizaton
> possible when the system is using buffers during operations.

1. If you were talking about something like the IBM i-series approach,
that would be an example of the kind of synergy that I just mentioned:
while doing the checksum, you could also move the data to consolidate it
at minimal additional cost.

2. But the ZFS approach keeps the checksums separate from the data, and
its sectors are packed normally (just payload).

>
> Additionally, differerent disk drives internally use differrent checksums.
>
> Plus there is no way to tell the disk what to write for a checksum. This
> is hard-coded into the disk controller.

You're very confused: ZFS's checksums have nothing whatsoever to do
with disk checksums.

- bill

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 14.11.2006 18:21:50 von Kees Nuyt

On Mon, 13 Nov 2006 21:48:07 -0500, Jerry Stuckle
wrote:

>Bill Todd wrote:
>> Dear me - I just bade you a fond farewell, and here you've at last come
>> up with something at least vaguely technical (still somewhat mistaken,
>> but at least technical). So I'll respond to it in kind:
>>
>> Jerry Stuckle wrote:
>>>

>And there is no way to make a pair of Cheetahs as reliable as drives
>made strictly for high end RAID devices. Some of these drives still
>sell for $30-60/GB (or more).

They often are Cheetahs or similar, usually with custom disk
controllers on them instead of commodity controllers. Remember
the original meaning of "RAID"?
The high price has to cover the cost of the custom disk
controllers, QA, packaging, fast transport, part tracking,
failure statistics and technical support.
That explains the difference between the cost of the bare
"inexpensive disk" and TCO.

>No, I didn't say ANY drive was "crap". They're good drives, when used
>for what they are designed. But drives made for RAID arrays are in a
>class by themselves. And they can do things that standard drives can't
>(like dynamically adjust amplifiers and slew rates when reading and
>writing data).

Depends of the drive controller, not the drive.
At the current densities even commodity drives will need
adaptive amplification etc.

>CRC's are not transferred to the host system, either in RAID or non-RAID
>drives. Yes, some drives have that capability for diagnostic purposes.
> But as a standard practice, transferring 512 bytes is 512 bytes of
>data - no more, no less.

CKD is still used in mainframes. Not exactly a checksum, nor at
sector level (rather at allocation unit level), but redundant
info anyway, to enhance reliability.

>I dont' know about NetApp, but IBM doesn't work this way at all. The
>channel itself is parity checked by hardware on both ends. Any parity
>check brings the system to an immediate halt.

Of course not. The I/O is invalidated, discarded and retried
over another path. You can pull a channel plug anytime, without
interrupting anything at the application level. Just a warning
on the console.

>>> And, most of these RAID devices use custom chip sets - not something
>>> off the shelf.

Or generic DSPs, FPGAs and RISC processors programmed for this
specific application.

>Actually, Symmetrix grew big in the small and medium systems, but IBM
>never lost the lead in the top end RAID solutions. But they also were
>(and still are) quite a bit more expensive than EMC's.

I guess you missed a serious price fighting round in the passed
decennium. IBM offered ESS for dump prices for quite a while
to gain market share.

>>> detect when a signal is getting marginal (it's either "good" or "bad",
>>> adjust the r/w head parameters, and similar things.

There is no reason to not use those capabilities in a ZFS
environment.

>Then why do all the high end disk controllers use DMA to transfer data?

Apples and pears. We're not talking high end PC's here.
In high end systems the disk controller doesn't have access to
system memory at all. It is just part of the storage system.
The storage system is connected to the computer system by some
sort of channel, which connects to a channel adapter of some
kind, which may have DMA access. In mainframes there's still an
IOP in between, there the IOP's use DMA.

Just my EUR 0,02
--
( Kees
)
c[_] A problem shared is a problem halved, so
is your problem really yours or just half
of someone else's? (#348)

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 14.11.2006 19:36:30 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>Robert Milkowski wrote:
>>
>>>It's here you don't understand. ZFS protect me from bad driver, bad FC switch,
>>>bad FC/SCSI/ESCON/... adapter corrupting data. As one SCSI adapter unfortunately
>>>did few weeks ago. So it's not only the array.
>>>And while I admit that I haven't seen (yet) ZFS detecting data corruption
>>>on Symmetrix boxes, but I did on another arrays, it could be due to fact
>>>I put ZFS on Symmetrix boxes not that long ago and comparing to other
>>>arrays it's not that much storage under ZFS on Symmetrix here. So statisticaly
>>>it can be that I'm just more lucky. And of course I expect Symmetrix to be
>>>more reliable than JBOD or medium array.
>>>
>>
>>Immaterial.
>>
>>A bad driver won't let the system run - at least not for long. Same
>>with a bad FC switch, etc. And how long did your system run with a bad
>>SCSI adapter?
>
>
> Actualy it was writing data with some errors for hours, then system panicked.
> Then again it was writing data for another 7-9h.
>
> Now of course system won't reboot just because bad switch - it's not the first
> time I had problem with fc switch (long time ago, granted). With at least dual links
> from each host to differen switch (differen fabric) it's not a big problem.
>

OK, that's possible, I guess. But a high end RAID device is running
diagnostics on the adapters when it's idle. And if it has two or more
controllers (which most of them do), they are also checking each other.
So your problem would have been detected in milliseconds.

>
>>And yes, as I've stated before - like anything else, you get what you
>>paid for. Get a good quality RAID and you won't get data corruption issues.
>
>
> Is IBM high-end array a good quality one for example?
>

That's one of them.

>
>
>>And what does any of this have to do with the discussion at hand - which
>>is data reliability? You seen to have a penchance for changing the
>>subject when you can't refute the facts.
>
>
> Reliability comes from end-to-end data integrity, which HW RAID itself can't
> provide so your data are less protected.
>

It can provide integrity right to the connector.

> Your RAID doesn't protect you from anything between itself and your host.
> Now vendors recognize it for years that's why IBM, EMC, Oracle, Sun, Hitachi, etc.
> all of them provide some hacks for specific applications like Oracle. Of course
> none of those solution are nearly as complete as ZFS.
> I know, you know better than all those vendors. You know better than people
> who actually lost their data both on cheap and 500k+ (you seem to like this number)
> arrays. It's just that you for some reasons can't accept simple facts.
>

That's not its job. It's job is to deliver accurate data to the bus.
If you want further integrity checking, it's quite easy to do in
hardware, also - i.e. parity checks, ECC, etc. on the bus. That's why
IBM mainframes have parity checking on their channels.

> Reliability comes from keeping meta data blocks on different LUNs + configured
> protection. While it's not strictly RAID issue the point is ZFS has file system
> integrated with Volume Manager, so it can do it. So even if you do just
> striping on ZFS on RAID LUNS, and you overwrite entire LUN your file system still
> will be consistent - only lost data (not metadata) are lost.
>

Gee, why not state the obvious? RAIDs do that quite well.

> Reliability comes from never overwriting actual data on a medium, so you don't
> have to deal with incomplete writes, etc.
>

That's where you're wrong. You're ALWAYS overwriting data on a medium.
Otherwise your disk would quickly fill. High end RAIDs ensure they
have sufficient backup power such at even in the event of a complete
power failure they can complete the current write, for instance. And
they detect the power failure.

Some are even designed to have sufficient backup power to flush anything
in the buffers to disk before they power down.

> Reliability comes from integrating file system with volume manager, so regardless
> of RAID type you used, you are always consistent on disk (both data and meta data).
> Something which traditional file systems, log structured, can't guarantee. And that's
> why even with journaling sometimes you end up with fsck after all.
>

Not at all. Volume manager has nothing to do with ensuring data
integrity on a RAID system.

> Reliability comes from knowing where exactly on each disk your data is, so if you
> do not have RAID full of data ZFS will resilver disk in case of emergency MUCH
> faster by resilvering only actual data and not all blocks on disk. Also as it understand
> data on disk it starts resilver from / so from the beginning even if resilver isn't
> completed yet you get some protection. On classic array you can't do it as they
> can't understand data on their disks.
>

Reliability comes from not caring where your data is physically on the
disk. Even the lower end disk drives have spare cylinders they can
allocate transparently in the case a sector or track goes bad. The
sector address the file system provides to the disk may or may not be
the physical sector where the data is written. It may have been mapped
to an entirely different area of the disk.

>
> Now there are other thing in ZFS which greatly increase reliability, manegability, and
> other things. But as you can't agree with basic and simple facts it doesn't really
> make sense to go any further. It probably doesn't even make sense to talk to you anymore.
>
>

I never said there were not some good things about ZFS. And it makes
sense as a cheap replacement for an expensive RAID.

But the fact are - there are things a RAID can do that ZFS cannot do.
And there are things that ZFS can do which RAID does not do, because it
is beyond the job of a disk subsystem.

And I agree. You've shown such ignorance of the technology involved in
RAID systems that it doesn't make any sense to talk to you, either.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 14.11.2006 20:02:28 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>Robert Milkowski wrote:
>>
>>>Also when you're using your high-end array without ZFS basicaly
>>>you get less reliability when you use the same array with ZFS.
>>>
>>
>>Proof? Statistics?
>
>
> Proof - I lost data on arrays, both big and large. ZFS already helped me
> few times. That's all proof I need. Now if you need some basics why this is
> possible please try again to read what all people here where writing, this
> time with open mind and understanding, then go to Open Solaris site for
> more details, then if you are still more curious and capable of reading code,
> then do it. Or you can just try it.
>
> However I suspect you will just keep trolling and stay in a dream of yours.
>
Yes, you will when you use cheap arrays. The ones I worked with never
lost data, even though disks crashed, controllers went bad and other
things happened. A good RAID array will detect the problems and recover
from them. At the same time it will notify the system of the problem so
corrective action can be taken.

>
>
>>ZFS cannot adjust amp gains. It cannot change the bias. It cannot
>>tweak the slew rates. And a lot more. These are all very low level
>>operations available only to the disk controller. And they are much of
>>what makes a difference between a high-end drive and a throw-away drive
>>(platter coating being another major cost difference).
>
>
> To be honest I have no idea if high end arrays do these things.
> But all I know is that if I use such an array and put ZFS on top of it,
> then additional to protection you describe I get much more. At the and
> I get better data protection. And that's exactly what people are already
> doing.
>
> Now, please try to ajust your bias, as evidently you've got undetected
> (at least by yourself) malfunction :))) no offence.
>

Of course they do. That's part of what makes them high end arrays.
This type of circuitry is much more reliable - and much more expensive
to implement. It's part of why they cost so much.

>
>
>>>>who has worked with high performance, critical systems knows there is a
>>>>*huge* difference between doing it in hardware and software.
>>>
>>>
>>>Really? Actually depending on workload specifics and hardware specifics
>>>I can see HW being faster than software, and the opposite.
>>>
>>>In some cases clever combination of both gives best results.
>>>
>>
>>Wrong! Data transfer in hardware is ALWAYS faster than in software.
>
>
> Holly.....!#>!>!
> Now you want persuade me that even if my application works faster with
> software RAID it's actually slower, just because you think so.
> You really have a problem with grasping reality around you.
>

And with a good hardware RAID device, it would work even faster. But
you've never tried it one, so you have no comparison, do you?
>
>
>>So how often do YOU get bad cables or connectors?
>>
>>Also, you're telling me you can go into your system while it's running
>>and just unplug any cable you want and it will keep running? Gee,
>>you've accomplished something computer manufacturers have dreamed about
>>for decades!
>
>
> Yep, in our HA solutions you can go and unplug one external cable you want and
> the system will keep going - doesn't metter if it's network cable, power cable,
> FC cable, ...
>
> You know maybe when you're studying, and it was long time ago I guess, people
> were dreaming about this, but it's really kind of standard in enterprise for years
> if even for much more time. You really don't know anything about HA (but that was obvious
> earlier).
>

Sure it's been a standard - since the 80's. And I dare say I know more
about them than you do. You've already proven that fact with your
statements.

>
>
>>>>And data in the system and system software can be corrupted. Once the
>>>>data is in the RAID device, it cannot.
>>>
>>>
>>>Really? Unfortunately for your claims it happens.
>>>And you know, even your beloved IBM's array lost some data here.
>>>The array even warned us about it :) It wasn't Shark, but also
>>>not low-end in IBMs arrays. And it did more than once.
>>>
>>
>>Not with good quality RAIDS. And obviously your claim of having IBM's
>>array is as full of manure as your earlier claims in this thread.
>
>
> Ok, enough.
> I guess from time to time I need to play a little bit with trolls, but this is
> enough.
>
> As someone else pointed - sometimes you really just can't help some people.
>
> EOT
>
>
>

Yea, you really can't help some people. You are so caught up in how
great ZFS is that you can't see reality, even when the facts are
presented to you.

But that's OK. You're (and your customers or employer) are the ones who
will suffer. And there will be people like me who don't have your
preconceived notions who will come along to pick up the pieces.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 14.11.2006 21:39:37 von Toby

Jerry Stuckle wrote:
> ...you get what you
> paid for. Get a good quality RAID and you won't get data corruption issues.

ITYM "...and you're somewhat less likely to get data corruption..."

--T

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 14.11.2006 22:13:52 von Jerry Stuckle

Bill Todd wrote:
> Jerry Stuckle wrote:
>
> ...
>
> software implementations are poor replacements for a truly
>
>> fault-tolerant system.
>
>
> You really ought to stop saying things like that: it just makes you
> look ignorant.
>

No, troll. You should learn to face facts. Hardware can do things that
software can't.

> While special-purpose embedded systems may be implementable wholly in
> firmware, general-purpose systems are not - and hence *cannot* be more
> reliable than their system software (and hardware) is. Either that
> software and the hardware it runs on are sufficiently reliable, or
> they're not - and if they're not, then the most reliable storage system
> in the world can't help the situation, because the data *processing*
> cannot be relied upon.
>

Hardware systems can be more reliable than the sum of their parts.
Duplication, backup and diagnostics are all a part of the equation which
makes hardware RAID devices more reliable than a single disk drive.
This has been proven time and time again.

> So stop babbling about software reliability: if it's adequate to
> process the data, it's adequate to perform the lower levels of data
> management as well. In many cases that combination may be *more*
> reliable than adding totally separate hardware and firmware designed by
> a completely different organization: using the existing system hardware
> that *already* must be reliable (rather than requiring that a *second*
> piece of hardware be equally reliable too) by definition reduces total
> exposure to hardware faults or flaws (and writing reliable system
> software is in no way more bug-prone than writing equivalent but
> separate RAID firmware).
>

Incorrect. The software processing the data may be totally reliable.
But other software in the system can corrupt things.

Windows 3.x was a perfect example. By itself, it was quite reliable.
But applications running under it could corrupt the system. And the
programs processing the data may have been perfect - but another program
could get in there and corrupt the data.

With the advent of the 80286 and later chipsets with virtual memory,
rings, protected memory and the rest of the nice things, an application
has a much smaller possibility of corrupting the system. But under
certain conditions it can still happen. And even with all that, a
poorly written driver can really screw things up.

These things cannot happen in a RAID device, where all the firmware is
completely isolated from the rest of the system.

And adding a second piece of hardware INCREASES reliability, when
properly configured. That's why RAID devices exist, after all. Sure,
you have a slightly higher probability of ONE piece failing, but an
almost infinitesimal chance of BOTH pieces failing at the same time.

> And the high end RAID devices do not require
>
>> special software - they look like any other disk device attached to
>> the system.
>
>
> Which is why they cannot include the kinds of end-to-end checks that a
> software implementation inside the OS can: the standard interface
> doesn't support it.
>

Not on the Intel systems, they don't, I agree. But mainframes, for
instance, are designed such that their I/O channels are parity checked
in the hardware. A parity problem on the bus brings the system to an
immediate screeching halt. This guarantees the data sent by the
controller arrives in the system correctly.

>>
>> As for bundling write acceleration in NVRAM - again, meaningless
>> because good RAID devices aren't loaded as a "special system device".
>
>
> Perhaps you misunderstood: I listed transparent write-acceleration by
> using NVRAM in hardware RAID as a hardware RAID *advantage* (it just has
> nothing to do with *reliability*, which has been the issue under
> discussion here).
>

OK. But asynchronous writes are buffered anyway (even in most non-RAID
controllers today), and if an immediate read find the data still in the
buffer, most controllers will just return it from the buffer. And
synchronous writes should never return until they are physically on the
device (something that some file systems ignore).

It should never be cached in NVRAM - unless the NVRAM itself is the
storage device (which is getting more and more possible).

>>
>> Prestoserve was one of the first lower-end RAID products made.
>
>
> I don't believe that Prestoserve had much to do with RAID: it was
> simply battery-backed (i.e., non-volatile) RAM that could be used as a
> disk (or in front of disks) to make writes persistent at RAM speeds.
>

That's interesting. I'm thinking of a different system, then - it's
been a few years. I was thinking they actually had a RAID device, also.

> However,
>
>> there were a huge number of them before that. But you wouldn't find
>> them on a PC. They were primarily medium and large system devices.
>
>
> I never heard of Prestoserve being available on a PC (though don't know
> for certain that it wasn't): I encountered it on mid-range DEC and Sun
> systems.
>

As I said - you wouldn't find them on a PC.

>> Prestoserve took some of the ideas and moved much of the hardware
>> handling into software. Unfortunately, when they did it, they lost
>> the ability to handle problems at a low-level (i.e. read head biasing,
>> etc.). It did make the arrays a lot cheaper, but at a price.
>
>
> I think you're confused: Prestoserve had nothing to do with any kind of
> 'hardware handling', just with write acceleration via use of NVRAM.
>

OK, again - a different device than what I was thinking of. But yes,
NVRAM devices do speed up disk access tremendously.

>>
>> And in the RAID devices, system address space was never a problem -
>> because the data was transferred to RAID cache immediately. This did
>> not come out of the system pool; the controllers have their own cache.
>
>
> Which was what I said: back when amounts of system memory were limited
> (by addressability if nothing else) this was a real asset that hardware
> RAID could offer, whereas today it's much less significant (since a
> 64-bit system can now address and effectively use as much RAM as can be
> connected to it).
>

That's true today. However, It still remains that the data can be
corrupted while in system memory. Once transferred to the controller,
it cannot be corrupted.

>>
>> I remember 64MB caches in the controllers way back in the mid 80's.
>> It's in the GB, now.
>
>
> Indeed - a quick look at some current IBM arrays show support up to at
> least 1 GB (and large EMC and HDS arrays offer even more). On the other
> hand, system RAM in large IBM p-series systems can reach 2 TB these
> days, so (as I noted) the amount that a RAID controller can add to that
> is far less (relatively) significant than it once was.
>

I'm surprised it's only 1 GB. I would have figured at least 100GB. The
64MB was from back in the 80's, when disks held a little over 600MB (and
the platters were the size of the tires on your car).

>>> But time marches on. Most serious operating systems now support
>>> (either natively or via extremely reputable decade-old,
>>> thoroughly-tested third-party system software products from people
>>> like Veritas) software RAID, and as much cache memory as you can
>>> afford (no more address-space limitations there) - plus (with
>>> products like ZFS) are at least starting to address synchronous
>>> small-update throughput (though when synchronous small-update
>>> *latency* is critical there's still no match for NVRAM).
>>>
>>
>> Sure, you can get software RAID. But it's not as reliable as a good
>> hardware RAID.
>
>
> That is simply incorrect: stop talking garbage.
>

And you stop talking about something of which you know nothing about.

>>
>>> You don't take $89 100GB disk drives off the shelf,
>>>
>>>> tack them onto an EIDE controller and add some software to the system.
>>>
>>>
>>>
>>> Actually, you can do almost *precisely* that, as long as the software
>>> is handles the situation appropriately - and that's part of what ZFS
>>> is offering (and what you so obviously completely fail to be able to
>>> grasp).
>>>
>>
>> In the cheap RAID devices, sure. But not in the good ones. You're
>> talking cheap. I'm talking quality.
>
>
> No: I'm talking relatively inexpensive, but with *no* sacrifice in
> quality. You just clearly don't understand how that can be done, but
> that's your own limitation, not any actual limitation on the technology.
>

No, YOU don't understand that to make a RAID device "relatively
inexpensive" you must cut something out. That includes both hardware
and firmware. And that makes them less reliable.

Sure, they are reliable enough for many situations. But they are not as
reliable as the high end devices.

>>
>>> No disk or firmware is completely foolproof. Not one. No matter how
>>> expensive and well-designed. So the question isn't whether the disks
>>> and firmware are unreliable, but just the degree and manner in which
>>> they are.
>>>
>>
>> I never said they were 100% foolproof. Rather, I said they are
>> amongst the most tested software made. Probably the only software
>> tested more thoroughly is the microcode on CPU's. And they are as
>> reliable as humanly possible.
>
>
> So is the system software in several OSs - zOS and VMS, for example (I
> suspect IBM's i-series as well, but I'm not as familiar with that).
> Those systems can literally run for on the order of a decade without a
> reboot, as long as the hardware doesn't die underneath them.
>

If that were the case, then companies wouldn't have to spend so much
money on support structures, issuing fixes and the like. Sure, this
software is tested. But not nearly as thoroughly as firmware.

But that's only to be expected. Firmware has a very dedicated job, with
limited functionality. For instance, in the case of RAID devices, it
has to interpret a limited number of commands from the system and
translate those to an even smaller number of commands to the disk
electronics.

System software, OTOH, must accept a lot more commands from application
programs, handle more devices of different types, and so on. As a
result, it is much more complicated than any RAID firmware, and much
more prone to problems.

> And people trust their data to the system software already, so (as I
> already noted) there's no significant additional exposure if that
> software handles some of the RAID duties as well (and *less* total
> *hardware* exposure, since no additional hardware has been introduced
> into the equation).
>

There is a huge additional exposure, but you refuse to believe that. If
there were no additional exposure, they why are people buying hardware
RAID devices left and right? After all - according to you, there is
plenty of memory, plenty of CPU cycles, and the software is perfect.

But they are. That's because your premises are wrong.

>>
>> Of course, the same thing goes for ZFS and any file system. They're
>> not completely foolproof, either, are they?
>
>
> No, but neither need they be any *less* foolproof: it just doesn't
> matter *where* you execute the RAID operations, just *how well* they're
> implemented.
>

It makes a world of difference where you execute the RAID operations.
But you refuse to understand that.

>>
>>> There is, to be sure, no way that you can make a pair of inexpensive
>>> SATA drives just as reliable as a pair of Cheetahs, all other things
>>> being equal. But it is *eminently* possible, using appropriate
>>> software (or firmware), to make *three or four* inexpensive SATA
>>> drives *more* reliable than a pair of Cheetahs that cost far more -
>>> and to obtain better performance in many areas in the bargain.
>>>
>>
>> And there is no way to make a pair of Cheetahs as reliable as drives
>> made strictly for high end RAID devices. Some of these drives still
>> sell for $30-60/GB (or more).
>
>
> It is possible that I'm just not familiar with the 'high-end RAID
> devices' that you're talking about - so let's focus on that.
>

From what you're saying, you're not even familiar with the low-end RAID
devices. All you know about is the marketing hype you've read about
ZFS. And you take that as gospel, and everyone else is wrong.

Let me clue you in, Bill. There are millions of people out there who
know better.

> EMC took a major chunk of the mainframe storage market away from IBM in
> the '90s using commodity SCSI disks from (mostly) Seagate, not any kind
> of 'special' drives (well, EMC got Seagate to make a few firmware
> revisions, but nothing of major significance - I've always suspected
> mostly to keep people from by-passing EMC and getting the drives at
> standard retail prices). At that point, IBM was still making its own
> proprietary drives at *far* higher prices per GB, and EMC cleaned up (by
> emulating the traditional IBM drive technology using the commodity SCSI
> Seagate drives and building in reliability through redundancy and
> intelligent firmware to compensate for the lower per-drive reliability -
> exactly the same kind of thing that I've been describing to let today's
> SATA drives substitute effectively for the currently-popular higher-end
> drives in enterprise use).
>

Sure. But these weren't RAID devices, either. A completely different
market. Those who needed high reliability bought hardware RAID devices
from IBM or others. And they still do.

> Since then, every major (and what *I* would consider 'high-end') array
> manufacturer has followed that path: IBM and Hitachi use commodity
> FC/SCSI drives in their high-end arrays too (and even offer SATA drives
> as an option for less-demanding environments). These are the kinds of
> arrays used in the highest-end systems that, e.g., IBM and HP run their
> largest-system TPC-C benchmark submissions on: I won't assert that even
> higher-end arrays using non-standard disks don't exist at all, but I
> sure don't *see* them being used *anywhere*.
>

Sure, they use the drives themselves. But the high end RAID devices
have different electronics, different controllers, and in some cases,
even different coatings on the disk surfaces.

> So exactly what drives are you talking about that cost far more than the
> best that Seagate has to offer, and offer far more features in terms of
> external control over internal drive operations (beyond the standard
> 'SCSI mode page' tweaks)? Where can we see descriptions of the
> super-high-end arrays (costing "$100-500/GB" in your words) that use
> such drives (and preferably descriptions of *how* they use them)?
>

The high end RAID devices only available to OEM's at a much higher
price. Check your IBM sales rep for one.

> Demonstrating that you have at least that much of a clue what you're
> talking about would not only help convince people that you might
> actually be worth listening to, but would also actually teach those of
> us whose idea of 'high-end arrays' stops with HDS and Symmetrix
> something we don't know. Otherwise, we'll just continue to assume that
> you're at best talking about '80s proprietary technology that's
> completely irrelevant today (if it ever existed as you describe it even
> back then).
>

I have demonstrated that. However, you have demonstrated you are either
too stupid or too close-minded to understand basic facts. Which is it,
troll?

> That would not, however, change the fact that equal reliability can be
> achieved at significantly lower cost by using higher numbers of low-cost
> (though reputable) drives with intelligent software to hook them
> together into a reliable whole. In fact, the more expensive those
> alleged non-standard super-high-end drives (and arrays) are, the easier
> that is to do.
>

And that's where you're wrong. And those who really understand the high
end RAID devices disagree with you.

> ...
>
>>> And you don't attach them through Brand X SATA controllers, either:
>>> ideally, you attach them directly (since you no longer need any
>>> intermediate RAID hardware), using the same quality electronics you
>>> have on the rest of your system board (so the SATA connection won't
>>> constitute a weak link). And by virtue of being considerably simpler
>>> hardware/firmware than a RAID implementation, that controller may
>>> well be *more* reliable.
>>>
>>
>> There is no way this is more reliable than a good RAID system. If you
>> had ever used one, you wouldn't even try to make that claim.
>
>
> One could equally observe that if you had ever used a good operating
> system, you wouldn't even try to make *your* claim. You clearly don't
> understand software capabilities at all.
>

I have used good operating systems. And I used to work with the
internals of operating systems when I worked for IBM as a Software
Engineer. I dare say I know a lot more about system software than you
do, especially from the internals end. Your statements above about how
reliable they are is proof of that.

> ...
>
>>> Whether you're aware of it or not, modern SATA drives (and even
>>> not-too-old ATA drives) do *all* the things that you just described
>>> in your last one-and-a-half paragraphs.
>>>
>>
>> And let's see those drives do things like dynamically adjust the
>> electronics - such as amp gain, bias, slew rate... They can't do it.
>
>
> Your continuing blind spot is in not being able to understand that they
> don't have to: whatever marginal improvement in per-drive reliability
> such special-purpose advantages may achieve (again, assuming that they
> achieve any at all: as I noted, commodity SATA drives *do* support most
> of what you described, and I have no reason to believe that you have any
> real clue what actual differences exist), those advantages can be
> outweighed simply by using more lower-cost drives (such that two or even
> three can fail for every high-cost drive failure without jeopardizing
> data).
>

Oh, I know the differences. However, it's obvious you don't.
Fortunately, those who need to understand the differences do - and they
buy hardware RAID.

> ...
>
>>> Modern disks (both FC/SCSI and ATA/SATA) do that themselves, without
>>> waiting for instructions from a higher level. They report any
>>> failure up so that the higher level (again, doesn't matter whether
>>> it's firmware or software) can correct the data if a good copy can be
>>> found elsewhere. If its internal retry succeeds, the disk doesn't
>>> report an error, but does log it internally such that any interested
>>> higher-level firmware or software can see whether such successful
>>> retries are starting to become alarmingly frequent and act accordingly.
>>>
>>
>> Yes, they report total failure on a read. But they can't go back and
>> try to reread the sector with different parms to the read amps, for
>> instance.
>
>
> As I already said, I have no reason to believe that you know what you're
> talking about there. Both FC/SCSI and ATA/SATA drives make *exhaustive*
> attempts to read data before giving up: they make multiple passes,
> jigger the heads first off to one side of the track and then off to the
> other to try to improve access, and God knows what else - to the point
> where they can keep working for on the order of a minute trying to read
> a bad sector before finally giving up (and I suspect that part of what
> keeps them working that long includes at least some of the kinds of
> electrical tweaks that you describe).
>

Of course you have no reason to believe it. You have no electronics
background at all. You have no idea how it works.

Multiple passes are a completely different thing. Any drive can try to
reread the sector. But advanced drives can vary the read parameters to
compensate for marginal signals.

To simplify things for you - it's like a camera being able to adjust the
F-STOP and shutter speed to account for different lighting conditions.
A box camera has a single shutter speed and a fixed lens opening. It
can take excellent pictures of nearly stationary objects under specific
lighting conditions. But get very high or very low light, and the
result is either overexposure or underexposure. A flash helps in low
light conditions, but that's about all you can do.

However a good SLR camera can adjust both the shutter speed and lens
opening. It can take excellent pictures under a wide variety of
lighting conditions. It can even virtually freeze rapidly moving
objects without underexposure.

But you aren't going to find a good SLR for the same price as a box camera.

In the same way, you won't find a top of the line hardware RAID for the
same price as a cheap one. The top of the line one can do more things
to ensure data integrity.

> And a good RAID controller will make decisions based in part
>
>> on what parameters it takes to read the data.
>>
>>>>
>>>> Also, with two or more controllers, the controllers talk to each
>>>> other directly, generally over a dedicated bus. They keep each
>>>> other informed of their status and constantly run diagnostics on
>>>> themselves and each other when the system is idle.
>>>
>>>
>>>
>>> Which is only necessary because they're doing things like capturing
>>> updates in NVRAM (updates that must survive controller failure and
>>> thus need to be mirrored in NVRAM at the other controller): if you
>>> eliminate that level of function, you lose any need for that level of
>>> complexity (not to mention eliminating a complete layer of complex
>>> hardware with its own potential to fail).
>>>
>>
>> This has nothing to do with updates in NVRAM.
>
>
> Yes, it does. In fact, that's about the *only* reason they really
> *need* to talk with each other (and they don't even need to do that
> unless they're configured as a fail-over pair, which itself is not an
> actual 'need' when data is mirrored such that a single controller can
> suffice).
>

And in a top of the line system a single controller *never* suffices.
What happens if that controller dies? Good RAID devices always have at
least two controllers to cover that possibility, just as they have two
disks for mirroring.

And they still write data to disk. NVRAM still cannot pack the density
of a hard disk.

Plus, controllers still talk to each other all the time. They run
diagnostics on each other, for instance. They also constantly track
each other's operations, to ensure a failure in one is accurately
reflected back to the system. After all, a failing component cannot be
trusted to detect and report its failure to the system. That can fail,
also.

> This has everything to do
>
>> with processing the data, constant self-checks, etc. This is critical
>> in high-reliabilty systems.
>
>
> No, it's not. You clearly just don't understand the various ways in
> which high reliability can be achieved.
>
> ...
>
>>> These tests include reading and writing
>>>
>>>> test cylinders on the disks to verify proper operation.
>>>
>>>
>>>
>>> The background disk scrubbing which both hardware and software RAID
>>> approaches should be doing covers that (and if there's really *no*
>>> writing going on in the system for long periods of time, the software
>>> can exercise that as well once in a while).
>>>
>>
>> No, it doesn't. For instance, these tests include things like writing
>> with a lower-level signal than normal and trying to read it back. It
>> helps catch potential problems in the heads and electronics. The same
>> is true for writing with stronger than normal currents - and trying to
>> read them back. Also checking adjacent tracks for "bit bleed". And a
>> lot of other things.
>>
>> These are things again no software implementation can do.
>
>
> Anything firmware can do, software can do - but anything beyond standard
> 'mode page' controls would require use of the same special-purpose disk
> interface that you allege the RAID firmware uses.
>

Again, you don't understand what you're talking about. Not if the
commands are not present at the interface - which they aren't.

Such processing would add a tremendous amount of overhead to the system.
The system would have to handle potentially hundreds of variations of
parameters - every drive manufacturer as slightly different parameters,
and many drives vary even within a manufacturer. It depends on the spin
rate, density and magnetic coating used on the device. Any filesytem
which tried to manage these parameters would have to know a lot of
details about every possible disk on the market. And when a new one
came out, those parameters would be required, also.

The inexpensive disk drives are made to present a simple interface to
the system. They understand a few commands, such as initialize,
self-test, read, write and seek. Not a lot more. It's neither
practical nor necessary in most systems to have any more. And to add
these capabilities would drastically increase the price of the drives
themselves - not to mention the cost of developing the software to
handle the commands.

OTOH, high end RAID controllers are made to work closely with one
specific device (or a limited number of devices). They don't need to
worry about hundreds of different parameters. Only the set of
parameters they are made to work with.

> Again, though, there are more ways to skin the reliability cat than
> continually torturing the disk through such a special interface - the
> opposite extreme being just to do none of those special checks, let the
> disk die (in whole or in part) if it decides to, and use sufficient
> redundancy that that doesn't matter.
>

Sure, you can do that. And the low end RAID devices do just that. But
high end devices are more reliable just *because* they do that. And
that's why they are in such demand for truly critical data.

>>
>>>>
>>>> Additionally, in the more expensive RAID devices, checksums are
>>>> typically at least 32 bits long (your off-the-shelf drive typically
>>>> uses a 16 bit checksum), and the checksum is built in hardware -
>>>> much more expensive, but much faster than doing it in firmware.
>>>> Checksum comparisons are done in hardware, also.
>>>
>>>
>>>
>>> Your hand-waving just got a bit fast to follow there.
>>>
>>> 1. Disks certainly use internal per-sector error-correction codes
>>> when transferring data to and from their platters. They are hundreds
>>> (perhaps by now *many* hundreds) of bits long.
>>
>>
>> Actually, not.
>
>
> Actually, yes: you really should lose your habit of making assertions
> about things that you don't know a damn thing about.
>
> Sectors are still 512 bytes. And the checksums (or ECC,
>
>> if they use them) are still only 16 or 32 bits.
>
>
> No, they are not.
>

You had better go back and check your facts again. The system can block
data in any size it wants. But the hardware still uses 512 byte blocks.

> And even if they use
>
>> ECC, 32 bits can only can only correct up to 3 bad bits out of the 512
>> bytes.
>
>
> Which is why they use hundreds of bits.
>
>> None use "many hundreds of bits".
>
>
> Yes, they do.
>
> It would waste too much disk
>
>> space.
>
>
> No, it doesn't - though it and other overhead do use enough space that
> the industry is slowly moving toward adopting 4 KB disk sectors to
> reduce the relative impact.
>
> Seagate's largest SATA drive generates a maximum internal bit rate of
> 1030 Mb/sec but a maximum net data transfer rate of only 78 MB/sec,
> suggesting that less than 2/3 of each track is occupied by user data -
> the rest being split between inter-record gaps and overhead (in part
> ECC). IBM's largest SATA drive states that it uses a 52-byte (416-bit)
> per-sector ECC internally (i.e., about 10% of the data payload size); it
> claims to be able to recover 5 random burst errors and a single 330-bit
> continuous burst error.
>

Sure, but there are a lot of things between the internal bit rate and
the data transfer rate. Internal bit rate is the speed at which data is
read off the disk. But this is not continuous. There are seek times
(both head and sector) and inter-record gaps which slow things down, for
instance. Plus you're talking megaBITS/s for the internal rate, and
megaBYTES per second for the external transfer rate. 78MB/s translates
to about 624 Mb/s. Not bad, I do admit.

And their ECC is beyond what the normal disk drive does. Most still use
a 512 byte sector with a 16 or 32 bit checksum (a few use ECC). Those
which do claim a 4K sector generally emulate it in firmware.

> Possibly you were confused by the use of 4-byte ECC values in the 'read
> long' and 'write long' commands: those values are emulated from the
> information in the longer physical ECC.
>

Nope, not at all.

>>
>>>
>>> 2. Disks use cyclic redundancy checks on the data that they accept
>>> from and distribute to the outside world (old IDE disks did not, but
>>> ATA disks do and SATA disks do as well - IIRC the width is 32 bits).
>>>
>>
>> See above. And even the orignal IDE drives used a 16 bit checksum.
>
>
> I'm not sure what you mean by 'see above', unless you confused about the
> difference between the (long) ECC used to correct data coming off the
> platter and the (32-bit - I just checked the SATA 2.5 spec) CRC used to
> guard data sent between the disk and the host.
>

No, I'm talking about the 16 or 32bit checksum still used by the
majority of the low-end drives.

>>
>>> 3. I'd certainly expect any RAID hardware to use those CRCs to
>>> communicate with both disks and host systems: that hardly qualifies
>>> as anything unusual. If you were talking about some *other* kind of
>>> checksum, it would have to have been internal to the RAID, since the
>>> disks wouldn't know anything about it (a host using special driver
>>> software potentially could, but it would add nothing of obvious value
>>> to the CRC mechanisms that the host already uses to communicate
>>> directly with disks, so I'd just expect the RAID box to emulate a
>>> disk for such communication).
>>>
>>
>> CRC's are not transferred to the host system, either in RAID or
>> non-RAID drives.
>
>
> Yes, they are: read the SATA spec (see the section about 'frames').
>
> Yes, some drives have that capability for diagnostic purposes.
>

OK, now you're talking specific drives. Yes SATA can transfer a
checksum to the system. But it's not normally done, and is specifically
for diagnostic purposes. They are a part of the test to ensure the
checksum is being correctly computed - but not transferred as part of
normal operation.

>> But as a standard practice, transferring 512 bytes is 512 bytes of
>> data - no more, no less.
>>
>>> 4. Thus data going from system memory to disk platter and back goes
>>> (in each direction) through several interfaces and physical
>>> connectors and multiple per-hop checks, and the probability of some
>>> undetected failure, while very small for any given interface,
>>> connector, or hop, is not quite as small for the sum of all of them
>>> (as well as there being some errors, such as misdirected or lost
>>> writes, that none of those checks can catch). What ZFS provides
>>> (that by definition hardware RAID cannot, since it must emulate a
>>> standard block-level interface to the host) is an end-to-end checksum
>>> that verifies data from the time it is created in main memory to the
>>> time it has been fetched back into main memory from disk. IBM,
>>> NetApp, and EMC use somewhat analogous supplementary checksums to
>>> protect data: in the i-series case I believe that they are created
>>> and checked in main memory at the driver level and are thus
>>> comparably strong, while in NetApp's and EMC's cases they are created
>>> and checked in the main memory of the file server or hardware box but
>>> then must get to and from client main memory across additional
>>> interfaces, connectors, and hops which have their own individual
>>> checks and are thus not comparably end-to-end in nature - though if
>>> the NetApp data is accessed through a file-level protocol that
>>> includes an end-to-end checksum that is created and checked in client
>>> and server main memory rather than, e.g., in some NIC hardware
>>> accelerator it could be *almost* comparable in strength.
>>>
>>
>> Yes, ZFS can correct for errors like bad connectors and cables. And I
>> guess you need it if you use cheap connectors or cables.
>
>
> You need it, period: the only question is how often.
>

Not if you have good hardware, you don't. That will detect bad data *at
the system*, like IBM's mainframe I/O channels do. But with cheap
hardware, yes you need it.

> But even if
>
>> they do fail - it's not going to be a one-time occurrance. Chances
>> are your system will crash within a few hundred ms.
>>
>> I dont' know about NetApp, but IBM doesn't work this way at all. The
>> channel itself is parity checked by hardware on both ends. Any parity
>> check brings the system to an immediate halt.
>
>
> Exactly what part of the fact that the end-to-end ZFS mechanism is meant
> to catch errors that are *not* caught elsewhere is still managing to
> escape you? And that IBM uses similar mechanisms itself in its i-series
> systems (as to other major vendors like EMC and NetApp) for the same
> reason?
>
>>
>>>>
>>>> Plus, with verified writes, the firmware has to go back and reread
>>>> the data the next time the sector comes around and compare it with
>>>> the contents of the buffer. Again, this is often done in hardware
>>>> on the high end RAID systems.
>>>
>>>
>>>
>>> And can just as well be done in system software (indeed, this is
>>> often a software option in high-end systems).
>>>
>>
>> Sure, it *can* be done with software, at a price.
>
>
> A much lower price than is required to write the same code as firmware
> and then enshrine it in additional physical hardware.
>

I never argued that you can't do *some* of it in software. But you
can't do *all* of it in software.

And of course it's cheaper to do it in software. But that doesn't make
it *more* reliable. It doesn't even make it *as* reliable.

>>
>>>>
>>>> And, most of these RAID devices use custom chip sets - not something
>>>> off the shelf.
>>>
>>>
>>>
>>> That in itself is a red flag: they are far more complex and also get
>>> far less thoroughly exercised out in the field than more standard
>>> components - regardless of how diligently they're tested.
>>>
>> Gotten a cell phone lately? Chances are the chips in your phone are
>> custom-made. Each manufacturer creates its own. Or an X-BOX,
>> Nintendo, PlayStation, etc.? Most of those have customer chips. And
>> the same is true for microwaves, TV sets and more.
>>
>> The big difference is that Nokia can make 10M custom chilps for its
>> phones; for a high-end RAID device, 100K is a big run.
>
>
> Exactly the point I was making: they get far less exercise out in the
> field to flush out the last remaining bugs. I'd be more confident in a
> carefully-crafted new software RAID implementation than in an equally
> carefully-crafted new hardware-plus-firmware implementation, because the
> former has considerably less 'new' in it (not to mention being easier to
> trouble-shoot and fix in place if something *does* go wrong).
>

One correction. They get *more* exercise in the plant and therefore
*need* far less exercise in the field to flush out remaining bugs.

And I'm glad you're more confident in a new software RAID implementation
than a new hardware-plus-firmware implementation. Fortunately, people
who need total reliability for critical data disagree with you.

> ...
>
>> You seem to think software is the way to go. Just tell me one thing.
>> When was the last time you had to have your computer fixed because of
>> a hardware problem? And how many times have you had to reboot due to
>> a software problem?
>
>
> Are you seriously suggesting that Intel and Microsoft have comparable
> implementation discipline? Not to mention the relative complexity of an
> operating system plus full third-party driver and application spectrum
> vs. the far more standardized relationships that typical PC hardware
> pieces enjoy.
>

First of all, I didn't say anything about Microsoft or Intel. You could
be talking *any* software or hardware manufacturer. The same thing goes
for Linux, OS/X or any other software. And the same thing goes for
Western Digital, HP, or any hardware manufacturer.

But you sidestep the question because I'm right.

> We're talking about reliability *performing the same function*, not
> something more like comparing the reliability of an automobile engine
> with that of the vehicle as a whole.
>

We're talking about software vs. hardware reliability. It's a fair
comparison.

>>
>> And you say software is as reliable?
>
>
> For a given degree of complexity, and an equally-carefully-crafted
> implementation, software that can leverage the reliability of existing
> hardware that already has to be depended upon for other processing is
> inherently more reliable - because code is code whether in software or
> firmware, but the software-only approach has less hardware to go wrong.
>

The fact still remains that hardware is much more reliable than
software. Even your processor runs of firmware (microcode). And your
disk drives have firmware. As do your printers, video adapter, Ethernet
port and even your serial port (unless you have a winmodem). And how
often do these fail? Or how about your cell phone, microwave or even
your TV set? These all have firmware, also. Even the U.S. power grid
is run by firmware.

I know of one bug in the Intel chips, for instance, in the early
Pentiums. A few had a very obscure floating point bug which under
certain conditions gave an incorrect result. This was a firmware bug
which was promptly corrected. There may have been others, but I haven't
heard of them.

The fact remains - firmware, because of the limited job it has to do and
limited interfaces can be (and is) tested much more thoroughly than any
general software implementation.

> ...
>
>>> I seriously doubt that anyone who's been talking with you (or at
>>> least trying to) about hardware RAID solutions has been talking about
>>> any that you'd find at CompUSA. EMC's Symmetrix, for example, was
>>> the gold standard of enterprise-level hardware RAID for most of the
>>> '90s - only relatively recently did IBM claw back substantial market
>>> share in that area (along with HDS).
>>>
>>
>> Actually, Symmetrix grew big in the small and medium systems, but IBM
>> never lost the lead in the top end RAID solutions. But they also were
>> (and still are) quite a bit more expensive than EMC's.
>
>
> You really need to point to specific examples of the kinds of 'higher
> end' RAIDs that you keep talking about (something we can look at and
> evaluate on line, rather than asking us to take your unsupported word
> for it). *Then* we'll actually have concrete competing approaches to
> discuss.
>

Talk to your IBM sales rep, for one. These devices are not available
"on line". How many mainframes do you see listed online? Or other
high-end systems?

Believe it or not, there are a lot of things which aren't available
online - because they are not general consumer products. And companies
who are looking for those products are not looking online.

> ...
>
> the software cannot
>
>>>
>>>> detect when a signal is getting marginal (it's either "good" or
>>>> "bad", adjust the r/w head parameters, and similar things.
>>>
>>>
>>>
>>> And neither can hardware RAID: those things happen strictly
>>> internally at the disk (for that matter, by definition *anything*
>>> that the disk externalizes can be handled by software as well as by
>>> RAID hardware).
>>>
>>
>> And here you show you know nothing about what you talk. RAID drives
>> are specially built to work with their controllers. And RAID
>> controllers are made to be able to do these things. This is very low
>> level stuff - not things which are avialable outside the
>> drive/controller.
>>
>> Effectively, the RAID controller and the disk controller become one
>> unit. Separate, but one.
>
>
> Provide examples we can look at if you want anyone to believe you.
>

Again, talk to your IBM sales rep.

>>
>>> Yes, it can
>>>
>>>> checksum the data coming back and read from the mirror drive if
>>>> necessary.
>>>
>>>
>>>
>>> Yup.
>>>
>>> Now, that *used* to be at least something of a performance issue -
>>> being able to offload that into firmware was measurably useful. But
>>> today's processor and memory bandwidth makes it eminently feasible -
>>> even in cases where it's not effectively free (if you have to move
>>> the data, or have to compress/decompress or encrypt/decrypt it, you
>>> can generate the checksum as it's passing through and pay virtually
>>> no additional cost at all).
>>>
>>
>> Sorry, Bill, this statement is really off the wall.
>
>
> Not at all: in fact, you just had someone tell you about very
> specifically comparing ZFS performance with more conventional approaches
> and finding it *better*.
>

And as I noted, there he was comparing apples and oranges, because he
wasn't comparing to a high end RAID array.

>>
>> Then why do all the high end disk controllers use DMA to transfer data?
>
>
> Because a) it's only been in the past few years that CPU and memory
> bandwidth has become 'too cheap to meter' (so controllers are still
> using the same approaches they used to and b) there's no point in
> *wasting* CPU bandwidth for no reason (DMA isn't magic, though: it
> doesn't save any *memory* bandwidth).
>

Nope, it's because it's faster and creates a lower load on the
processor. People still pay extra to get that capability. There must
be a reason why.

> When you're already moving the data, computing the checksum is free. If
> you're not, it's still cheap enough to be worth the cost for the benefit
> it confers (and there's often some way to achieve at least a bit of
> synergy - e.g., then deciding to move the data after all because it
> makes things easier and you've already got it in the processor cache to
> checksum it).
>

Actually, not. It takes cycles to compute the checksum, even if you're
doing it in hardware. It just takes fewer cycles (but more electronics)
to do it in hardware.

And as I noted before, the checksum is not transferred under normal
conditions.

>> Because it's faster and takes fewer CPU cycles than doing it
>> software, that's why. And computing checkums for 512 bytes takes a
>> significantly longer time that actually transferring the data to/from
>> memory via software.
>
>
> It doesn't take *any* longer: even with pipelined and prefetched
> caching, today's processors can compute checksums faster than the data
> can be moved.
>

Sure it takes longer. 4K of data can be transferred via hardware in as
little as 1K clock cycles (assuming a 32 bit bus). The same data
transfer in software takes around 10 times that long (I'd have to check
the current number of cycles each instruction in the loop takes to make
sure).

And there is no way the software can compute a 4K checksum in 1K cycles.
It can't even do it in 10K cycles.

But you've probably never written any assembler, so you don't know the
machine instructions involved or even the fact that a single instruction
can (and almost all do) take multiple clock cycles.

>>
>> Also, instead of allocating 512 byte buffers, the OS would have to
>> allocate 514 or 516 byte buffers. This removes a lot of the
>> optimizaton possible when the system is using buffers during operations.
>
>
> 1. If you were talking about something like the IBM i-series approach,
> that would be an example of the kind of synergy that I just mentioned:
> while doing the checksum, you could also move the data to consolidate it
> at minimal additional cost.
>
> 2. But the ZFS approach keeps the checksums separate from the data, and
> its sectors are packed normally (just payload).
>

OK, so ZFS has its own checksum for its data. But this is not the same
as the disk checksum. And having to move the data yet again just slows
the system down even more.

>>
>> Additionally, differerent disk drives internally use differrent
>> checksums.
>>
>> Plus there is no way to tell the disk what to write for a checksum.
>> This is hard-coded into the disk controller.
>
>
> You're very confused: ZFS's checksums have nothing whatsoever to do
> with disk checksums.
>
> - bill

But you've repeated several times how the disk drive returns its
checksum and ZFS checks it. So now you admit that isn't true.

But with a good RAID controller and hardware at the system, you can be
assured the data is received on the bus correctly. Unfortunately, PC's
don't have a way of even checking parity on the bus.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 14.11.2006 22:52:31 von Robert Milkowski

Jerry Stuckle wrote:
> Bill Todd wrote:
> > You really need to point to specific examples of the kinds of 'higher
> > end' RAIDs that you keep talking about (something we can look at and
> > evaluate on line, rather than asking us to take your unsupported word
> > for it). *Then* we'll actually have concrete competing approaches to
> > discuss.
> >
>
> Talk to your IBM sales rep, for one. These devices are not available
> "on line". How many mainframes do you see listed online? Or other
> high-end systems?
>
> Believe it or not, there are a lot of things which aren't available
> online - because they are not general consumer products. And companies
> who are looking for those products are not looking online.
>

Hehehehehehhe, this is really funny.

So only those super secret IBM arrays that no one can read about can
do some magical things Jerry was talking about - they never fail, never corrupt
data, etc. Of course you can't read about them.

You know what - I guess you came from some other plane of reality.

Here, in this universe, IBM hasn't developed such wonderful devices, at least not
yet. And unfortunately no one else did. Here, even mighty IBM has implemented for
specific applications like Oracle some form of end-to-end integrity, 'coz even
their arrays can corrupt data (or something else between can).

Now go back to your top-secret universe and prise those top secret technologies.
You're truly beautyful mind.

ps. I guess we should leave him now and let him go

--
Robert Milkowski
rmilkowskiXXXX@wp-sa.pl
http://milek.blogspot.com

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 14.11.2006 23:00:33 von Bill Todd

Jerry Stuckle wrote:
> Robert Milkowski wrote:

....

>> Reliability comes from end-to-end data integrity, which HW RAID itself
>> can't
>> provide so your data are less protected.
>>
>
> It can provide integrity right to the connector.

Which is not the same thing at all, and not good enough to keep
occasional data corruption out even when using the highest-grade hardware.

The point of doing end-to-end checks in main memory is not *only* to get
that last smidgeon of reliability, of course: it's also about getting
reliability comparable to the very *best* hardware solutions while using
relatively inexpensive hardware.

>
>> Your RAID doesn't protect you from anything between itself and your host.
>> Now vendors recognize it for years that's why IBM, EMC, Oracle, Sun,
>> Hitachi, etc.
>> all of them provide some hacks for specific applications like Oracle.
>> Of course
>> none of those solution are nearly as complete as ZFS.
>> I know, you know better than all those vendors. You know better than
>> people
>> who actually lost their data both on cheap and 500k+ (you seem to like
>> this number)
>> arrays. It's just that you for some reasons can't accept simple facts.
>>
>
> That's not its job.

So what? This discussion is not about individual component reliability,
but about overall subsystem reliability (otherwise, it would not include
the file system layer at all).

It's job is to deliver accurate data to the bus. If
> you want further integrity checking, it's quite easy to do in hardware,
> also - i.e. parity checks, ECC, etc. on the bus. That's why IBM
> mainframes have parity checking on their channels.

So, of course, do commodity systems and the disks on them: it's been a
*long* time (going at least back to the days of old-style IDE) since
communication between host and disk was unprotected.

And the fact that the sum of the individual checks that you describe is
insufficient to guarantee the level of data integrity that IBM would
like to have is why IBM supplements those checks with the kind of
end-to-end checks that they build into their i-series boxes.

....

>> Reliability comes from never overwriting actual data on a medium, so
>> you don't
>> have to deal with incomplete writes, etc.
>>
>
> That's where you're wrong. You're ALWAYS overwriting data on a medium.
> Otherwise your disk would quickly fill.

No, Jerry: that's where *you're* wrong, and where it becomes
crystal-clear that (despite your assertions to the contrary) you don't
know shit about ZFS - just as it appears that you don't know shit about
so many other aspects of this subject that you've been expostulating
about so incompetently for so long.

ZFS does not overwrite data on disk: it writes updates to space on the
disk which is currently unused, and then frees up the space that the old
copy of the data occupied (if there was an old copy) to make it
available for more updates. There's only a momentary increase in space
use equal to the size of the update: as soon as it completes, the old
space gets freed up.

....

Volume manager has nothing to do with ensuring data
> integrity on a RAID system.

In a software implementation, the volume manager *is* the RAID system.

>
>> Reliability comes from knowing where exactly on each disk your data
>> is, so if you
>> do not have RAID full of data ZFS will resilver disk in case of
>> emergency MUCH
>> faster by resilvering only actual data and not all blocks on disk.
>> Also as it understand
>> data on disk it starts resilver from / so from the beginning even if
>> resilver isn't
>> completed yet you get some protection. On classic array you can't do
>> it as they
>> can't understand data on their disks.
>>
>
> Reliability comes from not caring where your data is physically on the
> disk.

Not in this instance: once again, you don't know enough about how ZFS
works to be able to discuss it intelligently.

The aspect of reliability that Bob was referring to above is the ability
of ZFS to use existing free space in the system to restore the desired
level of redundancy after a disk fails, without requiring dedicated idle
hot-spare disks - and to restore that redundancy more quickly by copying
only the actual data that existed rather than then every sector that had
been on the failed disk (including those that were not occupied by live
data).

....

> there are things a RAID can do that ZFS cannot do.

ZFS *includes* RAID (unless you think that you're better-qualified to
define what RAID is than the people who invented it before the term even
existed and the later people who formally defined the term).

And for that matter the only things that you've been able to come up
with that even your own private little definition of RAID can do that
ZFS can't are these alleged special disk-head-level checks that your
alleged super-high-end arrays allegedly cause them to perform.

I now see that you've now replied to my morning post with a great deal
of additional drivel which simply isn't worth responding to - because,
despite my challenging you several times to point to *a single real
example* of this mythical super-high-end hardware that you keep babbling
about you just couldn't seem to come up with one that we could look at
to evaluate whether you were completely full of shit or might have at
least some small basis for your hallucinations (though you do seem to
have admitted that these mythical arrays use conventional disks after
all, despite your previous clear statements to the contrary).

When I realized that, boring and incompetent though you might be in
technical areas, you presented the opportunity to perform a moderately
interesting experiment in abnormal psychology, I decided to continue
talking to you to see whether referring to specific industry standards
(that define the communication path between host and commodity disk to
be handled very differently than you have claimed) and specific
manufacturer specifications (that define things like ECC lengths in
direct contradiction to the 'facts' that you've kept pulling out of your
ass) would make a dent in the fabric of your private little fantasy
world. Since it's now clear to me that you're both nutty as a fruitcake
and completely and utterly ineducable, that experiment is now at an end,
and so is our conversation.

- bill

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 14.11.2006 23:10:31 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>Bill Todd wrote:
>>
>>>You really need to point to specific examples of the kinds of 'higher
>>>end' RAIDs that you keep talking about (something we can look at and
>>>evaluate on line, rather than asking us to take your unsupported word
>>>for it). *Then* we'll actually have concrete competing approaches to
>>>discuss.
>>>
>>
>>Talk to your IBM sales rep, for one. These devices are not available
>>"on line". How many mainframes do you see listed online? Or other
>>high-end systems?
>>
>>Believe it or not, there are a lot of things which aren't available
>>online - because they are not general consumer products. And companies
>>who are looking for those products are not looking online.
>>
>
>
>
> Hehehehehehhe, this is really funny.
>
> So only those super secret IBM arrays that no one can read about can
> do some magical things Jerry was talking about - they never fail, never corrupt
> data, etc. Of course you can't read about them.
>

Not at all super secret. Just not available online.

> You know what - I guess you came from some other plane of reality.
>
> Here, in this universe, IBM hasn't developed such wonderful devices, at least not
> yet. And unfortunately no one else did. Here, even mighty IBM has implemented for
> specific applications like Oracle some form of end-to-end integrity, 'coz even
> their arrays can corrupt data (or something else between can).
>
> Now go back to your top-secret universe and prise those top secret technologies.
> You're truly beautyful mind.
>
>
> ps. I guess we should leave him now and let him go
>

Try asking your IBM salesman, troll.

How many mainframe printers do you see on the internet? Tape drives?
Disk arrays? Or are you telling me IBM doesn't sell those, either?
They're sold through their Marketing division.

Crawl back into your hole with the other trolls.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 14.11.2006 23:38:29 von Robert Milkowski

Jerry Stuckle wrote:
>
> How many mainframe printers do you see on the internet? Tape drives?
> Disk arrays? Or are you telling me IBM doesn't sell those, either?
> They're sold through their Marketing division.

Here you can find online info about IBMs mainframes, enterprise storage, enterprise
tape drives and libraries.

http://www-03.ibm.com/systems/z/
http://www-03.ibm.com/servers/storage/disk/
http://www-03.ibm.com/servers/storage/tape/

However no single word about your mythical array.
I told you, go back to your universe.

--
Robert Milkowski
rmilkowskiWW@wp-sa.pl
http://milek.blogspot.com

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 15.11.2006 02:17:43 von Jerry Stuckle

Bill Todd wrote:
> Jerry Stuckle wrote:
>
>> Robert Milkowski wrote:
>
>
> ...
>
>>> Reliability comes from end-to-end data integrity, which HW RAID
>>> itself can't
>>> provide so your data are less protected.
>>>
>>
>> It can provide integrity right to the connector.
>
>
> Which is not the same thing at all, and not good enough to keep
> occasional data corruption out even when using the highest-grade hardware.
>
> The point of doing end-to-end checks in main memory is not *only* to get
> that last smidgeon of reliability, of course: it's also about getting
> reliability comparable to the very *best* hardware solutions while using
> relatively inexpensive hardware.
>

Really, Bill, you can't remember details from one post to the next.
This has already been covered multiple times.

One last thing before I do to you what I do to all trolls.

If everything you were to claim were true, there would be no hardware
raid devices. There would be no market for them because your precious
ZFS would negate any need for them.

But it's a good think those in charge of critical systems know better.
And they disagree with you - 100%. That's why there is such a market,
why manufacturers build them, and why customers purchase them.

But obviously great troll Bill Todd knows more than all of these
manufacturers. He knows better than all of these customers. In fact,
he's such an expert on them that he doesn't need any facts. He can make
up his own.

And BTW - I told you how to get information on them. See your IBM Rep.
He can fill you in on all the details. Because not everything is on
the internet. And only someone with their head completely up their ass
would think there is.

So long, troll.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 15.11.2006 02:20:49 von Jerry Stuckle

Robert Milkowski wrote:
> Jerry Stuckle wrote:
>
>>How many mainframe printers do you see on the internet? Tape drives?
>>Disk arrays? Or are you telling me IBM doesn't sell those, either?
>>They're sold through their Marketing division.
>
>
> Here you can find online info about IBMs mainframes, enterprise storage, enterprise
> tape drives and libraries.
>
> http://www-03.ibm.com/systems/z/
> http://www-03.ibm.com/servers/storage/disk/
> http://www-03.ibm.com/servers/storage/tape/
>
>
> However no single word about your mythical array.
> I told you, go back to your universe.
>

Yep, you found some of their products. But do you really think these
are all of their products? Not a chance.

As I said. Contact your IBM Rep.

But I'm going to do you like I did the other troll, Bill.

So long, troll.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 09.12.2006 17:26:50 von alf

Jerry Stuckle wrote:

>>> There's a lot more to it. But the final result is these devices have
>>> a lot more hardware and software, a lot more internal communications,
>>> and a lot more firmware. And it costs a lot of money to design and
>>> manufacture these devices. That's why you won't find them at your
>>> local computer store.
s cheaper to have multiple mirrors.
>
> There are a few very high end who use 3 drives and compare everything (2
> out of 3 win). But these are very, very rare, and only used for the
> absolutely most critical data (i.e. space missions, where the can't be
> repaired/replaced easily).
>

can you actual name them and provide links to specific hardware
manufacture web sites?

--
alf

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 09.12.2006 17:55:41 von alf

Jerry Stuckle wrote:

>
> Real RAID arrays are not cheap. $100-500/GB is not out of the question.
> And you won't find them at COMP-USA or other retailers.
>

Does not RAID stand for 'redundant array of inexpensive disks' :-)?

--
alfz1

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in case of crash (mysql, O/S, hardware,

am 09.12.2006 18:28:25 von Toby

alf wrote:
> Jerry Stuckle wrote:
>
> >
> > Real RAID arrays are not cheap. $100-500/GB is not out of the question.
> > And you won't find them at COMP-USA or other retailers.
> >
>
>
> Does not RAID stand for 'redundant array of inexpensive disks' :-)?

It stands for "false sense of security".

>
> --
> alfz1

Re: ZFS vs RAID, was Re: MyISAM engine: worst case scenario in caseof crash (mysql, O/S, hardware, w

am 09.12.2006 18:46:57 von alf

toby wrote:
> alf wrote:
>
>>Jerry Stuckle wrote:
>>
>>
>>>Real RAID arrays are not cheap. $100-500/GB is not out of the question.
>>> And you won't find them at COMP-USA or other retailers.
>>>
>>
>>
>>Does not RAID stand for 'redundant array of inexpensive disks' :-)?
>
>
> It stands for "false sense of security".
>

agreed, plus the politically correct RAID stands for "redundant array of
independent disks" :-)