
Tue, July 18, 2006 - 10:16 PM
Sigh. Sometimes no matter how hard I try...
Over the weekend my DBA (data base administrator) copied our database from one SAN (a number of disk drives which appear to be a single disk drive) to a newer, faster one. It should have been a simple, risk-less operation, intended to increase the speed of our large computer system. We planned it well, or so we thought.
We scheduled 6 hours of downtime. If you think it's easy negotiating with our users for downtime like that, you don't know the retail industry. But we got our downtime, and the DBA began copying the data.
He had eight files, each 20 to 40 gigabytes, so he copied all eight at the same time using the simple copy command from unix. That was the problem. He should have used a different command, but how was he to know?
He finished on time and on Monday I came into the office and things looked fine. I looked at the system statistics and found something that seemed strange. On Friday, before the copy, we were doing a couple of thousand disk operations a second. On Monday, after the copy, we were doing four to five times as many. Something was wrong.
I have a couple of dozen consultants working for me, and they all shrugged. Not a problem, they said. But I've been working in this business for over 25 years now and I am very, very good at this, and the numbers made no sense at all. No one was giving me an answer, and by 5pm Monday I was getting quite upset.
The numbers didn't make sense and the performance didn't make sense.
I called some people and put together a team of experts and more or less told them find me the answer. Now. They worked all night long, and at 2:43am I received an email with the problem.
Fragmentation.
The problem was very bad. Very, very, very bad.
Files on disks can be stored in two ways: contiguous, meaning their data is more or less all together on the disk, or fragmented, meaning it is scattered in bits and pieces all over the disk.
Imagine if you took a piece of paper and ripped it into four pieces. Each piece had the data, say a part of a treasure map, and the location of the next piece of paper. If you gathered them all up you'd have a complete treasure map.
Our DBA had essentially ripped each of these eight files into 17 MILLION pieces. Very tiny pieces.
So the process of gathering up the data in the file became very expensive for the hardware. What we found was all 144 disks were running at 100%. Literally, 100%.
Now we have to repair the damage. We've scheduled a two hour downtime slot each evening for the next couple of weeks. The DBA will be moving data around until it's back to where it should be.
Sigh. In the meantime, the system is performing slower than it should, yet it's doing one hell of a lot of work to do what it's doing.
Auntie Pita (cause I am one!) (Wed, July 19, 2006 - 9:48 AM)
I'm shuddering just thinking about that!!!!
Deena (Wed, July 19, 2006 - 2:20 PM)
Oh man! That's scary!
Unless otherwise noted, all photos and text is Copyright © Richard G Lowe, Jr.