Uh, oh. Our first disaster at KH.

Karlis · Nov 3, 2015

We have been with KH for a few years, we have experienced short downtimes here and there, but support and recovery were usually excellent...until today. Our VPS went down due to RAID controller failure. Within an hour, KH restored the VPS (excellent), however the database (InnoDB) had crashed, cPanel did not work and there were other issues. All was sorted fairly quickly, except for the DB which is essential for our app. We did have backups. And this is where the disaster begins...

02:18 PM - we receive message that senior staff is looking at the problem, as mysql fails to start correctly.
03:27 PM - more than an hour later KH offers to restore from nightly backup, we say that we'd rather not to, but we do have a 20 minutes old backup before crash (we do a backup once an hour) as a last resort in restoring our DB.
03:56 PM - we receive confirmation from KH, that our backup is the best way to proceed.
04:14 PM - we have uploaded the dump and give KH clear green light to restore our DB from our backup.
06:12 PM - full two hours later KH announces that "restore" is successful and we should re-check. We re-checked and the DB was empty. KH apparently did not proceed as we planned and stubbornly persisted in attempt to recover corrupted DB instead for restoring from the dump we provided.

It took us around 20 min to create a new DB from our dump and re-launch the service. We could have relaunched much earlier should we have known that the recovery done by KH will fail or that KH will not stick to the agreed upon plan to use our DB dump. Basically KH wasted multiple precious hours in an unsuccessful attempt to recover 20 min worth of data that we did not prefer to, but were ok to loose.

So here are the questions to KH:
1) Is it normal for InnoDB databases to crash so bad that the whole database is gone?
2) Is it a normal practice to spend ~4 hours to ultimately discover that DB data is unrecoverable?
3) Why did the proposed and confirmed plan to restore from our dump was not followed?
4) Why are the responses so slow - at times we had to wait for over an hour to get the next response?

I do understand that hardware failures do happen, data gets lost even on RAID arrays, and for this we do have hourly backups, but, come on, why it takes more than 4 hours to negotiate a simple mysql DB recovery?

I am not going to judge KH from this one event, but I do believe that there is a plenty of space for improvement in communication at least... both by KH and by us.

p.s. Our VPS many hours after the recovery is way way slower than it was before (CPU is idling, slowness comes from disk reads). I'm told that this is because of recovery procedures done by other tenants, tomorrow we shall see, if it gets any better...

Dion · Nov 3, 2015

When InnoDB crashes it is catastrophic; every InnoDB table is rendered unusable. If all the tables in the database are InnoDB, then you have lost the entire database. MyISAM crashes are table-specific; at worst you will lose a single table.

This is the reason why I will only use InnoDB on tables that are written/updated constantly, such as session tables in many applications, and the _options and _usermeta tables in WordPress.

If the database is a manageable size, you should consider changing tables that aren't written/updated constantly to MyISAM. If you are using MariaDB, the Aria engine is a better replacement than MyISAM.

Karlis · Nov 4, 2015

Dion, we can not use MyISAM due to table locks, we have frequent writes and the DB is a few GBs.

Anyways, why did KH cycle on trying to restore our crashed InnoDB (which you are saying is on the impossible side) when we had a reasonable fresh backup is unclear. But what's done... done.

More worrying is that our East Coast VPS is still IO-slow today (plenty spare CPU/RAM, yet HTTP response times are multiple times slower than previously, sometimes there is a huge disk read slowdown which causes HTTP process queue to be exhausted) at close to unacceptable level. It was never so bad with KHs VPSes. At loss... what to do. My guess is that either the new RAID controller is crappy or something wrong with the disks. It was blazing fast before (same app, same load, same traffic).

KH-DanielP · Nov 4, 2015

@Karlis ,

We do apologize for the issues that resulted from the hardware failure.

As Dion mentioned innodb can be quite sensitive to hard crashes and recovery can be at times hit or miss. We do have success in recovering most of the data by using various innodb recovery levels and performing full database dumps of the impacted tables. Depending on the size of the information being extracted it can take a while to determine if this method was successful or not.

I've not reviewed over your exact ticket yet, however, I will ensure it is reviewed and we determine why your instructions were not followed to recover from the backup you provided, I do apologize for this.

In regards to the node, I can assure you the raid controller is not crappy as we use identical hardware between our nodes to allow for interchangeability between them. Since the controller had crashed on the previous system and the drives moved across it is currently in the process of re-initializing the array which does add some burden to the disks and overall I/O speed. The time it takes for this process to complete does vary upon a lot of factors so I can't give an exact time frame for this to be fully completed but we are also monitoring the node for any other activities to help lessen the impact of this.

Once your ticket has been reviewed I'll have our QA Manager get intouch with you regarding the outcome.

Karlis · Nov 4, 2015

Looks like another simple miscommunication

We were not told by Roman that the RAID is rebuilding when we asked about the sluggishness yesterday, so we figured maybe this is the "new normal" situation

. Also no word on the forum thread about system status. However when we re-asked today, Derrick responded promptly with this info and that clarified everything. We stopped worrying right that second.

I do appreciate you looking into our ticket and hope something good comes out of it for the future.

Anyways, all seems back to normal today and we do plan to stick around and even upgrade to SSD, so crisis averted...

We survived and KH regained our trust.

atul · Mar 30, 2016

How many time such hardware failure is done in KH ? Where this was happen ? In WEST COST DC or East cost DC ? because at present i have purchased vps in west cost dc. And now i am thinking to move into EAST COST DC to get better speed because our client are from INDIA and so.

phpAddict · Mar 30, 2016

Hard Drive failures can happen anywhere and that's the need for RAID, that and big cockroaches.

I would be more concerned if a host failed to recognize a failing drive for days or weeks. In this case it was noticed within an hour.

Rebuilding a RAID array = temporary sluggishness
rather than destroyed data = recovering from backups, assuming you have them, creating lengthy total down time

This is not the kind of thing to compare failures, and especially not speed, in west coast to east coast, unless of course @KH-Jonathan is installing Refurbished or Used 10 year old Drives in West coast and New ones in East. When it comes to downtime with KnowHost, as I monitor the Network and Hardware Status, almost all issues are isolated to a single server, rarely if ever a site wide issue affecting the entire DC.

KH-Jonathan · Mar 30, 2016

We have thousands of hard drives in production. Failure is inevitable. We limit the effects of this as much as possible with the use of RAID and enterprise-grade HDDs. Western Digital for what it's worth.

Uh, oh. Our first disaster at KH.

Karlis

New Member

Dion

Member

Karlis

New Member

KH-DanielP

KH-CEO

Karlis

New Member

atul

New Member

phpAddict

Active Member

KH-Jonathan

CTO