No mention a full node failure and data lost!

opoloko

Member
10 hours ago my VPS went down, late night here, and support said it was a node problem. 8 hours later is still down with support saying "sorry we lost data we need to fully restore and we have only backup up to such&such date, can we proceed?".

This is the worse that ever happened in a decade with KH, and no mention here, and is VERY worring. There are some databases with realtime data from a store, so transactions and other vital informations will be lost.

There should be a full post-mortem report and compensation offered and explanation, and most of all a system to put in place to avoid such problems.
 
As a further update, after HOURS of keeping the server down waiting for a reply from me (not that I had any choice), support said that I was moved to a new node (what I would expect if a node fails!) and so no restore was needed!

Once again, communication was poor, the VPS should have been moved to a new node not left down for hours waiting for a reply from me about a restore (loosing data) that in the end was also not needed!
 
As another further surprise, now they say that actually NO I need a full restore because files were corrupted.
Once again, this is the worst ever happened at Knownhost, communication is incredibly bad and confusing.
 
Howdy opoloko,

Apologies for our admin on shift not posting the typical notice. Because the problems were a bit more than expected he remained focused on getting everyone evacuated and things sorted and forgot the post.

Long and the short of the post mortem is that the node experienced a partial hardware failure, this resulted in corruption of some data / user disks. After it was confirmed the system was no longer stable plans were enacted to move all containers off of this equipment to online hot spares, but in an effort to preserve as much data as possible the existing data was moved first, only until that move is completed for a container can it be fully evaluated to determine the extent of the damage and if a restore was needed.

So far we've only had to restore 2 containers, unfortunately your container was one of them, but this is also why we take and maintain rigorous backups.

I do apologize that you got caught up in this, we do our best to ensure the reliability and redundancy of our equipment, including monitoring for any signs of failure, redundant arrays etc but even with all of that things can and do fail. Specifically in this case, the reason corruption happens in this kind of a crash is that any data that is in-memory cannot be written to disk, which causes that underlying corruption.

Regarding any compensation, this will of course be covered under our SLA so please reach out to our billing department and they'll get that sorted for you.

Again we do apologize for the issues encountered, our admins immediately started corrective action once it happened and are still engaged on the issue for any remaining customers.
 
Howdy opoloko,

Apologies for our admin on shift not posting the typical notice. Because the problems were a bit more than expected he remained focused on getting everyone evacuated and things sorted and forgot the post.

Long and the short of the post mortem is that the node experienced a partial hardware failure, this resulted in corruption of some data / user disks. After it was confirmed the system was no longer stable plans were enacted to move all containers off of this equipment to online hot spares, but in an effort to preserve as much data as possible the existing data was moved first, only until that move is completed for a container can it be fully evaluated to determine the extent of the damage and if a restore was needed.

So far we've only had to restore 2 containers, unfortunately your container was one of them, but this is also why we take and maintain rigorous backups.

I do apologize that you got caught up in this, we do our best to ensure the reliability and redundancy of our equipment, including monitoring for any signs of failure, redundant arrays etc but even with all of that things can and do fail. Specifically in this case, the reason corruption happens in this kind of a crash is that any data that is in-memory cannot be written to disk, which causes that underlying corruption.

Regarding any compensation, this will of course be covered under our SLA so please reach out to our billing department and they'll get that sorted for you.

Again we do apologize for the issues encountered, our admins immediately started corrective action once it happened and are still engaged on the issue for any remaining customers.
Hi Daniel, I appreciate and I know all of this happens. The worry is I have other VPSs with you and in those cases such problem would cause potential big losses, so it worries me a bit.

I do understand it happens, and it seems I was one of the unlucky ones. I just thought that if all is in arrays this would not happen, but I suppose it depends where the hardware breakdown happens.

Anyway, I do appreciate your explanation: support explanations were a bit confusing, and probably was unlucky it happend late at night here and so I was unable to immediately reply to the restore from backup email.

Thanks for your reply.
 
This is unacceptable. Data loss and downtime need real accountability now.
I do agree, I struggle understanding how a fully redundant setup for VPS (what I expect apart from backups) can actually loose data. Some downtime fair enough, but not data loss.

Also, this is the second time I had a problem with one of my VPSs because of a node failure...and in one instance it went on for months (poor performance and apps killed for supposed RAM usage spikes and random cPanel errors) and only in the end there was an admission it was a node problem and I was moved on a new node and all was solved.

This time I lost data and, in common with the other time (two different VPSs and accounts), the monitoring system didn't really detect any anomaly...it was me having to write and then in the reply knowing that they knew there was a node problem.

I think KH is still a fantastic service, but these nodes failures and a poor management of solving it makes me worry...imagine the money loss for data loss in an ecommerce store...it's not acceptable.
 
I do agree, I struggle understanding how a fully redundant setup for VPS (what I expect apart from backups) can actually loose data. Some downtime fair enough, but not data loss.

Also, this is the second time I had a problem with one of my VPSs because of a node failure...and in one instance it went on for months (poor performance and apps killed for supposed RAM usage spikes and random cPanel errors) and only in the end there was an admission it was a node problem and I was moved on a new node and all was solved.

This time I lost data and, in common with the other time (two different VPSs and accounts), the monitoring system didn't really detect any anomaly...it was me having to write and then in the reply knowing that they knew there was a node problem.

I think KH is still a fantastic service, but these nodes failures and a poor management of solving it makes me worry...imagine the money loss for data loss in an ecommerce store...it's not acceptable.
Benjamin has been removed from this conversation as he is neither a customer nor a legitimate poster, with multiple IPs linked to known spam sources and VPNs.

I sincerely apologize for the data loss you experienced—while extremely rare, it did happen, and there's no undoing that.

Regarding your past issue, I'd need to review your previous tickets to provide a full response. That said, we’ve significantly improved our internal monitoring and balancing systems to better detect and mitigate node-level problems before they escalate.

We completely understand that downtime and data loss are unacceptable, especially for businesses relying on their VPS for critical operations. Our standard plans are designed to offer the best balance of performance, redundancy, and cost. However, for those requiring near-zero downtime and data loss mitigation, we offer higher-tier solutions with additional failover protections. These come at a higher cost, as true high-availability infrastructure requires greater investment.

We always aim to strike the best balance between cost and reliability, and while no system is infallible, we continuously refine our approach to minimize risk and improve response times.
 
Benjamin has been removed from this conversation as he is neither a customer nor a legitimate poster, with multiple IPs linked to known spam sources and VPNs.

I sincerely apologize for the data loss you experienced—while extremely rare, it did happen, and there's no undoing that.

Regarding your past issue, I'd need to review your previous tickets to provide a full response. That said, we’ve significantly improved our internal monitoring and balancing systems to better detect and mitigate node-level problems before they escalate.

We completely understand that downtime and data loss are unacceptable, especially for businesses relying on their VPS for critical operations. Our standard plans are designed to offer the best balance of performance, redundancy, and cost. However, for those requiring near-zero downtime and data loss mitigation, we offer higher-tier solutions with additional failover protections. These come at a higher cost, as true high-availability infrastructure requires greater investment.

We always aim to strike the best balance between cost and reliability, and while no system is infallible, we continuously refine our approach to minimize risk and improve response times.
Hi Daniel, as usual thanks for this.

I can go on in private about the other incident, but most crucially I'm quite interested in these higher-tier solutions with additional failover protections for one of my VPSs.

Is it something we could chat about privately in DM or shall I ask something specific to Billing or check online? We're also using on some VPSs the legacy plans as we need to (better burst performance), so if there was a way to have something more tailored would be great.
 
Is it something we could chat about privately in DM or shall I ask something specific to Billing or check online? We're also using on some VPSs the legacy plans as we need to (better burst performance), so if there was a way to have something more tailored would be great.

Easiest way is just hop on sales chat on the main site and ask for me, I've got time this AM so I can sync up with you there.
 
  • Like
Reactions: Dan
Top