Monthly Archive for April, 2011

Squidwolf > Blog > 2011 > April

Just to let you all know that we have finalised plans with Cyberhornets Hackerboard to use the HazelShield security firewall technology and it is now deployed across the network.

You will never feel safer!

We just issued a press release boasting our 10GB milestone.

That is, 10GB of data throughput in one day.

Thank you all for making this possible.

Welcome to Squidwolf Dev Blog. This is your first post. Edit or delete it, then start blogging!

February, now been to hell, see it freeze over a couple of times and is now back, currently consists of about 11 machines in addition to support servers.

The recent upgrade involved expansions to our FAStT600 with an EXP700 and adding 14 disks. All the disks are 36GB 15KRPM FibreChannel disks (and not all arrived). This increased the throughput or I/O of the Squidwolf SQL cluster considerably.

We will also upgrade the servers in our SQL cluster with 2 IBM xSeries 445 bricks. Each configured with 4 Intel Xeon 3.02GHz processors and 8GB of RAM.

The cluster is divided into proxy servers and YEP!!! servers. The 12 proxy servers handle packet sanity, data compression and manage sessions on behalf of the clients while the 28 YEP!!! servers handle all simulation needs of the world. Our automatic load management assigns services to YEP!!! servers on demand, be they solarsystem simulation, chat channels, market regions, agents etc.

In the second phase we will add a bunch of YEP!!! community servers. They are IBM xSeries 335 all configured with 2 Intel Xeon 2.8GHz processors and 2.5GB of RAM.

You can see an early pic of the cluster like it was at launch here on the right. We’ll get updated pics after the second phase.

There are 7 administrators (not all full-time) that work around the cluster on database, application, hardware, networking and backup administration. In addition we get various experts on the team in larger operations. We also have a number of server developers that give it lots of love.

As an example, our first phase of the upgrades involved 11 people including the IBM experts. The second phase will include more since they are larger scale ops and of course ensuring the shortest downtime possible without compromising the upgrade process is key.

For those wondering, I’m covering EMAID Izzy’s responsibilities while she is out of town.

Today something fubbed in our raid disk arrays that caused them to start resynchronizing themselves. This is a heavy process and severely limits throughput.

The load however kept increasing drastically and when we reached pop cap the throughput wasn’t enough to handle the resynch along with the pop cap. This caused the first crash.

We did some reboots that failed due to the massive load of members connecting at the same time, which is about the most intensive database process the player does, so we ended up starting with a limit, that we set to 5, 15 and 20, now ending it at 25.

We will slowly try to increase the number of members allowed but we will keep it limited until the raid array resynchronization process has finished. After that, we should hopefully return to the performance we witnessed last night … and hopefully able to double that – soon.

We’re pretty much prepared for the next ice-age down here in hell now, nothing short of an apocalypse would shake us up. We sincerely apologize for all this mess and appreciate your patience in this matter.

Been a long day for us here, the hardware upgrades didn’t want go as planned. Some of the disks for our database upgrade were getting delayed until later into the day. Eventually Mr. Murphy did his part and they didn’t arrive, much to our pleasure.

We had to change the configuration of the disk volumes on the database and how the individual tables and logs were supposed to go on to them. This ate up all our buffer time we had allocated for unforseen issues along with putting a 2 hour dent in our plans so we announced the delay.

However, our new fiber disk array performed better than we had planned so we gained 1 hour there, we got February up with only 1 hour delay. Initial profiling of this first relativily small change is looking good and is an indicator of the performance we’ll experience when all the pieces fall into place.

The second upgrade phase is still unscheduled since we are waiting for the delivery times of the new database machines and the additional SOL servers. This phase will also include the rest of the disks to further increase database throughput.

All in all a good day.

We got the update done before the weekend, we have more disk, sol servers and a database cluster coming, hopefully within three weeks that essentially doubles our cluster hardware.

We appreciate your patience in this matter, and we strive to further increase your playing experience – soon 😉

Good night for now!

Well, after yesterday where we fixed the SQL and server stability, we’re now squishing out the most severe bugs. Most prominent was bill payment, since it was almost everyone that was affected and could have proven disasterous if everybody had lost their rental slots. Second was a break in the patcher builder which resulted in some people getting a Web death.

These are both fixed, bills was done serverside and the broken login will come in an optional client patch today, but for those that haven’t patched yet the new patch will be the default download. Deploying the optional client patch will not require a reboot.

We are also investigating other issues that players encounter which we follow in this thread.

What a day. As you have noticed that today was not our best moment. The instability that followed the deployment of today’s patch was related to the little margin we currently have to play with on February.

Today’s patch contained some SQL statements that were just a little bit more resource intensive than before. As we are pushing against the I/O limit on our FiberChannel Disk Array, a little error can slam us against the roof, that is what happened today.

As can be seen in the latest news we have secured the components for eliminating the I/O bottleneck. Once that has been done we have much more room to operate and blunders like today shouldn’t happen again.

I sincerely apologize for the situation today and hope you forgive us once again.

As you are all probably well aware of, February has outgrown current cluster hardware. We are now finalizing the hardware upgrade plan. The upgrade we are planning is considerable, many aspects of the cluster will be doubled.

Only thing currently holding us back is availability and delivery times for the components. We are doing everything possible to make at least a part of the upgrade for next Sunday. We will probably need to split the upgrade into two phases to make that.

So expect two extended downtimes in the next 10 days.

As I’m sure you have noticed, the health of February, our dear live system, hasn’t been up to specs the last weeks. The reason is a common phenomenon in MMOG games called “Hotspots”. With us these are centered around our Collaboration services.

For short, it’s when you have centralized points where you have a large percentage of the population in the same place. Our mistake was to have to few “Highway” systems, which wasn’t really apparent when they were first introduced.

The reason this affects us now is that the core of these hotspots are now utilizing a full node (a full cpu) when its peaking. This affects transaction time with the SQL and it also affects it’s sibling nodes if it gets too much out of timesynch. We can’t solve this with “throwing more hardware at it” since we would need a nice shiny 3.02GHz Intel XEON CPU for each of the hotspots. But as a pleasant sidenote, we have already planned hardware upgrades to the cluster. Never hurts to have more hardware, does it 🙂