Recently we at work became the proud new parents of and HP EVA4400 with 72 15K drives. This is significantly more spinals than the 12 disk than we are currently running on. Now like most environments we need to put 20 pounds of stuff in a 1 pound bag and get the most out of the hardware we can because there is simply no more money in our budget for more hardware. With that idea in mind I test drove several different best practice ideas from HP and VMware.
First we will review the best practices from HP. Please know that I tried to find a like for the EVA best practice document but have had no luck, the only one I have is the one that came on the CD with the EVA its self. Moving on, for HP the best practice comes in two flavors; first is the cost effective best practice which is to have all the disks in one large disk group while the second is to run Vraid 1 for the best availability and speed. Note that Vraid 1 on an EVA is the equivalent of the traditional raid 1+0 which is mirror and striping. This set up is true from an availability stand point and even on a raw performance stand point as well. There are a few shortcomings with it though. This first shortcoming being the amount of lost disk space. The disks in our EVA are 146 GB each for a total of 10512 GB or 10TB. With a raid 10 we will only get 5 TB raw and that will just not fit our needs. So we will be looking at Raid 5 and 6 setups. Additionally we have 2 very busy database servers and 2 busy exchange mail store servers. All four of these can have lots of disk I/O from Transaction logs and database read and writes. The best practice document from HP for the EVA does make note that when running database servers or anything else with large amounts of I/O the one large disk group might not be the best fit. With that being said and running the whole thing as a raid 10 eating up too much space I turned to VMware and their best practice and white papers for setting up storage.
Considering that we have 4 very active database servers which happen to be our tier one apps, I wanted to be sure that these servers had the I/O that they need to meet the application and business requirements. There were 2 white papers from VMware I read sometime back, one of these was about MS SQL (which can be found here
http://www.vmware.com/files/pdf/perf_vsphere_sql_scalability.pdf) and the other was about SAP (which can be found here
http://www.vmware.com/files/pdf/whitepaper_SAP_bestpractice_jan08.pdf). In the MS SQL document VMware took and created a one disk group for the database and then for the T-logs took 2 other disk groups and laid raid 0 arrays on them and set the VM to have mirrored the two. Note that this was an all-out speed test for MS SQL server in vSphere and the number of disks they used where much larger than anything we have but the configuration did offer a possible configuration for our needs. For the SAP document, as much as we are not running SAP, it addresses running more than one database server and how to configure the disk to handle the load.
Now of the fun part, test one the SAP configuration.
In the SAP document VMware took and sliced out one disk group of 8-12 disks for T-Logs and the other disk groups for data and OS drives. As for me I took and created 1 disk group of 8 disk for T-Logs and set 2 vdisks (LUNs) on it. The vdisks have a raid level of 10 each. The reason I created 2 LUNs was to reduce LUN lock between the hosts. I then created 4 disk groups of 13 disks each for data and OS drives. If you’re doing the math 13 * 4 + 8 does not equal 72 but 60 this is because all good admins keep a little bit of their resources unused and in their back pocket for a rainy day. On each one of the 4 disk groups I first created a LUN with Raid 6 on it for the Database drive, and then I created another LUN for data. The break down looks like this.
Next I created the test VMs. I created 4 servers to simulate the 4 database servers we run now. Each one of these VMs are Windows 2008 servers with 1 vCPU 2 GB RAM and 2 drives. The first drive is for the OS and second drive is for T-Logs. Each Drive is on its own dedicated SCSI channel to improve how it accesses each drive during simultaneous read and writes. The servers where setup on the datastores as:
I did not spread the OS drives over all 4 disk groups because my focus was to provide the I/O for T-Logs which have the highest write counts. To test the I/O I installed iometer (which can be downloaded here
http://www.iometer.org/doc/downloads.html) on each one of the servers. I setup the test to RUN for 10 minutes on all 4 servers at the exact same time. The results were quite imprecise considering that there was only 8 disk split into 2 LUNs serving 4 servers running a very disk intensive test. The results look like this:
The Second Test was the MS SQL setup by VMware.
In the MS SQL test, VMware created 1 large disk group for the database and they created 2 disk groups each with a rail level 0 for the T-Logs. Since raid 0 does not provide any resiliency to disk failure so they mirrored the disk at the server level. This was a setup for speed testing one SQL server but it seemed like it could have potential to meet our needs, even though I am not a big fan of software Raids I am willing to do what it takes to get the job done. For the second test I setup the EVA with 4 disk groups with 15 disks each and then created two LUNs on each disk group one LUN was a raid 6 for Data and OS drives and the other was raid 0 for T-Logs. The configuration looked like this:
The final test setup looked like this:
This is definitely a more complicated setup than the first test and will be more prone to misconfiguration but at this point it is only a test. Next came setting up the Software mirrors. After the mirrors where all in place I setup IOmeter to test out the T-logs drives and the C: drives as well since in the final confuguration the T-Logs drives would be on disk groups that would be used by other servers as well. I ran the exact same test for 10 minutes just like the first round and the results looked like this: