יום רביעי, 16 בפברואר 2011

SSDs – are they really a solution for database performance issues?

Most of the databases with which I am familiar – still reside on rotating plates (A.K.A disks). Disks, physical disks, hard drives or whatever else you might call them – have been out there for many years, and they are one of the few products that haven’t really developed in a technological sense. Moore’s Law never really applied to disks. Only CPU power. Too bad… because when it comes to the amount of calculation a CPU can execute in a single second – we are making a great leap every 18 months, but when it comes to the amount of IO/sec we can process even when using a high-quality storage device – the mechanics that work underneath are still stuck somewhere in the late 80s or early 90s.

Recently, there is another new buzz catching everyone’s attention. This one is around SSD drives. Solid State Drives. SSDs are really good when what you want to do is to read a lot of data. They are especially good (in comparison to physical rotating disks) when you want to read a lot of data in small chunks – what is also called “Random IO”. I’ve had an opportunity to work with databases that reside on SSD drives or what are also called Flash devices. I’ve worked with databases stored on Flash drives, Flash cache, Oracle Exadata, and many other solutions. They all have one property in common: when the disk operates faster, the CPU tends to work harder to process the huge stream of data.

Up until now, physical disks have been the main bottleneck to performance improvement in most of the systems with which I have worked. But when you substitute the IO sub-system with an alternative that is capable of executing so-and-so many IO operations per second, instead of the disk being the bottleneck, instead this role goes to the CPU. I’ve seen this in real life.

Here is an example: I was working on a system with 4 Quad CPUs and SAN storage that was capable of processing 10,000 IO/sec tops. • When my disk capacity was 100% utilized, my average CPU utilization was ~40% on average, and it peaked at 70%. • When I replaced the storage system with an SSD that was capable of processing 100,000 IO/sec (ie 10 times faster than my old storage), I noticed that my CPU consumption was still ~40% on average, but it peaked at 100%! • Another thing that I noticed was that even though the SSD capacity is 100,000 IO/sec, it never went beyond a speed of 20,000 IO/sec (meaning that it was only 20% utilized). Why was that? Because now I don’t have enough CPU power to process this fast data stream… Now we all know what happens when a machine reaches 100% CPU utilization… it crashes, times out, hangs… And if someone runs a report on this machine, if before it would have jammed only the storage device, now it jams the machine’s CPU…

Now tell me something – if you have to choose from between these two options, which one would you prefer?

• working with a jammed, over-utilized storage device?

• Working with a jammed, over-utilized CPU?

I would choose the second option. Because even when my hard drive on my laptop is working hard, most of the applications that reside in my memory and consume only CPU are still working fine. But when my CPU is 100% utilized – the laptop is completely stuck. No processing ability, not to process IO, nor to carry out CPU calculations.

People told me “SSD is going to solve our performance problems”. Well… – no. It sure will speed up some of the things that you are doing. But it is not going to solve the other problem from which you probably suffer – that of QoS and SLA. As long as there are computers, performance bottlenecks will remain. If you make one component work faster without changing anything else, the bottleneck will simply move to one of the other components.

The only way you can avoid the bottleneck and guarantee QoS in this case is by controlling the amount of resources each and every transaction takes… whether the resources are CPU or IO. It doesn’t matter if your problem is CPU or IO – MoreVRP will allow you to safeguard your QoS and SLA. Until today there doesn’t seem to be an overall solution to eliminate the bottleneck… but by using MoreVRP, you can make sure that it doesn’t affect your most critical transactions. No matter if you are working with SSDs or good old physical disks.

Something to think about the next time you consider buying a new storage device.

Cheers.

http://www.more-resource.com/blog/?p=74

2 תגובות:

גרי רשף Geri Reshef אמר/ה...

נשמע כמו Compression: פחות IO יותר CPU.

ליאור קינג אמר/ה...

מאמר יפה!

השימוש הנכון ב-SSD הוא במערכות OLTP ששם הגישה לנתונים היא אקראית ושם מורגש שיפור משמעותי בביצועים. זה פחות מומלץ ל-DWH ששם סורקים מסות של נתונים ודיסקים רגילים יודעים לעשות את זה די טוב (ובזול בהרבה).

אם ה-CPU שלך מגיע ל-100%, צריך לבדוק אלו תהליכים גורמים לו להשתולל כך. כנראה שמה שריסן את התהליכים הללו היה ה- I/O וכאשר שדרגת ל- SSD לא היה מה שיימתן אותם.
כדאי לשקול להשתמש ב-resource governor שירסן את התהליכים "המשתוללים" .
שרתי פרודקשן מחזיקים בד"כ בקרי I/O רציניים. עדיף להשתמש בכרטיסי HBA עם חיבור fiber optic שמפחיתים במידה רבה את כמות ה-interupts ל- CPU הנובעים מפעולות I/O.