Most of the reference I read about only mentioning something like “Do not use RAID for HDFS”, but not exactly mentioning the reason. In Hadoop: The Definitive Guide (3rd Edition), it is clearly say the reason.
HDFS clusters do not benefit from using RAID (Redundant Array of Independent Disks) for datanode storage (although RAID is recommended for the namenode’s disks, to protect against corruption of its metadata). The redundancy that RAID provides is not needed, since HDFS handles it by replication between nodes.
Furthermore, RAID striping (RAID 0), which is commonly used to increase performance, turns out to be slower than the JBOD (Just a Bunch Of Disks) configuration used by HDFS, which round-robins HDFS blocks between all disks. This is because RAID 0 read and write operations are limited by the speed of the slowest disk in the RAID array. In JBOD, disk operations are independent, so the average speed of operations is greater than that of the slowest disk. Disk performance often shows considerable variation in practice, even for disks of the same model. In some benchmarking carried out on a Yahoo! cluster (http://markmail.org/message/xmzc45zi25htr7ry), JBOD performed 10% faster than RAID 0 in one test (Gridmix) and 30% better in another (HDFS write throughput).
Finally, if a disk fails in a JBOD configuration, HDFS can continue to operate without the failed disk, whereas with RAID, failure of a single disk array (and hence the node) to become unavailable.”
Excerpt From: Tom White. “Hadoop.The.Definitive.Guide.3rd.Edition.epub.” iBooks. https://itunes.apple.com/WebObjects/MZStore.woa/wa/viewBook?id=540592910DD44A44D4D2DA62853ADD07”
So, that’s the main reason.