摘要
随着大数据系统近年来的飞速发展,各种开源的基准测试被设计出来比较和评估这些系统的性能,并促进了他们性能的提升。文章首先给出了流行的benchmark的总览。并且总结出benchmark侧重测试的三个方面:
- workload generation techniques 负载生成技术
- workload input data generation techniques 输入负载生成技术
- metrics 度量标准。
当前主流的大数据系统主要有三种:
- Hadoop and its related systems
- data stores(database management systems (DBMSs) and NoSQL)
-
specialized systems(connected graphs, continu- ous streams, and complex scientific data)
具体参考下图(本文图表引自文章原文)
当前存在的benchmark可以主要分为三大类:
- Micro benchmarks. 用于评估单个系统组件或特定系统行为,常见的有Word count, NNBench, TestDFSIO等
- End to end. 使用典型的应用场景评估整个系统,每个场景对应一组相关的工作负载,常见的有TPC(Transaction Processing Performance Council)提供的一系列OLTP(On-Line Transaction Processing)查询
-
Benchmark suites. 多个1和2的组合,常见的有HiBench, CloudSuite, BigDataBench
常见的NoSQL类型及例子:
- key/ value stores (e.g., Amazon Dynamo, Cassandra, Linkedin Voldemort)
- column-oriented databases (e.g., BigTable and Hypertable)
- document- oriented stores (e.g., CouchDB and MongoDB)
针对图数据的两种系统:
- graph databases such as Neo4j
- distributed graph processing systems such as Google Pregel