MAPD
Brief
MapD Core Database The MapD Core Database is an open source in-memory, SQL database that leverages the parallel processing power of GPUs to query billions of rows in milliseconds–hundreds to thousands of times faster than legacy CPU databases.
MapD Technologies 开发一款大数据分析平台,该公司称该平台能查询并可视化大数据,速度比其他系统快100倍。该软件充分利用商用图形处理单元(GPU)的大规模并行机制,面对数十亿行的数据集执行 SQL 查询只要短短数毫秒。系统可与其自己的 MapD Immerse 数据可视化工具或 Tableau 之类的其他可视化工具协同运行。
莫斯塔克在哈佛大学期间开发了这项技术的原型,起因是他等一个计算机系统处理数亿个推特消息中的模式等了数小时、乃至数天后,颇为沮丧――他当时在写一篇关于“阿拉伯之春”的论文,需要用到这些研究数据。他使用计算机游戏GPU卡建立了自己的计算机集群,然后在麻省理工学院(MIT)的计算机科学和人工智能实验室又深入研究了这项技术。
总部位于旧金山的这家公司创办于2013年,3月份推出了商用产品。
Website
Company
MapD Technologies, Inc
Mission
MapD’s mission is to not just make queries faster, but to create a fluid and immersive data exploration experience that removes the disconnect between an analyst and their data. Whether harnessed to explore correlations or identify anomalies, we have built the MapD platform from the ground up to make extracting insight from data effortless and lightning fast.
History
2017 May 2017 MapD Open Source and Community Editions made available
April 2017 MapD Version 3 Release with distributed scale-out, high availability, enhanced SQL, and native ODBC
March 2017 MapD receives Series B funding from New Enterprise Associates, Inc. (NEA), NVIDIA, Vanedge Capital and Verizon Ventures
2016 December 2016 MapD Version 2 Release with major improvements to the MapD Immerse Visual Analytics client November 2016 MapD helps publicly announce GPU instances on Google Cloud with stellar benchmarks against CPU systems
October 2016 MapD awarded Business Intelligence Group Start-up of the Year
October 2016 MapD helps publicly launch powerful GPU instances on Amazon Web Services, adding a MapD AMI to the marketplace
September 2016 MapD awarded for Fast Company Innovation by Design Award - Graphic Design and Data Visualization
July 2016 MapD named one of CRN’s 10 Coolest Big Data Startups Of 2016 April 2016 MapD named Gartner Cool Vendor 2016
March 2016 MapD officially launches & receives Series A funding from Vanedge Capital, Nvidia, In-Q-Tel, and Verizon Ventures
2015 July 2015 MapD is available on the cloud with IBM Softlayer Cloud Service’s announcement, making GPUs accessible to HPC users
February 2015 MapD signs a social media giant, its first paying customer
2014 September 2014 MapD moves its HQ from Cambridge to San Francisco
April 2014 MapD takes $2M in seed funding from investors including Nvidia, GV, and Vanedge Capital
March 2014 MapD wins Nvidia’s $100K Early Stage Challenge
2013 September 2013 MapD Technologies, Inc. is founded by Todd Mostak and Tom Graham in Cambridge, MA
January–December 2013 MapD is incubated in MIT CSAIL database group
2012 March–May 2012 First prototype of MapD is built by Todd Mostak for his MIT database course
Learn
Resources
MapD Whitepaper
https://www.mapd.com/resources/
Community
Blog
Documents
MapD Documentation
http://docs.mapd.com/latest/
Github
https://github.com/mapd/mapd-core#readme
websites
GPU 数据库 MapD 性能超传统数据库 70 倍,数据库瓶颈不是 IO 吗?
https://www.zhihu.com/question/21003317
Ending Analysis Paralysis: NVIDIA and MapD Solve Massive Big Data Woes Across Industrieshttps://blogs.nvidia.com/blog/2016/07/11/mapd-data-analytics/
MapD Offers a Columnar Database System that Runs on GPUshttps://thenewstack.io/new-mapd-database-system-runs-gpus/
https://en.wikipedia.org/wiki/MapD_Technologies
FAQ
https://www.mapd.com/faq/#what-are-gpus-not-good-for
1、为什么 GPU 能做数据库?
因为 CPU 已经到达瓶颈,凭借着内存和计算带宽的增加,一台全 GPU 的处理器的计算带宽可以达到将近 6TB/s,是 CPU 的 40 倍。并且能利用 GPU 的优势,在大结果情况下,使得计算结果不用返回 CPU ,直接通过客户端展示。
2、GPU 不擅长干什么?
GPU 适合做少分支的可并行算法,也就是每一个核在一个 lock-step 中只做一件事。例如文字处理就不适合 GPU 做,但是例如在很多文件中抓取某个文本,对GPU 就很简单。
3、MAPD 可以最多用多少个 GPU?
MAPD 可以用服务器上所有的 GPU。多数的 3U 或是 4U 的服务器可以装NVIDIA K80S,每一个 K80 有两个GPU,也就是每个服务器有16个 GPU。
未来 MAPD 可以部署在一体化高密度计算器上,可以集成16-32个GPU。
4、如果数据量的大小和 GPU 内存大小不匹配怎么办?
MAPD把所有 GPU 的RAM联合起来使用最为最初级的 cache ,并且尝试将热数据(列)的数据压缩后存在 RAM 中。单个节点的数据量一般在 T 级。当数据量不能全放在 GPU 内存中时,MAPD 会将一个较大的子集放入到 CPU 内存中
。这些数据可以在 CPU 和 GPU 中流动,或是同时在 CPU 或 GPU 中计算。
5、MAPD 可以支持多少并发?
由于 MAPD 查询的快,可以支持很多并发。如果用户只是手写 sql,那么每个服务器可以支持几百的并发。如果用户是用 MAPD immerse,那么同时10多个用户查询性能不会受影响。
6、对于数据流,MAPD 如何保证的实时刷新查询结果?
- 首先 MAPD 没有索引,并且能充分利用 GPU 的并行
- 主要是使用 GPU 执行查询,所以 CPU 可以用来解析或是进行其他的工作,如数据导入等
当系统接收到一个select请求,首先执行GPU内存中的数据,在执行过程中,异步的将新的数据插入到GPU内存中。
授权协议:Apache
开发语言:
操作系统:跨平台
Benchmarks
1、一知名数据专家、博客主 Mark Litwintschik 测试了 MAPD 在不同的硬件配置上的性能,从民用的Titan x板卡到商用的Telsa K80s。测试数据使用的是2009-2015年的出租车小票,共12亿条。
结论:MAPD 在 Telsa k80s上性能是其他CPU集群如 amazon redshift、bigquery、elastic、postgresql 和 presto 的55倍。
即使在民用GPU硬件上也能达到43倍。
2、
Verizon benchmarked MapD against a set of 20 Apache Impala servers churning though 3 billion rows. It took the Impala kit 15-20 seconds, whereas it took a single MapD server around 160 milliseconds.
GPU优势
1、核多
The beauty of GPUs is that they have hella cores. A server can run about 10 to 30 cores, but about 40,000 GPU cores. Granted, GPU cores are pretty dumb compared to CPUs, “but you can process a lot with them,” Mostak said.
But the maximum core counts is not the only advantage GPUs bring.
2、带宽大
“People think GPUs are great because they have so much computational power but we think that you really win because GPSs have so much memory bandwidth,” Mostak said. The Pascal cards will have the ability to scan data at a rate of 8TB/second scanning capability, a huge jump over CPU capabilities.
MAPD优势
1、无主节点 最多支持单节点16个 GPU
2、支持 sql 支持 ODBC
3、单个节点支持8个 NVIDIA K80显卡 最大192GB GPU RAM
4、代码针对 GPU 进行了优化
5、使用了 LLVM
LLVM allows MapD to
transform query plans into architecture-independent intermediate code (LLVM IR)
and then use any of the LLVM architecture-specific “backends” to compile that IR
code for the needed target, such as NVIDIA GPUs, x64 CPUs, and ARM CPUs.
6、使热数据持续在 GPU 内存
7、执行计划矢量化
Vectorized code allows the
compute resources of a processor to process multiple data items simultaneously.
将已经编译的执行计划缓存
Furthermore, the system can
cache templated versions of compiled query plans for reuse. This is important in
situations where our visualization layer is asked to animate billions of rows over
multiple correlated charts.
8、优秀的可视化图表
使用案例
1、在社交广告中寻找模式
2、实时发现电信问题
3、发现区域内的流行趋势
4、实时进行数据决策
5、加速分析