Kafka Summit 2017-sf (pipeline)

Billions of Messages a Day – Yelp’s Real-time Data Pipeline

by Justin Cunningham, Technical Lead, Software Engineering, Yelp
video, slide
Yelp moved quickly into building out a comprehensive service oriented architecture, and before long had over 100 data-owning production services. Distributing data across an organization creates a number of issues, particularly around the cost of joining disparate data sources, dramatically increasing the complexity of bulk data applications. Straightforward solutions like bulk data APIs and sharing data snapshots have significant drawbacks. Yelp’s Data Pipeline makes it easier for these services to communicate with each other, provides a framework for real-time data processing, and facilitates high-performance bulk data applications – making large SOAs easier to work with. The Data Pipeline provides a series of guarantees that makes it easy to create universal data producers and consumers that can be mashed up into interesting real-time data flows. We’ll show how a few simple services at Yelp lay the foundation that powers everything from search to our experimentation framework.

以下内容来自谷歌翻译：
Yelp迅速建立了面向全面的面向服务架构，并且长期以来一直拥有超过100个数据拥有的生产服务。跨组织分发数据会产生一些问题，特别是在加入不同数据源的成本之间，大大增加了批量数据应用程序的复杂性。直观的解决方案，如批量数据API和共享数据快照具有重大缺陷。 Yelp的数据管道使这些服务更容易相互通信，为实时数据处理提供框架，并促进高性能批量数据应用程序 - 使大型SOA更易于使用。数据管道提供了一系列保证，可以轻松创建通用数据生产者和消费者，从而将其融入有趣的实时数据流中。我们将展示Yelp的几个简单服务如何为搜索到实验框架提供一切依据。

Body Armor for Distributed System

by Michael Egorov, Co-founder and CTO, NuCypher
video, slide
We show a way to make Kafka end-to-end encrypted. It means that data is ever decrypted only at the side of producers and consumers of the data. The data is never decrypted broker-side. Importantly, all Kafka clients have their own encryption keys. There is no pre-shared encryption key. Our approach can be compared to TLS implemented for more than two parties connected together.

以下内容来自谷歌翻译：
我们展示了使端到端加密的Kafka的方法。这意味着数据只能在数据的生产者和消费者的一边被解密。数据从不解密代理方。重要的是，所有Kafka客户端都有自己的加密密钥。没有预共享加密密钥。我们的方法可以与连接在一起的两个以上方实施的TLS进行比较。

DNS for Data: The Need for a Stream Registry

by Praveen Hirsave, Director Cloud Engineering, HomeAway
video, slide
As organizations increasingly adopt streaming platforms such as kafka, the need for visibility and discovery has become paramount. Increasingly, with the advent of self-service streaming and analytics, a need to increase on overall speed, not only on time-to-signal, but also on reducing times to production is becoming the difference between winners and losers. Beyond Kafka being at the core of successful streaming platforms, there is a need for a stream registry. Come to this session to find out how HomeAway is solving this with a “just right” approach to governance.

以下内容来自谷歌翻译：
随着组织越来越多地采用流媒体平台，例如kafka，对可见性和发现的需求变得至关重要。越来越多的随着自助流媒体和分析技术的出现，不仅需要提高总体速度，而且在时间到信号的同时，还要减少生产时间成为赢家和输家之间的差异。超越Kafka是成功的流媒体平台的核心，需要一个流注册表。来参加这个会议，了解HomeAway如何用“正确”的治理方法来解决这个问题。

Efficient Schemas in Motion with Kafka and Schema Registry

by Pat Patterson, Community Champion, StreamSets Inc.
video, slide
Apache Avro allows data to be self-describing, but carries an overhead when used with message queues such as Apache Kafka. Confluent’s open source Schema Registry integrates with Kafka to allow Avro schemas to be passed ‘by reference’, minimizing overhead, and can be used with any application that uses Avro. Learn about Schema Registry, using it with Kafka, and leveraging it in your application.

以下内容来自谷歌翻译：
Apache Avro允许数据进行自我描述，但与消息队列（如Apache Kafka）一起使用时，会发生开销。 Confluent的开源架构注册表集成了Kafka，以允许Avro模式通过引用传递，最大限度地减少开销，并可与任何使用Avro的应用程序一起使用。了解架构注册表，使用Kafka，并将其用于您的应用程序。

From Scaling Nightmare to Stream Dream : Real-time Stream Processing at Scale

by Amy Boyle, Software Engineer, New Relic
video, slide
On the events pipeline team at New Relic, Kafka is the thread that stitches our micro-service architecture together. We receive billions of monitoring events an hour, which customers rely on us to alert on in real-time. Facing a ten fold+ growth in the system, learn how we avoided a costly scaling nightmare by switching to a streaming system, based on Kafka. We follow a DevOps philosophy at New Relic. Thus, I have a personal stake in how well our systems perform. If evaluation deadlines are missed, I loose sleep and customers loose trust. Without necessarily setting out to from the start, we’ve gone all in, using Kafka as the backbone of an event-driven pipeline, as a datastore, and for streaming updates to the system. Hear about what worked for us, what challenges we faced, and how we continue to scale our applications.

以下内容来自谷歌翻译：
在New Relic的事件管道团队中，Kafka是将我们的微服务体系结合在一起的线程。我们每小时收到数十亿次监控事件，客户依靠我们即时提醒。面对系统的十倍+增长，通过切换到基于Kafka的流式传输系统，了解我们如何避免昂贵的扩展噩梦。我们按照新遗物的DevOps理念。因此，我对我们的系统执行情况有个人利益。如果错过评估期限，我放松睡眠，客户信任松散。没有必要从一开始就开始，我们已经全部进入，使用Kafka作为事件驱动的流水线的主干，作为数据存储区，并将流式更新系统。听取有关我们的工作，我们面临的挑战以及我们如何继续扩大我们的应用程序。

How Blizzard Used Kafka to Save Our Pipeline (and Azeroth)

by Jeff Field, Systems Engineer, Blizzard
video, slide
When Blizzard started sending gameplay data to Hadoop in 2013, we went through several iterations before settling on Flumes in many data centers around the world reading from RabbitMQ and writing to central flumes in our Los Angeles datacenter. While this worked at first, by 2015 we were hitting problems scaling to the number of events required. This is how we used Kafka to save our pipeline.

以下内容来自谷歌翻译：
当暴雪在2013年开始向Hadoop发送游戏数据时，我们经历了几次迭代，然后在世界各地的许多数据中心处理Flumes，从RabbitMQ读取并写入我们Los的中央水槽安吉拉数据中心。虽然这一工作起初，到2015年，我们正在将问题扩大到所需的事件数量。这是我们如何使用Kafka来保存我们的管道。

Kafka Connect Best Practices – Advice from the Field

by Randall Hauch, Engineer, Confluent
video, slide
This talk will review the Kafka Connect Framework and discuss building data pipelines using the library of available Connectors. We’ll deploy several data integration pipelines and demonstrate :

best practices for configuring, managing, and tuning the connectors
tools to monitor data flow through the pipeline
using Kafka Streams applications to transform or enhance the data in flight.

以下内容来自谷歌翻译：
这个讨论将回顾Kafka连接框架，并讨论使用可用连接器库构建数据管道。我们将部署多个数据集成管道并展示：

配置，管理和调整连接器的最佳做法
通过管道监视数据流的工具
使用Kafka流应用程序来转换或增强飞行中的数据。

One Data Center is Not Enough: Scaling Apache Kafka Across Multiple Data Centers

by Gwen Shapira, Product Manager, Confluent
video, slide
You have made the transition from single machines and one-off solutions to distributed infrastructure in your data center powered by Apache Kafka. But what if one data center is not enough? In this session, we review resilient data pipelines with Apache Kafka that span multiple data centers. We provide an overview of best practices and common patterns including key areas such as architecture and data replication as well as disaster scenarios and failure handling.

以下内容来自谷歌翻译：
您已经通过Apache Kafka，将数据中心从单机和一次性解决方案过渡到数据中心的分布式基础设施。但是如果一个数据中心还不够？在本次会议中，我们将审查跨越多个数据中心的Apache Kafka的弹性数据流水线。我们提供最佳实践和常见模式的概述，包括架构和数据复制以及灾难情景和故障处理等关键领域。