Dataworks Summit 2019 三月在西班牙巴塞罗那举办。。。当年的Hadoop Summit盛极一时,后来随着Hadoop的黯淡,更名为Dataworks Summit。在Hortonworks和Cloudera合并后,不知道还会不会存在。貌似今年湾区已经不会举办了。。。历史在谢幕。。。
过来看看三月份的西班牙巴塞罗那举办的这场Dataworks Summit和Spark相关的八个Session?
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it the Right Way?
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but the overall cluster configuration can present some unespected issues that can compromise performances and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them. The presented use cases will refer to DL4J and ND4J on different Spark deployment modes (standalone, YARN, Kubernetes). The reference programming language for any code example would be Scala, but no preliminary Scala knowledge is mandatory in order to better understanding the presented topics.
Cobrix – a COBOL Data Source for Spark
The financial industry operates on a variety of different data and computing platforms. Integrating these different sources into a centralized data lake is crucial to support reporting and analytics tools.
Apache Spark is becoming the tool of choice for big data integration analytics due it’s scalable nature and because it supports processing data from a variety of data sources and formats such as JSON, Parquet, Kafka, etc. However, one of the most common platforms in the financial industry is the mainframe, which does not provide easy interoperability with other platforms.
COBOL is the most used language in the mainframe environment. It was designed in 1959 and evolved in parallel to other programming languages, thus, having its own constructs and primitives. Furthermore, data produced by COBOL has EBCDIC encoding and has a different binary representation of numeric data types.
We have developed Cobrix, a library that extends Spark SQL API to allow direct reading from binary files generated by mainframes.
While projects like Sqoop focus on transferring relational data by providing direct connectors to a mainframe, Cobrix can be used to parse and load hierarchical data (from IMS for instance) after it is transferred from a mainframe by dumping records to a binary file. Schema should be provided as a COBOL copybook. It can contain nested structures and arrays. We present how the schema mapping between COBOL and Spark was done, and how it was used in the implementation of Spark COBOL data source. We also present use cases of simple and multi-segment files to illustrate how we use the library to load data from mainframes into our Hadoop data lake.
We have open sourced Cobrix at https://github.com/AbsaOSS/cobrix
Near Real-time Search Index Generation with Lambda Architecture and Spark Streaming at Walmart Scale
Today Walmart offers many millions of products to purchase through its websites. All these products are managed in large scale product catalog which getting updated thousands of times per second. The changes include product information updates, new products, availability in stores and so many more different attributes. In quest of providing a seamless shopping experience for our customers, we developed a streaming indexing data pipeline which ensures that search index is getting updated on timely basis and always reflect latest state of product catalog in near real time. Our pipeline is a key component to ensure that our search data is always up-to-date and in sync with constantly changing product catalog and other features such as store and online availability, offers etc.
Our indexing component, which is based on Spark Streaming Receiver Approach, consumes events from multiple Kafka topics such as Product Change, Store Availability, and Offer Change and merges the transformed Product Attributes with the historical signals computed by relevance data pipeline stored in Cassandra. This data is further processed by another Streaming component, which partitions documents into Kafka topic for every shard as it can be indexed into Apache Solr for Product Search. Deployment of this pipeline is automated end to end.
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Big data and AI are joined at the hip: AI applications require massive amounts of training data to build state-of-the-art models. The problem is, big data frameworks like Apache Spark and distributed deep learning frameworks like TensorFlow don’t play well together due to the disparity between how big data jobs are executed and how deep learning jobs are executed.
Apache Spark 2.4 introduced a new scheduling primitive: barrier scheduling. User can indicate Spark whether it should be using the MapReduce mode or barrier mode at each stage of the pipeline, thus it’s easy to embed distributed deep learning training as a Spark stage to simplify the training workflow. In this talk, I will demonstrate how to build a real case pipeline which combines data processing with Spark and deep learning training with TensorFlow step by step. I will also share the best practices and hands-on experiences to show the power of this new features, and bring more discussion on this topic.
Storage Requirements and Options for Running Spark on Kubernetes
In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications.
This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence.
This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done
Performant and Reliable Apache Spark SQL Releases
In this talk, we present a comprehensive framework for assessing the correctness, stability, and performance of the Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. To automatically identify correctness issues and performance regressions, we have build a testing pipeline that consists of two complementary stages: randomized testing and benchmarking.
Randomized query testing aims at extending the coverage of the typical unit testing suites, while we use micro and application-like benchmarks to measure new features and make sure existing ones do not regress. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools.
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences.
To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals.
Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth.
We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services' costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* "Streaming" over Data Lake using Kafka
Tl;dr; How to make Apache Spark process data efficiently? Lessons learned from running petabyte scale Hadoop cluster and dozens of spark jobs’ optimisations including the most spectacular: from 2500 gigs of RAM to 240.
Apache Spark is extremely popular for processing data on Hadoop clusters. If Your Spark executors go down, an amount of memory is increased. If processing goes too slow, number of executors is increased. Well, this works for some time but sooner or later You end up with a whole cluster fully utilized in an inefficient way.
During the presentation, we will present our lessons learned and performance improvements on Spark jobs including the most spectacular: from 2500 gigs of RAM to 240. We will also answer the questions like:
- How does pySpark job differ from Scala jobs in terms of performance?
- How does caching affect dynamic resource allocation
- Why is it worth to use mapPartitions?
and many more