Entry point and basic abstraction
For Spark base
main entry point: SparkContext
basic abstraction: RDD
For Spark SQL
main entry point: SparkSession
basic abstraction: DataFrame
For Spark Streaming
Main entry point:
basic abstraction: DStream
For Spark ML
Main entry point:
Core Classes
Spark base
pyspark.SparkContext
Main entry point for Spark functionality.pyspark.RDD
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.Spark Streaming
pyspark.streaming.StreamingContext
Main entry point for Spark Streaming functionality.pyspark.streaming.DStream
A Discretized Stream (DStream), the basic abstraction in Spark Streaming.Spark SQL and DataFrame
pyspark.sql.SQLContext
Main entry point for DataFrame and SQL functionality.pyspark.sql.DataFrame
A distributed collection of data grouped into named columns.
Spark running mode
Locally
Cluster
Setup and run/submit job
Locally
Setup
Spark shell and submit job
./bin/spark-shell --master local[2]
OR
./bin/pyspark --master local[2]
Submit job
./bin/spark-submit examples/src/main/python/pi.py 10
OR
./bin/spark-submit examples/src/main/r/dataframe.R
Spark Stand alone cluster
Spark YARN cluster
What ?
:paste
:help
Spark context available as sc.
SQL context available as sqlContext.
Read csv files as Dataframe in Apache Spark with spark-csv package. after loading data to Dataframe save dataframe to parquetfile.
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mode", "DROPMALFORMED")
.load("/home/myuser/data/log/*.csv")
df.saveAsParquetFile("/home/myuser/data.parquet")
val df_1 = sqlContext.read.parquet("/Users/user_name/Work/tmp/sample.parquet")
df.dtypes
df.show()