pyspark api 解读一

pyspark 是spark的python api

公有类信息：

SparkContext:

spark 函数式编程的主入口.

RDD:

弹性分布式数据集，spark的基本抽象.

广播变量可以在任务之间重复使用.

任务之间共享的只增不减的变量.

配置spark变量.

Access files shipped with jobs.

StorageLevel:

细粒度的持久化等级.

TaskContext:

当前正在运行的任务信息，在worker节点上，目前是实验性的

class pyspark.SparkConf(loadDefaults=True, _jvm=None, _jconf=None)[source]

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.

spark应用程序的配置，用来设置spark的各种各样的键值对参数。

大多数情况下，你需要创建一个SparkConf对象，同时它也会加载java体系的参数。因此，你设置的任何参数的优先级是高于系统设置的参数的。

对于单元测试，你总是可以设置SparkConf(false)来跳过外部参数的加载，并且获得同样的配置，不管系统的参数配置是啥。

SparkConf下的所有setter方法支持链式操作。比如，你可以这样写：

conf.write.setMaster("local").setAppName("My app")

注意：

一旦SparkConf对象传递给了Spark，它就会被克隆并且不能够再被用户修改了。

contains(key)[source]

配置中是否含有某个制定的key

get(key, defaultValue=None)[source]

获取某个key的值或者获取默认值

getAll()[source]

获取所有参数值，返回键值对列表

set(key, value)[source]

设置一个配置属性.

setAll(pairs)[source]

设置多个参数，通过传入键值对列表。

Parameters:pairs – list of key-value pairs to set

setAppName(value)[source]

Set application name.

setExecutorEnv(key=None, value=None, pairs=None)[source]

设置一个传递个executor的环境变量

setIfMissing(key, value)[source]

设置一个配置属性如果这个配置属性缺失.

setMaster(value)[source]

设置master的url.

setSparkHome(value)[source]

设置worker节点的spark安装目录.

toDebugString()[source]

返回一个可打印版本的配置信息，以一个list key=value 对的形式，一个配置一行

class pyspark.SparkContext(master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, gateway=None, jsc=None, profiler_cls=)[source]

Spark 函数式编程的主要入口，一个SparkContext对象代表了一个Spark集群的链接，在集群中，它能够被用来创建RDD和广播变量

PACKAGE_EXTENSIONS = ('.zip', '.egg', '.jar')

accumulator(value, accum_param=None)[source]

创建一个指定初始值的累加器，使用一个指定的累加器参数帮助对象来定义指定类型怎样累加，如果你没有指定的话，默认的累加器参数是用来指定整型和浮点型数据的累加方式的。对于其他类型，你可以自定义一个累加器参数。

addFile(path, recursive=False)[source]

添加一个Spark每个节点都需要加载的文件，path可以是一个本地文件，hdfs文件，hadoop支持的其他文件，或者http，https，或者ftp uri

在Spark jobs中访问这个文件，使用SparkFiles.get(fileName)来获取这个文件的位置

如果recursive设置成true这里的path也可以是一个目录，目前这里的目录仅支持hadoop支持的文件系统目录。

>>> from pyspark import SparkFiles

>>> path=os.path.join(tempdir,"test.txt")

>>> with open(path,"w") as testFile:

_=testFile.write("100")

>>> sc.addFile(path)

>>> def func(iterator):

with open(SparkFiles.get("test.txt")) as testFile:

fileVal=int(testFile.readline())

return [x*fileValforxiniterator]

>>> sc.parallelize([1,2,3,4]).mapPartitions(func).collect()

[100, 200, 300, 400]

addPyFile(path)[source]

为将来在SparkContext上执行的所有任务添加一个.py或者.zip依赖。传递的路径可以是一个本地文件，也可以是一个hdfs上的文件或者其他hadoop支持的文件系统，或者是http，https，ftp uri。

applicationId

一个spark应用程序的独一无二的标识符。它的格式取决于调度器的实现方式。

如果是本地spark程序可能是‘local-1433865536131’

如果是yarn程序可能是‘application_1433865536131_34483’

binaryFiles(path, minPartitions=None)[source]

Note

Experimental

Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

Note

Small files are preferred, large file is also allowable, but may cause bad performance.

binaryRecords(path, recordLength)[source]

Note

Experimental

Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.

Parameters:path – Directory to the input data files

recordLength – The length at which to split the records

broadcast(value)[source]

Broadcast 是一个只读的变量，返回一个Broadcast对象用于分布式的方法。这个变量发送给每个节点仅一次。

cancelAllJobs()[source]

取消所有已经调度的或者正在运行的job。

cancelJobGroup(groupId)[source]

Cancel active jobs for the specified group. See SparkContext.setJobGroup for more information.

defaultMinPartitions

Default min number of partitions for Hadoop RDDs when not given by user

defaultParallelism

Default level of parallelism to use when not given by user (e.g. for reduce tasks)

dump_profiles(path)[source]

Dump the profile stats into directory path

emptyRDD()[source]

Create an RDD that has no partitions or elements.

getConf()[source]

getLocalProperty(key)[source]

Get a local property set in this thread, or null if it is missing. See setLocalProperty

classmethod getOrCreate(conf=None)[source]

Get or instantiate a SparkContext and register it as a singleton object.

Parameters:conf – SparkConf (optional)

hadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]

Read an ‘old’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile.

A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java.

Parameters:path – path to Hadoop file

inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”)

keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)

valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)

keyConverter – (None by default)

valueConverter – (None by default)

conf – Hadoop configuration, passed in as a dict (None by default)

batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)

hadoopRDD(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]

Read an ‘old’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for sc.sequenceFile.

Parameters:inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”)

keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)

valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)

keyConverter – (None by default)

valueConverter – (None by default)

conf – Hadoop configuration, passed in as a dict (None by default)

batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)

newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]

Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile.

A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java

Parameters:path – path to Hadoop file

inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)

keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)

valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)

keyConverter – (None by default)

valueConverter – (None by default)

conf – Hadoop configuration, passed in as a dict (None by default)

batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)

newAPIHadoopRDD(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]

Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for sc.sequenceFile.

Parameters:inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)

keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)

valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)

keyConverter – (None by default)

valueConverter – (None by default)

conf – Hadoop configuration, passed in as a dict (None by default)

batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)

parallelize(c, numSlices=None)[source]

Distribute a local Python collection to form an RDD. Using xrange is recommended if the input represents a range for performance.

>>> sc.parallelize([0,2,3,4,6],5).glom().collect()[[0], [2], [3], [4], [6]]>>> sc.parallelize(xrange(0,6,2),5).glom().collect()[[], [0], [], [2], [4]]

pickleFile(name, minPartitions=None)[source]

Load an RDD previously saved using RDD.saveAsPickleFile method.

>>> tmpFile=NamedTemporaryFile(delete=True)>>> tmpFile.close()>>> sc.parallelize(range(10)).saveAsPickleFile(tmpFile.name,5)>>> sorted(sc.pickleFile(tmpFile.name,3).collect())[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

range(start, end=None, step=1, numSlices=None)[source]

Create a new RDD of int containing elements from start to end (exclusive), increased by step every element. Can be called the same way as python’s built-in range() function. If called with a single argument, the argument is interpreted as end, and start is set to 0.

Parameters:start – the start value

end – the end value (exclusive)

step – the incremental step (default: 1)

numSlices – the number of partitions of the new RDD

Returns:An RDD of int

>>> sc.range(5).collect()[0, 1, 2, 3, 4]>>> sc.range(2,4).collect()[2, 3]>>> sc.range(1,7,2).collect()[1, 3, 5]

runJob(rdd, partitionFunc, partitions=None, allowLocal=False)[source]

Executes the given partitionFunc on the specified set of partitions, returning the result as an array of elements.

If ‘partitions’ is not specified, this will run over all partitions.

>>> myRDD=sc.parallelize(range(6),3)>>> sc.runJob(myRDD,lambdapart:[x*xforxinpart])[0, 1, 4, 9, 16, 25]

>>> myRDD=sc.parallelize(range(6),3)>>> sc.runJob(myRDD,lambdapart:[x*xforxinpart],[0,2],True)[0, 1, 16, 25]

sequenceFile(path, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)[source]

Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is as follows:

A Java RDD is created from the SequenceFile or other InputFormat, and the key and value Writable classes

Serialization is attempted via Pyrolite pickling

If this fails, the fallback is to call ‘toString’ on each key and value

PickleSerializer is used to deserialize pickled objects on the Python side

Parameters:path – path to sequncefile

keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)

valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)

keyConverter –

valueConverter –

minSplits – minimum splits in dataset (default min(2, sc.defaultParallelism))

batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)

setCheckpointDir(dirName)[source]

Set the directory under which RDDs are going to be checkpointed. The directory must be a HDFS path if running on a cluster.

setJobDescription(value)[source]

Set a human readable description of the current job.

setJobGroup(groupId, description, interruptOnCancel=False)[source]

Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared.

Often, a unit of execution in an application consists of multiple Spark actions or jobs. Application programmers can use this method to group all those jobs together and give a group description. Once set, the Spark web UI will associate such jobs with this group.

The application can use SparkContext.cancelJobGroup to cancel all running jobs in this group.

>>> importthreading>>> fromtimeimportsleep>>> result="Not Set">>> lock=threading.Lock()>>> defmap_func(x):... sleep(100)... raiseException("Task should have been cancelled")>>> defstart_job(x):... globalresult... try:... sc.setJobGroup("job_to_cancel","some description")... result=sc.parallelize(range(x)).map(map_func).collect()... exceptExceptionase:... result="Cancelled"... lock.release()>>> defstop_job():... sleep(5)... sc.cancelJobGroup("job_to_cancel")>>> supress=lock.acquire()>>> supress=threading.Thread(target=start_job,args=(10,)).start()>>> supress=threading.Thread(target=stop_job).start()>>> supress=lock.acquire()>>> print(result)Cancelled

If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job’s executor threads. This is useful to help ensure that the tasks are actually stopped in a timely manner, but is off by default due to HDFS-1208, where HDFS may respond to Thread.interrupt() by marking nodes as dead.

setLocalProperty(key, value)[source]

Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool.

setLogLevel(logLevel)[source]

Control our logLevel. This overrides any user-defined log settings. Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN

classmethod setSystemProperty(key, value)[source]

Set a Java system property, such as spark.executor.memory. This must must be invoked before instantiating SparkContext.

show_profiles()[source]

Print the profile stats to stdout

sparkUser()[source]

Get SPARK_USER for user who is running SparkContext.

startTime

Return the epoch time when the Spark Context was started.

statusTracker()[source]

Return StatusTracker object

stop()[source]

Shut down the SparkContext.

textFile(name, minPartitions=None, use_unicode=True)[source]

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)

>>> path=os.path.join(tempdir,"sample-text.txt")>>> withopen(path,"w")astestFile:... _=testFile.write("Hello world!")>>> textFile=sc.textFile(path)>>> textFile.collect()['Hello world!']

uiWebUrl

Return the URL of the SparkUI instance started by this SparkContext

union(rdds)[source]

Build the union of a list of RDDs.

This supports unions() of RDDs with different serialized formats, although this forces them to be reserialized using the default serializer:

>>> path=os.path.join(tempdir,"union-text.txt")>>> withopen(path,"w")astestFile:... _=testFile.write("Hello")>>> textFile=sc.textFile(path)>>> textFile.collect()['Hello']>>> parallelized=sc.parallelize(["World!"])>>> sorted(sc.union([textFile,parallelized]).collect())['Hello', 'World!']

version

The version of Spark on which this application is running.

wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]

Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)

For example, if you have the following files:

hdfs://a-hdfs-path/part-00000hdfs://a-hdfs-path/part-00001...hdfs://a-hdfs-path/part-nnnnn

Do rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”), then rdd contains:

(a-hdfs-path/part-00000,itscontent)(a-hdfs-path/part-00001,itscontent)...(a-hdfs-path/part-nnnnn,itscontent)

Note

Small files are preferred, as each file will be loaded fully in memory.

>>> dirPath=os.path.join(tempdir,"files")>>> os.mkdir(dirPath)>>> withopen(os.path.join(dirPath,"1.txt"),"w")asfile1:... _=file1.write("1")>>> withopen(os.path.join(dirPath,"2.txt"),"w")asfile2:... _=file2.write("2")>>> textFiles=sc.wholeTextFiles(dirPath)>>> sorted(textFiles.collect())[('.../1.txt', '1'), ('.../2.txt', '2')]

class pyspark.SparkFiles[source]

Resolves paths to files added through L{SparkContext.addFile()}.

SparkFiles contains only classmethods; users should not create SparkFiles instances.

classmethod get(filename)[source]

Get the absolute path of a file added through SparkContext.addFile().

classmethod getRootDirectory()[source]

Get the root directory that contains files added through SparkContext.addFile().

SparkFiles 主要解决了向spark添加文件的问题，这个文件用于spark的每个节点，推测spark有自己的临时目录存放文件

pyspark api 解读一

推荐阅读更多精彩内容