对于我们的新程序“ Apache Spark for Data Engineers”和12月2日的网络研讨会,我们准备了有关Spark 3.0的概述文章的翻译。
Spark 3.0进行了一系列重要的改进,包括:改进了ADQ的性能,读取了二进制文件,改进了SQL和Python支持,Python 3.0,Hadoop 3集成,ACID支持。
在本文中,作者试图给出使用这些新功能的示例。这是有关Spark 3.0功能的第一篇第一篇文章,并且本系列文章计划继续进行。
本文重点介绍Spark 3.0中的以下功能:
自适应查询执行(AQE)框架
支持新语言
结构化流媒体的新界面
读取二进制文件
递归文件夹浏览
多数据定界符支持(||)
新的内置Spark功能
切换到公历
数据框尾
SQL查询中的重新分区功能
改进的ANSI SQL兼容性
(AQE) – , , Spark 3.0. , , .
3.0 Spark , , Spark , . AQE , , , .
, (AQE) . spark.sql.adaptive.enabled true. AQE, Spark TPC-DS Spark 2.4
AQE Spark 3.0 3 :
,
join sort-merge broadcast
Spark 3.0 , :
Python3 (Python 2.x)
Scala 2.12
JDK 11
Hadoop 3 , Kafka 2.4.1 .
Spark Structured Streaming
web- Spark . , , , -, . , .
2 :
: Databricks
«Active Streaming Queries» , «Completed Streaming Queries» – .
Run ID : , , , , , . , Databricks.
Spark 3.0 “binaryFile”, .
binaryFile, DataFrameReader image, pdf, zip, gzip, tar . , .
val df = spark.read.format("binaryFile").load("/tmp/binary/spark.png")
df.printSchema()
df.show()
root
|-- path: string (nullable = true)
|-- modificationTime: timestamp (nullable = true)
|-- length: long (nullable = true)
|-- content: binary (nullable = true)
+--------------------+--------------------+------+--------------------+
| path| modificationTime|length| content|
+--------------------+--------------------+------+--------------------+
|file:/C:/tmp/bina…|2020-07-25 10:11:…| 74675|[89 50 4E 47 0D 0...|
+--------------------+--------------------+------+--------------------+
Spark 3.0 recursiveFileLookup, . true , DataFrameReader , .
spark.read.option("recursiveFileLookup", "true").csv("/path/to/folder")
Spark 3.0 (||) CSV . , CSV :
col1||col2||col3||col4
val1||val2||val3||val4
val1||val2||val3||val4
:
val df = spark.read
.option("delimiter","||")
.option("header","true")
.csv("/tmp/data/douplepipedata.csv")
Spark 2.x , . :
throws java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ||
Spark
Spark SQL, Spark .
sinh,cosh,tanh,asinh,acosh,atanh,any,bitand,bitor,bitcount,bitxor,
booland,boolor,countif,datepart,extract,forall,fromcsv,
makedate,makeinterval,maketimestamp,mapentries
mapfilter,mapzipwith,maxby,minby,schemaofcsv,tocsv
transformkeys,transform_values,typeof,version
xxhash64
Spark : 1582 , – .
JDK 7 java.sql.Date API
. JDK 8 java.time.LocalDate API
.
Spark 3.0 , Pandas, R Apache Arrow. 15 1582 ., Date&Timestamp, Spark 3.0, . , 15 1582 .
Spark 3.0 Date & Timestamp :
makedate(), maketimestamp(), makeinterval().
makedate(year, month, day)
– <>, <> <>.
makedate(2014, 8, 13)
//returns 2014-08-13.
maketimestamp(year, month, day, hour, min, sec[, timezone])
– Timestamp <>, <>, <>, <>, <>, < >.
maketimestamp(2014, 8, 13, 1,10,40.147)
//returns Timestamp 2014-08-13 1:10:40.147
maketimestamp(2014, 8, 13, 1,10,40.147,CET)
makeinterval(years, months, weeks, days, hours, mins, secs)
–
makedate()
make_timestam()
0.
DataFrame.tail()
Spark head(),
, tail()
, Pandas Python. Spark 3.0 tail()
. tail()
scala.Array[T]
Scala.
val data=spark.range(1,100).toDF("num").tail(5)
data.foreach(print)
//Returns
//[95][96][97][98][99]
repartition SQL
SQL Spark actions, Dataset/DataFrame, , Spark SQL repartition() . SQL-. .
val df=spark.range(1,10000).toDF("num")
println("Before re-partition :"+df.rdd.getNumPartitions)
df.createOrReplaceTempView("RANGE¨C17CTABLE")
println("After re-partition :"+df2.rdd.getNumPartitions)
//Returns
//Before re-partition :1
//After re-partition :20
ANSI SQL
Spark data-, ANSI SQL, Spark 3.0 . , true spark.sql.parser.ansi.enabled
Spark .
Newprolab Apache Spark:
Apache Spark - (Scala). 11 , 5 .
Apache Spark (Python). " ". 6 , 5 .