对于“数据工程师”和“生态系统Hadoop,Spark,Hive”课程的未来学生,我们准备了另一篇有用的文章的翻译。
Criteo — , . , . Spark — . , , , .
:
Spark . , Spark SQL (Datasets) Spark Core API (RDD), , 2–10 , .
Spark 2.4.6, Macbook Pro 2017 Intel Core i7 3,5
Java- ( 100 , 90 ). Scala, Python.
,
, , :
, ;
-, , .
— 2006 , Hadoop, , MapReduce. , Spark. .
2015 (Kay Ousterhout) .¹ Spark, , , , - . , , TPC-DS², , :
, 2 % ( );
- , 19 % ( ).
! , - , . :
Spark - , , .
, , , , , .
, Databricks 2016 ³ , Spark . SQL, API DataFrames Datasets.
Spark?
— 0 10⁹. Spark, , , Scala:
var res: Long = 0L
var i: Long = 0L
while (i < 1000L * 1000 * 1000) {
if (i % 2 == 0) res += 1
i += 1L
}
1.
Spark RDD Spark Datasets. , Spark [1] :
val res = spark.sparkContext
.range(0L, 1000L * 1000 * 1000)
.filter(_ % 2 == 0)
.count()
2. RDD
val res = spark.range(1000L * 1000 * 1000)
.filter(col("id") % 2 === 0)
.select(count(col("id")))
.first().getAs[Long](0)
3. Datasets
. , . , RDD , Datasets , .
Datasets
: API- Datasets RDD, , , , . ? .
— Volcano
, RDD, Volcano. , RDD :
RDD;
compute
Iterator[T], RDD ( private Spark).
abstract class RDD[T: ClassTag]
def compute(…): Iterator[T]
4. RDD.scala
RDD, , :
def pseudo_rdd_count(rdd: RDD[T]): Long = {
val iter = rdd.compute
var result = 0
while (iter.hasNext) result += 1
result
}
5. RDD
, , 1? :
: Iterator.next() , , JIT (inline).
: Java- JIT -, 5, , -, 1. , Java- JIT , .
—
, Spark SQL⁵, , , RDD. , Spark , . (Whole-Stage Code Generation)⁶. Spark , . JVM/JIT . Spark , ., , Spark 3.
Spark , - Janino⁴. Spark SQL RDD.
Spark
Spark 3 API- Scala/Java: RDD, Datasets DataFrames ( Datasets). RDD Spark — , - , API , « » . , , API- Datasets .
—
, Spark SQL, API RDD. , Java, Spark SQL:
val res = spark.range(1000L * 1000 * 1000)
.rdd
.filter(_ %2 == 0)
.count()
6. Dataset RDD
43 2,1 , . RDD Java, . 3 6 (. ), , .
—
Spark SQL . ( 6 ):
val res = spark
.range(1000L * 1000 * 1000)
.filter(x => x % 2 == 0) // note that the condition changed
.select(count(col("id")))
.first()
.getAs[Long](0)
7. Spark SQL Scala
Spark . Scala, Spark SQL, Spark , . — (. 1a), , (DAG) Spark.
Spark SQL — - ! , , : filter(condition: Column) filter(T => Boolean) select(…) map(…). Spark , (Dataset). , , RDD.
, . , Spark SQL . , , -.
2–10 , , !
Ousterhout, Kay, et al. Making sense of performance in data analytics frameworks ( ). 12- {USENIX} ({NSDI} 15). 2015.
databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html