1.1 总体过程
1.2 SQLContext 里面包含了 parser、analyzer、optimer。其中 analyzer 和 optimzer 都是包含了一些规则,用来对查询计划树进行转换,实现上主要是利用 pattern matching 做转换。其中这些规则按照作用的次数(Once 或者 FixedPoint)进行了分组,详细情况见第 1.4 条。
1.3 表达式和执行计划都继承自 TreeNode,表达式里面定义如何做类型转换(详见第 1.4 条中的注释)。
1.4 analyzer 和 optimizer 主要的内容,这里的内容主要来自于连城的 straper 项目(链接见后面参考资料)。
======== parser ========
- sql query
- parser.parse
======== analyzer ========
RuleBatch("Substitution", FixedPoint.Unlimited, Seq(
SubstituteCTEs
)),
RuleBatch("Resolution", FixedPoint.Unlimited, Seq(
ResolveRelations,
ExpandStars,
ResolveReferences,
ResolveFunctions,
ResolveAliases,
DeduplicateReferences,
ResolveSortReferences,
RewriteDistinctsAsAggregates,
GlobalAggregates,
MergeHavingConditions,
MergeSortsOverAggregates,
ResolveAggregates
)),
// 类型提升是在 TypeCheck 里面做的,是自己定义的提升规则,例如:
// private val implicitlyFromInt: PartialFunction[DataType, Any => Any] = {
// case BooleanType => asInt andThen (_ == 0)
// case LongType => asInt andThen (_.toLong)
// case FloatType => asInt andThen (_.toFloat)
// case DoubleType => asInt andThen (_.toDouble)
// case StringType => _.toString
// }
RuleBatch("Type check", Once, Seq(
TypeCheck
)),
RuleBatch("Post-analysis check", Once, Seq(
PostAnalysisCheck
))
)
======== optimizer ========
RuleBatch("Optimizations", FixedPoint.Unlimited, Seq(
FoldConstants,
FoldConstantFilters,
FoldLogicalPredicates,
CNFConversion,
EliminateCommonPredicates,
ReduceAliases,
ReduceCasts,
MergeFilters,
ReduceLimits,
ReduceNegations,
MergeProjects,
EliminateSubqueries,
PushFiltersThroughProjects,
PushFiltersThroughJoins,
PushFiltersThroughAggregates,
PushProjectsThroughLimits,
PushLimitsThroughUnions
))
2.1 印度小哥完整的讲解了 SparkSQL 的过程 video:https://www.youtube.com/watch?v=iKOGBr-kOks 2小时 slide:http://www.slideshare.net/datamantra/anatomy-of-data-frame-api code:https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
2.2 OopsOutOfMemory 的 Spark SQL 分析系列 blog:http://blog.csdn.net/oopsoom/article/details/38257749
2.3 连城 Spark SQL / Catalyst 的实验场 code:https://github.com/liancheng/scraper
2.4 Spark SQL 的类之间的关系(上面的图就来自这篇文章) blog:http://www.trongkhoanguyen.com/2015/08/sparksql-internals.html
2.5 其他 pr:apache/spark#2354 可视化 query plan
3.1 case object vs case class https://www.quora.com/Whats-the-difference-between-case-class-and-case-object-in-Scala
3.2 中缀操作符的优先级 http://scala-lang.org/files/archive/spec/2.11/06-expressions.html#infix-operations Scala 社区喜欢省略括号,如果遇到 DSL 的话,其实看起来不太容易懂,如:
context lift data rename ("i", "s") where 'i =/= lit(1: Byte) + 1 select ('s, 'i)
上面的表达式加上括号后是下面这样,注意 where 后面的表达式需要考虑优先级规则:
context.lift(data).rename("i", "s").where('i.=/=(lit(1: Byte).+(1))).select('s, 'i)
我是感觉写成下面这样比较好理解:
context.lift(data).rename("i", "s").where('i =/= lit(1: Byte) + 1).select('s, 'i)
// or
context lift data rename ("i", "s") where ('i =/= lit(1: Byte) + 1) select ('s, 'i)
3.3 Scala 类型系统 http://lampwww.epfl.ch/~amin/dot/fool_slides.pdf
3.4 sealed trait http://stackoverflow.com/questions/11203268/what-is-a-sealed-trait
3.5 lift of PartialFunction[A, B] http://stackoverflow.com/questions/17965059/what-is-lifting-in-scala