一、总结

1.1 总体过程

1.2 SQLContext 里面包含了 parser、analyzer、optimer。其中 analyzer 和 optimzer 都是包含了一些规则，用来对查询计划树进行转换，实现上主要是利用 pattern matching 做转换。其中这些规则按照作用的次数（Once 或者 FixedPoint）进行了分组，详细情况见第 1.4 条。

1.3 表达式和执行计划都继承自 TreeNode，表达式里面定义如何做类型转换（详见第 1.4 条中的注释）。

1.4 analyzer 和 optimizer 主要的内容，这里的内容主要来自于连城的 straper 项目（链接见后面参考资料）。

======== parser ========

- sql query
     - parser.parse

======== analyzer ========

    RuleBatch("Substitution", FixedPoint.Unlimited, Seq(
      SubstituteCTEs
    )),

    RuleBatch("Resolution", FixedPoint.Unlimited, Seq(
      ResolveRelations,
      ExpandStars,
      ResolveReferences,
      ResolveFunctions,
      ResolveAliases,
      DeduplicateReferences,

      ResolveSortReferences,
      RewriteDistinctsAsAggregates,
      GlobalAggregates,
      MergeHavingConditions,
      MergeSortsOverAggregates,
      ResolveAggregates
    )),

    // 类型提升是在 TypeCheck 里面做的，是自己定义的提升规则，例如：
    // private val implicitlyFromInt: PartialFunction[DataType, Any => Any] = {
    //     case BooleanType => asInt andThen (_ == 0)
    //     case LongType    => asInt andThen (_.toLong)
    //     case FloatType   => asInt andThen (_.toFloat)
    //     case DoubleType  => asInt andThen (_.toDouble)
    //     case StringType  => _.toString
    //   }
    RuleBatch("Type check", Once, Seq(
      TypeCheck
    )),

    RuleBatch("Post-analysis check", Once, Seq(
      PostAnalysisCheck
    ))
  )

======== optimizer ========

    RuleBatch("Optimizations", FixedPoint.Unlimited, Seq(
      FoldConstants,
      FoldConstantFilters,
      FoldLogicalPredicates,

      CNFConversion,
      EliminateCommonPredicates,

      ReduceAliases,
      ReduceCasts,
      MergeFilters,
      ReduceLimits,
      ReduceNegations,
      MergeProjects,
      EliminateSubqueries,

      PushFiltersThroughProjects,
      PushFiltersThroughJoins,
      PushFiltersThroughAggregates,
      PushProjectsThroughLimits,
      PushLimitsThroughUnions
    ))

二、 Spark SQL 参考资料

2.1 印度小哥完整的讲解了 SparkSQL 的过程 video：https://www.youtube.com/watch?v=iKOGBr-kOks 2小时 slide：http://www.slideshare.net/datamantra/anatomy-of-data-frame-api code：https://github.com/phatak-dev/anatomy_of_spark_dataframe_api

2.2 OopsOutOfMemory 的 Spark SQL 分析系列 blog：http://blog.csdn.net/oopsoom/article/details/38257749

2.3 连城 Spark SQL / Catalyst 的实验场 code：https://github.com/liancheng/scraper

2.4 Spark SQL 的类之间的关系（上面的图就来自这篇文章） blog：http://www.trongkhoanguyen.com/2015/08/sparksql-internals.html

2.5 其他 pr：apache/spark#2354 可视化 query plan

三、阅读 Scala 时遇到的一些问题

3.1 case object vs case class https://www.quora.com/Whats-the-difference-between-case-class-and-case-object-in-Scala

3.2 中缀操作符的优先级 http://scala-lang.org/files/archive/spec/2.11/06-expressions.html#infix-operations Scala 社区喜欢省略括号，如果遇到 DSL 的话，其实看起来不太容易懂，如：

context lift data rename ("i", "s") where 'i =/= lit(1: Byte) + 1 select ('s, 'i)

上面的表达式加上括号后是下面这样，注意 where 后面的表达式需要考虑优先级规则：

context.lift(data).rename("i", "s").where('i.=/=(lit(1: Byte).+(1))).select('s, 'i)

我是感觉写成下面这样比较好理解：

context.lift(data).rename("i", "s").where('i =/= lit(1: Byte) + 1).select('s, 'i)
// or
context lift data rename ("i", "s") where ('i =/= lit(1: Byte) + 1) select ('s, 'i)

3.3 Scala 类型系统 http://lampwww.epfl.ch/~amin/dot/fool_slides.pdf

3.4 sealed trait http://stackoverflow.com/questions/11203268/what-is-a-sealed-trait

3.5 lift of PartialFunction[A, B] http://stackoverflow.com/questions/17965059/what-is-lifting-in-scala

zzl0/SparkSQL.md

一、总结

二、 Spark SQL 参考资料

三、阅读 Scala 时遇到的一些问题