Skip to content

Instantly share code, notes, and snippets.

@zzl0
Last active October 18, 2016 06:15
Show Gist options
  • Save zzl0/243b5d06ed36b14ed6e2826c0f90d72a to your computer and use it in GitHub Desktop.
Save zzl0/243b5d06ed36b14ed6e2826c0f90d72a to your computer and use it in GitHub Desktop.

一、总结

1.1 总体过程

1.2 SQLContext 里面包含了 parser、analyzer、optimer。其中 analyzer 和 optimzer 都是包含了一些规则,用来对查询计划树进行转换,实现上主要是利用 pattern matching 做转换。其中这些规则按照作用的次数(Once 或者 FixedPoint)进行了分组,详细情况见第 1.4 条。

1.3 表达式和执行计划都继承自 TreeNode,表达式里面定义如何做类型转换(详见第 1.4 条中的注释)。

1.4 analyzer 和 optimizer 主要的内容,这里的内容主要来自于连城的 straper 项目(链接见后面参考资料)。

======== parser ========

- sql query
     - parser.parse

======== analyzer ========

    RuleBatch("Substitution", FixedPoint.Unlimited, Seq(
      SubstituteCTEs
    )),

    RuleBatch("Resolution", FixedPoint.Unlimited, Seq(
      ResolveRelations,
      ExpandStars,
      ResolveReferences,
      ResolveFunctions,
      ResolveAliases,
      DeduplicateReferences,

      ResolveSortReferences,
      RewriteDistinctsAsAggregates,
      GlobalAggregates,
      MergeHavingConditions,
      MergeSortsOverAggregates,
      ResolveAggregates
    )),

    // 类型提升是在 TypeCheck 里面做的,是自己定义的提升规则,例如:
    // private val implicitlyFromInt: PartialFunction[DataType, Any => Any] = {
    //     case BooleanType => asInt andThen (_ == 0)
    //     case LongType    => asInt andThen (_.toLong)
    //     case FloatType   => asInt andThen (_.toFloat)
    //     case DoubleType  => asInt andThen (_.toDouble)
    //     case StringType  => _.toString
    //   }
    RuleBatch("Type check", Once, Seq(
      TypeCheck
    )),

    RuleBatch("Post-analysis check", Once, Seq(
      PostAnalysisCheck
    ))
  )

======== optimizer ========

    RuleBatch("Optimizations", FixedPoint.Unlimited, Seq(
      FoldConstants,
      FoldConstantFilters,
      FoldLogicalPredicates,

      CNFConversion,
      EliminateCommonPredicates,

      ReduceAliases,
      ReduceCasts,
      MergeFilters,
      ReduceLimits,
      ReduceNegations,
      MergeProjects,
      EliminateSubqueries,

      PushFiltersThroughProjects,
      PushFiltersThroughJoins,
      PushFiltersThroughAggregates,
      PushProjectsThroughLimits,
      PushLimitsThroughUnions
    ))

二、 Spark SQL 参考资料

2.1 印度小哥完整的讲解了 SparkSQL 的过程 video:https://www.youtube.com/watch?v=iKOGBr-kOks 2小时 slide:http://www.slideshare.net/datamantra/anatomy-of-data-frame-api code:https://github.com/phatak-dev/anatomy_of_spark_dataframe_api

2.2 OopsOutOfMemory 的 Spark SQL 分析系列 blog:http://blog.csdn.net/oopsoom/article/details/38257749

2.3 连城 Spark SQL / Catalyst 的实验场 code:https://github.com/liancheng/scraper

2.4 Spark SQL 的类之间的关系(上面的图就来自这篇文章) blog:http://www.trongkhoanguyen.com/2015/08/sparksql-internals.html

2.5 其他 pr:apache/spark#2354 可视化 query plan

三、阅读 Scala 时遇到的一些问题

3.1 case object vs case class https://www.quora.com/Whats-the-difference-between-case-class-and-case-object-in-Scala

3.2 中缀操作符的优先级 http://scala-lang.org/files/archive/spec/2.11/06-expressions.html#infix-operations Scala 社区喜欢省略括号,如果遇到 DSL 的话,其实看起来不太容易懂,如:

context lift data rename ("i", "s") where 'i =/= lit(1: Byte) + 1 select ('s, 'i)

上面的表达式加上括号后是下面这样,注意 where 后面的表达式需要考虑优先级规则:

context.lift(data).rename("i", "s").where('i.=/=(lit(1: Byte).+(1))).select('s, 'i)

我是感觉写成下面这样比较好理解:

context.lift(data).rename("i", "s").where('i =/= lit(1: Byte) + 1).select('s, 'i)
// or
context lift data rename ("i", "s") where ('i =/= lit(1: Byte) + 1) select ('s, 'i)

3.3 Scala 类型系统 http://lampwww.epfl.ch/~amin/dot/fool_slides.pdf

3.4 sealed trait http://stackoverflow.com/questions/11203268/what-is-a-sealed-trait

3.5 lift of PartialFunction[A, B] http://stackoverflow.com/questions/17965059/what-is-lifting-in-scala

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment