Last active
May 29, 2018 08:02
-
-
Save thanoojgithub/f87cd7d30a5dfca41da0 to your computer and use it in GitHub Desktop.
ORDER BY and SORT BY in Hive
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ORDER BY and SORT BY | |
| It performs a total ordering of the query result set. | |
| This means that all the data is passed through a single reducer, which may take an unacceptably long time to execute for larger data sets. | |
| Hive adds an alternative, SORT BY, that orders the data only within each reducer, thereby performing a local ordering, where each reducer’s output will be sorted. Better performance is traded for total ordering. | |
| In both cases, the syntax differs only by the use of the ORDER or SORT keyword. You can specify any columns you wish and specify whether or not the columns are ascending using the ASC keyword (the default) or descending using the DESC keyword. | |
| Here is an example using ORDER BY: | |
| hive> SELECT s.ymd, s.symbol, s.price_close | |
| >FROM stocks s | |
| >ORDER BY s.ymd ASC, s.symbol DESC; | |
| Here is the same example using SORT BY instead: | |
| hive>SELECT s.ymd, s.symbol, s.price_close | |
| >FROM stocks s | |
| >SORT BY s.ymd ASC, s.symbol DESC; | |
| The two queries look almost identical, but if more than one reducer is invoked, the output will be sorted differently. While each reducer’s output files will be sorted, the data will probably overlap with the output of other reducers. | |
| DISTRIBUTE BY with SORT BY | |
| DISTRIBUTE BY controls how map output is divided among reducers. | |
| All data that flows through a MapReduce job is organized into key-value pairs. Hive must use this feature internally when it converts your queries to MapReduce jobs. | |
| By default, MapReduce computes a hash on the keys output by mappers and tries to evenly distribute the key-value pairs among the available reducers using the hash values. | |
| hive> SELECT s.ymd, s.symbol, s.price_close | |
| > FROM stocks s | |
| > DISTRIBUTE BY s.symbol | |
| > SORT BY s.symbol ASC, s.ymd ASC; | |
| DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers receive rows for processing, while SORT BY controls the sorting of data inside the reducer. | |
| Note that Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause. |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
CLUSTERED BY is a shortcut for both DISTRIBUTE BY and SORT BY.