Skip to content

Instantly share code, notes, and snippets.

@inkrement
Last active May 2, 2023 12:18
Show Gist options
  • Save inkrement/c456079501dc2fa5821dc93aee02562a to your computer and use it in GitHub Desktop.
Save inkrement/c456079501dc2fa5821dc93aee02562a to your computer and use it in GitHub Desktop.

SQL Introduction (DuckDB)

Expressions: Values, Operators and Functions

Expressions are the elementar building-blocks of SQL-Queries; most SQL-Clauses (e.g., SELECT, WHERE) use them as arguments. An expression is either a value (e.g., 3, 2.3, 'hello world' or column_name), an operator (e.g., +, -, /), or a function (e.g., my_function()).

Some operations, and most functions, work only on specific data types. For example, there are a number of operators and functions specifically for dates and strings.

Data types

Although SQL is primarily designed for processing data, the language only a few data types. Depending on the dialect, they are mostly integers (e.g. 42), floating point numbers (e.g. 3.1242), truth values (0 or 1), and strings (e.g. "Hello World"). Our database, DuckDB, also supports types for date and time. The data type specifies how operations or operators are to be interpreted. Thus, the division on integers is defined in a different way than for floating point numbers (3/3 returns a different result than 3.0/3).

Some Database management systems also support a special data type to encode null-values (i.e., missing values).

Semantics

While we usually do not follow too strict standards in our use of language, the computer usually reacts very sensitively to details. For example, 2 and 2. are two different values. The first is an integer and the second is a floating point number.

Another example would be quotation marks. While we often switch between single, double, French, or German quotation marks depending on our mood, they have very different meanings. Since SQL is based on the English language, single or double (but not French or German) are used to mark up strings. A similar design decision can be seen with the comma characters, where the period is used as a separator rather than the single comma.

Examples

Let's start with some examples to show the different types of expressions.

  1. An expression could be a value such as 'happy weekend' (i.e., string)
SELECT 'happy weekend'
  1. An expression can also be a more "complex" combination of other expressions, or the application of operators such as multiplications:
SELECT 5*64.34
  1. Another possibility is functions that transform the input (arguments) and yield an output (result). In this example, we call a function named log with the input 45. The result is calculated and replaces the expression such that the SELECT clause returns it. Most DBMS provide a large set of pre-defined functions, and some (e.g., BigQuery, Clickhouse) allow one to define own functions (user-defined functions). However, in the beginning, knowing how to use them is essential. A function call always starts with its name, followed by parentheses. The input of the functions is called arguments and is listed within the brackets. The number and type of arguments depend on the function at hand. For example, mathematical functions work on numeric data types, and String functions work on strings.
SELECT log(45)

WHERE clause

A thousand roads lead to Rome. So, as so often, there are umpteen ways to get the same result. The following select queries all lead to the same result.

SELECT followers_count FROM youtube_tweets WHERE followers_count % 2 = 1 LIMIT 10;
SELECT followers_count FROM youtube_tweets WHERE followers_count % 2 != 0 LIMIT 10;
SELECT followers_count FROM youtube_tweets WHERE followers_count % 2 > 0 LIMIT 10;
SELECT
 youtube_url
FROM youtube_tweets
WHERE youtube_url = 'https://www.youtube.com/watch?v=sSgD19URsyU&feature=youtu.be'
LIMIT 10;

Like

You can very quickly compare expressions directly with the equal sign. However, the computer takes the task very seriously here and returns true only if the two expressions are identical. For example, a different decimal point for numbers or a different case for upper and lower case letters does not lead to a match. With numbers you can avoid the problem by rounding to the desired decimal place, but with text you use the like operator to match patterns. Like compares the expressions on both sides. If you don't want the method to be sensitive to case, you can use the iLike operator instead. If only parts are to be matched, % can be included as placeholders.

https://duckdb.org/docs/sql/functions/patternmatching

select username from youtube_tweets where username like 'martin' limit 10;
select username from youtube_tweets where username like '%martin%' limit 10;
select username from youtube_tweets where username like 'martin%' limit 10;
select username from youtube_tweets where username ilike 'martin%' limit 10;
  • How many distinct usernames contain your first name?

I'll showcase it with my name (i.e., Christian)

SELECT
 count(distinct username)
FROM youtube_tweets
WHERE username ilike '%christian%'
LIMIT 10;
  • What is AND/OR?

Both are logical operators

Group by

  • get the top 10 users based on the number of their postings (two columns, first the name, second the number of postings)
SELECT 
 username, count(*) as postings 
FROM youtube_tweets
GROUP BY username
ORDER BY postings DESC
  • Return two rows; one for Justin Bieber and the other for U2 (column name should be artist) and the respective number of tweets for these two in a separate column named num_tweets:
SELECT
  if(text ilike '%justin bieber%', 'Justin B.', 'U2') as artist, 
  count(text) as num_tweets 
FROM youtube_tweets
WHERE text ilike '%justin bieber%' OR text like '% U2 %' 
GROUP BY artist 
LIMIT 10;
  • How many tweets are written on average on the 24th and on other days?
SELECT
 if(day = '2022-12-24', '24th', 'other') as Type,
 avg(cnt) as avg_tweets_per_day 
FROM (
  SELECT 
    created_at::DATE as day, 
    count(*) as cnt 
   FROM youtube_tweets 
   GROUP BY day
) 
GROUP BY Type;
  • What is the average number of verified authors/day (in percent & rounded)
SELECT 
  weekday, 
  round(avg(verified::INT)*100, 2) as avg_verified_authors 
FROM (
 SELECT 
   author_id, dayname(created_at::DATE) as weekday, verified
 FROM youtube_tweets
 GROUP BY author_id, weekday, verified
) 
GROUP BY weekday;
  • volume of tweets, number of likes and average likes per tweet for each saturday
select
 created_at::DATE as day, count(*) as volume, sum(like_count) as likes, likes::FLOAT/volume as avg_likes_per_tweet 
FROM youtube_tweets WHERE dayname(day) = 'Saturday'
GROUP BY day
LIMIT 10;
  • ex. pewdiepie like: on how many distinct days did the authors tweets; sorted by the count.
SELECT
 username, 
 COUNT(distinct created_at::DATE) as cnt 
FROM youtube_tweets
GROUP BY username
ORDER BY cnt DESC
LIMIT 100;
SELECT * 
FROM youtube_tweets 
ORDER BY like_count - retweet_count 
LIMIT 10;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment