Advanced SQL Techniques

Notes on advanced SQL techniques on Mode.


SQL Data types

数据看起来可能是Numeric的,但有可能里面有逗号或者货币符号,在存进database的时候就被当成了Non-numeric的,由于COUNT/SUM这类函数只有在numeric的时候才能用,date类的函数也只能用在明确是date/time的格式的数据,所以要注意这一点。

Changing a column’s data type

CAST(column_name AS integer), column_name::integer OR CONVERT(integer, column_name) can all do the trick.

Example code:

1
2
3
SELECT CAST(funding_total_usd AS varchar) funding_total_usd, 
founded_at_clean::varchar founded_at_clean
FROM tutorial.crunchbase_companies_clean_date

SQL Date Format

存成YYYY-MM-DD的才好排序,MM-DD-YYYY会有排序问题。

做日期/时间格式的数据的运算的时候,date - date = intervaldata - interval = date

1
2
3
4
5
6
7
8
SELECT companies.permalink,
companies.founded_at_clean,
acquisitions.acquired_at_cleaned,
acquisitions.acquired_at_cleaned - companies.founded_at_clean::timestamp AS time_to_acquisition
FROM tutorial.crunchbase_companies_clean_date companies
JOIN tutorial.crunchbase_acquisitions_clean_date acquisitions
ON acquisitions.company_permalink = companies.permalink
WHERE founded_at_clean IS NOT NULL

也可以直接用INTERVAL function。下面的code是直接加1星期:

1
2
3
4
5
6
SELECT companies.permalink,
companies.founded_at_clean,
companies.founded_at_clean::timestamp +
INTERVAL '1 week' AS plus_one_week
FROM tutorial.crunchbase_companies_clean_date companies
WHERE founded_at_clean IS NOT NULL

如果需要当前的时间:NOW()

1
2
3
4
5
SELECT companies.permalink,
companies.founded_at_clean,
NOW() - companies.founded_at_clean::timestamp AS founded_time_ago
FROM tutorial.crunchbase_companies_clean_date companies
WHERE founded_at_clean IS NOT NULL

Practice question:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
--- Write a query that counts the number of companies acquired within 3 years,
--- 5 years, and 10 years of being founded (in 3 separate columns). Include a
--- column for total companies acquired as well. Group by category and limit
--- to only rows with a founding date.

SELECT companies.category_code,
COUNT(companies.permalink) AS total,
COUNT(CASE
WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '3 years' THEN 1
ELSE NULL
END) AS Y3,
COUNT(CASE
WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '5 years' THEN 1
ELSE NULL
END) AS Y5,
COUNT(CASE
WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '10 years' THEN 1
ELSE NULL
END) AS Y10
FROM tutorial.crunchbase_companies_clean_date AS companies
JOIN tutorial.crunchbase_acquisitions_clean_date AS acquisitions
ON acquisitions.company_permalink = companies.permalink
WHERE founded_at_clean IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC


Data Wrangling with SQL

Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.

Using SQL String Functions to Clean Data

Cleaning Strings

LEFT, RIGHT and LENGTH
LEFT(stirng, n): 取某一条string从左数1到第n个character
RIGHT(stirng, n): 取某一条string从右数1到第n个character

sample code: 原本date是这个column是前面为日期,后面是自动附上的time stamp。这里因为知道每个cell的长度都是一样的,所以直接选定左右的数字便能把date和time拆开。

1
2
3
4
5
SELECT incidnt_num,
date,
LEFT(date, 10) AS cleaned_date,
RIGHT(date, 17) AS cleaned_time
FROM tutorial.sf_crime_incidents_2014_01

LENGTH(string): return string的长度

sample code: 如果不确定每个cell的长度是否一致,只知道一边固定的长度,可以用LENGTH()来做减法

1
2
3
4
5
SELECT incidnt_num,
date,
LEFT(date, 10) AS cleaned_date,
RIGHT(date, LENGTH(date) - 11) AS cleaned_time
FROM tutorial.sf_crime_incidents_2014_01

TRIM
TRIM([position] 'charaters to be trimemd' FROM col_name)可以用来去掉某个string的开头和结尾的characters
postion: leading, trailing, both
charaters to be trimmed: e.g. ‘()’ ‘/‘

sample code: 去掉location中前后的括号

1
2
3
4
5
SELECT location,
TRIM(both '()' FROM location),
TRIM(leading '(' FROM location),
TRIM(trailing ')' FROM location)
FROM tutorial.sf_crime_incidents_2014_01

POSTION and STRPOS
POSITION('A' IN B): 找到‘A’在B中从左数第一次出现的位置
STRPOS(B, 'A'): 同上

注意:这两个函数都是case sensitive的。如果想要把大小写都找出来,最好先用UPPER/LOWER把原string全变大写/小写再找

sample code:

1
2
3
4
5
SELECT incidnt_num,
descript,
POSITION('A' IN descript) AS a_position
STRPOS(descript, 'A') AS a_position_s
FROM tutorial.sf_crime_incidents_2014_01

SUBSTR
SUBSTR(string, p, n): 取一条string中从第p数n个characters

1
2
3
4
SELECT incidnt_num,
date,
SUBSTR(date, 4, 2) AS day
FROM tutorial.sf_crime_incidents_2014_01

Practice question: 将location column中(lat, lon)分为两个column:lat, lon

1
2
3
4
5
6
7
8
9
10
SELECT location,
lat,
lon,
SUBSTR(location,
POSITION('(' IN location) + 1,
POSITION(',' IN location) - POSITION('(' IN location) - 1) AS LAT1,
SUBSTR(location,
POSITION(',' IN location) + 1,
POSITION(')' IN location) - POSITION(',' IN location) - 1) AS LON1
FROM tutorial.sf_crime_incidents_2014_01

方法二:

1
2
3
4
SELECT location,
TRIM(leading '(' FROM LEFT(location, POSITION(',' IN location) - 1)) AS lattitude,
TRIM(trailing ')' FROM RIGHT(location, LENGTH(location) - POSITION(',' IN location) ) ) AS longitude
FROM tutorial.sf_crime_incidents_2014_01

CONCAT
CONCAT(a, ',', b)): 将分别的两个column a和b concat为’a,b’
a || ',' || b: 同上

sample code:

1
2
3
4
5
6
SELECT incidnt_num,
day_of_week,
date,
LEFT(date, 10) AS cleaned_date,
CONCAT(day_of_week, ', ', LEFT(date, 10)) AS day_and_date
FROM tutorial.sf_crime_incidents_2014_01

Practice question: Concatenate the lat and lon fields to form a field that is equivalent to the location field.

1
2
3
4
5
SELECT lat,
lon,
location,
CONCAT('(', lat, ', ', lon, ')') AS loc
FROM tutorial.sf_crime_incidents_2014_01

or

1
2
3
4
5
SELECT lat,
lon,
location,
'(' || lat || ', ' || lon || ')' AS loc
FROM tutorial.sf_crime_incidents_2014_01

Practice question 2: Write a query that creates a date column formatted YYYY-MM-DD.

1
2
3
SELECT date,
SUBSTR(date, 7, 4) || '-' || LEFT(date 2) || '-' || SUBSTR(date, 4, 2) AS new_date
FROM tutorial.sf_crime_incidents_2014_01

UPPER and LOWER
UPPER(string): 把string全改为大写
LOWER(string): 把string全改为小写

Practice question: Write a query that returns the category field, but with the first letter capitalized and the rest of the letters in lower-case.

1
2
3
SELECT category,
LEFT(category, 1) || LOWER(RIGHT(category, LENGTH(category) - 1))
FROM tutorial.sf_crime_incidents_2014_01

Turning Strings into Dates

可以CAST::成date, timestamp(有时分秒)

sample code:

1
2
3
4
5
SELECT incidnt_num,
date,
(SUBSTR(date, 7, 4) || '-' || LEFT(date, 2) ||
'-' || SUBSTR(date, 4, 2))::date AS cleaned_date
FROM tutorial.sf_crime_incidents_2014_01

Practice question: Write a query that creates an accurate timestamp using the date and time columns in tutorial.sf_crime_incidents_2014_01. Include a field that is exactly 1 week later as well.

1
2
3
4
5
SELECT date,
time,
(SUBSTR(date, 7, 4) || '-' || LEFT(date, 2) || '-' || SUBSTR(date, 4, 2) || ' ' || time) ::timestamp AS acc_timestamp,
(SUBSTR(date, 7, 4) || '-' || LEFT(date, 2) || '-' || SUBSTR(date, 4, 2) || ' ' || time) ::timestamp + INTERVAL '1 week' AS timestamp_1w
FROM tutorial.sf_crime_incidents_2014_01

Turning Dates into More Useful Dates

EXTRACT从date格式的数据里提取自己需要的部分
sample code:

1
2
3
4
5
6
7
8
9
10
SELECT cleaned_date,
EXTRACT('year' FROM cleaned_date) AS year,
EXTRACT('month' FROM cleaned_date) AS month,
EXTRACT('day' FROM cleaned_date) AS day,
EXTRACT('hour' FROM cleaned_date) AS hour,
EXTRACT('minute' FROM cleaned_date) AS minute,
EXTRACT('second' FROM cleaned_date) AS second,
EXTRACT('decade' FROM cleaned_date) AS decade, /*YYYY年份的前三位*/
EXTRACT('dow' FROM cleaned_date) AS day_of_week
FROM tutorial.sf_crime_incidents_cleandate

DATE_TRUNC把date格式的数据round为某个时间单位精度的值,比如你只需要精确到天,就把时分秒全变00:00:00;如果只需要到年,就把月份变为01-01。
sample code:

1
2
3
4
5
6
7
8
9
10
SELECT cleaned_date,
DATE_TRUNC('year' , cleaned_date) AS year,
DATE_TRUNC('month' , cleaned_date) AS month,
DATE_TRUNC('week' , cleaned_date) AS week, /*当周的周一*/
DATE_TRUNC('day' , cleaned_date) AS day,
DATE_TRUNC('hour' , cleaned_date) AS hour,
DATE_TRUNC('minute' , cleaned_date) AS minute,
DATE_TRUNC('second' , cleaned_date) AS second,
DATE_TRUNC('decade' , cleaned_date) AS decade
FROM tutorial.sf_crime_incidents_cleandate

practice question: Write a query that counts the number of incidents reported by week. Cast the week as a date to get rid of the hours/minutes/seconds.
注意这里不需要再套一个subquery,做出来第一个column直接就可以用为groupby条件了

1
2
3
4
5
SELECT DATE_TRUNC('week', cleaned_date) :: date AS week,
COUNT(incidnt_num) as cnt
FROM tutorial.sf_crime_incidents_cleandate
GROUP BY 1
ORDER BY 1

Extract Current Time

如果需要今天的日期或者当下的timestamp,可以用对应的直接call出来

1
2
3
4
5
6
SELECT CURRENT_DATE AS date,
CURRENT_TIME AS time,
CURRENT_TIMESTAMP AS timestamp,
LOCALTIME AS localtime,
LOCALTIMESTAMP AS localtimestamp,
NOW() AS now /*timestamp*/

如果要换到别的时区:
a complete list of time zones

1
2
SELECT CURRENT_TIME AS time,
CURRENT_TIME AT TIME ZONE 'PST' AS time_pst

practice question: Write a query that shows exactly how long ago each indicent was reported. Assume that the dataset is in Pacific Standard Time (UTC - 8).

1
2
3
4
5
SELECT incidnt_num,
cleaned_date,
NOW() AT TIME ZONE 'PST' AS now,
NOW() AT TIME ZONE 'PST' - cleaned_date AS time_ago
FROM tutorial.sf_crime_incidents_cleandate

COALESCE

COALESCE(expression1, expresson2, ...):return第一个非null的值。比如SELECT COALESCE(NULL, 1, 2, 'W3Schools.com')会return1.

如果想将所有null替换为别的值,可以用COALESCE(col, 'string')。所有非null值仍为原值,null则被替换为'string'
sample code: 把所有descript中的null换为’No Description’

1
2
3
4
5
SELECT incidnt_num,
descript,
COALESCE(descript, 'No Description')
FROM tutorial.sf_crime_incidents_cleandate
ORDER BY descript DESC


Writing Suqueries in SQL

Using Subqueries to Aggregate in Multiple Stages

How many incidents happen, on average, on every day of week in each month?
思路:要求avg需要的是一个月某一个dow的四个count,在这四个之中求average。所以要先求发生在每一个周五的事件总数,然后group by月份求每月的average。

1
2
3
4
5
6
7
8
9
10
SELECT LEFT(date, 2) AS month,
a.dow,
AVG(a.cnt) AS avg
FROM (
SELECT day_of_week AS dow, date, COUNT(incidnt_num) as cnt
FROM tutorial.sf_crime_incidents_2014_01
GROUP BY 1, 2
ORDER BY 1, 2) a
GROUP BY 1, 2
ORDER BY 1, 2

Practice question: Write a query that displays the average number of monthly incidents for each category.
算每个category里12个monthly sum的average

1
2
3
4
5
6
7
8
9
10
SELECT a.category, 
AVG(a.cnt) AS average
FROM (
SELECT EXTRACT('MONTH' FROM cleaned_date) AS month,
category,
COUNT(incidnt_num) AS cnt
FROM tutorial.sf_crime_incidents_cleandate
GROUP BY 1, 2
) a
GROUP BY 1

Subqueries in Conditional Logic

可以再WHERE, JOIN/ON, CASE上用subquery。但注意这种时候就不需要再写alias了,因为subquery在这里并没有被当成一个table,而是当成了a set of values)

sample code: WHERE = subquery

1
2
3
4
5
SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE Date = (SELECT MIN(date)
FROM tutorial.sf_crime_incidents_2014_01
)

sample code: WHERE IN subquery

1
2
3
4
5
6
7
SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE Date IN (SELECT date
FROM tutorial.sf_crime_incidents_2014_01
ORDER BY date
LIMIT 5
)

Joining Subqueries

或者直接用join来filter。下面code结果同上:

1
2
3
4
5
6
7
8
SELECT *
FROM tutorial.sf_crime_incidents_2014_01 incidents
JOIN (
SELECT date
FROM tutorial.sf_crime_incidents_2014_01
ORDER BY date LIMIT 5
) sub
ON incidents.date = sub.date

join还有一个很好的优势在于,可以把groupby aggregate的结果再join回原表,使得每一行数据都能附上一个aggregated value。

1
2
3
4
5
6
7
8
9
10
11
SELECT incidents.*,
sub.incidents AS incidents_that_day
FROM tutorial.sf_crime_incidents_2014_01 incidents
JOIN (
SELECT date,
COUNT(incidnt_num) AS incidents
FROM tutorial.sf_crime_incidents_2014_01
GROUP BY 1
) sub
ON incidents.date = sub.date
ORDER BY sub.incidents DESC, time

If you’d like to aggregate the numbers of all of the companies receiving investment and companies acquired each month:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT COALESCE(acquisitions.month, investments.month) AS month,
--因为是fulljoin,所以可能会有的acquisition月份在investments里面没有,COALESCE是将null的月份用investment的月份补足了,这样相当于union了两个subqueries里的month。
acquisitions.companies_acquired,
investments.companies_rec_investment
FROM (
SELECT acquired_month AS month,
COUNT(DISTINCT company_permalink) AS companies_acquired
FROM tutorial.crunchbase_acquisitions
GROUP BY 1 ) acquisitions
FULL JOIN (
SELECT funded_month AS month,
COUNT(DISTINCT company_permalink) AS companies_rec_investment
FROM tutorial.crunchbase_investments
GROUP BY 1 )investments
ON acquisitions.month = investments.month
ORDER BY 1 DESC

Practice question: Write a query that counts the number of companies founded and acquired by quarter starting in Q1 2012. Create the aggregations in two separate queries, then join them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
SELECT COALESCE(companies.founded_quarter, acquisitions.acquired_quarter) AS quarter,
companies.num_founded,
acquisitions.num_acquired
FROM (
SELECT founded_quarter,
COUNT(DISTINCT permalink) AS num_founded
FROM tutorial.crunchbase_companies
WHERE founded_year >= 2012
GROUP BY 1) companies
FULL JOIN (
SELECT acquired_quarter,
COUNT(DISTINCT company_permalink) AS num_acquired
FROM tutorial.crunchbase_acquisitions
WHERE acquired_year >= 2012
GROUP BY 1) acquisitions
ON companies.founded_quarter = acquired_quarter
ORDER BY 1

Subqueries and UNIONs

UNION ALL在subquery中合并被分成几块的数据:

1
2
3
4
5
6
7
8
9
10
SELECT COUNT(*) AS total_rows
FROM (
SELECT *
FROM tutorial.crunchbase_investments_part1

UNION ALL

SELECT *
FROM tutorial.crunchbase_investments_part2
) sub

Practice question: Write a query that ranks investors from the combined dataset above by the total number of investments they have made.

1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT sub.investor_name,
COUNT(sub.company_permalink) as num_investments
FROM (
SELECT *
FROM tutorial.crunchbase_investments_part1

UNION ALL

SELECT *
FROM tutorial.crunchbase_investments_part2
) sub
GROUP BY 1
ORDER BY 2 DESC

Practice question 2: Write a query that does the same thing as in the previous problem, except only for companies that are still operating. Hint: operating status is in tutorial.crunchbase_companies.

1
2
3
4
5
6
7
8
9
10
```
***

## Window Functions

### Basic windowing syntax
```sql
SELECT duration_seconds,
SUM(duration_seconds) OVER (ORDER BY start_time) AS running_total
FROM tutorial.dc_bikeshare_q1_2012

这是在求running total,也就是到当前时间为止的cumulative sum(这一点是由ORDER BY决定的,下面会说只用partition不用ORDER BY的情况)

如果还需要进一步细化,针对每个group来求running total,每换一个新group就清零重记,这时候就需要partition

1
2
3
4
5
6
7
SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_total
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'

这里先是用PARTITION按照start_terminal分组,然后在每个组里order by start_time求running sum

1
2
3
4
5
6
7
SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_total
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'

这里把ORDER BY去掉了,只剩partition,这里就是在根据partition求每个partition里的sum,跟GROUP BY的概念是一样的,但是Window function可以让每一行都出现这个sum,而不需要aggregate。

注意:GROUP BY和Partition不可共存,换句话说,在WINDOW FUNCTIONS中不可以有GROUP BY clause

还可以求每一行record的duration_seconds占每个partition sum的百分比:

1
2
3
4
5
6
7
8
9
--- Write a query modification of the above example query that 
--- shows the duration of each ride as a percentage
--- of the total time accrued by riders from each start_terminal
SELECT start_terminal, duration_seconds,
SUM(duration_seconds) OVER (PARTITION BY start_terminal) AS start_terminal_sum,
(duration_seconds / SUM(duration_seconds) OVER (PARTITION BY start_terminal))*100 AS perc_of_parti_sum
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY 1, 4 DESC

除了running sum, 我们还可以做 COUNT, AVG.

1
2
3
4
5
6
7
8
9
10
SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal) AS running_total,
COUNT(duration_seconds) OVER
(PARTITION BY start_terminal) AS running_count,
AVG(duration_seconds) OVER
(PARTITION BY start_terminal) AS running_avg
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'

or with ORDER BY:

1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_total,
COUNT(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_count,
AVG(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_avg
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'

Another practice question:

1
2
3
4
5
--- Write a query that shows a running total of the duration of bike rides (similar to the last example), but grouped by end_terminal, and with ride duration sorted in descending order.
SELECT end_terminal,
SUM(duration_seconds) OVER (PARTITION BY end_terminal ORDER BY duration_seconds DESC) AS running_total
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'

Rolling Window

If the timestamp is consecutive:

1
2
3
4
5
6
SELECT x.*,
AVG(daily_purchase_amt) OVER (PARTITION BY uid ORDER BY Timestamp ROWS BETWEEN 29 PRECEDING AND current ROW) AS rolling_avg
FROM (
SELECT uid, timestamp, sum(purchase_amt) as daily_purchase_amt
from tbl
group by timestamp) x

ROW_NUMBER()

ROW_NUMBER()可以显示当前行数,根据ORDER BY的顺序从1开始。比如下面的语句就是根据start_time排的序:

1
2
3
4
5
6
SELECT start_terminal,
start_time,
duration_seconds,
ROW_NUMBER() OVER (ORDER BY start_time)AS row_number
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'

加上PARTITION BY则可以在每个partition从1开始数

1
2
3
4
5
6
7
SELECT start_terminal,
start_time,
duration_seconds,
ROW_NUMBER() OVER (PARTITION BY start_terminal
ORDER BY start_time) AS row_number
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'

RANK() and DENSE_RANK()

ROW_NUMBER对于并列的情况没有考虑,而是直接assign两个不同的row number给两个相同值的record。比如上面ORDER BY start_time,很有可能有相同的start_time,但ROW_NUMBER()依然会给这两个相同的值不同的row number。

RANK()则考虑了这种情况,对相同的值就赋予相同的rank,下一级就跳一个数。比如有两个第四名时,下一个就是第六名。
DENSE_RANK()处理方式稍有不同,在有两个第四名时,下一个仍是第五名,它不会skip rank。
sample code:

1
2
3
4
5
6
SELECT start_terminal,
duration_seconds,
RANK() OVER (PARTITION BY start_terminal
ORDER BY start_time) AS rank
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'

Practice question:

1
2
3
4
5
6
7
8
9
10
11
--- Write a query that shows the 5 longest rides from each starting terminal, ordered by terminal, and longest to shortest rides within each terminal. Limit to rides that occurred before Jan. 8, 2012.
SELECT *
FROM (
SELECT start_terminal,
duration_seconds,
RANK() OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS rank
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
) a
WHERE a.rank <=5

这个问题要输出前5个,所以用RANK()直接选到5就可以,如果有并列的情况就用DENSE_RANK。

NTILE

NTILE()可以用来找当前record的percentile,syntax是NTILE(# of buckets you want)。比如NTILE(4)是等分成4分相当于quartile,NTILE(1000)是等分成100份相当于percentile。算percentile的对象是ORDER BY后面的field

sample code:

1
2
3
4
5
6
7
8
9
10
11
12
SELECT start_terminal,
duration_seconds,
NTILE(4) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS quartile,
NTILE(5) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS quintile,
NTILE(100) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal,
duration_seconds

需要注意的是如果dataset太小(行数小于Buckets的数量),那PERCENTILE就会像ROW_NUMBER()一样算下去,无法准确表示PERCENTILE

Practice question:

1
2
3
4
5
6
--- Write a query that shows only the duration of the trip and the percentile into which that duration falls (across the entire dataset—not partitioned by terminal), and order by duration_seconds
SELECT duration_seconds,
NTILE(100) OVER (ORDER BY duration_seconds) as percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY 1 DESC

LAG AND LEAD

LAG和LEAD可以让你pull当前record之前和之后的records,具体pull之前的第几条在括号内specify即可。如LAG(duration_seconds, 1)取的是之前一条的duration_seconds, LEAD(duration_seconds, 1)取的是后一条的duration_seconds。当然排列顺序也由ORDER BY决定。

Sample code: 算的是每一个start_terminal中按duration seconds从短到长排

1
2
3
4
5
6
7
8
9
SELECT start_terminal,
duration_seconds,
LAG(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds) AS lag,
LEAD(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds) AS lead
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal, duration_seconds

在不同行之间计算

1
2
3
4
5
6
7
SELECT start_terminal,
duration_seconds,
duration_seconds - LAG(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)AS difference
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal, duration_seconds

注意:拿LAG(X, 1)和LEAD(X, 1)来做例子,对于PARTITION里第一条(没有前一条),LAG为空。对于PARTITION里最后一条(没有后一条),LEAD为空
如果想去掉null,可以用一个outer query

1
2
3
4
5
6
7
8
9
10
11
SELECT *
FROM
( SELECT start_terminal,
duration_seconds,
duration_seconds - LAG(duration_seconds, 1) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS difference
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal,
duration_seconds ) sub
WHERE sub.difference IS NOT NULL

Defining a Window Alias

如果需要写多个window function用相同的window,可以给window一个alias,就不用重复写很多遍window1的代码了。比如将下面一段:

1
2
3
4
5
6
7
8
9
10
11
12
SELECT start_terminal,
duration_seconds,
NTILE(4) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS quartile,
NTILE(5) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS quintile,
NTILE(100) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal,
duration_seconds

改为用alias的:

1
2
3
4
5
6
7
8
9
10
SELECT start_terminal,
duration_seconds,
NTILE(4) OVER ntile_window AS quartile,
NTILE(5) OVER ntile_window AS quintile,
NTILE(100) OVER ntile_window AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
WINDOW ntile_window AS (PARTITION BY start_terminal ORDER BY duration_seconds)
ORDER BY start_terminal,
duration_seconds

需要注意的是WINDOWclause必须在WHEREclause之后。


Performance Tuning SQL Queries

想让SQL跑的更快,主要是要减少calculation的数量。下面几点是对calculation数量有明显影响,且controllable的:

  1. Table size: 数据量
  2. Joins: 如果join出来的行数过多,很可能计算量会很大
  3. Aggregations: combining rows比光retrieve rows需要更大的计算量

同时还有其他uncontrollable的因素:

  1. Other users running queries: 运算量是有限且shared的
  2. Database software and optimization:

下面我们只看如何处理controllable的因素。

Reducing Table Size

  1. 用Filter减少行数:可以用subset先做exploratory analysis,确保query能跑,结果正确,再去fullset上跑
  2. 在aggregate之前,先用subquery减少行数(比如在subquery里放LIMIT,但要注意这样可能出来的结果不正确,只能用于test query)

Making Joins Less Complicated

在join之前先尽量减少table size。对于数据量本身很大的table来说增益尤为明显。

EXPLAIN

在query之前加EXPLAIN可以查看每一步entry大概用了多少row,需要cost大约多久时间。可以用这个方法来为你提供revise query的思路。


Pivoting Data in SQL

Pivoting Rows to Columns

现有的dataset有三列:conference(conference类别), year(FR, SO, JR, SR四类年份), players(人数)。我们想让一条record能展示conference,该会总player人数,及各类年份player人数。这就需要把year的值提到colunm name上。
想将每一行里的值放在column name上,首先得做aggregation

1
2
3
4
5
6
7
8
SELECT teams.conference AS conference,
players.year,
COUNT(1) AS players
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
GROUP BY 1,2
ORDER BY 1,2

想清楚什么不动,什么要放到column name上。然后把不动的作为最外层group by的条件,用CASE+aggregation来制作每个column,内层subquery放上一步做好的aggregation table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
SELECT conference,
SUM(players) AS total_players,
SUM(CASE WHEN year = 'FR' THEN players ELSE NULL END) AS fr,
SUM(CASE WHEN year = 'SO' THEN players ELSE NULL END) AS so,
SUM(CASE WHEN year = 'JR' THEN players ELSE NULL END) AS jr,
SUM(CASE WHEN year = 'SR' THEN players ELSE NULL END) AS sr
FROM (
SELECT teams.conference AS conference,
players.year,
COUNT(1) AS players
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
GROUP BY 1,2
) sub
GROUP BY 1
ORDER BY 2 DESC

Pivoting Columns to Rows

现有的dataset column name是年份(2000-2012),row name是magnitude(约10个级别),cell value是地震对应的数量。对于这种需要的数据出现在了column name的情况,我们可以把他转换为一个三个column的表格:”magnitude”, “year”, and “number of earthquakes”.

首先要建一个新表,使所有column name都变为rows。这里就是把年份放到rows中。原来是row name的magnitude不用管,因为它本来就是一个column了。

1
2
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010),(2011),(2012)) v(year)

因为本身这张表就是所有magnitude和年份的排列组合,所以我们可以这一列年份直接跟原表(row为magnitude)做CROSS JOIN(即cartesian product),那么这样所有的magnitude和年份的组合就有了。

1
2
3
4
5
6
7
SELECT years.*,
earthquakes.*
FROM tutorial.worldwide_earthquakes earthquakes
CROSS JOIN (
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years

再利用CASE在这张大表中对”number of earthquakes”做筛选,只取对应年份的值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
SELECT years.*,
earthquakes.magnitude,/*这里把row给定位住了*/
CASE year
WHEN 2000 THEN year_2000/*这里定位要取的是哪个column的值*/
WHEN 2001 THEN year_2001
WHEN 2002 THEN year_2002
WHEN 2003 THEN year_2003
WHEN 2004 THEN year_2004
WHEN 2005 THEN year_2005
WHEN 2006 THEN year_2006
WHEN 2007 THEN year_2007
WHEN 2008 THEN year_2008
WHEN 2009 THEN year_2009
WHEN 2010 THEN year_2010
WHEN 2011 THEN year_2011
WHEN 2012 THEN year_2012
ELSE NULL END
AS number_of_earthquakes
FROM tutorial.worldwide_earthquakes earthquakes
CROSS JOIN (
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years