Notes on advanced SQL techniques on Mode.
SQL Data types
数据看起来可能是Numeric的,但有可能里面有逗号或者货币符号,在存进database的时候就被当成了Non-numeric的,由于COUNT/SUM这类函数只有在numeric的时候才能用,date类的函数也只能用在明确是date/time的格式的数据,所以要注意这一点。
Changing a column’s data type
CAST(column_name AS integer)
, column_name::integer
OR CONVERT(integer, column_name)
can all do the trick.
Example code:1
2
3SELECT CAST(funding_total_usd AS varchar) funding_total_usd,
founded_at_clean::varchar founded_at_clean
FROM tutorial.crunchbase_companies_clean_date
SQL Date Format
存成YYYY-MM-DD的才好排序,MM-DD-YYYY会有排序问题。
做日期/时间格式的数据的运算的时候,date - date = interval
,data - interval = date
1
2
3
4
5
6
7
8SELECT companies.permalink,
companies.founded_at_clean,
acquisitions.acquired_at_cleaned,
acquisitions.acquired_at_cleaned - companies.founded_at_clean::timestamp AS time_to_acquisition
FROM tutorial.crunchbase_companies_clean_date companies
JOIN tutorial.crunchbase_acquisitions_clean_date acquisitions
ON acquisitions.company_permalink = companies.permalink
WHERE founded_at_clean IS NOT NULL
也可以直接用INTERVAL
function。下面的code是直接加1星期:1
2
3
4
5
6SELECT companies.permalink,
companies.founded_at_clean,
companies.founded_at_clean::timestamp +
INTERVAL '1 week' AS plus_one_week
FROM tutorial.crunchbase_companies_clean_date companies
WHERE founded_at_clean IS NOT NULL
如果需要当前的时间:NOW()
1
2
3
4
5SELECT companies.permalink,
companies.founded_at_clean,
NOW() - companies.founded_at_clean::timestamp AS founded_time_ago
FROM tutorial.crunchbase_companies_clean_date companies
WHERE founded_at_clean IS NOT NULL
Practice question:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25--- Write a query that counts the number of companies acquired within 3 years,
--- 5 years, and 10 years of being founded (in 3 separate columns). Include a
--- column for total companies acquired as well. Group by category and limit
--- to only rows with a founding date.
SELECT companies.category_code,
COUNT(companies.permalink) AS total,
COUNT(CASE
WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '3 years' THEN 1
ELSE NULL
END) AS Y3,
COUNT(CASE
WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '5 years' THEN 1
ELSE NULL
END) AS Y5,
COUNT(CASE
WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL '10 years' THEN 1
ELSE NULL
END) AS Y10
FROM tutorial.crunchbase_companies_clean_date AS companies
JOIN tutorial.crunchbase_acquisitions_clean_date AS acquisitions
ON acquisitions.company_permalink = companies.permalink
WHERE founded_at_clean IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC
Data Wrangling with SQL
Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.
Using SQL String Functions to Clean Data
Cleaning Strings
LEFT, RIGHT and LENGTHLEFT(stirng, n)
: 取某一条string从左数1到第n个characterRIGHT(stirng, n)
: 取某一条string从右数1到第n个character
sample code: 原本date是这个column是前面为日期,后面是自动附上的time stamp。这里因为知道每个cell的长度都是一样的,所以直接选定左右的数字便能把date和time拆开。1
2
3
4
5SELECT incidnt_num,
date,
LEFT(date, 10) AS cleaned_date,
RIGHT(date, 17) AS cleaned_time
FROM tutorial.sf_crime_incidents_2014_01
LENGTH(string)
: return string的长度
sample code: 如果不确定每个cell的长度是否一致,只知道一边固定的长度,可以用LENGTH()来做减法1
2
3
4
5SELECT incidnt_num,
date,
LEFT(date, 10) AS cleaned_date,
RIGHT(date, LENGTH(date) - 11) AS cleaned_time
FROM tutorial.sf_crime_incidents_2014_01
TRIMTRIM([position] 'charaters to be trimemd' FROM col_name)
可以用来去掉某个string的开头和结尾的characters
postion: leading, trailing, both
charaters to be trimmed: e.g. ‘()’ ‘/‘
sample code: 去掉location中前后的括号1
2
3
4
5SELECT location,
TRIM(both '()' FROM location),
TRIM(leading '(' FROM location),
TRIM(trailing ')' FROM location)
FROM tutorial.sf_crime_incidents_2014_01
POSTION and STRPOSPOSITION('A' IN B)
: 找到‘A’在B中从左数第一次出现的位置STRPOS(B, 'A')
: 同上
注意:这两个函数都是case sensitive的。如果想要把大小写都找出来,最好先用UPPER/LOWER把原string全变大写/小写再找
sample code:1
2
3
4
5SELECT incidnt_num,
descript,
POSITION('A' IN descript) AS a_position
STRPOS(descript, 'A') AS a_position_s
FROM tutorial.sf_crime_incidents_2014_01
SUBSTRSUBSTR(string, p, n)
: 取一条string中从第p数n个characters1
2
3
4SELECT incidnt_num,
date,
SUBSTR(date, 4, 2) AS day
FROM tutorial.sf_crime_incidents_2014_01
Practice question: 将location column中(lat, lon)分为两个column:lat, lon1
2
3
4
5
6
7
8
9
10SELECT location,
lat,
lon,
SUBSTR(location,
POSITION('(' IN location) + 1,
POSITION(',' IN location) - POSITION('(' IN location) - 1) AS LAT1,
SUBSTR(location,
POSITION(',' IN location) + 1,
POSITION(')' IN location) - POSITION(',' IN location) - 1) AS LON1
FROM tutorial.sf_crime_incidents_2014_01
方法二:1
2
3
4SELECT location,
TRIM(leading '(' FROM LEFT(location, POSITION(',' IN location) - 1)) AS lattitude,
TRIM(trailing ')' FROM RIGHT(location, LENGTH(location) - POSITION(',' IN location) ) ) AS longitude
FROM tutorial.sf_crime_incidents_2014_01
CONCATCONCAT(a, ',', b))
: 将分别的两个column a和b concat为’a,b’a || ',' || b
: 同上
sample code:1
2
3
4
5
6SELECT incidnt_num,
day_of_week,
date,
LEFT(date, 10) AS cleaned_date,
CONCAT(day_of_week, ', ', LEFT(date, 10)) AS day_and_date
FROM tutorial.sf_crime_incidents_2014_01
Practice question: Concatenate the lat
and lon
fields to form a field that is equivalent to the location field.1
2
3
4
5SELECT lat,
lon,
location,
CONCAT('(', lat, ', ', lon, ')') AS loc
FROM tutorial.sf_crime_incidents_2014_01
or1
2
3
4
5SELECT lat,
lon,
location,
'(' || lat || ', ' || lon || ')' AS loc
FROM tutorial.sf_crime_incidents_2014_01
Practice question 2: Write a query that creates a date column formatted YYYY-MM-DD.
1 | SELECT date, |
UPPER and LOWERUPPER(string)
: 把string全改为大写LOWER(string)
: 把string全改为小写
Practice question: Write a query that returns the category
field, but with the first letter capitalized and the rest of the letters in lower-case.1
2
3SELECT category,
LEFT(category, 1) || LOWER(RIGHT(category, LENGTH(category) - 1))
FROM tutorial.sf_crime_incidents_2014_01
Turning Strings into Dates
可以CAST
或::
成date, timestamp(有时分秒)
sample code:1
2
3
4
5SELECT incidnt_num,
date,
(SUBSTR(date, 7, 4) || '-' || LEFT(date, 2) ||
'-' || SUBSTR(date, 4, 2))::date AS cleaned_date
FROM tutorial.sf_crime_incidents_2014_01
Practice question: Write a query that creates an accurate timestamp using the date
and time
columns in tutorial.sf_crime_incidents_2014_01
. Include a field that is exactly 1 week later as well.1
2
3
4
5SELECT date,
time,
(SUBSTR(date, 7, 4) || '-' || LEFT(date, 2) || '-' || SUBSTR(date, 4, 2) || ' ' || time) ::timestamp AS acc_timestamp,
(SUBSTR(date, 7, 4) || '-' || LEFT(date, 2) || '-' || SUBSTR(date, 4, 2) || ' ' || time) ::timestamp + INTERVAL '1 week' AS timestamp_1w
FROM tutorial.sf_crime_incidents_2014_01
Turning Dates into More Useful Dates
用EXTRACT
从date格式的数据里提取自己需要的部分
sample code:1
2
3
4
5
6
7
8
9
10SELECT cleaned_date,
EXTRACT('year' FROM cleaned_date) AS year,
EXTRACT('month' FROM cleaned_date) AS month,
EXTRACT('day' FROM cleaned_date) AS day,
EXTRACT('hour' FROM cleaned_date) AS hour,
EXTRACT('minute' FROM cleaned_date) AS minute,
EXTRACT('second' FROM cleaned_date) AS second,
EXTRACT('decade' FROM cleaned_date) AS decade, /*YYYY年份的前三位*/
EXTRACT('dow' FROM cleaned_date) AS day_of_week
FROM tutorial.sf_crime_incidents_cleandate
用DATE_TRUNC
把date格式的数据round为某个时间单位精度的值,比如你只需要精确到天,就把时分秒全变00:00:00;如果只需要到年,就把月份变为01-01。
sample code:1
2
3
4
5
6
7
8
9
10SELECT cleaned_date,
DATE_TRUNC('year' , cleaned_date) AS year,
DATE_TRUNC('month' , cleaned_date) AS month,
DATE_TRUNC('week' , cleaned_date) AS week, /*当周的周一*/
DATE_TRUNC('day' , cleaned_date) AS day,
DATE_TRUNC('hour' , cleaned_date) AS hour,
DATE_TRUNC('minute' , cleaned_date) AS minute,
DATE_TRUNC('second' , cleaned_date) AS second,
DATE_TRUNC('decade' , cleaned_date) AS decade
FROM tutorial.sf_crime_incidents_cleandate
practice question: Write a query that counts the number of incidents reported by week. Cast the week as a date to get rid of the hours/minutes/seconds.
注意这里不需要再套一个subquery,做出来第一个column直接就可以用为groupby条件了1
2
3
4
5SELECT DATE_TRUNC('week', cleaned_date) :: date AS week,
COUNT(incidnt_num) as cnt
FROM tutorial.sf_crime_incidents_cleandate
GROUP BY 1
ORDER BY 1
Extract Current Time
如果需要今天的日期或者当下的timestamp,可以用对应的直接call出来1
2
3
4
5
6SELECT CURRENT_DATE AS date,
CURRENT_TIME AS time,
CURRENT_TIMESTAMP AS timestamp,
LOCALTIME AS localtime,
LOCALTIMESTAMP AS localtimestamp,
NOW() AS now /*timestamp*/
如果要换到别的时区:
a complete list of time zones1
2SELECT CURRENT_TIME AS time,
CURRENT_TIME AT TIME ZONE 'PST' AS time_pst
practice question: Write a query that shows exactly how long ago each indicent was reported. Assume that the dataset is in Pacific Standard Time (UTC - 8).1
2
3
4
5SELECT incidnt_num,
cleaned_date,
NOW() AT TIME ZONE 'PST' AS now,
NOW() AT TIME ZONE 'PST' - cleaned_date AS time_ago
FROM tutorial.sf_crime_incidents_cleandate
COALESCE
COALESCE(expression1, expresson2, ...)
:return第一个非null的值。比如SELECT COALESCE(NULL, 1, 2, 'W3Schools.com')
会return1.
如果想将所有null
替换为别的值,可以用COALESCE(col, 'string')
。所有非null值仍为原值,null则被替换为'string'
。
sample code: 把所有descript中的null换为’No Description’1
2
3
4
5SELECT incidnt_num,
descript,
COALESCE(descript, 'No Description')
FROM tutorial.sf_crime_incidents_cleandate
ORDER BY descript DESC
Writing Suqueries in SQL
Using Subqueries to Aggregate in Multiple Stages
How many incidents happen, on average, on every day of week in each month?
思路:要求avg需要的是一个月某一个dow的四个count,在这四个之中求average。所以要先求发生在每一个周五的事件总数,然后group by月份求每月的average。1
2
3
4
5
6
7
8
9
10SELECT LEFT(date, 2) AS month,
a.dow,
AVG(a.cnt) AS avg
FROM (
SELECT day_of_week AS dow, date, COUNT(incidnt_num) as cnt
FROM tutorial.sf_crime_incidents_2014_01
GROUP BY 1, 2
ORDER BY 1, 2) a
GROUP BY 1, 2
ORDER BY 1, 2
Practice question: Write a query that displays the average number of monthly incidents for each category.
算每个category里12个monthly sum的average1
2
3
4
5
6
7
8
9
10SELECT a.category,
AVG(a.cnt) AS average
FROM (
SELECT EXTRACT('MONTH' FROM cleaned_date) AS month,
category,
COUNT(incidnt_num) AS cnt
FROM tutorial.sf_crime_incidents_cleandate
GROUP BY 1, 2
) a
GROUP BY 1
Subqueries in Conditional Logic
可以再WHERE, JOIN/ON, CASE上用subquery。但注意这种时候就不需要再写alias了,因为subquery在这里并没有被当成一个table,而是当成了a set of values)
sample code: WHERE = subquery
1
2
3
4
5SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE Date = (SELECT MIN(date)
FROM tutorial.sf_crime_incidents_2014_01
)
sample code: WHERE IN subquery
1
2
3
4
5
6
7SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE Date IN (SELECT date
FROM tutorial.sf_crime_incidents_2014_01
ORDER BY date
LIMIT 5
)
Joining Subqueries
或者直接用join来filter。下面code结果同上:1
2
3
4
5
6
7
8SELECT *
FROM tutorial.sf_crime_incidents_2014_01 incidents
JOIN (
SELECT date
FROM tutorial.sf_crime_incidents_2014_01
ORDER BY date LIMIT 5
) sub
ON incidents.date = sub.date
join还有一个很好的优势在于,可以把groupby aggregate的结果再join回原表,使得每一行数据都能附上一个aggregated value。1
2
3
4
5
6
7
8
9
10
11SELECT incidents.*,
sub.incidents AS incidents_that_day
FROM tutorial.sf_crime_incidents_2014_01 incidents
JOIN (
SELECT date,
COUNT(incidnt_num) AS incidents
FROM tutorial.sf_crime_incidents_2014_01
GROUP BY 1
) sub
ON incidents.date = sub.date
ORDER BY sub.incidents DESC, time
If you’d like to aggregate the numbers of all of the companies receiving investment and companies acquired each month:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16SELECT COALESCE(acquisitions.month, investments.month) AS month,
--因为是fulljoin,所以可能会有的acquisition月份在investments里面没有,COALESCE是将null的月份用investment的月份补足了,这样相当于union了两个subqueries里的month。
acquisitions.companies_acquired,
investments.companies_rec_investment
FROM (
SELECT acquired_month AS month,
COUNT(DISTINCT company_permalink) AS companies_acquired
FROM tutorial.crunchbase_acquisitions
GROUP BY 1 ) acquisitions
FULL JOIN (
SELECT funded_month AS month,
COUNT(DISTINCT company_permalink) AS companies_rec_investment
FROM tutorial.crunchbase_investments
GROUP BY 1 )investments
ON acquisitions.month = investments.month
ORDER BY 1 DESC
Practice question: Write a query that counts the number of companies founded and acquired by quarter starting in Q1 2012. Create the aggregations in two separate queries, then join them.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17SELECT COALESCE(companies.founded_quarter, acquisitions.acquired_quarter) AS quarter,
companies.num_founded,
acquisitions.num_acquired
FROM (
SELECT founded_quarter,
COUNT(DISTINCT permalink) AS num_founded
FROM tutorial.crunchbase_companies
WHERE founded_year >= 2012
GROUP BY 1) companies
FULL JOIN (
SELECT acquired_quarter,
COUNT(DISTINCT company_permalink) AS num_acquired
FROM tutorial.crunchbase_acquisitions
WHERE acquired_year >= 2012
GROUP BY 1) acquisitions
ON companies.founded_quarter = acquired_quarter
ORDER BY 1
Subqueries and UNIONs
用UNION ALL
在subquery中合并被分成几块的数据:1
2
3
4
5
6
7
8
9
10SELECT COUNT(*) AS total_rows
FROM (
SELECT *
FROM tutorial.crunchbase_investments_part1
UNION ALL
SELECT *
FROM tutorial.crunchbase_investments_part2
) sub
Practice question: Write a query that ranks investors from the combined dataset above by the total number of investments they have made.1
2
3
4
5
6
7
8
9
10
11
12
13SELECT sub.investor_name,
COUNT(sub.company_permalink) as num_investments
FROM (
SELECT *
FROM tutorial.crunchbase_investments_part1
UNION ALL
SELECT *
FROM tutorial.crunchbase_investments_part2
) sub
GROUP BY 1
ORDER BY 2 DESC
Practice question 2: Write a query that does the same thing as in the previous problem, except only for companies that are still operating. Hint: operating status is in tutorial.crunchbase_companies.1
2
3
4
5
6
7
8
9
10```
***
## Window Functions
### Basic windowing syntax
```sql
SELECT duration_seconds,
SUM(duration_seconds) OVER (ORDER BY start_time) AS running_total
FROM tutorial.dc_bikeshare_q1_2012
这是在求running total,也就是到当前时间为止的cumulative sum(这一点是由ORDER BY决定的,下面会说只用partition不用ORDER BY的情况)
如果还需要进一步细化,针对每个group来求running total,每换一个新group就清零重记,这时候就需要partition1
2
3
4
5
6
7SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_total
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
这里先是用PARTITION按照start_terminal分组,然后在每个组里order by start_time求running sum
1 | SELECT start_terminal, |
这里把ORDER BY去掉了,只剩partition,这里就是在根据partition求每个partition里的sum,跟GROUP BY的概念是一样的,但是Window function可以让每一行都出现这个sum,而不需要aggregate。
注意:GROUP BY和Partition不可共存,换句话说,在WINDOW FUNCTIONS中不可以有GROUP BY clause
还可以求每一行record的duration_seconds占每个partition sum的百分比:1
2
3
4
5
6
7
8
9--- Write a query modification of the above example query that
--- shows the duration of each ride as a percentage
--- of the total time accrued by riders from each start_terminal
SELECT start_terminal, duration_seconds,
SUM(duration_seconds) OVER (PARTITION BY start_terminal) AS start_terminal_sum,
(duration_seconds / SUM(duration_seconds) OVER (PARTITION BY start_terminal))*100 AS perc_of_parti_sum
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY 1, 4 DESC
除了running sum, 我们还可以做 COUNT
, AVG
.1
2
3
4
5
6
7
8
9
10SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal) AS running_total,
COUNT(duration_seconds) OVER
(PARTITION BY start_terminal) AS running_count,
AVG(duration_seconds) OVER
(PARTITION BY start_terminal) AS running_avg
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
or with ORDER BY
:1
2
3
4
5
6
7
8
9
10
11
12
13SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_total,
COUNT(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_count,
AVG(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_avg
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
Another practice question:1
2
3
4
5--- Write a query that shows a running total of the duration of bike rides (similar to the last example), but grouped by end_terminal, and with ride duration sorted in descending order.
SELECT end_terminal,
SUM(duration_seconds) OVER (PARTITION BY end_terminal ORDER BY duration_seconds DESC) AS running_total
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
Rolling Window
If the timestamp is consecutive:1
2
3
4
5
6SELECT x.*,
AVG(daily_purchase_amt) OVER (PARTITION BY uid ORDER BY Timestamp ROWS BETWEEN 29 PRECEDING AND current ROW) AS rolling_avg
FROM (
SELECT uid, timestamp, sum(purchase_amt) as daily_purchase_amt
from tbl
group by timestamp) x
ROW_NUMBER()
ROW_NUMBER()可以显示当前行数,根据ORDER BY的顺序从1开始。比如下面的语句就是根据start_time排的序:1
2
3
4
5
6SELECT start_terminal,
start_time,
duration_seconds,
ROW_NUMBER() OVER (ORDER BY start_time)AS row_number
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
加上PARTITION BY
则可以在每个partition从1开始数1
2
3
4
5
6
7SELECT start_terminal,
start_time,
duration_seconds,
ROW_NUMBER() OVER (PARTITION BY start_terminal
ORDER BY start_time) AS row_number
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
RANK() and DENSE_RANK()
ROW_NUMBER对于并列的情况没有考虑,而是直接assign两个不同的row number给两个相同值的record。比如上面ORDER BY start_time
,很有可能有相同的start_time
,但ROW_NUMBER()
依然会给这两个相同的值不同的row number。
而RANK()
则考虑了这种情况,对相同的值就赋予相同的rank,下一级就跳一个数。比如有两个第四名时,下一个就是第六名。DENSE_RANK()
处理方式稍有不同,在有两个第四名时,下一个仍是第五名,它不会skip rank。
sample code:1
2
3
4
5
6SELECT start_terminal,
duration_seconds,
RANK() OVER (PARTITION BY start_terminal
ORDER BY start_time) AS rank
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
Practice question:1
2
3
4
5
6
7
8
9
10
11--- Write a query that shows the 5 longest rides from each starting terminal, ordered by terminal, and longest to shortest rides within each terminal. Limit to rides that occurred before Jan. 8, 2012.
SELECT *
FROM (
SELECT start_terminal,
duration_seconds,
RANK() OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS rank
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
) a
WHERE a.rank <=5
这个问题要输出前5个,所以用RANK()直接选到5就可以,如果有并列的情况就用DENSE_RANK。
NTILE
NTILE()
可以用来找当前record的percentile,syntax是NTILE(# of buckets you want)
。比如NTILE(4)
是等分成4分相当于quartile,NTILE(1000)
是等分成100份相当于percentile。算percentile的对象是ORDER BY
后面的field
sample code:1
2
3
4
5
6
7
8
9
10
11
12SELECT start_terminal,
duration_seconds,
NTILE(4) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS quartile,
NTILE(5) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS quintile,
NTILE(100) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal,
duration_seconds
需要注意的是如果dataset太小(行数小于Buckets的数量),那PERCENTILE就会像ROW_NUMBER()一样算下去,无法准确表示PERCENTILE
Practice question:1
2
3
4
5
6--- Write a query that shows only the duration of the trip and the percentile into which that duration falls (across the entire dataset—not partitioned by terminal), and order by duration_seconds
SELECT duration_seconds,
NTILE(100) OVER (ORDER BY duration_seconds) as percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY 1 DESC
LAG AND LEAD
LAG和LEAD可以让你pull当前record之前和之后的records,具体pull之前的第几条在括号内specify即可。如LAG(duration_seconds, 1)
取的是之前一条的duration_seconds, LEAD(duration_seconds, 1)
取的是后一条的duration_seconds。当然排列顺序也由ORDER BY
决定。
Sample code: 算的是每一个start_terminal中按duration seconds从短到长排1
2
3
4
5
6
7
8
9SELECT start_terminal,
duration_seconds,
LAG(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds) AS lag,
LEAD(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds) AS lead
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal, duration_seconds
在不同行之间计算1
2
3
4
5
6
7SELECT start_terminal,
duration_seconds,
duration_seconds - LAG(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)AS difference
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal, duration_seconds
注意:拿LAG(X, 1)和LEAD(X, 1)来做例子,对于PARTITION里第一条(没有前一条),LAG为空。对于PARTITION里最后一条(没有后一条),LEAD为空
如果想去掉null,可以用一个outer query1
2
3
4
5
6
7
8
9
10
11SELECT *
FROM
( SELECT start_terminal,
duration_seconds,
duration_seconds - LAG(duration_seconds, 1) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS difference
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal,
duration_seconds ) sub
WHERE sub.difference IS NOT NULL
Defining a Window Alias
如果需要写多个window function用相同的window,可以给window一个alias,就不用重复写很多遍window1的代码了。比如将下面一段:1
2
3
4
5
6
7
8
9
10
11
12SELECT start_terminal,
duration_seconds,
NTILE(4) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS quartile,
NTILE(5) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS quintile,
NTILE(100) OVER (PARTITION BY start_terminal
ORDER BY duration_seconds) AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal,
duration_seconds
改为用alias的:1
2
3
4
5
6
7
8
9
10SELECT start_terminal,
duration_seconds,
NTILE(4) OVER ntile_window AS quartile,
NTILE(5) OVER ntile_window AS quintile,
NTILE(100) OVER ntile_window AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
WINDOW ntile_window AS (PARTITION BY start_terminal ORDER BY duration_seconds)
ORDER BY start_terminal,
duration_seconds
需要注意的是WINDOW
clause必须在WHERE
clause之后。
Performance Tuning SQL Queries
想让SQL跑的更快,主要是要减少calculation的数量。下面几点是对calculation数量有明显影响,且controllable的:
- Table size: 数据量
- Joins: 如果join出来的行数过多,很可能计算量会很大
- Aggregations: combining rows比光retrieve rows需要更大的计算量
同时还有其他uncontrollable的因素:
- Other users running queries: 运算量是有限且shared的
- Database software and optimization:
下面我们只看如何处理controllable的因素。
Reducing Table Size
- 用Filter减少行数:可以用subset先做exploratory analysis,确保query能跑,结果正确,再去fullset上跑
- 在aggregate之前,先用subquery减少行数(比如在subquery里放LIMIT,但要注意这样可能出来的结果不正确,只能用于test query)
Making Joins Less Complicated
在join之前先尽量减少table size。对于数据量本身很大的table来说增益尤为明显。
EXPLAIN
在query之前加EXPLAIN
可以查看每一步entry大概用了多少row,需要cost大约多久时间。可以用这个方法来为你提供revise query的思路。
Pivoting Data in SQL
Pivoting Rows to Columns
现有的dataset有三列:conference(conference类别), year(FR, SO, JR, SR四类年份), players(人数)。我们想让一条record能展示conference,该会总player人数,及各类年份player人数。这就需要把year的值提到colunm name上。
想将每一行里的值放在column name上,首先得做aggregation1
2
3
4
5
6
7
8SELECT teams.conference AS conference,
players.year,
COUNT(1) AS players
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
GROUP BY 1,2
ORDER BY 1,2
想清楚什么不动,什么要放到column name上。然后把不动的作为最外层group by的条件,用CASE+aggregation
来制作每个column,内层subquery放上一步做好的aggregation table:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17SELECT conference,
SUM(players) AS total_players,
SUM(CASE WHEN year = 'FR' THEN players ELSE NULL END) AS fr,
SUM(CASE WHEN year = 'SO' THEN players ELSE NULL END) AS so,
SUM(CASE WHEN year = 'JR' THEN players ELSE NULL END) AS jr,
SUM(CASE WHEN year = 'SR' THEN players ELSE NULL END) AS sr
FROM (
SELECT teams.conference AS conference,
players.year,
COUNT(1) AS players
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
GROUP BY 1,2
) sub
GROUP BY 1
ORDER BY 2 DESC
Pivoting Columns to Rows
现有的dataset column name是年份(2000-2012),row name是magnitude(约10个级别),cell value是地震对应的数量。对于这种需要的数据出现在了column name的情况,我们可以把他转换为一个三个column的表格:”magnitude”, “year”, and “number of earthquakes”.
首先要建一个新表,使所有column name都变为rows。这里就是把年份放到rows中。原来是row name的magnitude不用管,因为它本来就是一个column了。1
2SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
因为本身这张表就是所有magnitude和年份的排列组合,所以我们可以这一列年份直接跟原表(row为magnitude)做CROSS JOIN(即cartesian product),那么这样所有的magnitude和年份的组合就有了。1
2
3
4
5
6
7SELECT years.*,
earthquakes.*
FROM tutorial.worldwide_earthquakes earthquakes
CROSS JOIN (
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years
再利用CASE在这张大表中对”number of earthquakes”做筛选,只取对应年份的值。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23SELECT years.*,
earthquakes.magnitude,/*这里把row给定位住了*/
CASE year
WHEN 2000 THEN year_2000/*这里定位要取的是哪个column的值*/
WHEN 2001 THEN year_2001
WHEN 2002 THEN year_2002
WHEN 2003 THEN year_2003
WHEN 2004 THEN year_2004
WHEN 2005 THEN year_2005
WHEN 2006 THEN year_2006
WHEN 2007 THEN year_2007
WHEN 2008 THEN year_2008
WHEN 2009 THEN year_2009
WHEN 2010 THEN year_2010
WHEN 2011 THEN year_2011
WHEN 2012 THEN year_2012
ELSE NULL END
AS number_of_earthquakes
FROM tutorial.worldwide_earthquakes earthquakes
CROSS JOIN (
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years