Notes on advanced SQL techniques on Mode.
SQL Data types
数据看起来可能是Numeric的,但有可能里面有逗号或者货币符号,在存进database的时候就被当成了Non-numeric的,由于COUNT/SUM这类函数只有在numeric的时候才能用,date类的函数也只能用在明确是date/time的格式的数据,所以要注意这一点。
Changing a column’s data type
CAST(column_name AS integer)
, column_name::integer
OR CONVERT(integer, column_name)
can all do the trick.
Example code:
1 | SELECT CAST(funding_total_usd AS varchar) funding_total_usd, |
SQL Date Format
存成YYYY-MM-DD的才好排序,MM-DD-YYYY会有排序问题。
做日期/时间格式的数据的运算的时候,date - date = interval
,data - interval = date
1 | SELECT companies.permalink, |
也可以直接用INTERVAL
function。下面的code是直接加1星期:
1 | SELECT companies.permalink, |
如果需要当前的时间:NOW()
1 | SELECT companies.permalink, |
Practice question:
1 | --- Write a query that counts the number of companies acquired within 3 years, |
Data Wrangling with SQL
Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.
Using SQL String Functions to Clean Data
Cleaning Strings
LEFT, RIGHT and LENGTHLEFT(stirng, n)
: 取某一条string从左数1到第n个characterRIGHT(stirng, n)
: 取某一条string从右数1到第n个character
sample code: 原本date是这个column是前面为日期,后面是自动附上的time stamp。这里因为知道每个cell的长度都是一样的,所以直接选定左右的数字便能把date和time拆开。
1 | SELECT incidnt_num, |
LENGTH(string)
: return string的长度
sample code: 如果不确定每个cell的长度是否一致,只知道一边固定的长度,可以用LENGTH()来做减法
1 | SELECT incidnt_num, |
TRIMTRIM([position] 'charaters to be trimemd' FROM col_name)
可以用来去掉某个string的开头和结尾的characters
postion: leading, trailing, both
charaters to be trimmed: e.g. ‘()’ ‘/‘
sample code: 去掉location中前后的括号
1 | SELECT location, |
POSTION and STRPOSPOSITION('A' IN B)
: 找到‘A’在B中从左数第一次出现的位置STRPOS(B, 'A')
: 同上
注意:这两个函数都是case sensitive的。如果想要把大小写都找出来,最好先用UPPER/LOWER把原string全变大写/小写再找
sample code:
1 | SELECT incidnt_num, |
SUBSTRSUBSTR(string, p, n)
: 取一条string中从第p数n个characters
1 | SELECT incidnt_num, |
Practice question: 将location column中(lat, lon)分为两个column:lat, lon
1 | SELECT location, |
方法二:
1 | SELECT location, |
CONCATCONCAT(a, ',', b))
: 将分别的两个column a和b concat为’a,b’a || ',' || b
: 同上
sample code:
1 | SELECT incidnt_num, |
Practice question: Concatenate the lat
and lon
fields to form a field that is equivalent to the location field.
1 | SELECT lat, |
or
1 | SELECT lat, |
Practice question 2: Write a query that creates a date column formatted YYYY-MM-DD.
1 | SELECT date, |
UPPER and LOWERUPPER(string)
: 把string全改为大写LOWER(string)
: 把string全改为小写
Practice question: Write a query that returns the category
field, but with the first letter capitalized and the rest of the letters in lower-case.
1 | SELECT category, |
Turning Strings into Dates
可以CAST
或::
成date, timestamp(有时分秒)
sample code:
1 | SELECT incidnt_num, |
Practice question: Write a query that creates an accurate timestamp using the date
and time
columns in tutorial.sf_crime_incidents_2014_01
. Include a field that is exactly 1 week later as well.
1 | SELECT date, |
Turning Dates into More Useful Dates
用EXTRACT
从date格式的数据里提取自己需要的部分
sample code:
1 | SELECT cleaned_date, |
用DATE_TRUNC
把date格式的数据round为某个时间单位精度的值,比如你只需要精确到天,就把时分秒全变00:00:00;如果只需要到年,就把月份变为01-01。
sample code:
1 | SELECT cleaned_date, |
practice question: Write a query that counts the number of incidents reported by week. Cast the week as a date to get rid of the hours/minutes/seconds.
注意这里不需要再套一个subquery,做出来第一个column直接就可以用为groupby条件了
1 | SELECT DATE_TRUNC('week', cleaned_date) :: date AS week, |
Extract Current Time
如果需要今天的日期或者当下的timestamp,可以用对应的直接call出来
1 | SELECT CURRENT_DATE AS date, |
如果要换到别的时区:
a complete list of time zones
1 | SELECT CURRENT_TIME AS time, |
practice question: Write a query that shows exactly how long ago each indicent was reported. Assume that the dataset is in Pacific Standard Time (UTC - 8).
1 | SELECT incidnt_num, |
COALESCE
COALESCE(expression1, expresson2, ...)
:return第一个非null的值。比如SELECT COALESCE(NULL, 1, 2, 'W3Schools.com')
会return1.
如果想将所有null
替换为别的值,可以用COALESCE(col, 'string')
。所有非null值仍为原值,null则被替换为'string'
。
sample code: 把所有descript中的null换为’No Description’
1 | SELECT incidnt_num, |
Writing Suqueries in SQL
Using Subqueries to Aggregate in Multiple Stages
How many incidents happen, on average, on every day of week in each month?
思路:要求avg需要的是一个月某一个dow的四个count,在这四个之中求average。所以要先求发生在每一个周五的事件总数,然后group by月份求每月的average。
1 | SELECT LEFT(date, 2) AS month, |
Practice question: Write a query that displays the average number of monthly incidents for each category.
算每个category里12个monthly sum的average
1 | SELECT a.category, |
Subqueries in Conditional Logic
可以再WHERE, JOIN/ON, CASE上用subquery。但注意这种时候就不需要再写alias了,因为subquery在这里并没有被当成一个table,而是当成了a set of values)
sample code: WHERE = subquery
1 | SELECT * |
sample code: WHERE IN subquery
1 | SELECT * |
Joining Subqueries
或者直接用join来filter。下面code结果同上:
1 | SELECT * |
join还有一个很好的优势在于,可以把groupby aggregate的结果再join回原表,使得每一行数据都能附上一个aggregated value。
1 | SELECT incidents.*, |
If you’d like to aggregate the numbers of all of the companies receiving investment and companies acquired each month:
1 | SELECT COALESCE(acquisitions.month, investments.month) AS month, |
Practice question: Write a query that counts the number of companies founded and acquired by quarter starting in Q1 2012. Create the aggregations in two separate queries, then join them.
1 | SELECT COALESCE(companies.founded_quarter, acquisitions.acquired_quarter) AS quarter, |
Subqueries and UNIONs
用UNION ALL
在subquery中合并被分成几块的数据:
1 | SELECT COUNT(*) AS total_rows |
Practice question: Write a query that ranks investors from the combined dataset above by the total number of investments they have made.
1 | SELECT sub.investor_name, |
Practice question 2: Write a query that does the same thing as in the previous problem, except only for companies that are still operating. Hint: operating status is in tutorial.crunchbase_companies.
1 | ``` |
这是在求running total,也就是到当前时间为止的cumulative sum(这一点是由ORDER BY决定的,下面会说只用partition不用ORDER BY的情况)
如果还需要进一步细化,针对每个group来求running total,每换一个新group就清零重记,这时候就需要partition
1 | SELECT start_terminal, |
这里先是用PARTITION按照start_terminal分组,然后在每个组里order by start_time求running sum
1 | SELECT start_terminal, |
这里把ORDER BY去掉了,只剩partition,这里就是在根据partition求每个partition里的sum,跟GROUP BY的概念是一样的,但是Window function可以让每一行都出现这个sum,而不需要aggregate。
注意:GROUP BY和Partition不可共存,换句话说,在WINDOW FUNCTIONS中不可以有GROUP BY clause
还可以求每一行record的duration_seconds占每个partition sum的百分比:
1 | --- Write a query modification of the above example query that |
除了running sum, 我们还可以做 COUNT
, AVG
.
1 | SELECT start_terminal, |
or with ORDER BY
:
1 | SELECT start_terminal, |
Another practice question:
1 | --- Write a query that shows a running total of the duration of bike rides (similar to the last example), but grouped by end_terminal, and with ride duration sorted in descending order. |
Rolling Window
If the timestamp is consecutive:
1 | SELECT x.*, |
ROW_NUMBER()
ROW_NUMBER()可以显示当前行数,根据ORDER BY的顺序从1开始。比如下面的语句就是根据start_time排的序:
1 | SELECT start_terminal, |
加上PARTITION BY
则可以在每个partition从1开始数
1 | SELECT start_terminal, |
RANK() and DENSE_RANK()
ROW_NUMBER对于并列的情况没有考虑,而是直接assign两个不同的row number给两个相同值的record。比如上面ORDER BY start_time
,很有可能有相同的start_time
,但ROW_NUMBER()
依然会给这两个相同的值不同的row number。
而RANK()
则考虑了这种情况,对相同的值就赋予相同的rank,下一级就跳一个数。比如有两个第四名时,下一个就是第六名。DENSE_RANK()
处理方式稍有不同,在有两个第四名时,下一个仍是第五名,它不会skip rank。
sample code:
1 | SELECT start_terminal, |
Practice question:
1 | --- Write a query that shows the 5 longest rides from each starting terminal, ordered by terminal, and longest to shortest rides within each terminal. Limit to rides that occurred before Jan. 8, 2012. |
这个问题要输出前5个,所以用RANK()直接选到5就可以,如果有并列的情况就用DENSE_RANK。
NTILE
NTILE()
可以用来找当前record的percentile,syntax是NTILE(# of buckets you want)
。比如NTILE(4)
是等分成4分相当于quartile,NTILE(1000)
是等分成100份相当于percentile。算percentile的对象是ORDER BY
后面的field
sample code:
1 | SELECT start_terminal, |
需要注意的是如果dataset太小(行数小于Buckets的数量),那PERCENTILE就会像ROW_NUMBER()一样算下去,无法准确表示PERCENTILE
Practice question:
1 | --- Write a query that shows only the duration of the trip and the percentile into which that duration falls (across the entire dataset—not partitioned by terminal), and order by duration_seconds |
LAG AND LEAD
LAG和LEAD可以让你pull当前record之前和之后的records,具体pull之前的第几条在括号内specify即可。如LAG(duration_seconds, 1)
取的是之前一条的duration_seconds, LEAD(duration_seconds, 1)
取的是后一条的duration_seconds。当然排列顺序也由ORDER BY
决定。
Sample code: 算的是每一个start_terminal中按duration seconds从短到长排
1 | SELECT start_terminal, |
在不同行之间计算
1 | SELECT start_terminal, |
注意:拿LAG(X, 1)和LEAD(X, 1)来做例子,对于PARTITION里第一条(没有前一条),LAG为空。对于PARTITION里最后一条(没有后一条),LEAD为空
如果想去掉null,可以用一个outer query
1 | SELECT * |
Defining a Window Alias
如果需要写多个window function用相同的window,可以给window一个alias,就不用重复写很多遍window1的代码了。比如将下面一段:
1 | SELECT start_terminal, |
改为用alias的:
1 | SELECT start_terminal, |
需要注意的是WINDOW
clause必须在WHERE
clause之后。
Performance Tuning SQL Queries
想让SQL跑的更快,主要是要减少calculation的数量。下面几点是对calculation数量有明显影响,且controllable的:
- Table size: 数据量
- Joins: 如果join出来的行数过多,很可能计算量会很大
- Aggregations: combining rows比光retrieve rows需要更大的计算量
同时还有其他uncontrollable的因素:
- Other users running queries: 运算量是有限且shared的
- Database software and optimization:
下面我们只看如何处理controllable的因素。
Reducing Table Size
- 用Filter减少行数:可以用subset先做exploratory analysis,确保query能跑,结果正确,再去fullset上跑
- 在aggregate之前,先用subquery减少行数(比如在subquery里放LIMIT,但要注意这样可能出来的结果不正确,只能用于test query)
Making Joins Less Complicated
在join之前先尽量减少table size。对于数据量本身很大的table来说增益尤为明显。
EXPLAIN
在query之前加EXPLAIN
可以查看每一步entry大概用了多少row,需要cost大约多久时间。可以用这个方法来为你提供revise query的思路。
Pivoting Data in SQL
Pivoting Rows to Columns
现有的dataset有三列:conference(conference类别), year(FR, SO, JR, SR四类年份), players(人数)。我们想让一条record能展示conference,该会总player人数,及各类年份player人数。这就需要把year的值提到colunm name上。
想将每一行里的值放在column name上,首先得做aggregation
1 | SELECT teams.conference AS conference, |
想清楚什么不动,什么要放到column name上。然后把不动的作为最外层group by的条件,用CASE+aggregation
来制作每个column,内层subquery放上一步做好的aggregation table:
1 | SELECT conference, |
Pivoting Columns to Rows
现有的dataset column name是年份(2000-2012),row name是magnitude(约10个级别),cell value是地震对应的数量。对于这种需要的数据出现在了column name的情况,我们可以把他转换为一个三个column的表格:”magnitude”, “year”, and “number of earthquakes”.
首先要建一个新表,使所有column name都变为rows。这里就是把年份放到rows中。原来是row name的magnitude不用管,因为它本来就是一个column了。
1 | SELECT year |
因为本身这张表就是所有magnitude和年份的排列组合,所以我们可以这一列年份直接跟原表(row为magnitude)做CROSS JOIN(即cartesian product),那么这样所有的magnitude和年份的组合就有了。
1 | SELECT years.*, |
再利用CASE在这张大表中对”number of earthquakes”做筛选,只取对应年份的值。
1 | SELECT years.*, |