MySQL tutorial using SQL Cookbook (O'Reilly)
In this tutorial, I compiled key lessons from the SQL Cookbook (by Anthony Molinaro and published in O’Reilly) from the perspective of someone with nonzero but minimal SQL experience. I offer instructions on how to set up MySQL on a personal computer (Mac) and work through selected examples from the book, also discussing alternative solutions I came up with for some problems.
Setup
Requirements
- Macbook
- SQL Cookbook. O’Reilly Media.
How to set up MySQL and example data
Download MySQL
- Go to the MySQL download website:
- For Mac, download
macOS 10.14 (x86, 64-bit), DMG Archive
If still unclear, check out video tutorial how to install MySQL on Mac by Linda.com.
Set up example data from the book on computer:
- Turn on the MySQL server on Preference Pane
- On Terminal, enter:
cd /usr/local/mysql/bin
- Then enter:
./mysql -u root -p
- Enter the password you chose
- If this is the first time you use MySQL on Mac, check out for more detailed instructions on password video tutorial on how to start a MySQL by rudolfson.junior.
- Run code in crsqlbook_data.txt to create database “sqlbook_data” and 4 tables: “emp”, “dept”, “t10”, “t100”
- Skip running this code if you’ve already created data. Instead just load the database by entering:
use sqlbook_data;
- Skip running this code if you’ve already created data. Instead just load the database by entering:
Key lessons from SQL Cookbook
Retrieve records
-
The order in which the SQL queries are evaluated (p. 4):
FROM
clause ->WHERE
->SELECT
orGROUP BY
->SELECT
This is why we need to have a query aliasing columns in an inline view, and then select * in the outer query, e.g.
WRONG:
select sal as salary, comm as commission from emp where salary < 5000;
RIGHT:
select * from ( select sal as salary, comm as commission from emp ) x where salary < 5000;
- Alias table created in the inline view, e.g. by x
concat(col1, “ character “, col2)
to concatenate column values (even if col1, col2 are numeric)- Conditional logic variable syntax of the form; don’t forget
end
at the end!!case when col1 < 2000 then 1 when col2 > 3000 then 2 else 3 end as conditional
limit 5
at the end of query: show the first 5 observations- Randomly select 5 rows with
order by rand() limit 5
where col1 is null
orwhere col1 is not null
(never use col1!=null)coalesce(col1,0)
to fill in missing values with0
or can usecase
clause- Search for patterns
- col1 is 10 or 20:
where col1 in (10,20)
- col1 ends with ‘ER’:
where col1 regexp [ER]$
orwhere col1 like '%ER'
- col1 starts with ‘ER’:
where col1 regexp '^[ER]'
orwhere col1 like 'ER%'
- col1 contains ‘I’:
where col1 like '%I%'
- col1 is 10 or 20:
Sorting query results
order by col1, col2 desc
: sort by col1 in ascending, by col2 in descending order- Sort by the last two characters of job title: use
substr()
select ename, job from emp order by substr(job,length(job)-1); % this substring = all characters from the first to the last char
- Sort by
comm
if job is salesman, and bysal
for all other jobs -> usecase
expression in theorder by
select ename, sal, job, comm from emp order by case when job='SALESMAN' then comm else sal end;
Working with multiple tables
union all
to stack rows from multiple tables; those tables must have same number and same data types of columns but not necessarily same names; if you want to remove duplicates (as in a real set concept), useunion
inner join
syntax: e.g.select e.ename, d.loc from emp e inner join dept d on e.deptno=d.deptno where e.deptno=10;
- (Hard!!) Select values in a column that don’t exist in the other table (including null values)
select d.deptno from dept d where not exists ( select 1 from new_dept nd where d.deptno = nd.deptno);
- Select rows that don’t exist in another table
select d.* from dept d left outer join emp e on (d.deptno=e.deptno) where e.deptno is null;
- Add new information in the left table by merging with other table
select e.ename, d.loc, eb.received from emp e join dept d on (e.deptno=d.deptno) left join emp_bonus eb on (e.empno=eb.empno) order by 2;
or
select e.ename, d.loc, (select eb.received from emp_bonus eb where eb.empno=e.empno) as received from emp e, dept d where e.deptno=d.deptno order by 2;
-
Performing joins when using aggregates: e.g. sum salary and bonus for deptno=10 when bonus data contain bonuses for all employees in deptno=10
1) Merge first and then sum (with distinct inside sum function)
select deptno, sum(distinct sal) as tot_salary, sum(bonus) as tot_bonus from ( select e.empno, e.ename, e.deptno, e.sal, e.sal*case when eb.type=1 then 0.1 when eb.type=2 then 0.2 else 0.3 end as bonus from emp e, emp_bonus eb where e.empno=eb.empno and e.deptno=10 ) x group by deptno, tot_salary;
or 2) Sum employee salary first and merge
select d.deptno, d.tot_salary, sum(e.sal*case when eb.type=1 then 0.1 when eb.type=2 then 0.2 else 0.3 end) as tot_bonus from emp e, emp_bonus eb, (select deptno, sum(sal) as tot_salary from emp where deptno=10 group by deptno) d where e.empno=eb.empno and e.deptno=d.deptno group by deptno, tot_salary;
- Performing outer joins when using aggregates: sum salary and bonus for deptno=10 when bonus data don’t contain bonuses for all employees in deptno=10
select deptno, sum(distinct sal) as tot_salary, sum(bonus) as tot_bonus from (select e.deptno, e.ename, e.sal, e.sal*case when eb.type is null then 0 else eb.type/10 end as bonus from emp e left outer join emp_bonus eb on (e.empno=eb.empno) where deptno=10) x group by deptno, tot_salary;
or same as the 2nd solution in #6:
select d.deptno, d.tot_salary, sum(e.sal*eb.type/10) as tot_bonus from emp e, emp_bonus eb, (select deptno, sum(sal) as tot_salary from emp where deptno=10 group by deptno) d where e.empno=eb.empno and e.deptno=d.deptno group by deptno, tot_salary;
- Returning missing data from both merging tables ->
union
the results of 2 different outer joins (can’t be done soley byleft join
orright join
one at a time)select d.deptno, d.dname, e.ename from dept d left join emp e on (d.deptno=e.deptno) union select d.deptno, d.dname, e.ename from dept d right join emp e on (d.deptno=e.deptno);
- Operation/comparison when a column contains null: e.g. find all employees whose commission is less than the commision of employee “WARD”, including those with a null commission
select ename, comm from emp where coalesce(comm,0) < (select coalesce(comm,0) from emp where ename='WARD');
- Extract numbers from alphanumeric var: e.g. get 123321 from ‘paul123f321’
Working with numbers
avg(col)
to compute average; can be combined withgroup by
max(col)
,min(col)
,sum(col)
count(*)
counts number of rows vscount(col)
number of non-NULL values- Running total without using window function but with scalar subquery
select e.ename, e.sal, (select sum(d.sal) from emp d where d.empno <= e.empno) as run_total from emp e order by 3;
Note this can be done more concisely using
sum() over()
:select ename, sal, sum(sal) over(order by sal) from emp;
- Running product - take sum of logs
ln(sal)
and exponentiate Note: this doesn’t work ifsal
<=0 since you can’t take log of non-positive valuesselect ename, sal, round(exp(log),0) from ( select ename, sal, sum(ln(sal)) over(order by sal) as log from emp ) x;
or
select e.ename, e.sal, (select round(exp(sum(ln(d.sal))),0) from emp d where d.empno <= e.empno and d.deptno=e.deptno) as run_product from emp e where e.deptno=10;
- Running difference: only the first value in the group is conserved, multiply the rest by (-1) and sum them up
- Warning: the MySQL solution on running difference on p.172 of the book is wrong.
- Don’t know how to do this
- Finding a mode is a little tricky, involving
having count(*) >= all (count by value)
select sal from emp where deptno=20 group by sal having count(*) >= all (select count(*) from emp where deptno=20 group by sal);
- (Hard and didn’t understand!!) Finding a median is also tricky: do a self-join
select a.sal from emp a, emp b where a.deptno=b.deptno and a.deptno=20 group by a.sal having sum(case when a.sal = b.sal then 1 else 0 end) >= abs(sum(sign(e.sal - d.sal)));
- % salaries in deptno=10 out of total
select 100*sum(case when deptno=10 then sal end)/sum(sal) as pct_dept10 from emp;
- Take average after excluding the highest and lowest values
select avg(sal) from emp where sal not in ( (select min(sal) from emp), (select max(sal) from emp) );
The above code may drop multiple employees, if they have highest or lowest salary. If you want to exclude only the highest and lowest value once each, just exclude them and divide by the total (N-2):
select (sum(sal)-min(sal)-max(sal))/(count(*)-2) as pct from emp;
- Update the total balance column (i.e. running total) after each transaction row, which can be either ‘purchase’ or ‘payment’
select case when trx='PR' then 'Purchase' else 'Payment' end as trx_type, amt, sum(case when trx='PR' then amt else -amt end) over(order by id) as balance from V;
Date arithmetic
- Subtract 5 days/months/years from the date variable using
interval 5 day (month/year)
ordate_add
functionselect ename, hiredate, hiredate-interval 5 day as hd_minus_5d, hiredate+interval 5 month as hd_plus_5m, hiredate-interval 5 year as hd_minus_5y from emp where deptno=10;
or
select ename, hiredate, date_add(hiredate, interval -5 day) as hd_minus_5d, date_add(hiredate, interval +5 month) as hd_plus_5m, date_add(hiredate, interval +5 year) as hd_plus_5y from emp where deptno=10;
- Find the number of days between 2 dates (e.g. hiring dates for Allen and Ward) using
datediff(d1,d2)
:=d1-d2; put the earlier date last in the 2 argumentsselect datediff(ward_hd, allen_hd) from (select hiredate as ward_hd from emp where ename='WARD') x, (select hiredate as allen_hd from emp where ename='ALLEN') y;
-
(New trick!!) Find the number of business days between 2 dates; Requires 2 steps:
1) get two dates and create rows for each day between those dates 2) count number of rows (i.e. days), excluding weekends
select sum(case when date_format(jones_hd+interval (id-1) day, '%a') in ('Sat','Sun') then 0 else 1 end) as biz_days, max(id) as days from ( select max(case when ename = 'BLAKE' then hiredate end) as blake_hd, max(case when ename='JONES' then hiredate end) as jones_hd from emp where ename in ('BLAKE','JONES') ) x, t100 where t100.id <= datediff(blake_hd,jones_hd)+1;
- Find the number of months or years between 2 dates: use
year
andmonth
functionsselect n_month, round(n_month/12,0) as n_year from ( select (year(last_hd)-year(first_hd))*12 + (month(last_hd)-month(first_hd)) as n_month from ( select min(hiredate) as first_hd, max(hiredate) as last_hd from emp ) x ) y;
- Find the number of seconds, minutes, or hours between 2 dates
select datediff(ward_hd, allen_hd)*24 as hr, datediff(ward_hd, allen_hd)*24*60 as min, datediff(ward_hd, allen_hd)*24*60*60 as sec from (select hiredate as ward_hd from emp where ename='WARD') a, (select hiredate as allen_hd from emp where ename='ALLEN') b;
- (Involving many tricks!!) Count the occurrences of each weekday in a year
- Use
cast(string as date)
to convert string to a date variable - date_format(date, ‘%W’) returns e.g. ‘Monday’
select date_format( cast(concat(year(current_date),'-01-01') as date) + interval t500.id-1 day, '%W') day, count(*) from t500 where t500.id <= datediff( cast(concat(year(current_date)+1,'-01-01') as date), cast(concat(year(current_date),'-01-01') as date)) group by date_format( cast(concat(year(current_date),'-01-01') as date) + interval t500.id-1 day, '%W');
- Use
- (Unfamiliar but important trick!!) Find the date difference between the current record and the next one: Use the scalar subquery
select *, datediff(next_earliest_hd,hiredate) as gap from (select e.deptno, e.ename, e.hiredate, (select min(d.hiredate) from emp d where d.deptno=e.deptno and d.hiredate > e.hiredate) as next_earliest_hd from emp e) x order by 1,3;
Date manipulation
- Determine if it’s a leap year, i.e. February ends on the 29th
The solution in the book suggests using
lastday()
function but here’s an alternative solutionselect case when cast(concat(year(current_date)+1,'-02-28') as date) + interval 1 day=cast(concat(year(current_date)+1,'-03-01') as date) then 0 else 1 end as is_leapyear;
- Determine the number of days in a year
select datediff(cast(concat(year(current_date),'-12-31') as date), cast(concat(year(current_date),'-01-01') as date))+1 as ndays;
or
select datediff(first_dayofyear+interval 1 year,first_dayofyear) as ndays from (select current_date-interval (dayofyear(current_date)-1) day as first_dayofyear) x;
- Skipped 9.3-9.9
-
Filling in missing dates: Find the number of employees hired in each month from 1980 to 1983
1) First get boundary months: January 1st of the earliest hiring date year - Dec 1st of the last hiring date year
select adddate(min(hiredate),-dayofyear(min(hiredate))+1) as min_hd, cast(concat(year(max(hiredate)),'-12-01') as date) as max_hd from emp;
2) Add rows for each month in the boundary date range
select x.min_hd+interval t.id-1 month as month from ( select adddate(min(hiredate),-dayofyear(min(hiredate))+1) as min_hd, cast(concat(year(max(hiredate)),'-12-01') as date) as max_hd from emp ) x, t500 t where adddate(min_hd,interval t.id-1 month) <= max_hd;
3) Left outer join on the month of hiredate and count the number of non-NULL hiredates per month (this is final code)
select z.month, count(e.hiredate) num_hired from ( select x.min_hd+interval t.id-1 month as month from ( select adddate(min(hiredate),-dayofyear(min(hiredate))+1) as min_hd, cast(concat(year(max(hiredate)),'-12-01') as date) as max_hd from emp ) x, t500 t where adddate(min_hd,interval t.id-1 month) <= max_hd ) z left join emp e on (z.month=adddate(last_day(e.hiredate)-interval 1 month,1)) group by z.month order by 1;
- Find all employees hired in February or December, or on a Tuesday
select sum(case when month(hiredate)=2 or month(hiredate)=12 then 1 else 0 end) as emp_feb_dec, sum(case when dayname(hiredate)='Tuesday' then 1 else 0 end) as emp_tue from emp; select ename from emp where dayname(hiredate)='Tuesday' or monthname(hiredate) in ('February','December');
- Comparing records using specific units of time: Which employees were hired on the same weekday and same month?
Do a self-join to compare dates but remove reciprocals
select concat(e.ename, ' was hired on the same month and same weekday as ', d.ename) as msg from emp e, emp d where date_format(e.hiredate,'%w%M')=date_format(d.hiredate,'%w%M') and e.empno < d.empno order by e.ename;
- Identifying overlapping date ranges
select b.empno, b.ename, concat('project ',b.proj_id,' overlaps project ',a.proj_id) as msg from emp_project a, emp_project b where a.empno=b.empno and b.proj_start >= a.proj_start and b.proj_start <= a.proj_end and a.proj_id < b.proj_id;
Advanced searching
- Skipped 11.1
- Skipping n rows in a table: e.g. show employees in odd rows
select x.rn, x.ename from (select a.ename, (select count(*) from emp b where b.empno <= a.empno) as rn from emp a) x where mod(x.rn,2)!=0;
or
select x.rn, x.ename from (select a.ename, count(*) over(order by empno range between unbounded preceding and current row) as rn from emp a) x where mod(x.rn,2)!=0;
- Selecting the top n records: 1) rank, 2) limit to n records; use scalar subquery!!
If you want to find the top 3 salary employees out of all employees:
select x.ename, x.sal from (select a.ename, a.sal, (select count(*) from emp b where b.sal >= a.sal) as rn from emp a) x where x.rn <=3 order by 2 desc,1;
If you want to find the top 3 salary employees in each department:
select x.deptno, x.ename, x.sal from (select a.deptno, a.ename, a.sal, (select count(*) from emp b where b.sal >= a.sal and b.deptno=a.deptno) as rn from emp a) x where x.rn <=3 order by 1,3 desc, 2;
In case, there are multiple employees with same salry within a department, and want to count the top 3 distinct salary employees (i.e. there can be more than 3 employees per department in the output), restrict to distcint salary in each department in the scalar subquery:
select x.deptno, x.ename, x.sal from (select a.deptno, a.ename, a.sal, (select count(*) from (select distinct sal, deptno from emp) b where b.sal >= a.sal and b.deptno=a.deptno) as rn from emp a) x where x.rn <=3 order by 1,3 desc, 2;
Window function refresher
- Number of employees in the department and job, respectively, for each employee’s department/job
select ename, deptno, job, count(*) over(partition by deptno) as dept_cnt, count(*) over(partition by job) as job_cnt from emp order by 2;
- Running total of salaries for employees in deptno=10
select deptno, ename, hiredate, sal, sum(sal) over(partition by deptno) as total1, sum(sal) over() as total2, sum(sal) over(order by hiredate) as running_total from emp where deptno=10;
You can use framing clauses like
range between unbounded preceding and current row
(default if unspecified),rows between 1 preceding and 1 following
select deptno, ename, hiredate, sal, sum(sal) over(partition by deptno) as total1, sum(sal) over() as total2, sum(sal) over(order by hiredate range between unbounded preceding and current row) as running_total1, sum(sal) over(order by hiredate rows between 1 preceding and current row) as running_total2, sum(sal) over(order by hiredate range between current row and unbounded following) as running_total3, sum(sal) over(order by hiredate rows between 1 preceding and 1 following) as running_total4 from emp where deptno=10;
Note: The framing clauses didn’t work on MySQL version 8.0.15.
max
/min
over- To answer: “What is the number of employees in each dept, how many different types of employees in each dept, how many total employees in table?”
select deptno, job, count(*) over(partition by deptno) as nemp_dept, count(*) over(partition by deptno, job) as nemp_job_dept, count(*) over() as tot_employees from emp;
select deptno, emp_cnt as dept_total, total, max(case when job='CLERK' then job_cnt else 0) as clerks, max(case when job='MANAGER' then job_cnt else 0) as managers, max(case when job='PRESIDENT' then job_cnt else 0) as prez, max(case when job='ANALYST' then job_cnt else 0) as anals, max(case when job='SALESMAN' then job_cnt else 0) as smen from ( select deptno, job, count(*) over() as total, count(*) over(partition by deptno) as emp_cnt, count(*) over(partition by deptno, job) as job_cnt from emp ) x group by deptno, emp_cnt, total;
Errata I’d like to report:
- p.193 MySQL solution for using
datediff(d1,d2)
returns d1-d2, so should put earlier date last to return postiive difference. - p.195 typo: “January 10th is a Tuesday and January 11th is a Monday” - since it’s a hypothetical situation, dates don’t have to matter but just for accuracy
- Ch 8.5 uses table “T500” but the provided code doesn’t create this table; instead use the table “T100”
- p.253 typo ‘Mysol’; should be ‘MySQL’