[CMU-15-445/645]-Assignment 1

这是我在学习 Databases Systems CMU 15-445/645/ Fall 2019 过程记录的一些笔记，本文了记录我在完成第一次作业的过程中完成结果和遇到的一些疑问。

作业 & 软件安装

本次作业主要是熟悉以下SQL的各种用法，用的软件是Sqlite3，软件安装过程以及数据集下载都可以根据作业指导，下载完数据集之后用sqlite验证一下各部分是不是都完好，这一部分没遇到什么问题。以下是我的运行环境：

    系统：Ubuntu18.04
    sqlite版本：
    $ sqlite3 -version
    3.22.0 2018-01-22 18:45:57 0c55d179733b46d8d0ba4d88e01a25e10677046ee3da1d5b1581e86726f2alt1

作业

Q1 SAMPLE

运行一下样本代码并保存。

// 输出电影里面所有类别（不重复）
sqlite> select distinct(type) from titles order by type;
movie
short
tvEpisode
tvMiniSeries
tvMovie
tvSeries
tvShort
tvSpecial
video
videoGame

Q2 UNCOMMON_TYPE

List the longest title of each type along with the runtime minutes.

找出每一类电影中时长最久的电影，如果有时长一样的话全部选择，最后输出电影类型，电影名和持续时间。输出时按照类别升序，类别相同时按照名称升序输出。这道题最初想用嵌套查找+RANK()来做，类似课上讲的输出每门课上id最高的学生的例子，但是sqlite3一直报括号附近语法错误，直接运行老师的代码都不行，最后也没弄明白，最后换了一种思路用 with来写，先找出每类最长电影的持续时长存在一个中间table里面，后面输出就比较简单了，结果和答案一样，代码如下：

-- select longest runtime by type and its type
WITH temp(maxruntime, t) AS (
        SELECT MAX(runtime_minutes), titles.type FROM titles GROUP BY type
)

SELECT titles.type, titles.primary_title, titles.runtime_minutes FROM titles, temp
WHERE titles.runtime_minutes = temp.maxruntime
AND titles.type = temp.t
ORDER BY type ASC, primary_title ASC;

-- 结果
sqlite> .read ./placeholder/q2_uncommon_type.sql
movie|Logistics|51420
short|Kuriocity|461
tvEpisode|Téléthon 2012|1800
tvMiniSeries|Kôya no yôjinbô|1755
tvMovie|ArtQuench Presents Spirit Art|2112
tvSeries|The Sharing Circle|8400
tvShort|Paul McCartney Backstage at Super Bowl XXXIX|60
tvShort|The People Next Door|60
tvSpecial|Katy Perry Live: Witness World Wide|5760
video|Midnight Movie Madness: 50 Movie Mega Pack|5135
videoGame|Flushy Fish VR: Just Squidding Around|1500

Q3 TV_VS_MOVIE

List all types of titles along with the number of associated titles.

输出每个类别和该类别的电影数量，按数量升序输出。这道题相对上一道题比较简单，代码如下：

WITH type_number(type, number) AS (
        SELECT type, COUNT(*) FROM titles GROUP BY type
)
SELECT * FROM type_number ORDER BY number ASC;

-- 结果
tvShort|4075
videoGame|9044
tvSpecial|9107
tvMiniSeries|10291
tvMovie|45431
tvSeries|63631
video|90069
movie|197957
short|262038
tvEpisode|1603076

Q4 OLD_IS_NOT_GOLD

Which decades saw the most number of titles getting premiered? List the number of titles in every decade. Like 2010s|2789741.

按电影年代输出每个年代的电影数量，按电影数量降序输出。

这道题一开始想到用一个WITH创建了临时表格存储年代以及对应的电影，最后可以通过按年代COUNT输出年代以及数量（上映年份为NULL的去除），但是这样不能用数量作为排序标准，后来参考这个博主的做法，创建了两个临时表格最后输出，代码如下：

WITH period(year, movie) AS (
        SELECT SUBSTR(CAST(premiered AS TEXT), 1, 3) ||  "0s",
                primary_title FROM titles
        WHERE premiered IS NOT NULL
), year_num(year, num) AS (
        SELECT year, COUNT(*) FROM period
        GROUP BY year
)

SELECT year, num FROM year_num ORDER BY num DESC;

-- 输出
sqlite> .read ./placeholder/q4_old_is_not_gold.sql
2010s|1050732
2000s|494639
1990s|211453
1980s|119258
1970s|99707
1960s|75237
1950s|39554
1910s|26596
1920s|13153
1930s|11492
1940s|10011
1900s|9586
2020s|2492
1890s|2286
1880s|22
1870s|1

Q5 PERCENTAGE

List the decades and the percentage of titles which premiered in the corresponding decade. Display like : 2010s|45.7042

在上一题的基础上把数量换成总电影的百分比（包括NULL的部分）。这一题一开始卡在怎么计算电影总量不被 group 影响，后来发现可以用 select 生成一个临时变量，代码如下：

WITH period(year, movie) AS (
        SELECT SUBSTR(CAST(premiered AS TEXT), 1, 3) ||  "0s",
                primary_title FROM titles
        WHERE premiered IS NOT NULL)
, year_num(year, num ) AS (
        SELECT year, 
        ROUND(100.0 * COUNT(*)/ (SELECT COUNT(*) FROM titles), 4) FROM period
        GROUP BY year
)

SELECT year, num FROM year_num ORDER BY num DESC;

-- 输出
sqlite> .read ./placeholder/q5_percentage.sql
2010s|45.7891
2000s|21.5555
1990s|9.2148
1980s|5.1971
1970s|4.3451
1960s|3.2787
1950s|1.7237
1910s|1.159
1920s|0.5732
1930s|0.5008
1940s|0.4363
1900s|0.4177
2020s|0.1086
1890s|0.0996
1880s|0.001
1870s|0.0

Q6 DUBBED_SMASH

List the top 10 dubbed titles with the number of dubs.

按配音版本数量降序输出前十名配音版本最多的电影名和对应的数量，这道题先找到所有的配音版本电影数量，再用 LIMIT 来限制输出数量，代码如下。

WITH dubbed(t_id, num) AS (
        SELECT title_id, COUNT(*) FROM akas
        GROUP BY title_id
)

SELECT primary_title, num  FROM dubbed, titles
WHERE titles.title_id = dubbed.t_id
ORDER BY num DESC
LIMIT 10;

-- 输出
sqlite> .read ./placeholder/q6_dubbed_smash.sql
Mutant Virus: Vol. 1|126
The Good, the Bad and the Ugly|73
Star Wars: Episode V - The Empire Strikes Back|71
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb|68
Raiders of the Lost Ark|62
Star Wars: Episode VII - The Force Awakens|62
The Shawshank Redemption|61
Once Upon a Time in the West|60
Indiana Jones and the Kingdom of the Crystal Skull|60
Airplane!|59

Q7 IMDB_250

List the IMDB Top 250 movies along with its weighted rating.

按权重分数降序输出IMDB TOP250电影，计算规则如下

Weighted rating (WR) = (v/(v+m)) * R + (m/(v+m)) * C
R = 电影平均得分 i.e. ratings.rating
v = 电影投票数, i.e. ratings.votes
m = 要进top250的最少投票数 (current 25000)
C = 所有投票的平均得分 （计算方式是所有投票乘相应得分再除以总票数）

这道题主要就是注意以下各个变量的计算方法还有imbd要限定类型是电影，代码如下，结果太长就不打了，确认过和答案一样。

WITH wr( t_id, WR ) AS (
        SELECT ratings.title_id,
        (votes / (votes + 25000.0)) * rating +
        (25000.0 / (votes + 25000.0)) * 
        (SELECT SUM(rating * votes) / SUM(votes) FROM ratings, titles 
         WHERE ratings.title_id = titles.title_id AND titles.type = "movie")
        FROM ratings,titles
        WHERE ratings.title_id = titles.title_id AND titles.type = "movie"
)

SELECT primary_title, WR FROM wr, titles
WHERE titles.title_id = t_id
ORDER BY WR DESC
LIMIT 250;

Q8 NUMBER_OF_ACTORS

List the number of actors / actresses who have appeared in any title with Mark Hamill (born in 1951).

输出所有和mark hamill有合作过的演员。这道题的思路是可以先找到所有mark hamill参演过的节目，然后再找到参演过那些节目的演员，主要去去重，代码如下：

WITH title_with_hamill(title_id) AS (
        SELECT title_id FROM crew, people
        WHERE crew.person_id = people.person_id  
        AND people.name = "Mark Hamill" 
        AND people.born = 1951
)

SELECT COUNT(DISTINCT person_id) FROM crew, title_with_hamill
WHERE crew.title_id = title_with_hamill.title_id
AND (crew.category = "actor" OR crew.category = "actress");

-- 输出
sqlite> .read ./placeholder/q8_number_of_actors.sql
206

Q9 MOVIE_NAMES

List the movies in alphabetical order which cast both Mark Hamill (born in 1951) and George Lucas (born in 1944).

输出所有Mark Hamill 和 George Lucas 共同参演的电影名，按字母顺序输出，思路跟上一题差不多，找到mark hamill 参演的电影在在里面找 Georage Lucas 参演的过输出名字就可以了。

WITH title_with_hamill(title_id) AS (
        SELECT title_id FROM crew, people
        WHERE crew.person_id = people.person_id  
        AND people.name = "Mark Hamill"
        AND people.born = 1951
)

SELECT DISTINCT primary_title FROM crew, title_with_hamill, people, titles
WHERE crew.title_id = title_with_hamill.title_id
AND titles.type = "movie"
AND crew.title_id = titles.title_id
AND (crew.person_id = people.person_id 
        AND people.name = "George Lucas" 
        AND people.born = 1944)
ORDER BY primary_title;

-- 输出
sqlite> .read ./placeholder/q9_movie_names.sql
Star Wars: Episode V - The Empire Strikes Back

Q10 GENRE_COUNTS

List all distinct genres and the number of titles associated with them.

从title里面的genres类别提取出所有不重复的类别，并类别输出相对应的节目数量，按降序排列，这里主要参考了这种做法，将genres里的值取出来然后统一成'', '...,...,...,'的格式，然后通过迭代和用INSTR找到逗号的位置，以及SUBSTR分割字符串来提取出所有类，最后按类别输出数量，代码如下：

WITH split(word, str) AS (
    SELECT '', genres || ',' FROM titles
    UNION ALL SELECT
    SUBSTR(str, 0, INSTR(str, ',')),
    SUBSTR(str, INSTR(str, ',')+1)
    FROM split WHERE str!=''
), 
genre_num(genre, num) AS (
SELECT word, count(word) 
FROM split
GROUP BY word
HAVING word != '' AND word != '\N'
)

SELECT  * FROM genre_num
ORDER BY num DESC;

-- 结果
sqlite> .read ./placeholder/q10_genre_counts.sql
Drama|620063
Comedy|486163
Short|310619
Documentary|222187
Talk-Show|215144
Romance|211462
Family|159035
News|148941
Animation|115998
Reality-TV|113180
Music|105724
Crime|99019
Action|97544
Adventure|81686
Game-Show|75169
Adult|65704
Sport|48855
Fantasy|48341
Mystery|47155
Horror|41552
Thriller|40664
History|31675
Sci-Fi|31441
Biography|27001
Musical|17939
Western|9811
War|9309
Film-Noir|322