最近发现一个不错的数据实战小demo,总共有10个业务场景大约100个例题组成,笔者主要通过SQL与python的pandas包进行问题的解决,不足之处望读者多多指正。
SELECT * FROM `chipotle` -- 业务练习100例子.SQL -- 了解chipotle数据 -- 字段信息 order_id quantity (数量) item_name (类目) choice_description item_price -- 1、导入数据集 -- 见图 -- 2、打印出全部的列名称 desc `chipotle` -- 3、打印出前10条信息 SELECT * from chipotle limit 10 -- 4、被下单数最多商品(item)是什么? SELECT item_name ,max(num) AS max_num from ( SELECT item_name , sum(quantity) AS num from chipotle group by item_name ) a -- 5、在item_name这一列中,一共有多少种商品被下单? SELECT count(distinct item_name) from chipotle -- 6、在choice_description中,下单次数最多的商品是什么? SELECT choice_description ,max(num) AS max_num from ( SELECT choice_description , sum(quantity) AS num from chipotle where choice_description !='NULL' group by choice_description ) a -- 7、每一单(order)对应的平均总价是多少? SELECT round(avg(jg),2) from ( SELECT order_id ,sum(cast(quantity as double))*cast(SUBSTR(item_price,2) as DOUBLE) as jg from chipotle group by order_id ) a -- 8、一共有多少商品被下单? SELECT count(1) as cnt from chipotle -- 9、将item_price转换为浮点数 SELECT ROUND(cast(SUBSTR(item_price,2) as float),2) as item_price from chipotle -- 10、在该数据集对应的时期内,收入(revenue)是多少 SELECT ROUND(sum(quantity*item_price),2) as revenue FROM( SELECT order_id ,sum(cast(quantity as double)) as quantity ,cast(SUBSTR(item_price,2) as DOUBLE) as item_price from chipotle group by order_id ,cast(SUBSTR(item_price,2) as DOUBLE) ) a -- 11、在该数据集对应的时期内,一共有多少订单? SELECT count(distinct order_id) as order_id from chipotle --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --
# 了解chipotle数据 # 字段信息 order_id quantity (数量) item_name (类目) choice_description item_price # 1、导入数据 import pandas as pd chipo= pd.read_csv(r"E:\code\jupyter_notebook\pandas_exercise\exercise_data\chipotle.csv", engine = 'python')
# 2、打印出全部的列名称 chipo.columns
Index(['order_id', 'quantity', 'item_name', 'choice_description', 'item_price'], dtype='object')
# 3、打印出前10条信息 chipo.head(10)
order_id | quantity | item_name | choice_description | item_price | |
---|---|---|---|---|---|
0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
1 | 1 | 1 | Izze | [Clementine] | $3.39 |
2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
5 | 3 | 1 | Chicken Bowl | [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... | $10.98 |
6 | 3 | 1 | Side of Chips | NaN | $1.69 |
7 | 4 | 1 | Steak Burrito | [Tomatillo Red Chili Salsa, [Fajita Vegetables... | $11.75 |
8 | 4 | 1 | Steak Soft Tacos | [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... | $9.25 |
9 | 5 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... | $9.25 |
# 4、被下单数最多商品(item)是什么? q4 = chipo[['item_name','quantity']].groupby(['item_name'],as_index=False).agg({'quantity':sum}) q4.sort_values(['quantity'],ascending=False,inplace=True) q4.head(1)
item_name | quantity | |
---|---|---|
17 | Chicken Bowl | 761 |
# 5、在item_name这一列中,一共有多少种商品被下单? chipo['item_name'].nunique()
50
# 6、在choice_description中,下单次数最多的商品是什么? chipo['choice_description'].value_counts().head(1)
[Diet Coke] 134 Name: choice_description, dtype: int64
# 7、每一单(order)对应的平均总价是多少? chipo[['order_id','sub_total']].groupby(by=['order_id'] ).agg({'sub_total':'sum'})['sub_total'].mean()
21.394231188658654
# 8、一共有多少商品被下单? total_items_orders = chipo['quantity'].sum() total_items_orders
4972
# 9、将item_price转换为浮点数 dollarizer = lambda x: float(x[1:-1]) chipo['item_price'] = chipo['item_price'].apply(dollarizer)
# 10、在该数据集对应的时期内,收入(revenue)是多少 chipo['sub_total'] = round(chipo['item_price'] * chipo['quantity'],2) chipo['sub_total'].sum()
39237.02
# 11、在该数据集对应的时期内,一共有多少订单? chipo['order_id'].nunique()
1834
日常工作hive用习惯了,搞这个pandas来写有点不习惯,反而没觉得方便多少,但确实是比较省代码的,暂时就玩玩当做对比SQL学习预习复习pandas的一种方式,除了数据比赛和割韭菜的培训班暂时还无法脑补啥样子的业务场景会用pandas,不足之处,望多多指正。