Python教程

Python数据分析pandas入门练习题(八)

本文主要是介绍Python数据分析pandas入门练习题(八),对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!

Python数据分析基础

  • Preparation
  • Exercise 1- US - Baby Names
      • Introduction:
      • Step 1. Import the necessary libraries
      • Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv).
      • Step 3. Assign it to a variable called baby_names.
      • Step 4. See the first 10 entries
      • Step 5. Delete the column 'Unnamed: 0' and 'Id'
      • Step 6. Is there more male or female names in the dataset?
      • Step 7. Group the dataset by name and assign to names
      • Step 8. How many different names exist in the dataset?
      • Step 9. What is the name with most occurrences?
      • Step 10. How many different names have the least occurrences?
      • Step 11. What is the median name occurrence?
      • Step 12. What is the standard deviation of names?
      • Step 13. Get a summary with the mean, min, max, std and quartiles.
  • Exercise 2- Wind Statistics
      • Introduction:
      • Step 1. Import the necessary libraries
      • Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/Wind_Stats/wind.data)
      • Step 3. Assign it to a variable called data and replace the first 3 columns by a proper datetime index.
      • Step 4. Year 2061? Do we really have data from this year? Create a function to fix it and apply it.
      • Step 5. Set the right dates as the index. Pay attention at the data type, it should be datetime64[ns].
      • Step 6. Compute how many values are missing for each location over the entire record.
        • They should be ignored in all calculations below.
      • Step 7. Compute how many non-missing values there are in total.
      • Step 8. Calculate the mean windspeeds of the windspeeds over all the locations and all the times.
        • A single number for the entire dataset.
      • Step 9. Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days
        • A different set of numbers for each location.
      • Step 10. Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day.
        • A different set of numbers for each day.
      • Step 11. Find the average windspeed in January for each location.
        • Treat January 1961 and January 1962 both as January.
      • Step 12. Downsample the record to a yearly frequency for each location.
      • Step 13. Downsample the record to a monthly frequency for each location.
      • Step 14. Downsample the record to a weekly frequency for each location.
      • Step 15. Calculate the min, max and mean windspeeds and standard deviations of the windspeeds across all locations for each week (assume that the first week starts on January 2 1961) for the first 52 weeks.
  • Conclusion

Preparation

需要数据集可以自行网上寻找(都是公开的数据集)或私聊博主,传到csdn,你们下载要会员,就不传了。下面数据集链接下载不一定能成功。

Exercise 1- US - Baby Names

Introduction:

We are going to use a subset of US Baby Names from Kaggle.
In the file it will be names from 2004 until 2014

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called baby_names.

代码如下:

baby_names = pd.read_csv("US_Baby_Names_right.csv")
baby_names.info()

输出结果如下:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
Unnamed: 0    1016395 non-null int64
Id            1016395 non-null int64
Name          1016395 non-null object
Year          1016395 non-null int64
Gender        1016395 non-null object
State         1016395 non-null object
Count         1016395 non-null int64
dtypes: int64(4), object(3)
memory usage: 54.3+ MB

Step 4. See the first 10 entries

代码如下:

baby_names.head(10)

输出结果如下:

Unnamed: 0IdNameYearGenderStateCount
01134911350Emma2004FAK62
11135011351Madison2004FAK48
21135111352Hannah2004FAK46
31135211353Grace2004FAK44
41135311354Emily2004FAK41
51135411355Abigail2004FAK37
61135511356Olivia2004FAK33
71135611357Isabella2004FAK30
81135711358Alyssa2004FAK29
91135811359Sophia2004FAK28

Step 5. Delete the column ‘Unnamed: 0’ and ‘Id’

代码如下:

del baby_names['Id']
# OR del baby_names['Unnamed: 0']
baby_names = baby_names.loc[:, ~baby_names.columns.str.contains('^Unnamed')]
baby_names.head()

输出结果如下:

NameYearGenderStateCount
0Emma2004FAK62
1Madison2004FAK48
2Hannah2004FAK46
3Grace2004FAK44
4Emily2004FAK41

Step 6. Is there more male or female names in the dataset?

代码如下:

# baby_names['Gender'].value_counts()
baby_names.groupby('Gender').Count.sum()

输出结果如下:

Gender
F    16380293
M    19041199
Name: Count, dtype: int64

Step 7. Group the dataset by name and assign to names

代码如下:

del baby_names["Year"]
names = baby_names.groupby("Name").sum()
names.head()
print(names.shape)
names.sort_values("Count", ascending = 0).head()
# names= baby_names.groupby('Name')
# names.head(1)

输出结果如下:

(17632, 1)
Count
Name
Jacob242874
Emma214852
Michael214405
Ethan209277
Isabella204798

Step 8. How many different names exist in the dataset?

代码如下:

len(names)

输出结果如下:

17632

Step 9. What is the name with most occurrences?

代码如下:

# names['Count'].sum().argmax()
names.Count.idxmax()  # idxmax()获取pandas中series最大值对应的索引

输出结果如下:

'Jacob'

Step 10. How many different names have the least occurrences?

代码如下:

len(names[names.Count == names.Count.min()])

输出结果如下:

2578

Step 11. What is the median name occurrence?

代码如下:

names[names.Count == names.Count.median()]

输出结果如下:

Count
Name
Aishani49
Alara49
Alysse49
Ameir49
Anely49
Antonina49
Aveline49
Aziah49
Baily49
Caleah49
Carlota49
Cristine49
Dahlila49
Darvin49
Deante49
Deserae49
Devean49
Elizah49
Emmaly49
Emmanuela49
Envy49
Esli49
Fay49
Gurshaan49
Hareem49
Iven49
Jaice49
Jaiyana49
Jamiracle49
Jelissa49
......
Kyndle49
Kynsley49
Leylanie49
Maisha49
Malillany49
Mariann49
Marquell49
Maurilio49
Mckynzie49
Mehdi49
Nabeel49
Nalleli49
Nassir49
Nazier49
Nishant49
Rebecka49
Reghan49
Ridwan49
Riot49
Rubin49
Ryatt49
Sameera49
Sanjuanita49
Shalyn49
Skylie49
Sriram49
Trinton49
Vita49
Yoni49
Zuleima49

66 rows × 1 columns

Step 12. What is the standard deviation of names?

代码如下:

names.Count.std()

输出结果如下:

11006.069467891111

Step 13. Get a summary with the mean, min, max, std and quartiles.

代码如下:

names.describe()

输出结果如下:

Count
count17632.000000
mean2008.932169
std11006.069468
min5.000000
25%11.000000
50%49.000000
75%337.000000
max242874.000000

Exercise 2- Wind Statistics

Introduction:

The data have been modified to contain some missing values, identified by NaN.
Using pandas should make this exercise
easier, in particular for the bonus question.

You should be able to perform all of these operations without using
a for loop or other looping construct.

  1. The data in ‘wind.data’ has the following format:
"""
Yr Mo Dy   RPT   VAL   ROS   KIL   SHA   BIR   DUB   CLA   MUL   CLO   BEL   MAL
61  1  1 15.04 14.96 13.17  9.29   NaN  9.87 13.67 10.25 10.83 12.58 18.50 15.04
61  1  2 14.71   NaN 10.83  6.50 12.62  7.67 11.50 10.04  9.79  9.67 17.54 13.83
61  1  3 18.50 16.88 12.33 10.13 11.17  6.17 11.25   NaN  8.50  7.67 12.75 12.71
"""
'\nYr Mo Dy   RPT   VAL   ROS   KIL   SHA   BIR   DUB   CLA   MUL   CLO   BEL   MAL\n61  1  1 15.04 14.96 13.17  9.29   NaN  9.87 13.67 10.25 10.83 12.58 18.50 15.04\n61  1  2 14.71   NaN 10.83  6.50 12.62  7.67 11.50 10.04  9.79  9.67 17.54 13.83\n61  1  3 18.50 16.88 12.33 10.13 11.17  6.17 11.25   NaN  8.50  7.67 12.75 12.71\n'

The first three columns are year, month and day. The
remaining 12 columns are average windspeeds in knots at 12
locations in Ireland on that day.

More information about the dataset go here.

Step 1. Import the necessary libraries

代码如下:

import pandas as pd
import datetime

Step 2. Import the dataset from this address

Step 3. Assign it to a variable called data and replace the first 3 columns by a proper datetime index.

代码如下:

data = pd.read_table('wind.data', sep='\s+', parse_dates = [[0, 1, 2]])
data.head()

输出结果如下:

Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
02061-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
12061-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
22061-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
32061-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
42061-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83

Step 4. Year 2061? Do we really have data from this year? Create a function to fix it and apply it.

代码如下:

def fix_century(x):
    year = x.year - 100 if x.year > 1989 else x.year
    return datetime.date(year, x.month, x.day)
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)
data.head()

输出结果如下:

Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
01961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
11961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
21961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
31961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
41961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83

Step 5. Set the right dates as the index. Pay attention at the data type, it should be datetime64[ns].

代码如下:

data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])  # 转换为datetime64
data = data.set_index('Yr_Mo_Dy')
data.head()

输出结果如下:

RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
Yr_Mo_Dy
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
1961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
1961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
1961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
1961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83

Step 6. Compute how many values are missing for each location over the entire record.

They should be ignored in all calculations below.

代码如下:

data.isnull().sum()

输出结果如下:

RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64

Step 7. Compute how many non-missing values there are in total.

代码如下:

data.shape[0] - data.isnull().sum()
#OR data.notnull.sum()

输出结果如下:

RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64

Step 8. Calculate the mean windspeeds of the windspeeds over all the locations and all the times.

A single number for the entire dataset.

代码如下:

data.fillna(0).values.flatten().mean()  # a.flatten()就是把data降到一维,默认是按行的方向降

输出结果如下:

10.223864592840483

Step 9. Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days

A different set of numbers for each location.

代码如下:

# loc_stats = data.loc[:, 'RPT':'MAL'].describe(percentiles=[])
# loc_stats
data.describe(percentiles=[])

输出结果如下:

RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
count6568.0000006571.0000006572.0000006569.0000006572.0000006574.0000006571.0000006572.0000006571.0000006573.0000006574.0000006570.000000
mean12.36298710.64431411.6605266.30646810.4558347.0922549.7973438.4950538.4935908.70733213.12100715.599079
std5.6184135.2673565.0084503.6058114.9361253.9686834.9775554.4994494.1668724.5039545.8350376.699794
min0.6700000.2100001.5000000.0000000.1300000.0000000.0000000.0000000.0000000.0400000.1300000.670000
50%11.71000010.17000010.9200005.7500009.9600006.8300009.2100008.0800008.1700008.29000012.50000015.000000
max35.80000033.37000033.84000028.46000037.54000026.16000030.37000031.08000025.88000028.21000042.38000042.540000

Step 10. Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day.

A different set of numbers for each day.

代码如下:

day_stats = pd.DataFrame()

day_stats['min'] = data.min(axis = 1)
day_stats['max'] = data.max(axis = 1)
day_stats['mean'] = data.mean(axis = 1)
day_stats['std'] = data.std(axis = 1)

day_stats.head()

输出结果如下:

minmaxmeanstd
Yr_Mo_Dy
1961-01-019.2918.5013.0181822.808875
1961-01-026.5017.5411.3363643.188994
1961-01-036.1718.5011.6418183.681912
1961-01-041.7911.756.6191673.198126
1961-01-056.1713.3310.6300002.445356

Step 11. Find the average windspeed in January for each location.

Treat January 1961 and January 1962 both as January.

代码如下:

data.loc[data.index.month == 1].mean()

输出结果如下:

RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64

Step 12. Downsample the record to a yearly frequency for each location.

代码如下:

# pd.Period()创建时期数据
# pd.Period()参数:一个时间戳 + freq 参数 → freq 用于指明该 period 的长度,时间戳则说明该 period 在时间轴上的位置
# DatetimeIndex对象的数据转换为PeriodIndex
data.groupby(data.index.to_period('A')).mean() 

输出结果如下:

RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
Yr_Mo_Dy
196112.29958310.35179611.3623696.95822710.8817637.7297269.7339238.8587888.6476529.83557713.50279513.680773
196212.24692310.11043811.7327126.96044010.6579187.39306811.0207128.7937538.3168229.67624712.93068514.323956
196312.81345210.83698612.5411517.33005511.7241108.43471211.07569910.3365488.90358910.22443813.63887714.999014
196412.36366110.92016412.1043726.78778711.4544817.57087410.2591539.4673507.78901610.20795113.74054614.910301
196512.45137011.07553411.8487676.85846611.0247957.47811010.6187128.8799187.9074259.91808212.96424715.591644
196613.46197311.55720512.0206307.34572611.8050417.79367110.5798088.8350968.5144389.76895914.26583616.307260
196712.73715110.99098611.7393977.14342511.6307407.36816410.6520279.3256168.6450149.54742514.77454817.135945
196811.83562810.46819711.4097546.47767810.7607656.0673228.8591808.2555197.2249457.83297812.80863415.017486
196911.1663569.72369910.9020005.7679739.8739186.1899738.5644937.7113977.9245217.75438412.62123315.762904
197012.60032910.72693211.7302476.21717810.5673707.6094529.6098908.3346309.2976168.28980813.18364416.456027
197111.2731239.09517811.0883295.2415079.4403296.0971518.3858906.7573157.9153707.22975312.20893215.025233
197212.46396210.56131112.0583335.9296999.4304106.3588259.7045087.6807928.3572957.51527312.72737715.028716
197311.82846610.68049310.6804935.5478639.6408776.5487408.4821107.6142748.2455347.81241112.16969915.441096
197413.64309611.81178112.3363566.42704111.1109866.80978110.0846039.8969869.3317538.73635613.25295916.947671
197512.00857510.29383611.5647125.2690969.1900825.6685218.5626037.8438368.7979457.38282212.63167115.307863
197611.73784210.20311510.7612305.1094268.8463396.3110389.1491267.1462028.8837167.88308712.33237715.471448
197713.09961611.14449312.6278366.07394510.0038368.58643811.5232058.3783849.0981928.82161613.45906816.590849
197812.50435611.04427411.3800006.08235610.1672337.6506589.4893428.8004669.0897538.30169912.96739716.771370

Step 13. Downsample the record to a monthly frequency for each location.

代码如下:

data.groupby(data.index.to_period('M')).mean().head()

输出结果如下:

RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
Yr_Mo_Dy
1961-0114.84133311.98833313.4316137.73677411.0727598.58806511.1848399.2453339.08580610.10741913.88096814.703226
1961-0216.26928614.97535714.4414819.23074113.85214310.93750011.89071411.84607111.82142912.71428618.58321415.411786
1961-0310.89000011.29645210.7529037.28400010.5093558.8667749.6441949.82967710.29413811.25193516.41096815.720000
1961-0410.7226679.4276679.9980005.8306678.4350006.4950006.9253337.0946677.3423337.23700011.14733310.278333
1961-059.8609688.85000010.8180655.9053339.4903236.5748397.6040008.1770978.0393558.49935511.90032312.011613

Step 14. Downsample the record to a weekly frequency for each location.

代码如下:

data.groupby(data.index.to_period('W')).mean().head()

输出结果如下:

RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
Yr_Mo_Dy
1960-12-26/1961-01-0115.04000014.96000013.1700009.290000NaN9.87000013.67000010.25000010.83000012.58000018.50000015.040000
1961-01-02/1961-01-0813.54142911.48666710.4871436.4171439.4742866.43571411.0614296.6166678.4342868.49714312.48142913.238571
1961-01-09/1961-01-1512.4685718.96714311.9585714.6300007.3514295.0728577.5357146.8200005.7128577.57142911.12571411.024286
1961-01-16/1961-01-2213.2042869.86285712.9828576.3285718.9666677.4171439.2571437.8757147.1457148.1242869.82142911.434286
1961-01-23/1961-01-2919.88000016.14142918.22571412.72000017.43285714.82857115.52857115.16000014.48000015.64000020.93000022.530000

Step 15. Calculate the min, max and mean windspeeds and standard deviations of the windspeeds across all locations for each week (assume that the first week starts on January 2 1961) for the first 52 weeks.

代码如下:

# data.groupby(data.index.to_period('1961-01-02', 'W')).describe(percentiles=[]).head()
weekly = data.resample('W').agg(['min', 'max', 'mean', 'std'])  # resample()重新设置频率采样,再sh
weekly.loc[weekly.index[1:53], "RPT":"MAL"].head(10)

输出结果如下:

RPTVALROS...CLOBELMAL
minmaxmeanstdminmaxmeanstdminmax...meanstdminmaxmeanstdminmaxmeanstd
Yr_Mo_Dy
1961-01-0810.5818.5013.5414292.6313216.6316.8811.4866673.9495257.6212.33...8.4971431.7049415.4617.5412.4814294.34913910.8816.4613.2385711.773062
1961-01-159.0419.7512.4685713.5553923.5412.088.9671433.1489457.0819.50...7.5714294.0842935.2520.7111.1257145.5522155.1716.9211.0242864.692355
1961-01-224.9219.8313.2042865.3374023.4214.379.8628573.8377857.2920.79...8.1242864.7839526.5015.929.8214293.6265846.7917.9611.4342864.237239
1961-01-2913.6225.0419.8800004.6190619.9623.9116.1414295.17022412.6725.84...15.6400003.71336814.0427.7120.9300005.21072617.5027.6322.5300003.874721
1961-02-0510.5824.2116.8271435.2514089.4624.2115.4600005.1873959.0419.70...9.4600002.8395019.1719.3314.0128574.2108587.1719.2511.9357144.336104
1961-02-1216.0024.5419.6842863.58767711.5421.4216.4171433.60837313.6721.34...14.4400001.74674915.2126.3821.8328574.06375317.0421.8419.1557141.828705
1961-02-196.0422.5015.1300005.06460911.6320.1715.0914293.5750126.1319.41...13.5428572.53136114.0929.6321.1671435.91093810.9622.5816.5842864.685377
1961-02-267.7925.8015.2214297.0207167.0821.5013.6257145.1473486.0822.42...12.7300004.9200649.5923.2116.3042865.0911626.6723.8714.3228576.182283
1961-03-0510.9613.3312.1014290.9977218.8317.0012.9514292.8519558.1713.67...12.3700001.59368511.5823.4517.8428574.3323318.8317.5413.9516673.021387
1961-03-124.8814.799.3766673.7322638.0816.9611.5785713.2301677.5416.38...10.4585713.65511310.2122.7116.7014294.3587595.5422.5414.4200005.769890

10 rows × 48 columns

Conclusion

今天的pandas题更新,继续刷题,加油!

这篇关于Python数据分析pandas入门练习题(八)的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!