1일차
01 판다스 실습 환경 준비하기
01-1 아나콘다 설치하기
01-2 판다스 실습 공부하기
https://github.com/EasysPublishing/do_it_pandas
아나콘다 설치 후 Jetbrains ultimate 요금제를 매년 결제하고 있기 때문에
Data Spell을 사용해 책을 진행하려고 한다.
02 판다스 시작하기
판다스는 데이터프레임과 시리즈라는 두 가지 새로운 자료형을 제공하는 데이터 분석용 오픈소스 라이브러리, 스프레드시트 형태의 데이터를 불러와 빠르게 조작, 정렬, 병합할 수 있다.
시리즈는 데이터프레임의 한 열을 나타낸다. 시리즈를 여러 개 모은 딕셔너리나 컬렉션이 판다스의 데이터프레임
판다스의 장점
- 자동화
- 모든 실행 단계를 기록할 수 있다, 재현성
02-2 데이터셋 불러오기
통계 분석 서비스인 갭마인더에서 제공하는 데이터셋을 이용
In [3]:
import pandas as pd
df = pd.read_csv('../data/gapminder.tsv', sep='\t')
print(df)
country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106 ... ... ... ... ... ... ... 1699 Zimbabwe Africa 1987 62.351 9216418 706.157306 1700 Zimbabwe Africa 1992 60.377 10704340 693.420786 1701 Zimbabwe Africa 1997 46.809 11404948 792.449960 1702 Zimbabwe Africa 2002 39.989 11926563 672.038623 1703 Zimbabwe Africa 2007 43.487 12311143 469.709298 [1704 rows x 6 columns]
In [4]:
print(type(df))
<class 'pandas.core.frame.DataFrame'>
In [5]:
print(df.shape)
(1704, 6)
!shape 는 메서드가 아니라 속성
In [6]:
print(df.columns)
Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')
In [8]:
print(df.dtypes)
country object continent object year int64 lifeExp float64 pop int64 gdpPercap float64 dtype: object
In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1704 entries, 0 to 1703 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 1704 non-null object 1 continent 1704 non-null object 2 year 1704 non-null int64 3 lifeExp 1704 non-null float64 4 pop 1704 non-null int64 5 gdpPercap 1704 non-null float64 dtypes: float64(2), int64(2), object(2) memory usage: 80.0+ KB
02-3 데이터 추출하기¶
열 데이터 추출하기¶
In [10]:
df.head()
Out[10]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 |
In [11]:
df['country'].head()
Out[11]:
0 Afghanistan 1 Afghanistan 2 Afghanistan 3 Afghanistan 4 Afghanistan Name: country, dtype: object
In [12]:
df['country'].tail()
Out[12]:
1699 Zimbabwe 1700 Zimbabwe 1701 Zimbabwe 1702 Zimbabwe 1703 Zimbabwe Name: country, dtype: object
In [14]:
subset = df[['country', 'continent', 'year']]
subset
Out[14]:
country | continent | year | |
---|---|---|---|
0 | Afghanistan | Asia | 1952 |
1 | Afghanistan | Asia | 1957 |
2 | Afghanistan | Asia | 1962 |
3 | Afghanistan | Asia | 1967 |
4 | Afghanistan | Asia | 1972 |
... | ... | ... | ... |
1699 | Zimbabwe | Africa | 1987 |
1700 | Zimbabwe | Africa | 1992 |
1701 | Zimbabwe | Africa | 1997 |
1702 | Zimbabwe | Africa | 2002 |
1703 | Zimbabwe | Africa | 2007 |
1704 rows × 3 columns
열 데이터를 추출하는 두가지 방법의 차이점은 반환하는 객체의 자료형
df['country'] 와 df[['country']]
In [15]:
print(type(df['country']))
print(type(df[['country']]))
<class 'pandas.core.series.Series'> <class 'pandas.core.frame.DataFrame'>
In [16]:
print(df.country)
print(type(df.country))
0 Afghanistan 1 Afghanistan 2 Afghanistan 3 Afghanistan 4 Afghanistan ... 1699 Zimbabwe 1700 Zimbabwe 1701 Zimbabwe 1702 Zimbabwe 1703 Zimbabwe Name: country, Length: 1704, dtype: object <class 'pandas.core.series.Series'>
특수 문자를 포함한다면 대괄호 표기법을 사용해야 함
행 데이터 추출하기¶
loc : 행 이름을 기준으로 행 추출 iloc : 행 번호(행 위치)를 기준으로 행 추출
In [17]:
df.loc[0]
Out[17]:
country Afghanistan continent Asia year 1952 lifeExp 28.801 pop 8425333 gdpPercap 779.445314 Name: 0, dtype: object
In [18]:
df.tail(n = 1)
Out[18]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
1703 | Zimbabwe | Africa | 2007 | 43.487 | 12311143 | 469.709298 |
In [19]:
type(df.tail(n = 1))
Out[19]:
pandas.core.frame.DataFrame
In [20]:
df.loc[[0, 99, 999]]
Out[20]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
99 | Bangladesh | Asia | 1967 | 43.453 | 62821884 | 721.186086 |
999 | Mongolia | Asia | 1967 | 51.253 | 1149500 | 1226.041130 |
In [21]:
df.iloc[1]
Out[21]:
country Afghanistan continent Asia year 1957 lifeExp 30.332 pop 9240934 gdpPercap 820.85303 Name: 1, dtype: object
In [22]:
df.iloc[-1]
Out[22]:
country Zimbabwe continent Africa year 2007 lifeExp 43.487 pop 12311143 gdpPercap 469.709298 Name: 1703, dtype: object
In [23]:
type(df.iloc[-1])
Out[23]:
pandas.core.series.Series
In [24]:
df.loc[[0, 99, 999], ['country', 'continent', 'year']]
Out[24]:
country | continent | year | |
---|---|---|---|
0 | Afghanistan | Asia | 1952 |
99 | Bangladesh | Asia | 1967 |
999 | Mongolia | Asia | 1967 |
In [25]:
df.loc[:, ['country', 'continent', 'year']]
Out[25]:
country | continent | year | |
---|---|---|---|
0 | Afghanistan | Asia | 1952 |
1 | Afghanistan | Asia | 1957 |
2 | Afghanistan | Asia | 1962 |
3 | Afghanistan | Asia | 1967 |
4 | Afghanistan | Asia | 1972 |
... | ... | ... | ... |
1699 | Zimbabwe | Africa | 1987 |
1700 | Zimbabwe | Africa | 1992 |
1701 | Zimbabwe | Africa | 1997 |
1702 | Zimbabwe | Africa | 2002 |
1703 | Zimbabwe | Africa | 2007 |
1704 rows × 3 columns
In [26]:
df.iloc[:, [2, 4, -1]]
Out[26]:
year | pop | gdpPercap | |
---|---|---|---|
0 | 1952 | 8425333 | 779.445314 |
1 | 1957 | 9240934 | 820.853030 |
2 | 1962 | 10267083 | 853.100710 |
3 | 1967 | 11537966 | 836.197138 |
4 | 1972 | 13079460 | 739.981106 |
... | ... | ... | ... |
1699 | 1987 | 9216418 | 706.157306 |
1700 | 1992 | 10704340 | 693.420786 |
1701 | 1997 | 11404948 | 792.449960 |
1702 | 2002 | 11926563 | 672.038623 |
1703 | 2007 | 12311143 | 469.709298 |
1704 rows × 3 columns
In [27]:
df.iloc[:, range(0, 5)]
Out[27]:
country | continent | year | lifeExp | pop | |
---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 |
... | ... | ... | ... | ... | ... |
1699 | Zimbabwe | Africa | 1987 | 62.351 | 9216418 |
1700 | Zimbabwe | Africa | 1992 | 60.377 | 10704340 |
1701 | Zimbabwe | Africa | 1997 | 46.809 | 11404948 |
1702 | Zimbabwe | Africa | 2002 | 39.989 | 11926563 |
1703 | Zimbabwe | Africa | 2007 | 43.487 | 12311143 |
1704 rows × 5 columns
In [28]:
df.iloc[:, 0:6:2]
Out[28]:
country | year | pop | |
---|---|---|---|
0 | Afghanistan | 1952 | 8425333 |
1 | Afghanistan | 1957 | 9240934 |
2 | Afghanistan | 1962 | 10267083 |
3 | Afghanistan | 1967 | 11537966 |
4 | Afghanistan | 1972 | 13079460 |
... | ... | ... | ... |
1699 | Zimbabwe | 1987 | 9216418 |
1700 | Zimbabwe | 1992 | 10704340 |
1701 | Zimbabwe | 1997 | 11404948 |
1702 | Zimbabwe | 2002 | 11926563 |
1703 | Zimbabwe | 2007 | 12311143 |
1704 rows × 3 columns
In [29]:
df.loc[10:13, :]
Out[29]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
10 | Afghanistan | Asia | 2002 | 42.129 | 25268405 | 726.734055 |
11 | Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.580338 |
12 | Albania | Europe | 1952 | 55.230 | 1282697 | 1601.056136 |
13 | Albania | Europe | 1957 | 59.280 | 1476505 | 1942.284244 |
In [30]:
df.iloc[10:13, :]
Out[30]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
10 | Afghanistan | Asia | 2002 | 42.129 | 25268405 | 726.734055 |
11 | Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.580338 |
12 | Albania | Europe | 1952 | 55.230 | 1282697 | 1601.056136 |
In [31]:
df.groupby('year')['lifeExp'].mean()
Out[31]:
year 1952 49.057620 1957 51.507401 1962 53.609249 1967 55.678290 1972 57.647386 1977 59.570157 1982 61.533197 1987 63.212613 1992 64.160338 1997 65.014676 2002 65.694923 2007 67.007423 Name: lifeExp, dtype: float64
In [32]:
type(df.groupby('year'))
Out[32]:
pandas.core.groupby.generic.DataFrameGroupBy
In [33]:
type(df.groupby('year')['lifeExp'])
Out[33]:
pandas.core.groupby.generic.SeriesGroupBy
In [34]:
mean_lifeExp_by_year = df.groupby('year')['lifeExp'].mean()
mean_lifeExp_by_year
Out[34]:
year 1952 49.057620 1957 51.507401 1962 53.609249 1967 55.678290 1972 57.647386 1977 59.570157 1982 61.533197 1987 63.212613 1992 64.160338 1997 65.014676 2002 65.694923 2007 67.007423 Name: lifeExp, dtype: float64
In [35]:
multi_group_var = df.groupby(['year', 'continent'])[['lifeExp', 'gdpPercap']].mean()
multi_group_var
Out[35]:
lifeExp | gdpPercap | ||
---|---|---|---|
year | continent | ||
1952 | Africa | 39.135500 | 1252.572466 |
Americas | 53.279840 | 4079.062552 | |
Asia | 46.314394 | 5195.484004 | |
Europe | 64.408500 | 5661.057435 | |
Oceania | 69.255000 | 10298.085650 | |
1957 | Africa | 41.266346 | 1385.236062 |
Americas | 55.960280 | 4616.043733 | |
Asia | 49.318544 | 5787.732940 | |
Europe | 66.703067 | 6963.012816 | |
Oceania | 70.295000 | 11598.522455 | |
1962 | Africa | 43.319442 | 1598.078825 |
Americas | 58.398760 | 4901.541870 | |
Asia | 51.563223 | 5729.369625 | |
Europe | 68.539233 | 8365.486814 | |
Oceania | 71.085000 | 12696.452430 | |
1967 | Africa | 45.334538 | 2050.363801 |
Americas | 60.410920 | 5668.253496 | |
Asia | 54.663640 | 5971.173374 | |
Europe | 69.737600 | 10143.823757 | |
Oceania | 71.310000 | 14495.021790 | |
1972 | Africa | 47.450942 | 2339.615674 |
Americas | 62.394920 | 6491.334139 | |
Asia | 57.319269 | 8187.468699 | |
Europe | 70.775033 | 12479.575246 | |
Oceania | 71.910000 | 16417.333380 | |
1977 | Africa | 49.580423 | 2585.938508 |
Americas | 64.391560 | 7352.007126 | |
Asia | 59.610556 | 7791.314020 | |
Europe | 71.937767 | 14283.979110 | |
Oceania | 72.855000 | 17283.957605 | |
1982 | Africa | 51.592865 | 2481.592960 |
Americas | 66.228840 | 7506.737088 | |
Asia | 62.617939 | 7434.135157 | |
Europe | 72.806400 | 15617.896551 | |
Oceania | 74.290000 | 18554.709840 | |
1987 | Africa | 53.344788 | 2282.668991 |
Americas | 68.090720 | 7793.400261 | |
Asia | 64.851182 | 7608.226508 | |
Europe | 73.642167 | 17214.310727 | |
Oceania | 75.320000 | 20448.040160 | |
1992 | Africa | 53.629577 | 2281.810333 |
Americas | 69.568360 | 8044.934406 | |
Asia | 66.537212 | 8639.690248 | |
Europe | 74.440100 | 17061.568084 | |
Oceania | 76.945000 | 20894.045885 | |
1997 | Africa | 53.598269 | 2378.759555 |
Americas | 71.150480 | 8889.300863 | |
Asia | 68.020515 | 9834.093295 | |
Europe | 75.505167 | 19076.781802 | |
Oceania | 78.190000 | 24024.175170 | |
2002 | Africa | 53.325231 | 2599.385159 |
Americas | 72.422040 | 9287.677107 | |
Asia | 69.233879 | 10174.090397 | |
Europe | 76.700600 | 21711.732422 | |
Oceania | 79.740000 | 26938.778040 | |
2007 | Africa | 54.806038 | 3089.032605 |
Americas | 73.608120 | 11003.031625 | |
Asia | 70.728485 | 12473.026870 | |
Europe | 77.648600 | 25054.481636 | |
Oceania | 80.719500 | 29810.188275 |
In [36]:
flat = multi_group_var.reset_index()
flat
Out[36]:
year | continent | lifeExp | gdpPercap | |
---|---|---|---|---|
0 | 1952 | Africa | 39.135500 | 1252.572466 |
1 | 1952 | Americas | 53.279840 | 4079.062552 |
2 | 1952 | Asia | 46.314394 | 5195.484004 |
3 | 1952 | Europe | 64.408500 | 5661.057435 |
4 | 1952 | Oceania | 69.255000 | 10298.085650 |
5 | 1957 | Africa | 41.266346 | 1385.236062 |
6 | 1957 | Americas | 55.960280 | 4616.043733 |
7 | 1957 | Asia | 49.318544 | 5787.732940 |
8 | 1957 | Europe | 66.703067 | 6963.012816 |
9 | 1957 | Oceania | 70.295000 | 11598.522455 |
10 | 1962 | Africa | 43.319442 | 1598.078825 |
11 | 1962 | Americas | 58.398760 | 4901.541870 |
12 | 1962 | Asia | 51.563223 | 5729.369625 |
13 | 1962 | Europe | 68.539233 | 8365.486814 |
14 | 1962 | Oceania | 71.085000 | 12696.452430 |
15 | 1967 | Africa | 45.334538 | 2050.363801 |
16 | 1967 | Americas | 60.410920 | 5668.253496 |
17 | 1967 | Asia | 54.663640 | 5971.173374 |
18 | 1967 | Europe | 69.737600 | 10143.823757 |
19 | 1967 | Oceania | 71.310000 | 14495.021790 |
20 | 1972 | Africa | 47.450942 | 2339.615674 |
21 | 1972 | Americas | 62.394920 | 6491.334139 |
22 | 1972 | Asia | 57.319269 | 8187.468699 |
23 | 1972 | Europe | 70.775033 | 12479.575246 |
24 | 1972 | Oceania | 71.910000 | 16417.333380 |
25 | 1977 | Africa | 49.580423 | 2585.938508 |
26 | 1977 | Americas | 64.391560 | 7352.007126 |
27 | 1977 | Asia | 59.610556 | 7791.314020 |
28 | 1977 | Europe | 71.937767 | 14283.979110 |
29 | 1977 | Oceania | 72.855000 | 17283.957605 |
30 | 1982 | Africa | 51.592865 | 2481.592960 |
31 | 1982 | Americas | 66.228840 | 7506.737088 |
32 | 1982 | Asia | 62.617939 | 7434.135157 |
33 | 1982 | Europe | 72.806400 | 15617.896551 |
34 | 1982 | Oceania | 74.290000 | 18554.709840 |
35 | 1987 | Africa | 53.344788 | 2282.668991 |
36 | 1987 | Americas | 68.090720 | 7793.400261 |
37 | 1987 | Asia | 64.851182 | 7608.226508 |
38 | 1987 | Europe | 73.642167 | 17214.310727 |
39 | 1987 | Oceania | 75.320000 | 20448.040160 |
40 | 1992 | Africa | 53.629577 | 2281.810333 |
41 | 1992 | Americas | 69.568360 | 8044.934406 |
42 | 1992 | Asia | 66.537212 | 8639.690248 |
43 | 1992 | Europe | 74.440100 | 17061.568084 |
44 | 1992 | Oceania | 76.945000 | 20894.045885 |
45 | 1997 | Africa | 53.598269 | 2378.759555 |
46 | 1997 | Americas | 71.150480 | 8889.300863 |
47 | 1997 | Asia | 68.020515 | 9834.093295 |
48 | 1997 | Europe | 75.505167 | 19076.781802 |
49 | 1997 | Oceania | 78.190000 | 24024.175170 |
50 | 2002 | Africa | 53.325231 | 2599.385159 |
51 | 2002 | Americas | 72.422040 | 9287.677107 |
52 | 2002 | Asia | 69.233879 | 10174.090397 |
53 | 2002 | Europe | 76.700600 | 21711.732422 |
54 | 2002 | Oceania | 79.740000 | 26938.778040 |
55 | 2007 | Africa | 54.806038 | 3089.032605 |
56 | 2007 | Americas | 73.608120 | 11003.031625 |
57 | 2007 | Asia | 70.728485 | 12473.026870 |
58 | 2007 | Europe | 77.648600 | 25054.481636 |
59 | 2007 | Oceania | 80.719500 | 29810.188275 |
그룹화한 데이터 개수 세기¶
In [37]:
df.groupby('continent')['country'].nunique()
Out[37]:
continent Africa 52 Americas 25 Asia 33 Europe 30 Oceania 2 Name: country, dtype: int64
In [38]:
df.groupby('continent')['country'].value_counts()
Out[38]:
continent country Africa Algeria 12 Angola 12 Libya 12 Ghana 12 Guinea 12 .. Europe Germany 12 Greece 12 Hungary 12 Oceania Australia 12 New Zealand 12 Name: count, Length: 142, dtype: int64
02-5 데이터를 그래프로 표현하려면?¶
In [39]:
global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean()
global_yearly_life_expectancy
Out[39]:
year 1952 49.057620 1957 51.507401 1962 53.609249 1967 55.678290 1972 57.647386 1977 59.570157 1982 61.533197 1987 63.212613 1992 64.160338 1997 65.014676 2002 65.694923 2007 67.007423 Name: lifeExp, dtype: float64
In [40]:
global_yearly_life_expectancy.plot()
Out[40]:
<Axes: xlabel='year'>
'도서 > 프로그래밍' 카테고리의 다른 글
[03] Do it! 데이터 분석을 위한 판다스 입문 (0) | 2024.01.04 |
---|---|
[02] Do it! 데이터 분석을 위한 판다스 입문 (0) | 2024.01.03 |
[07][完] 객체지향의 사실과 오해 - 함께 모으기 (2) | 2024.01.01 |
[06] 객체지향의 사실과 오해 - 객체지도 (2) | 2023.12.31 |
[05] 객체지향의 사실과 오해 - 책임과 메시지 (2) | 2023.12.30 |