문자열 처리하기¶
유능한 데이터 분석가가 되려면 문자열을 잘 처리할 수 있어야 합니다.
In [1]:
test = """
123
456
789"""
print(test.splitlines())
['', '123', '456', '789']
In [2]:
digits = 67890
s = f"recited {digits:,} digits"
print(s)
recited 67,890 digits
In [3]:
prop = 7 / 67890
s = f"I remember {prop:.4} or {prop:.4%} of what Lu Chao recited."
print(s)
I remember 0.0001031 or 0.0103% of what Lu Chao recited.
In [4]:
id = 42
print(f"My Id number is {id:05d}")
My Id number is 00042
In [5]:
import re
tele_num = '1234567890'
m = re.match(pattern='\d\d\d\d\d\d\d\d\d\d', string=tele_num)
print(type(m))
<class 're.Match'>
In [6]:
print(m)
<re.Match object; span=(0, 10), match='1234567890'>
In [7]:
print(bool(m))
True
In [8]:
if m:
print('match')
else:
print('no match')
match
In [9]:
print(m.start())
0
In [10]:
print(m.end())
10
In [11]:
m.span()
Out[11]:
(0, 10)
In [12]:
m.group()
Out[12]:
'1234567890'
In [13]:
tele_num_spaces = '123 456 7890'
In [15]:
m = re.match(pattern='\d{10}', string=tele_num_spaces)
print(m)
None
In [16]:
p = '\d{3}\s?\d{3}\s?\d{4}'
m = re.match(pattern=p, string=tele_num_spaces)
print(m)
<re.Match object; span=(0, 12), match='123 456 7890'>
In [17]:
tele_num_space_paren_dash = '(123) 456-7890'
p = '\(?\d{3}\)?\s?\d{3}\s?-?\d{4}'
m = re.match(pattern=p, string=tele_num_space_paren_dash)
print(m)
<re.Match object; span=(0, 14), match='(123) 456-7890'>
In [25]:
cnty_tele_num_space_paren_dash = '+1 (123) 456-7890'
p = '\+?1\s?\(?\d{3}\)?\s?\d{3}\s?-?\d{4}'
m = re.match(pattern=p, string=cnty_tele_num_space_paren_dash)
print(m)
<re.Match object; span=(0, 17), match='+1 (123) 456-7890'>
In [26]:
p = (
'\+?' # +가 0개 또는 1개
'1' # 숫자1
'\s?' # 공백 문자가 0개 또는 1개
)
print(p)
\+?1\s?
In [27]:
s = (
"14 Ncuti Gatwa, "
"13 Jodie Whittaker, war John Hurt, 12 Peter Capaldi, "
"11 Matt Smith, 10 David Tennant, 9 Christopher Eccleston"
)
print(s)
14 Ncuti Gatwa, 13 Jodie Whittaker, war John Hurt, 12 Peter Capaldi, 11 Matt Smith, 10 David Tennant, 9 Christopher Eccleston
In [29]:
p = "\d+"
m = re.findall(pattern=p, string=s)
print(m)
['14', '13', '12', '11', '10', '9']
In [42]:
multi_str = """Guard: What? Ridden on a horse?
King Arthur: Yes!
Guard: You're using coconuts!
King Arthur: What?
Guard: You've got ... coconut[s] and you're bangin' 'em together.
"""
p = '\w+\s?\w+:\s?'
s = re.sub(pattern=p, string=multi_str, repl='')
print(s)
What? Ridden on a horse? Yes! You're using coconuts! What? You've got ... coconut[s] and you're bangin' 'em together.
In [43]:
guard = s.splitlines()[::2]
kinga = s.splitlines()[1::2]
print(guard)
['What? Ridden on a horse?', "You're using coconuts!", "You've got ... coconut[s] and you're bangin' 'em together."]
In [44]:
print(kinga)
['Yes!', 'What?']
In [45]:
p = re.compile('\d{10}')
s = '1234567890'
m = p.match(s)
print(m)
<re.Match object; span=(0, 10), match='1234567890'>
In [46]:
p = re.compile('\d+')
s = (
"14 Ncuti Gatwa, "
"13 Jodie Whittaker, war John Hurt, 12 Peter Capaldi, "
"11 Matt Smith, 10 David Tennant, 9 Christopher Eccleston"
)
m = p.findall(s)
print(m)
['14', '13', '12', '11', '10', '9']
11-6 regex 라이브러리 활용하기¶
re 라이브러리는 기본 라이브러리, 조금 더 심화된 정규식 기능을 사용하고 싶다면 regex 라이브러리를 사용 re 모듈 기능은 그대로 사용 가능
In [ ]:
12장 시계열 데이터 알아보기¶
In [1]:
import pandas as pd
ebola = pd.read_csv('../data/country_timeseries.csv')
In [2]:
print(ebola.iloc[:5, :5])
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone 0 1/5/2015 289 2776.0 NaN 10030.0 1 1/4/2015 288 2775.0 NaN 9780.0 2 1/3/2015 287 2769.0 8166.0 9722.0 3 1/2/2015 286 NaN 8157.0 NaN 4 12/31/2014 284 2730.0 8115.0 9633.0
In [3]:
print(ebola.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 122 entries, 0 to 121 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 122 non-null object 1 Day 122 non-null int64 2 Cases_Guinea 93 non-null float64 3 Cases_Liberia 83 non-null float64 4 Cases_SierraLeone 87 non-null float64 5 Cases_Nigeria 38 non-null float64 6 Cases_Senegal 25 non-null float64 7 Cases_UnitedStates 18 non-null float64 8 Cases_Spain 16 non-null float64 9 Cases_Mali 12 non-null float64 10 Deaths_Guinea 92 non-null float64 11 Deaths_Liberia 81 non-null float64 12 Deaths_SierraLeone 87 non-null float64 13 Deaths_Nigeria 38 non-null float64 14 Deaths_Senegal 22 non-null float64 15 Deaths_UnitedStates 18 non-null float64 16 Deaths_Spain 16 non-null float64 17 Deaths_Mali 12 non-null float64 dtypes: float64(16), int64(1), object(1) memory usage: 17.3+ KB None
In [4]:
ebola['data_dt'] = pd.to_datetime(ebola['Date'])
In [5]:
ebola['data_dt'] = pd.to_datetime(ebola['Date'], format='%m/%d/%Y')
In [6]:
print(ebola.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 122 entries, 0 to 121 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 122 non-null object 1 Day 122 non-null int64 2 Cases_Guinea 93 non-null float64 3 Cases_Liberia 83 non-null float64 4 Cases_SierraLeone 87 non-null float64 5 Cases_Nigeria 38 non-null float64 6 Cases_Senegal 25 non-null float64 7 Cases_UnitedStates 18 non-null float64 8 Cases_Spain 16 non-null float64 9 Cases_Mali 12 non-null float64 10 Deaths_Guinea 92 non-null float64 11 Deaths_Liberia 81 non-null float64 12 Deaths_SierraLeone 87 non-null float64 13 Deaths_Nigeria 38 non-null float64 14 Deaths_Senegal 22 non-null float64 15 Deaths_UnitedStates 18 non-null float64 16 Deaths_Spain 16 non-null float64 17 Deaths_Mali 12 non-null float64 18 data_dt 122 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(16), int64(1), object(1) memory usage: 18.2+ KB None
In [7]:
ebola = pd.read_csv('../data/country_timeseries.csv', parse_dates=['Date'])
print(ebola.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 122 entries, 0 to 121 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 122 non-null datetime64[ns] 1 Day 122 non-null int64 2 Cases_Guinea 93 non-null float64 3 Cases_Liberia 83 non-null float64 4 Cases_SierraLeone 87 non-null float64 5 Cases_Nigeria 38 non-null float64 6 Cases_Senegal 25 non-null float64 7 Cases_UnitedStates 18 non-null float64 8 Cases_Spain 16 non-null float64 9 Cases_Mali 12 non-null float64 10 Deaths_Guinea 92 non-null float64 11 Deaths_Liberia 81 non-null float64 12 Deaths_SierraLeone 87 non-null float64 13 Deaths_Nigeria 38 non-null float64 14 Deaths_Senegal 22 non-null float64 15 Deaths_UnitedStates 18 non-null float64 16 Deaths_Spain 16 non-null float64 17 Deaths_Mali 12 non-null float64 dtypes: datetime64[ns](1), float64(16), int64(1) memory usage: 17.3 KB None
In [8]:
d = pd.to_datetime('2021-12-14')
print(d)
2021-12-14 00:00:00
In [9]:
print(type(d))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
In [10]:
print(d.year)
2021
In [11]:
print(d.month)
12
In [12]:
ebola['date_dt'] = pd.to_datetime(ebola['Date'])
print(ebola.loc[:,['Date', 'date_dt']])
Date date_dt 0 2015-01-05 2015-01-05 1 2015-01-04 2015-01-04 2 2015-01-03 2015-01-03 3 2015-01-02 2015-01-02 4 2014-12-31 2014-12-31 .. ... ... 117 2014-03-27 2014-03-27 118 2014-03-26 2014-03-26 119 2014-03-25 2014-03-25 120 2014-03-24 2014-03-24 121 2014-03-22 2014-03-22 [122 rows x 2 columns]
In [13]:
ebola['year'] = ebola['date_dt'].dt.year
ebola[['Date', 'date_dt', 'year']]
Out[13]:
Date | date_dt | year | |
---|---|---|---|
0 | 2015-01-05 | 2015-01-05 | 2015 |
1 | 2015-01-04 | 2015-01-04 | 2015 |
2 | 2015-01-03 | 2015-01-03 | 2015 |
3 | 2015-01-02 | 2015-01-02 | 2015 |
4 | 2014-12-31 | 2014-12-31 | 2014 |
... | ... | ... | ... |
117 | 2014-03-27 | 2014-03-27 | 2014 |
118 | 2014-03-26 | 2014-03-26 | 2014 |
119 | 2014-03-25 | 2014-03-25 | 2014 |
120 | 2014-03-24 | 2014-03-24 | 2014 |
121 | 2014-03-22 | 2014-03-22 | 2014 |
122 rows × 3 columns
In [14]:
ebola = ebola.assign(
month=ebola["date_dt"].dt.month,
day=ebola["date_dt"].dt.day
)
In [15]:
ebola
Out[15]:
Date | Day | Cases_Guinea | Cases_Liberia | Cases_SierraLeone | Cases_Nigeria | Cases_Senegal | Cases_UnitedStates | Cases_Spain | Cases_Mali | ... | Deaths_SierraLeone | Deaths_Nigeria | Deaths_Senegal | Deaths_UnitedStates | Deaths_Spain | Deaths_Mali | date_dt | year | month | day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-01-05 | 289 | 2776.0 | NaN | 10030.0 | NaN | NaN | NaN | NaN | NaN | ... | 2977.0 | NaN | NaN | NaN | NaN | NaN | 2015-01-05 | 2015 | 1 | 5 |
1 | 2015-01-04 | 288 | 2775.0 | NaN | 9780.0 | NaN | NaN | NaN | NaN | NaN | ... | 2943.0 | NaN | NaN | NaN | NaN | NaN | 2015-01-04 | 2015 | 1 | 4 |
2 | 2015-01-03 | 287 | 2769.0 | 8166.0 | 9722.0 | NaN | NaN | NaN | NaN | NaN | ... | 2915.0 | NaN | NaN | NaN | NaN | NaN | 2015-01-03 | 2015 | 1 | 3 |
3 | 2015-01-02 | 286 | NaN | 8157.0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 2015-01-02 | 2015 | 1 | 2 |
4 | 2014-12-31 | 284 | 2730.0 | 8115.0 | 9633.0 | NaN | NaN | NaN | NaN | NaN | ... | 2827.0 | NaN | NaN | NaN | NaN | NaN | 2014-12-31 | 2014 | 12 | 31 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
117 | 2014-03-27 | 5 | 103.0 | 8.0 | 6.0 | NaN | NaN | NaN | NaN | NaN | ... | 5.0 | NaN | NaN | NaN | NaN | NaN | 2014-03-27 | 2014 | 3 | 27 |
118 | 2014-03-26 | 4 | 86.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 2014-03-26 | 2014 | 3 | 26 |
119 | 2014-03-25 | 3 | 86.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 2014-03-25 | 2014 | 3 | 25 |
120 | 2014-03-24 | 2 | 86.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 2014-03-24 | 2014 | 3 | 24 |
121 | 2014-03-22 | 0 | 49.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 2014-03-22 | 2014 | 3 | 22 |
122 rows × 22 columns
In [16]:
print(ebola.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 122 entries, 0 to 121 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 122 non-null datetime64[ns] 1 Day 122 non-null int64 2 Cases_Guinea 93 non-null float64 3 Cases_Liberia 83 non-null float64 4 Cases_SierraLeone 87 non-null float64 5 Cases_Nigeria 38 non-null float64 6 Cases_Senegal 25 non-null float64 7 Cases_UnitedStates 18 non-null float64 8 Cases_Spain 16 non-null float64 9 Cases_Mali 12 non-null float64 10 Deaths_Guinea 92 non-null float64 11 Deaths_Liberia 81 non-null float64 12 Deaths_SierraLeone 87 non-null float64 13 Deaths_Nigeria 38 non-null float64 14 Deaths_Senegal 22 non-null float64 15 Deaths_UnitedStates 18 non-null float64 16 Deaths_Spain 16 non-null float64 17 Deaths_Mali 12 non-null float64 18 date_dt 122 non-null datetime64[ns] 19 year 122 non-null int32 20 month 122 non-null int32 21 day 122 non-null int32 dtypes: datetime64[ns](2), float64(16), int32(3), int64(1) memory usage: 19.7 KB None
In [17]:
print(ebola.iloc[-5:, :5])
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone 117 2014-03-27 5 103.0 8.0 6.0 118 2014-03-26 4 86.0 NaN NaN 119 2014-03-25 3 86.0 NaN NaN 120 2014-03-24 2 86.0 NaN NaN 121 2014-03-22 0 49.0 NaN NaN
In [18]:
ebola['outbreak_d'] = ebola['date_dt'] - ebola['date_dt'].min();
print(ebola[['Date', 'Day', 'outbreak_d']])
Date Day outbreak_d 0 2015-01-05 289 289 days 1 2015-01-04 288 288 days 2 2015-01-03 287 287 days 3 2015-01-02 286 286 days 4 2014-12-31 284 284 days .. ... ... ... 117 2014-03-27 5 5 days 118 2014-03-26 4 4 days 119 2014-03-25 3 3 days 120 2014-03-24 2 2 days 121 2014-03-22 0 0 days [122 rows x 3 columns]
In [19]:
print(ebola.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 122 entries, 0 to 121 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 122 non-null datetime64[ns] 1 Day 122 non-null int64 2 Cases_Guinea 93 non-null float64 3 Cases_Liberia 83 non-null float64 4 Cases_SierraLeone 87 non-null float64 5 Cases_Nigeria 38 non-null float64 6 Cases_Senegal 25 non-null float64 7 Cases_UnitedStates 18 non-null float64 8 Cases_Spain 16 non-null float64 9 Cases_Mali 12 non-null float64 10 Deaths_Guinea 92 non-null float64 11 Deaths_Liberia 81 non-null float64 12 Deaths_SierraLeone 87 non-null float64 13 Deaths_Nigeria 38 non-null float64 14 Deaths_Senegal 22 non-null float64 15 Deaths_UnitedStates 18 non-null float64 16 Deaths_Spain 16 non-null float64 17 Deaths_Mali 12 non-null float64 18 date_dt 122 non-null datetime64[ns] 19 year 122 non-null int32 20 month 122 non-null int32 21 day 122 non-null int32 22 outbreak_d 122 non-null timedelta64[ns] dtypes: datetime64[ns](2), float64(16), int32(3), int64(1), timedelta64[ns](1) memory usage: 20.6 KB None
In [20]:
banks = pd.read_csv('../data/banklist.csv')
print(banks.head())
Bank Name City ST \ 0 Fayette County Bank Saint Elmo IL 1 Guaranty Bank, (d/b/a BestBank in Georgia & Mi... Milwaukee WI 2 First NBC Bank New Orleans LA 3 Proficio Bank Cottonwood Heights UT 4 Seaway Bank and Trust Company Chicago IL CERT Acquiring Institution Closing Date Updated Date 0 1802 United Fidelity Bank, fsb 26-May-17 26-Jul-17 1 30003 First-Citizens Bank & Trust Company 5-May-17 26-Jul-17 2 58302 Whitney Bank 28-Apr-17 26-Jul-17 3 35495 Cache Valley Bank 3-Mar-17 18-May-17 4 19328 State Bank of Texas 27-Jan-17 18-May-17
In [21]:
banks = pd.read_csv('../data/banklist.csv', parse_dates=['Closing Date', 'Updated Date'])
print(banks.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 553 entries, 0 to 552 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Bank Name 553 non-null object 1 City 553 non-null object 2 ST 553 non-null object 3 CERT 553 non-null int64 4 Acquiring Institution 553 non-null object 5 Closing Date 553 non-null datetime64[ns] 6 Updated Date 553 non-null datetime64[ns] dtypes: datetime64[ns](2), int64(1), object(4) memory usage: 30.4+ KB None
C:\Users\offse\AppData\Local\Temp\ipykernel_15392\3072937290.py:1: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. banks = pd.read_csv('../data/banklist.csv', parse_dates=['Closing Date', 'Updated Date']) C:\Users\offse\AppData\Local\Temp\ipykernel_15392\3072937290.py:1: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. banks = pd.read_csv('../data/banklist.csv', parse_dates=['Closing Date', 'Updated Date'])
In [22]:
banks = banks.assign(
closing_quarter=banks['Closing Date'].dt.quarter,
closing_year=banks['Closing Date'].dt.year
)
In [23]:
closing_year = banks.groupby(['closing_year']).size()
print(closing_year)
closing_year 2000 2 2001 4 2002 11 2003 3 2004 4 2007 3 2008 25 2009 140 2010 157 2011 92 2012 51 2013 24 2014 18 2015 8 2016 5 2017 6 dtype: int64
In [24]:
closing_year_q = (
banks
.groupby(['closing_year', 'closing_quarter'])
.size()
)
print(closing_year_q)
closing_year closing_quarter 2000 4 2 2001 1 1 2 1 3 2 2002 1 6 2 2 3 1 4 2 2003 1 1 2 1 4 1 2004 1 3 2 1 2007 1 1 3 1 4 1 2008 1 2 2 2 3 9 4 12 2009 1 21 2 24 3 50 4 45 2010 1 41 2 45 3 41 4 30 2011 1 26 2 22 3 26 4 18 2012 1 16 2 15 3 12 4 8 2013 1 4 2 12 3 6 4 2 2014 1 5 2 7 3 2 4 4 2015 1 4 2 1 3 1 4 2 2016 1 1 2 2 3 2 2017 1 3 2 3 dtype: int64
In [25]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax = closing_year.plot()
plt.show()
In [26]:
fig, ax = plt.subplots()
ax = closing_year_q.plot()
plt.show()
In [27]:
import pandas_datareader.data as web
tesla = web.DataReader('TSLA', 'stooq')
print(tesla)
Open High Low Close Volume Date 2024-01-08 236.1400 241.2500 235.3000 240.4500 85166580 2024-01-05 236.8600 240.1196 234.9001 237.4900 92488939 2024-01-04 239.2500 242.7000 237.7300 237.9300 102629283 2024-01-03 244.9800 245.6800 236.3200 238.4500 121082599 2024-01-02 250.0800 251.2500 244.4100 248.4200 104654163 ... ... ... ... ... ... 2019-01-16 22.9853 23.4667 22.9000 23.0700 70376085 2019-01-15 22.3333 23.2533 22.3000 22.9620 90848850 2019-01-14 22.8253 22.8333 22.2667 22.2933 78709260 2019-01-11 22.8060 23.2273 22.5847 23.1507 75585780 2019-01-10 22.2933 23.0260 22.1193 22.9980 90845310 [1257 rows x 5 columns]
In [28]:
tesla.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 1257 entries, 2024-01-08 to 2019-01-10 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Open 1257 non-null float64 1 High 1257 non-null float64 2 Low 1257 non-null float64 3 Close 1257 non-null float64 4 Volume 1257 non-null int64 dtypes: float64(4), int64(1) memory usage: 58.9 KB
In [29]:
tesla.index
Out[29]:
DatetimeIndex(['2024-01-08', '2024-01-05', '2024-01-04', '2024-01-03', '2024-01-02', '2023-12-29', '2023-12-28', '2023-12-27', '2023-12-26', '2023-12-22', ... '2019-01-24', '2019-01-23', '2019-01-22', '2019-01-18', '2019-01-17', '2019-01-16', '2019-01-15', '2019-01-14', '2019-01-11', '2019-01-10'], dtype='datetime64[ns]', name='Date', length=1257, freq=None)
In [33]:
tesla = pd.read_csv('../data/tesla_stock_yahoo.csv', parse_dates=['Date'])
print(tesla.head())
Date Open High Low Close Adj Close Volume 0 2010-06-29 19.000000 25.00 17.540001 23.889999 23.889999 18766300 1 2010-06-30 25.790001 30.42 23.299999 23.830000 23.830000 17187100 2 2010-07-01 25.000000 25.92 20.270000 21.959999 21.959999 8218800 3 2010-07-02 23.000000 23.10 18.709999 19.200001 19.200001 5139800 4 2010-07-06 20.000000 20.00 15.830000 16.110001 16.110001 6866900
In [34]:
print(tesla.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1791 entries, 0 to 1790 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 1791 non-null datetime64[ns] 1 Open 1791 non-null float64 2 High 1791 non-null float64 3 Low 1791 non-null float64 4 Close 1791 non-null float64 5 Adj Close 1791 non-null float64 6 Volume 1791 non-null int64 dtypes: datetime64[ns](1), float64(5), int64(1) memory usage: 98.1 KB None
In [35]:
print(
tesla.loc[
(tesla.Date.dt.year == 2010) & (tesla.Date.dt.month == 6)
]
)
Date Open High Low Close Adj Close Volume 0 2010-06-29 19.000000 25.00 17.540001 23.889999 23.889999 18766300 1 2010-06-30 25.790001 30.42 23.299999 23.830000 23.830000 17187100
In [36]:
tesla.index = tesla['Date']
print(tesla.index)
DatetimeIndex(['2010-06-29', '2010-06-30', '2010-07-01', '2010-07-02', '2010-07-06', '2010-07-07', '2010-07-08', '2010-07-09', '2010-07-12', '2010-07-13', ... '2017-07-26', '2017-07-27', '2017-07-28', '2017-07-31', '2017-08-01', '2017-08-02', '2017-08-03', '2017-08-04', '2017-08-07', '2017-08-08'], dtype='datetime64[ns]', name='Date', length=1791, freq=None)
In [37]:
print(tesla.loc['2015'])
Date Open High Low Close \ Date 2015-01-02 2015-01-02 222.869995 223.250000 213.259995 219.309998 2015-01-05 2015-01-05 214.550003 216.500000 207.160004 210.089996 2015-01-06 2015-01-06 210.059998 214.199997 204.210007 211.279999 2015-01-07 2015-01-07 213.350006 214.779999 209.779999 210.949997 2015-01-08 2015-01-08 212.809998 213.800003 210.009995 210.619995 ... ... ... ... ... ... 2015-12-24 2015-12-24 230.559998 231.880005 228.279999 230.570007 2015-12-28 2015-12-28 231.490005 231.979996 225.539993 228.949997 2015-12-29 2015-12-29 230.059998 237.720001 229.550003 237.190002 2015-12-30 2015-12-30 236.600006 243.630005 235.669998 238.089996 2015-12-31 2015-12-31 238.509995 243.449997 238.369995 240.009995 Adj Close Volume Date 2015-01-02 219.309998 4764400 2015-01-05 210.089996 5368500 2015-01-06 211.279999 6261900 2015-01-07 210.949997 2968400 2015-01-08 210.619995 3442500 ... ... ... 2015-12-24 230.570007 708000 2015-12-28 228.949997 1901300 2015-12-29 237.190002 2406300 2015-12-30 238.089996 3697900 2015-12-31 240.009995 2683200 [252 rows x 7 columns]
In [38]:
print(tesla.loc['2010-06'])
Date Open High Low Close Adj Close \ Date 2010-06-29 2010-06-29 19.000000 25.00 17.540001 23.889999 23.889999 2010-06-30 2010-06-30 25.790001 30.42 23.299999 23.830000 23.830000 Volume Date 2010-06-29 18766300 2010-06-30 17187100
In [39]:
tesla['ref_date'] = tesla['Date'] - tesla['Date'].min()
In [40]:
tesla.index = tesla['ref_date']
print(tesla.index)
TimedeltaIndex([ '0 days', '1 days', '2 days', '3 days', '7 days', '8 days', '9 days', '10 days', '13 days', '14 days', ... '2584 days', '2585 days', '2586 days', '2589 days', '2590 days', '2591 days', '2592 days', '2593 days', '2596 days', '2597 days'], dtype='timedelta64[ns]', name='ref_date', length=1791, freq=None)
In [41]:
print(tesla)
Date Open High Low Close \ ref_date 0 days 2010-06-29 19.000000 25.000000 17.540001 23.889999 1 days 2010-06-30 25.790001 30.420000 23.299999 23.830000 2 days 2010-07-01 25.000000 25.920000 20.270000 21.959999 3 days 2010-07-02 23.000000 23.100000 18.709999 19.200001 7 days 2010-07-06 20.000000 20.000000 15.830000 16.110001 ... ... ... ... ... ... 2591 days 2017-08-02 318.940002 327.119995 311.220001 325.890015 2592 days 2017-08-03 345.329987 350.000000 343.149994 347.089996 2593 days 2017-08-04 347.000000 357.269989 343.299988 356.910004 2596 days 2017-08-07 357.350006 359.480011 352.750000 355.170013 2597 days 2017-08-08 357.529999 368.579987 357.399994 365.220001 Adj Close Volume ref_date ref_date 0 days 23.889999 18766300 0 days 1 days 23.830000 17187100 1 days 2 days 21.959999 8218800 2 days 3 days 19.200001 5139800 3 days 7 days 16.110001 6866900 7 days ... ... ... ... 2591 days 325.890015 13091500 2591 days 2592 days 347.089996 13535000 2592 days 2593 days 356.910004 9198400 2593 days 2596 days 355.170013 6276900 2596 days 2597 days 365.220001 7449837 2597 days [1791 rows x 8 columns]
In [45]:
print(tesla.loc['0 days': '10 days'])
Date Open High Low Close Adj Close \ ref_date 0 days 2010-06-29 19.000000 25.000000 17.540001 23.889999 23.889999 1 days 2010-06-30 25.790001 30.420000 23.299999 23.830000 23.830000 2 days 2010-07-01 25.000000 25.920000 20.270000 21.959999 21.959999 3 days 2010-07-02 23.000000 23.100000 18.709999 19.200001 19.200001 7 days 2010-07-06 20.000000 20.000000 15.830000 16.110001 16.110001 8 days 2010-07-07 16.400000 16.629999 14.980000 15.800000 15.800000 9 days 2010-07-08 16.139999 17.520000 15.570000 17.459999 17.459999 10 days 2010-07-09 17.580000 17.900000 16.549999 17.400000 17.400000 Volume ref_date ref_date 0 days 18766300 0 days 1 days 17187100 1 days 2 days 8218800 2 days 3 days 5139800 3 days 7 days 6866900 7 days 8 days 6921700 8 days 9 days 7711400 9 days 10 days 4050600 10 days
In [49]:
tesla.plot(y = ['High'])
Out[49]:
<Axes: xlabel='Date'>
In [50]:
ebola = pd.read_csv(
'../data/country_timeseries.csv', parse_dates=['Date']
)
print(ebola.iloc[:, :5])
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone 0 2015-01-05 289 2776.0 NaN 10030.0 1 2015-01-04 288 2775.0 NaN 9780.0 2 2015-01-03 287 2769.0 8166.0 9722.0 3 2015-01-02 286 NaN 8157.0 NaN 4 2014-12-31 284 2730.0 8115.0 9633.0 .. ... ... ... ... ... 117 2014-03-27 5 103.0 8.0 6.0 118 2014-03-26 4 86.0 NaN NaN 119 2014-03-25 3 86.0 NaN NaN 120 2014-03-24 2 86.0 NaN NaN 121 2014-03-22 0 49.0 NaN NaN [122 rows x 5 columns]
In [51]:
head_range = pd.date_range(start='2014-12-31', end='2015-01-05')
print(head_range)
DatetimeIndex(['2014-12-31', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'], dtype='datetime64[ns]', freq='D')
In [52]:
ebola_5 = ebola.head()
In [53]:
ebola_5.index = ebola_5['Date']
In [54]:
ebola_5 = ebola_5.reindex(head_range)
print(ebola_5.iloc[:, :5])
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone 2014-12-31 2014-12-31 284.0 2730.0 8115.0 9633.0 2015-01-01 NaT NaN NaN NaN NaN 2015-01-02 2015-01-02 286.0 NaN 8157.0 NaN 2015-01-03 2015-01-03 287.0 2769.0 8166.0 9722.0 2015-01-04 2015-01-04 288.0 2775.0 NaN 9780.0 2015-01-05 2015-01-05 289.0 2776.0 NaN 10030.0
In [55]:
print(pd.date_range('2022-01-01', '2022-01-07', freq='B'))
DatetimeIndex(['2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07'], dtype='datetime64[ns]', freq='B')
In [56]:
print(pd.date_range('2022-01-01', '2022-01-07', freq='2B'))
DatetimeIndex(['2022-01-03', '2022-01-05', '2022-01-07'], dtype='datetime64[ns]', freq='2B')
In [60]:
print(pd.date_range('2022-01-01', '2022-12-31', freq='WOM-1THU')) # 매월 첫 번째 목요일
DatetimeIndex(['2022-01-06', '2022-02-03', '2022-03-03', '2022-04-07', '2022-05-05', '2022-06-02', '2022-07-07', '2022-08-04', '2022-09-01', '2022-10-06', '2022-11-03', '2022-12-01'], dtype='datetime64[ns]', freq='WOM-1THU')
In [61]:
print(pd.date_range('2022-01-01', '2022-12-31', freq='WOM-3FRI')) # 매월 세 번째 금요일
DatetimeIndex(['2022-01-21', '2022-02-18', '2022-03-18', '2022-04-15', '2022-05-20', '2022-06-17', '2022-07-15', '2022-08-19', '2022-09-16', '2022-10-21', '2022-11-18', '2022-12-16'], dtype='datetime64[ns]', freq='WOM-3FRI')
In [62]:
ebola.index = ebola['Date']
fig, ax = plt.subplots()
fig, ax = plt.subplots()
ax = ebola.plot(ax=ax)
ax.legend(fontsize=7, loc=2, borderaxespad=0.0)
plt.show()
In [63]:
ebola_sub = ebola[['Day', 'Cases_Guinea', 'Cases_Liberia']]
print(ebola_sub.tail(10))
Day Cases_Guinea Cases_Liberia Date 2014-04-04 13 143.0 18.0 2014-04-01 10 127.0 8.0 2014-03-31 9 122.0 8.0 2014-03-29 7 112.0 7.0 2014-03-28 6 112.0 3.0 2014-03-27 5 103.0 8.0 2014-03-26 4 86.0 NaN 2014-03-25 3 86.0 NaN 2014-03-24 2 86.0 NaN 2014-03-22 0 49.0 NaN
In [64]:
ebola = pd.read_csv(
"../data/country_timeseries.csv",
parse_dates=["Date"],
index_col="Date",
)
print(ebola.iloc[:, :4])
Day Cases_Guinea Cases_Liberia Cases_SierraLeone Date 2015-01-05 289 2776.0 NaN 10030.0 2015-01-04 288 2775.0 NaN 9780.0 2015-01-03 287 2769.0 8166.0 9722.0 2015-01-02 286 NaN 8157.0 NaN 2014-12-31 284 2730.0 8115.0 9633.0 ... ... ... ... ... 2014-03-27 5 103.0 8.0 6.0 2014-03-26 4 86.0 NaN NaN 2014-03-25 3 86.0 NaN NaN 2014-03-24 2 86.0 NaN NaN 2014-03-22 0 49.0 NaN NaN [122 rows x 4 columns]
In [65]:
new_idx = pd.date_range(ebola.index.min(), ebola.index.max())
print(new_idx)
DatetimeIndex(['2014-03-22', '2014-03-23', '2014-03-24', '2014-03-25', '2014-03-26', '2014-03-27', '2014-03-28', '2014-03-29', '2014-03-30', '2014-03-31', ... '2014-12-27', '2014-12-28', '2014-12-29', '2014-12-30', '2014-12-31', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'], dtype='datetime64[ns]', length=290, freq='D')
In [66]:
new_idx = reversed(new_idx)
print(new_idx)
<reversed object at 0x000001F9ADCE0160>
In [67]:
ebola = ebola.reindex(new_idx)
In [68]:
print(ebola.iloc[:, :4])
Day Cases_Guinea Cases_Liberia Cases_SierraLeone Date 2015-01-05 289.0 2776.0 NaN 10030.0 2015-01-04 288.0 2775.0 NaN 9780.0 2015-01-03 287.0 2769.0 8166.0 9722.0 2015-01-02 286.0 NaN 8157.0 NaN 2015-01-01 NaN NaN NaN NaN ... ... ... ... ... 2014-03-26 4.0 86.0 NaN NaN 2014-03-25 3.0 86.0 NaN NaN 2014-03-24 2.0 86.0 NaN NaN 2014-03-23 NaN NaN NaN NaN 2014-03-22 0.0 49.0 NaN NaN [290 rows x 4 columns]
In [69]:
last_valid = ebola.apply(pd.Series.last_valid_index)
print(last_valid)
Day 2014-03-22 Cases_Guinea 2014-03-22 Cases_Liberia 2014-03-27 Cases_SierraLeone 2014-03-27 Cases_Nigeria 2014-07-23 Cases_Senegal 2014-08-31 Cases_UnitedStates 2014-10-01 Cases_Spain 2014-10-08 Cases_Mali 2014-10-22 Deaths_Guinea 2014-03-22 Deaths_Liberia 2014-03-27 Deaths_SierraLeone 2014-03-27 Deaths_Nigeria 2014-07-23 Deaths_Senegal 2014-09-07 Deaths_UnitedStates 2014-10-01 Deaths_Spain 2014-10-08 Deaths_Mali 2014-10-22 dtype: datetime64[ns]
In [70]:
earliest_date = ebola.index.min()
print(earliest_date)
2014-03-22 00:00:00
In [71]:
shift_values = last_valid - earliest_date
print(shift_values)
Day 0 days Cases_Guinea 0 days Cases_Liberia 5 days Cases_SierraLeone 5 days Cases_Nigeria 123 days Cases_Senegal 162 days Cases_UnitedStates 193 days Cases_Spain 200 days Cases_Mali 214 days Deaths_Guinea 0 days Deaths_Liberia 5 days Deaths_SierraLeone 5 days Deaths_Nigeria 123 days Deaths_Senegal 169 days Deaths_UnitedStates 193 days Deaths_Spain 200 days Deaths_Mali 214 days dtype: timedelta64[ns]
In [73]:
ebola_dict = {}
for idx, col in enumerate(ebola):
d = shift_values[idx].days
shifted = ebola[col].shift(d)
ebola_dict[col] = shifted
In [74]:
ebola_shift = pd.DataFrame(ebola_dict)
In [75]:
print(ebola_shift.tail())
Day Cases_Guinea Cases_Liberia Cases_SierraLeone \ Date 2014-03-26 4.0 86.0 8.0 2.0 2014-03-25 3.0 86.0 NaN NaN 2014-03-24 2.0 86.0 7.0 NaN 2014-03-23 NaN NaN 3.0 2.0 2014-03-22 0.0 49.0 8.0 6.0 Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain \ Date 2014-03-26 1.0 NaN 1.0 1.0 2014-03-25 NaN NaN NaN NaN 2014-03-24 NaN NaN NaN NaN 2014-03-23 NaN NaN NaN NaN 2014-03-22 0.0 1.0 1.0 1.0 Cases_Mali Deaths_Guinea Deaths_Liberia Deaths_SierraLeone \ Date 2014-03-26 NaN 62.0 4.0 2.0 2014-03-25 NaN 60.0 NaN NaN 2014-03-24 NaN 59.0 2.0 NaN 2014-03-23 NaN NaN 3.0 2.0 2014-03-22 1.0 29.0 6.0 5.0 Deaths_Nigeria Deaths_Senegal Deaths_UnitedStates Deaths_Spain \ Date 2014-03-26 1.0 NaN 0.0 1.0 2014-03-25 NaN NaN NaN NaN 2014-03-24 NaN NaN NaN NaN 2014-03-23 NaN NaN NaN NaN 2014-03-22 0.0 0.0 0.0 1.0 Deaths_Mali Date 2014-03-26 NaN 2014-03-25 NaN 2014-03-24 NaN 2014-03-23 NaN 2014-03-22 1.0
In [76]:
ebola_shift.index = ebola_shift['Day']
ebola_shift = ebola_shift.drop(['Day'], axis="columns")
print(ebola_shift.tail())
Cases_Guinea Cases_Liberia Cases_SierraLeone Cases_Nigeria \ Day 4.0 86.0 8.0 2.0 1.0 3.0 86.0 NaN NaN NaN 2.0 86.0 7.0 NaN NaN NaN NaN 3.0 2.0 NaN 0.0 49.0 8.0 6.0 0.0 Cases_Senegal Cases_UnitedStates Cases_Spain Cases_Mali \ Day 4.0 NaN 1.0 1.0 NaN 3.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 1.0 1.0 1.0 1.0 Deaths_Guinea Deaths_Liberia Deaths_SierraLeone Deaths_Nigeria \ Day 4.0 62.0 4.0 2.0 1.0 3.0 60.0 NaN NaN NaN 2.0 59.0 2.0 NaN NaN NaN NaN 3.0 2.0 NaN 0.0 29.0 6.0 5.0 0.0 Deaths_Senegal Deaths_UnitedStates Deaths_Spain Deaths_Mali Day 4.0 NaN 0.0 1.0 NaN 3.0 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 1.0
In [77]:
ebola.index = ebola['Day']
fig, ax = plt.subplots()
ax = ebola.plot(ax=ax)
ax.legend(fontsize=7, loc=2, borderaxespad=0.0)
plt.show()
In [79]:
ebola = pd.read_csv(
"../data/country_timeseries.csv",
parse_dates=["Date"],
index_col="Date",
)
down = ebola.resample('M').mean()
print(down.iloc[:, :5])
Day Cases_Guinea Cases_Liberia Cases_SierraLeone \ Date 2014-03-31 4.500000 94.500000 6.500000 3.333333 2014-04-30 24.333333 177.818182 24.555556 2.200000 2014-05-31 51.888889 248.777778 12.555556 7.333333 2014-06-30 84.636364 373.428571 35.500000 125.571429 2014-07-31 115.700000 423.000000 212.300000 420.500000 2014-08-31 145.090909 559.818182 868.818182 844.000000 2014-09-30 177.500000 967.888889 2815.625000 1726.000000 2014-10-31 207.470588 1500.444444 4758.750000 3668.111111 2014-11-30 237.214286 1950.500000 7039.000000 5843.625000 2014-12-31 271.181818 2579.625000 7902.571429 8985.875000 2015-01-31 287.500000 2773.333333 8161.500000 9844.000000 Cases_Nigeria Date 2014-03-31 NaN 2014-04-30 NaN 2014-05-31 NaN 2014-06-30 NaN 2014-07-31 1.333333 2014-08-31 13.363636 2014-09-30 20.714286 2014-10-31 20.000000 2014-11-30 20.000000 2014-12-31 20.000000 2015-01-31 NaN
In [80]:
up = down.resample('D').mean()
print(up.iloc[:, :5])
Day Cases_Guinea Cases_Liberia Cases_SierraLeone \ Date 2014-03-31 4.5 94.500000 6.5 3.333333 2014-04-01 NaN NaN NaN NaN 2014-04-02 NaN NaN NaN NaN 2014-04-03 NaN NaN NaN NaN 2014-04-04 NaN NaN NaN NaN ... ... ... ... ... 2015-01-27 NaN NaN NaN NaN 2015-01-28 NaN NaN NaN NaN 2015-01-29 NaN NaN NaN NaN 2015-01-30 NaN NaN NaN NaN 2015-01-31 287.5 2773.333333 8161.5 9844.000000 Cases_Nigeria Date 2014-03-31 NaN 2014-04-01 NaN 2014-04-02 NaN 2014-04-03 NaN 2014-04-04 NaN ... ... 2015-01-27 NaN 2015-01-28 NaN 2015-01-29 NaN 2015-01-30 NaN 2015-01-31 NaN [307 rows x 5 columns]
In [81]:
import pytz
In [82]:
print(len(pytz.all_timezones))
596
In [83]:
import re
regex = re.compile(r'^US')
selected_files = filter(regex.search, pytz.common_timezones)
print(list(selected_files))
['US/Alaska', 'US/Arizona', 'US/Central', 'US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific']
In [84]:
depart = pd.Timestamp('2017-08-29 07:00', tz='US/Eastern')
print(depart)
2017-08-29 07:00:00-04:00
In [85]:
arrive = pd.Timestamp('2017-08-29 09:57')
print(arrive)
2017-08-29 09:57:00
In [86]:
arrive = arrive.tz_localize('US/Pacific')
print(arrive)
2017-08-29 09:57:00-07:00
In [87]:
print(arrive.tz_convert('US/Eastern'))
2017-08-29 12:57:00-04:00
In [88]:
duration = arrive - depart
print(duration)
0 days 05:57:00
'도서 > 프로그래밍' 카테고리의 다른 글
[02] 쉽게 배우는 JSP 웹 프로그래밍 (1) | 2024.01.14 |
---|---|
[01] 쉽게 배우는 JSP 웹 프로그래밍 (1) | 2024.01.13 |
[08] Do it! 데이터 분석을 위한 판다스 입문 (0) | 2024.01.08 |
[07] Do it! 데이터 분석을 위한 판다스 입문 (0) | 2024.01.08 |
[06] Do it! 데이터 분석을 위한 판다스 입문 (0) | 2024.01.07 |