Pandas学习笔记(二)

"Pandas学习笔记"

Posted by jhljx on January 29, 2018

目录

1. Python2向文件中写入Unicode字符

pandas df[‘petal area’] = df.apply(lambda r: r[‘petal length’] * r[‘petal width’], axis=1)

# axis=1表示对列进行操作,axis=0表示对行进行操作

对数据框中每个数据单元执行一个函数: df.applymap(lambda v: np.log(v) if isinstance(v, float) else v)

X_train, X_test = X[:idx], X[idx:] y_train = ipos[’$ Chg Open to Close’][:idx].map(lambda x: 1 if x >= .25 else 0) y_test = ipos[’$ Chg Open to Close’][idx:].map(lambda x: 1 if x >= .25 else 0)

df.groupby(‘class’)返回值是什么? df.group(‘class’).mean()

自己以前写的两个程序:

def strToDate(strSample):
    return datetime.strptime(strSample, "%Y%m%d")
card_df = pd.read_csv('cards.csv', usecols=[1,3,9], header=0, names=['Date', 'Fee', 'TrCode'],
    dtype={'Date':str, 'Fee':np.float32, 'TrCode':np.int32})
card_df['Date'] = card_df['Date'].map(strToDate)
card_df = card_df['Fee'].groupby([card_df['Date'], card_df['TrCode']]).sum()
card_df = card_df.unstack('TrCode')
card_df.to_csv("card_out.csv", na_rep=0)



card_df = pd.read_csv('cards.csv', usecols=[0,1,2,3,9], header=0, names=['StuId', 'Date', 'Time', 'Fee', 'TrCode'],
	dtype={'StuId':str, 'Date':str, 'Time':np.int32, 'Fee':np.float32, 'TrCode':np.int32})
card_df['Date'] = card_df['Date'].map(strToDate)
card_df['Time'] = card_df['Time'].map(transfer_time_to_section)
sub_card = (card_df['TrCode'] == 210) & (card_df['Fee'] <= 50.0)
card_df_grouped_count = card_df[sub_card]['Fee'].groupby([card_df['StuId']]).count()
card_df_grouped_fee = card_df[sub_card]['Fee'].groupby([card_df['StuId'], card_df['Date'],card_df['Time']]).sum().unstack('Time')
card_df_grouped_fee = card_df_grouped_fee.groupby(level='StuId', axis=0).mean()
card_df_grouped_fee['Count'] = card_df_grouped_count
card_df_grouped_fee.to_csv("cards_out.csv", na_rep=0) 

DataFrame创建 http://blog.csdn.net/zutsoft/article/details/51483710

DataFrame如何改列名?以前应该搜过

df.groupby(‘petal width’)[‘class’].unique().to_frame()

df.groupby(‘class’)[‘petal width’].agg({‘delta’: lambda x: x.max() - x.min(), ‘max’: np.max, ‘min’: np.min}) # 根据类别来分类画板宽度,并且调用agg函数,最终会返回以字典键值作为列名的数据框

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
clf = RandomForestClassifier(max_depth=5, n_estimators=10)
X = df.ix[:,:4]
y = df.ix[:,4]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
rf = pd.DataFrame(list(zip(y_pred, y_test)), columns=['predicted', 'actual']) #注意这里DataFrame的构造
rf['correct'] = rf.apply(lambda r: 1 if r['predicted'] == r['actual'] else 0, axis=1)
rf
rf['correct'].sum()/rf['correct'].count()
#对于python2这里的除法会有问题,需要添加额外的import语句:from __future__ import division

sklearn的train_test_split会打乱原有的数据先后顺序

pandas的使用方法: pd.set_option(“display.max_columns”, 30) pd.set_option(“display.max_colwidth”, 100) pd.set_option(“display.precision”, 3)

df[(df[‘class’]==’Iris-virginica’)&(df[‘petal width’]>2.2)]

df.columns可以获取dataframe的列的list

df.T可以转置dataframe

df.T.ix[:,1:2] df.ix[:3,:2] df.ix[:3, [x for x in df.columns if ‘width’ in x]]

df[‘class’].unique()

nn = df[df[‘routable_link/_text’].str.contains(‘203 Rivington’)|df[‘routable_link/_text’].str.contains(‘280 E 2nd’)] 上面语句里的df[‘xxx’].str是什么意思?

len(su[su[‘pricelarge_value_prices’].isnull()])

#检查没有包含’bd’或’Studio’的行数 len(su[~(su[‘propertyinfo_value’].str.contains(‘Studio’)|su[‘propertyinfo_value’].str.contains(‘bd’))])

sucln = su[~su.index.isin(no_baths.index)]

tt[tt[‘title’].str.contains(‘Dies’)]

def parse_info(row): if not ‘sqft’ in row: #可以用来判断字符串中是否含有sqft br, ba = row.split(‘•’)[:2] sqft = np.nan else: br, ba, sqft = row.split(‘•’)[:3]
return pd.Series({‘Beds’: br, ‘Baths’: ba, ‘Sqft’: sqft}) attr = sucln[‘propertyinfo_value’].apply(parse_info)

attr_cln = attr.applymap(lambda x: x.strip().split(‘ ‘)[0] if isinstance(x,str) else np.nan) #可以用来取值数据元素中包含数字又包含字母的情况

sujnd = sucln.join(attr_cln)

dataframe的join函数是什么个意思?

https://www.cnblogs.com/huiyang865/p/5553494.html

def parse_addy(r): so_zip = re.search(‘, NY(\d+)’, r) so_flr = re.search(‘(?:APT|#)\s+(\d+)[A-Z]+,’, r) if so_zip: zipc = so_zip.group(1) else: zipc = np.nan if so_flr: flr = so_flr.group(1) else: flr = np.nan return pd.Series({‘Zip’:zipc, ‘Floor’: flr}) flrzip = sujnd[‘routable_link/_text’].apply(parse_addy)

对dataframe的一个列应用parse_addy函数可以生成一个新的dataframe。具体新生成的结果可以根据需要自行调整

df[“Status”] = df[“Status”].astype(“category”) df[“Status”].cat.set_categories([“won”,”pending”,”presented”,”declined”],inplace=True)

只取dataframe中特定的列 sudf = suf[[‘pricelarge_value_prices’,’Beds’,’Baths’,’Sqft’,’Floor’,’Zip’]]

重命名列名 sudf.rename(columns={‘pricelarge_value_prices’:’Rent’}, inplace=True)

sudf.reset_index(drop=True, inplace=True)

#将出现的’Studio’替代为0 sudf.loc[:,’Beds’] = sudf[‘Beds’].map(lambda x: 0 if ‘Studio’ in x else x)

sudf.info()可以查看dataframe中的每个列的数据类型

sudf.loc[:,’Rent’] = sudf[‘Rent’].astype(int) sudf.loc[:,’Beds’] = sudf[‘Beds’].astype(int)

# 对于数字为’,’分开的数据需要将’,’去掉 sudf.loc[:,’Sqft’] = sudf[‘Sqft’].str.replace(‘,’,’’)

ipos.replace(‘N/C’,0, inplace=True)

dataframe查找满足特定条件的子数据框 sudf[sudf[‘Floor’]>5]

sudf = sudf.drop([318])

sudf.pivot_table(‘Rent’, ‘Zip’, ‘Beds’, aggfunc=’mean’)什么意思? 在zip作为行,beds的值作为列的情况下,求rent的平均值。

pd.pivot_table(df,index=[“Manager”,”Rep”],values=[“Price”],aggfunc=np.sum)

pandas分组统计函数:groupby、pivot_table及crosstab http://blog.csdn.net/elecjack/article/details/50760736

DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc=’mean’, fill_value=None, margins=False, dropna=True, margins_name=’All’)

pd.pivot_table(df,index=[“Manager”,”Rep”],values=[“Price”]) 以Manager与Rep作为二级索引,price作为列

dataframe排序?su_lt_two.sort(‘Zip’)

X.iloc[-1]

to_pred_idx = X.iloc[0].index to_pred_zeros = np.zeros(len(to_pred_idx)) tpdf = pd.DataFrame(to_pred_zeros, index=to_pred_idx, columns=[‘value’])

sp20 = sp[[x for x in sp.columns if ‘Close Minus’ in x or x == ‘Close’]].iloc[20:,] sp20 = sp20.iloc[:,::-1]

X_train = sp20[:-2000] y_train = sp20[‘Close’].shift(-1)[:-2000]

pandas的shift函数是什么用?

df.median() df.describe() rf.describe([.10,.25,.5,.75,.9]) #可以自己传入自定义百分比

px = [x for x in fares[‘price’]] ff = pd.DataFrame(px, columns=[‘fare’]).reset_index()

ipos.sort_values(‘% Chg Open to Close’)

df.labels()

pf = pd.concat([ff,pd.DataFrame(db.labels_, columns=[‘cluster’])], axis=1) rf = pf.groupby(‘cluster’)[‘fare’].agg([‘min’,’count’]).sort_values(‘min’, ascending=True) rf.iloc[0][‘min’]

y_train = all_data.iloc[train_index][‘fb’].astype(int) X_train_nosite = all_data.iloc[train_index][[‘img_count’, ‘word_count’]]

pf[‘fare’].min()

rf[‘count’].quantile(.10)

rez = pd.merge(results,zf, left_index=True, right_index=True)

df.assign函数是什么? tf2 = tf2.assign(Signal = tf2.apply(get_signal, axis=1)) tf2 = tf2.assign(PnL = tf2.apply(get_ret, axis=1))

dfc[::-1]

fig, ax = plt.subplots(figsize=(8,6)) y = dfc[‘img_count’].value_counts().sort_index() x = y.sort_index().index plt.bar(x, y, color=’k’, align=’center’) plt.title(‘Image Count Frequency’, fontsize=16, y=1.01) ax.set_xlim(-.5,5.5) ax.set_ylabel(‘Count’) ax.set_xlabel(‘Number of Images’)

dfc[‘site’].value_counts().to_frame()

mci = dfc[‘main_hex’].value_counts().to_frame(‘count’) mci[‘color’] = ‘ ‘ def color_cells(x): return ‘background-color: ‘ + x.index mci.style.apply(color_cells, subset=[‘color’], axis=0)

dfc[‘reds’], dfc[‘greens’], dfc[‘blues’] = zip(*dfc[‘main_rgb’].map(get_csplit))

sites = pd.get_dummies(all_data[‘site’])

pandas_datareader是什么库? from pandas_datareader import data, wb import pandas_datareader as pdr

tf = pd.DataFrame(list(zip(y_test, preds)), columns=[‘Next Day Close’, ‘Predicted Next Close’], index=y_test.index)

fastdtw是什么库 from scipy.spatial.distance import euclidean from fastdtw import fastdtw

def dtw_dist(x, y): distance, path = fastdtw(x, y, dist=euclidean) return distance

dist_frame = pd.DataFrame(dist_pairs, columns=[‘A’,’B’,’Dist’, ‘A Ret’, ‘B Ret’]) sf = dist_frame[dist_frame[‘Dist’]>0].sort_values([‘A’,’B’]).reset_index(drop=1) sfe = sf[sf[‘A’] < sf[‘B’]] winf = sfe[(sfe[‘Dist’]<=1) & (sfe[‘A Ret’]>0)]

kf = pd.DataFrame(k_sim).T kf.columns = [‘similarity’] kf.sort_values(‘similarity’, ascending=False)

fdf.iloc[71,:][fdf.iloc[71,:]==1] all_recs = fdf.iloc[[31,5,71,79],:][fdf.iloc[[31,5,71,79],:]==1].fillna(0).T all_recs[(all_recs==1).all(axis=1)]

str_recs_tmp = all_recs[all_recs[myun]==0].copy() str_recs = str_recs_tmp.iloc[:,:-1].copy()

str_recs[(str_recs==1).all(axis=1)] str_recs[str_recs.sum(axis=1)>1]

ipos.groupby(ipos[‘Date’].dt.year)[‘1st Day % Px Chng ‘]\ .mean().plot(kind=’bar’, figsize=(15,10), color=’k’, title=’1st Day Mean IPO Percentage Change’) ipos.groupby(ipos[‘Date’].dt.year)[‘1st Day % Px Chng ‘]\ .median().plot(kind=’bar’, figsize=(15,10), color=’k’, title=’1st Day Median IPO Percentage Change’)

ipos[ipos[‘Date’]==’11/120’] ipos.loc[1660, ‘Date’] = pd.to_datetime(‘2012-11-20’)

dataframe画图 fig, ax = plt.subplots(figsize=(15,10)) rf[rf[‘predicted’]==1][’$ chg’].plot(kind=’bar’) ax.set_title(‘Model Predicted Buys’, y=1.01) ax.set_ylabel(‘$ Change Open to Close’) ax.set_xlabel(‘Index’)