データ解析のための統計モデリング入門３章

いわゆる緑本を勉強したのだが, 全部Rで書かれてたのでpythonのほうが好きかなーやっぱｗという気持ちで色々書いてみた.

モジュールのインポート

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import matplotlib.style
import seaborn as sns

スタイルを変更したい気持ち

matplotlib.style.use("ggplot")

第3章 : 一般化線形モデル(GLM) -ポアソン回帰-

データの読み込み

data = pd.read_csv("data3a.csv")

データの確認

data.head()

	y	x	f
0	6	8.31	C
1	6	9.44	C
2	6	9.50	C
3	12	9.07	C
4	10	10.16	C

data["x"].head()

0     8.31
1     9.44
2     9.50
3     9.07
4    10.16
Name: x, dtype: float64

data["y"].head()

0     6
1     6
2     6
3    12
4    10
Name: y, dtype: int64

data["f"].head()

0    C
1    C
2    C
3    C
4    C
Name: f, dtype: object

3.2 観測されたデータの概要を調べる

data.describe()

	y	x
count	100.000000	100.000000
mean	7.830000	10.089100
std	2.624881	1.008049
min	2.000000	7.190000
25%	6.000000	9.427500
50%	8.000000	10.155000
75%	10.000000	10.685000
max	15.000000	12.400000

fig,ax = plt.subplots() # fig は描画プロット, ax はサブプロット

X = data["x"]
y = data["y"]

ax.scatter(X,y, label = "x-y") # 散布図を出すときはscatter関数, label="hogehoge" とすることで凡例のラベルをつけられる
ax.set_xlabel("data_x") # x軸のラベル
ax.set_ylabel("data_y") # y軸のラベル
ax.set_title("data_x - data_y") # グラフの表題
ax.legend(loc = "best") # 凡例を表示する. loc="best"とすることで最も重なりのない位置に凡例を表示することができる.
plt.show()

f:id:remark_tzi:20190205040119p:plain

# 描画プロットとサブプロットの設定
fig,ax = plt.subplots() # fig は描画プロット, ax はサブプロット

## dataをf==Cとf==Tで分ける
data_C = data[data["f"]=="C"] # dataからf==Cのものだけをとってくる
data_T = data[data["f"]=="T"] # dataからf==Tのものだけをとってくる

## CとTごとにxとyに分ける
x_c = data_C["x"] 
y_c = data_C["y"]
x_t = data_T["x"]
y_t = data_T["y"]

## 散布図を出力する
ax.scatter(x_c,y_c,marker="v", label = "x_c-y_c") # marker = "v" とすると下三角
ax.scatter(x_t,y_t,marker="s", label = "x_t-y_t") # marker = "s" とすると正四角形, 他にも^や*やxなどがある
ax.set_xlabel("data_x") # x軸のラベル
ax.set_ylabel("data_y") # y軸のラベル
ax.set_title("data_x - data_y") # グラフの表題
ax.legend(loc = "best") # 凡例を表示する. loc="best"とすることで最も重なりのない位置に凡例を表示することができる.
plt.show()

f:id:remark_tzi:20190205040122p:plain

## f==cとf==Tの場合の箱ひげ図もみてみる
fig,ax = plt.subplots()

label_name = ["f = C","f = T"]
ax.boxplot((y_c,y_t), labels = label_name)

plt.show()

f:id:remark_tzi:20190205040126p:plain

## f==cとf==Tの場合の箱ひげ図もみてみる
fig,ax = plt.subplots()

#label_name = ["f = C","f = T"]
ax.boxplot(y_c,labels = "c")

plt.show()

f:id:remark_tzi:20190205040129p:plain

## f==cとf==Tの場合の箱ひげ図もみてみる
fig,ax = plt.subplots()

#label_name = ["f = C","f = T"]
ax.boxplot(y_c,labels = "ac")

plt.show()

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-17-f0fe891b5ccf> in <module>
      3 
      4 #label_name = ["f = C","f = T"]
----> 5 ax.boxplot(y_c,labels = "ac")
      6 
      7 plt.show()

c:\users\nago\documents\python\studying\python3.6\lib\site-packages\matplotlib\__init__.py in inner(ax, data, *args, **kwargs)
   1808                         "the Matplotlib list!)" % (label_namer, func.__name__),
   1809                         RuntimeWarning, stacklevel=2)
-> 1810             return func(ax, *args, **kwargs)
   1811 
   1812         inner.__doc__ = _add_data_doc(inner.__doc__,

c:\users\nago\documents\python\studying\python3.6\lib\site-packages\matplotlib\axes\_axes.py in boxplot(self, x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_xticks, autorange, zorder)
   3501 
   3502         bxpstats = cbook.boxplot_stats(x, whis=whis, bootstrap=bootstrap,
-> 3503                                        labels=labels, autorange=autorange)
   3504         if notch is None:
   3505             notch = rcParams['boxplot.notch']

c:\users\nago\documents\python\studying\python3.6\lib\site-packages\matplotlib\cbook\__init__.py in boxplot_stats(X, whis, bootstrap, labels, autorange)
   1179         labels = itertools.repeat(None)
   1180     elif len(labels) != ncols:
-> 1181         raise ValueError("Dimensions of labels and X must be compatible")
   1182 
   1183     input_whis = whis

ValueError: Dimensions of labels and X must be compatible

ここはなんでエラー吐くのかよくわからんからメモで残しといた. そのうち調べる（これは調べないですわ）

まあでも以上のことから

xが増加するに従ってyも増加しそう(よくわからない)
fはあまり影響しなさそう

とわかる

ポアソン回帰の統計モデル

ここで使うモジュールのインポート

## 統計的にはこのモジュールを使うと良いとされている
## というのも詳細がわかりやすい
import statsmodels.api as sm

GLMのフィッティング

x-yでポアソン回帰

# 植物の大きさ
x = sm.add_constant(data["x"])
# 飛ばした種子の数
y = data["y"]
# ポアソン回帰
model = sm.GLM(y, x, family=sm.families.Poisson()) # modelの作成
results = model.fit()

results.summary()

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	100
Model:	GLM	Df Residuals:	98
Model Family:	Poisson	Df Model:	1
Link Function:	log	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-235.39
Date:	Tue, 05 Feb 2019	Deviance:	84.993
Time:	03:27:42	Pearson chi2:	83.8
No. Iterations:	4	Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	1.2917	0.364	3.552	0.000	0.579	2.005
x	0.0757	0.036	2.125	0.034	0.006	0.145

f-yでポアソン回帰

## 施肥効果の変数
f = sm.add_constant(data["f"])
## 何もしてなければ0,なにかしていれば1に置き換える
f = f.replace({"C":0,"T":1})

## モデルの作成
model = sm.GLM(y, f, family=sm.families.Poisson())
results = model.fit()

## 詳細
results.summary()

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	100
Model:	GLM	Df Residuals:	98
Model Family:	Poisson	Df Model:	1
Link Function:	log	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-237.63
Date:	Tue, 05 Feb 2019	Deviance:	89.475
Time:	03:27:44	Pearson chi2:	87.1
No. Iterations:	4	Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	2.0516	0.051	40.463	0.000	1.952	2.151
f	0.0128	0.071	0.179	0.858	-0.127	0.153

説明変数が２つ以上あるとき

## 説明変数をまとめる
Xs = data.drop("y",axis=1)
## objectoをダミー変数化
Xs = Xs.replace({"C":0,"T":1})
# 確率モデルをポアソン分布と仮定してモデルのフィッティング
Xs = sm.add_constant(Xs)
"""
# Xs = Xs.values # どうやらわざわざndarrayにしなくてもやってくれるらしい
"""
results = sm.GLM(endog=y, exog=Xs,family=sm.families.Poisson()).fit()
results.summary()

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	100
Model:	GLM	Df Residuals:	97
Model Family:	Poisson	Df Model:	2
Link Function:	log	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-235.29
Date:	Tue, 05 Feb 2019	Deviance:	84.808
Time:	03:27:47	Pearson chi2:	83.8
No. Iterations:	4	Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	1.2631	0.370	3.417	0.001	0.539	1.988
x	0.0801	0.037	2.162	0.031	0.007	0.153
f	-0.0320	0.074	-0.430	0.667	-0.178	0.114

## 説明変数をまとめる
Xs = data.drop("y",axis=1)
## objectoをダミー変数化
Xs = Xs.replace({"C":0,"T":1})
# 確率モデルを正規分布と仮定してモデルのフィッティング
Xs = sm.add_constant(Xs)
"""
# Xs = Xs.values # どうやらわざわざndarrayにしなくてもやってくれるらしい
"""
results = sm.GLM(endog=y, exog=Xs,family=sm.families.Gaussian()).fit()
results.summary()

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	100
Model:	GLM	Df Residuals:	97
Model Family:	Gaussian	Df Model:	2
Link Function:	identity	Scale:	6.6522
Method:	IRLS	Log-Likelihood:	-235.12
Date:	Tue, 05 Feb 2019	Deviance:	645.26
Time:	03:27:48	Pearson chi2:	645.
No. Iterations:	3	Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	1.6169	2.653	0.610	0.542	-3.582	6.816
x	0.6284	0.268	2.345	0.019	0.103	1.154
f	-0.2538	0.537	-0.472	0.637	-1.307	0.800

正規分布とポアソン分布の線形回帰の違いを見てみる

poisson_const = 1.2631
poisson_f = -0.0320
poisson_consts = poisson_const + poisson_f
poisson_x =0.0801

j = np.arange(7.0, 12.1, 0.5)

poisson_link = [math.exp(poisson_consts + poisson_x * k) for k in j]

gauss_const = 1.6169
gauss_f = -0.2538
gauss_consts = gauss_const + gauss_f
gauss_x = 0.6284

gauss_link = [math.exp(gauss_consts + gauss_x * k) for k in j]
x_labels = [k for k in j]

gauss_id = [gauss_consts + gauss_x * k for k in j]

poisson_id = [poisson_consts + poisson_x * k for k in j]

fig, ax = plt.subplots()

#ax.scatter(x_labels,gauss_link, marker = "s", label = "Gauss_link")
ax.scatter(x_labels,poisson_link, marker = "*", label = "Poisson_link")
#ax.scatter(x_labels,gauss_id, marker = "v", label = "Gauss_id")
ax.scatter(x_labels,poisson_id, marker = "^", label = "Poisson_id")
ax.scatter(X,y,marker = "x", label = "data")
ax.set_xlabel("x_values")
ax.set_ylabel("lambda")
ax.legend(loc = "best") 
plt.show()

f:id:remark_tzi:20190205040114p:plain

タイトルに反して結局消してしまったガウス分布. 理由は簡単でこいつでやったのアホほど跳ね上がってほかがみえないからです. ごめんね！

Aya's Prelude

大学院生のよしなごと

緑本の３章をpythonで勉強しました