本篇文章我们复现Fama & French 1993年的经典论文Common risk factors in the returns on stocks and bonds的因子构建方式,并给出python代码。
Fama French三因子模型是量化投资领域最为经典的理论模型之一。最早提出的CAPM模型无法解释市场收益率,Fama French提出了三个因子用以解释股票在横截面上的收益差异。这三个因子分别捕捉市场收益率(Mkt)、小盘股效应(Size)、价值效应(Value),这也是最早的资产模拟组合(Factor mimicking portfolios)。三因子模型也因此被作为股票市场的基准(benchmark),后续人们在检验一个新的投资异象(anomalies)时,也都会首先使用三因子对其进行解释。
本文数据源于wrds,其中因子的构建主要是采取先double sort,再加权求平均的方式。最后我们绘制出这两个因子在美股上的历史表现,并与French官网上的因子历史收益进行比较。
Fama French 的做法是将每只股票按照size分为两组,按照Book-to-market分为三组。这里,市值等于股价乘以流通股股数。
=所有者权益+递延税金和投资税减免-优先股市值。
财务数据来自compustat的年度基本面数据,以年报数据计算上一年度的账面价值。交易数据来自crsp的月度收益数据(后续也可以继续拓展到日度数据)。在这一部分计算市值时,使用delisting return乘以流通股数量作为市值
。由于一家公司可能存在多只证券,因此,将同一个公司所有证券市值的和作为其
。Book-to-Market等于
年会计年末的账面价值除以
年12月份的市值。Size则直接为
年当年6月份的市值。
组合的构建方式是:
在年的6月份,求出所有股票的市值中位数,从而将股票划分为小盘股(Small)和大盘股(Big)。使用上述所构建的
,将所有股票按照BM排序分为Low,Median,High三组。股票的筛选条件为:
,
,并且至少在Compustat中存续两年。由此,便可以得到6个股票组合,分别是
,则小盘股组合的收益为
,大盘股组合的收益为
。组合在6月份底建仓,一直持续12个月。
最终可以看到组合在美股的长期表现如下:
对比从French官网上下载的SMB因子收益率序列可以看到,复现效果较好
下面是构建因子的代码,使用python编写。
import pandas as pd
import numpy as np
import datetime as dt
import wrds
import psycopg2
import matplotlib.pyplot as plt
from dateutil.relativedelta import *
from pandas.tseries.offsets import *
from scipy import stats
###################
# Compustat Block #
###################
comp_all = pd.read_sas('F:\data\comp\compa.sas7bdat')
# 将列名由大写变为小写
column_upper = list(comp_all.columns)
column_lower = [s.lower() for s in column_upper]
comp_all.columns = column_lower
# 保留下来符合条件的股票,注意,此处原始文件中是object类型
comp_need = comp_all[comp_all['indfmt']==b'INDL']
comp_need = comp_need[comp_need['datafmt']==b'STD']
comp_need = comp_need[comp_need['popsrc'] == b'D']
comp_need = comp_need[comp_need['consol'] == b'C']
comp_need = comp_need[comp_need['datadate'] >= '01/01/1959']
comp = comp_need[['gvkey', 'datadate', 'at', 'pstkl', 'txditc',
'pstkrv', 'seq', 'pstk']]
comp['datadate']=pd.to_datetime(comp['datadate']) #convert datadate to date fmt
# 获取年份
comp['year']=comp['datadate'].dt.year
# create preferrerd stock,若pstkrv是空的话,则用liquidating value, or to use rede value
comp['ps']=np.where(comp['pstkrv'].isnull(), comp['pstkl'], comp['pstkrv'])
comp['ps']=np.where(comp['ps'].isnull(),comp['pstk'], comp['ps'])
#若继续为空,则用0替代
comp['ps']=np.where(comp['ps'].isnull(),0,comp['ps'])
comp['txditc']=comp['txditc'].fillna(0)
# create book equity,若小于零,则设为缺失
comp['be']=comp['seq']+comp['txditc']-comp['ps']
comp['be']=np.where(comp['be']>0, comp['be'], np.nan)
# number of years in Compustat
comp=comp.sort_values(by=['gvkey','datadate'])
comp['count']=comp.groupby(['gvkey']).cumcount()
# 只取用其中的有用的变量:股票代码索引、时间、年份、Book-to-Market值、年份索引
comp=comp[['gvkey','datadate','year','be','count']]
###################
# CRSP Block #
###################
# sql similar to crspmerge macro
crsp_sas = pd.read_sas('F:\\data\\crsp\\All_crsp_m20181231.sas7bdat')
column_upper = list(crsp_sas.columns)
column_lower = [s.lower() for s in column_upper]
crsp_sas.columns = column_lower
crsp_m = crsp_sas[['permno','permco','date','shrcd','exchcd','ret','retx','shrout','prc']]
crsp_m = crsp_m[(crsp_m['date'] >= '01/01/1959') & (crsp_m['date'] <= '12/31/2017')]
crsp_m = crsp_m[(crsp['exchcd'] >=1 ) & (crsp_m['exchcd'] <= 3)]
# change variable format to int
crsp_m[['permco','permno','shrcd','exchcd']]=crsp_m[['permco','permno','shrcd','exchcd']].astype(int)
# Line up date to be end of month,即获取当月月末是几号
crsp_m['date']=pd.to_datetime(crsp_m['date'])
crsp_m['jdate']=crsp_m['date']+MonthEnd(0)
# add delisting return
dlret = pd.read_sas('F:\data\crsp\msedelist.sas7bdat')
column_upper = list(dlret.columns)
column_lower = [s.lower() for s in column_upper]
dlret.columns = column_lower
dlret = dlret[['permno','dlret','dlstdt']]
dlret.permno=dlret.permno.astype(int)
dlret['dlstdt']=pd.to_datetime(dlret['dlstdt'])
dlret['jdate']=dlret['dlstdt']+MonthEnd(0)
# 将收益率与delisting 数据合并,用于计算考虑delist之后的收益率,python中,缺失值用0填充
crsp = pd.merge(crsp_m, dlret, how='left',on=['permno','jdate'])
crsp['dlret']=crsp['dlret'].fillna(0)
crsp['ret']=crsp['ret'].fillna(0)
crsp['retx'] = crsp['retx'].fillna(0)
crsp['retadj']=(1+crsp['ret'])*(1+crsp['dlret'])-1
crsp['me']=crsp['prc'].abs()*crsp['shrout'] # calculate market equity流通市值
crsp=crsp.drop(['dlret','dlstdt','prc','shrout'], axis=1)
crsp=crsp.sort_values(by=['jdate','permco','me'])
#------------------------------------------------------------------------------
# There are cases when the same firm (permco) has two or more
# securities (permno) at same date. For the purpose of ME for
# the firm, we aggregated all ME for a given permco, date. This
# aggregated ME will be assigned to the Permno with the largest ME
#------------------------------------------------------------------------------
### Aggregate Market Cap针对同一家公司permco存在多个证券permno的情形 ###
# sum of me across different permno belonging to same permco a given date
crsp_summe = crsp.groupby(['jdate','permco'])['me'].sum().reset_index()
# largest mktcap within a permco/date
crsp_maxme = crsp.groupby(['jdate','permco'])['me'].max().reset_index()
# join by jdate/maxme to find the permno
# crsp1数据比crsp数据少10万行,表明有些行所对应的是同一家公司,但是发行了另外的股票
crsp1=pd.merge(crsp, crsp_maxme, how='inner', on=['jdate','permco','me'])
# drop me column and replace with the sum me
crsp1=crsp1.drop(['me'], axis=1)
# join with sum of me to get the correct market cap info
crsp2=pd.merge(crsp1, crsp_summe, how='inner', on=['jdate','permco'])
# sort by permno and date and also drop duplicates
crsp2=crsp2.sort_values(by=['permno','jdate']).drop_duplicates()
# keep December market cap,先生成每个日期的月份,然后只保留下来12月的Market Equity
crsp2['year']=crsp2['jdate'].dt.year
crsp2['month']=crsp2['jdate'].dt.month
decme=crsp2[crsp2['month']==12]
decme=decme[['permno','date','jdate','me','year']].rename(columns={'me':'dec_me'})
### July to June dates,
crsp2['ffdate']=crsp2['jdate']+MonthEnd(-6) # 向前推6个月
crsp2['ffyear']=crsp2['ffdate'].dt.year #向前推6个月所在的年份
crsp2['ffmonth']=crsp2['ffdate'].dt.month # 向前推6个月所在的月份
crsp2['1+retx']=1+crsp2['retx']
# 并按照股票、日期排序
crsp2=crsp2.sort_values(by=['permno','date'])
# cumret by stock,按照前推6个月所在年份分组,年度累计收益率,ffyear为新年中第一个的
# 则仍未原始收益率
crsp2['cumretx']=crsp2.groupby(['permno','ffyear'])['1+retx'].cumprod()
# lag cumret
crsp2['lcumretx']=crsp2.groupby(['permno'])['cumretx'].shift(1)
# lag market cap
crsp2['lme']=crsp2.groupby(['permno'])['me'].shift(1)
# if first permno then use me/(1+retx) to replace the missing value
# 对每个股票进行分组,第一只股票使用 me/(1+retx)作为lag(me)的替代
# 一个类似于sas中的 the first 的用法,即先给其标序号,然后count=0则是第一个,
crsp2['count']=crsp2.groupby(['permno']).cumcount()
crsp2['lme']=np.where(crsp2['count']==0, crsp2['me']/crsp2['1+retx'], crsp2['lme'])
# baseline me,如果前推6个月为1月的话,lg(me)作为baes
mebase=crsp2[crsp2['ffmonth']==1][['permno','ffyear', 'lme']].rename(columns={'lme':'mebase'})
# merge result back together,即使用前推6个月所在年份的1月份,作为Market Equity的base
# 加权权重 wt,如果是1月份的话,则使用lag(me)即上一年的market equity作为权重
# 如果是其他月份的话,则使用lag(me)*lag(cumretx)作为权重
crsp3=pd.merge(crsp2, mebase, how='left', on=['permno','ffyear'])
crsp3['wt']=np.where(crsp3['ffmonth']==1, crsp3['lme'], crsp3['mebase']*crsp3['lcumretx'])
# 滞后1年,则dec_me即表示前一年12月份的Market Equity
decme['year']=decme['year']+1
decme=decme[['permno','year','dec_me']]
#-------------Create a file with data for each June with ME from previous December
# 在每一个6月份上,将上一年份12月的Market Equity合并进来
# Info as of June
# Each Firm's monthly return ret t will be weighted by
# ME(t-1) = ME(t-2) * (1+RETX(t-1))
# wt: me_base(t-1 me) * lag(cum_retx)
# dec_me is the december t-1 Market Equity(ME)
crsp3_jun = crsp3[crsp3['month']==6]
crsp_jun = pd.merge(crsp3_jun, decme, how='inner', on=['permno','year'])
crsp_jun=crsp_jun[['permno','date', 'jdate', 'shrcd','exchcd','retadj','me','wt','cumretx','mebase','lme','dec_me']]
crsp_jun=crsp_jun.sort_values(by=['permno','jdate']).drop_duplicates()
#-----------------------------------------------------------------------------#
#######################
# CCM Block #
#######################
# merge CRSP data with Compustat data, at June of every year
# Match fiscal year ending calendar year t-1 with June t
ccm_sas = pd.read_sas('F:\data\crsp\ccmxpf_linktable.sas7bdat')
ccm = ccm_sas[['gvkey','lpermno', 'linktype', 'linkprim','linkdt', 'linkenddt']]
ccm = ccm.rename(columns={'lpermno':'permno'})
ccm = ccm[(ccm['linktype']!=b'NP') & (ccm['linktype']!=b'NR') & (ccm['linktype']!=b'NU')]
ccm = ccm[(ccm['linkprim'] == b'C') | (ccm['linkprim'] == b'P')]
ccm['linkdt']=pd.to_datetime(ccm['linkdt'])
ccm['linkenddt']=pd.to_datetime(ccm['linkenddt'])
# if linkenddt is missing then set to today date
ccm['linkenddt']=ccm['linkenddt'].fillna(pd.to_datetime('today'))
# 将基本面comp数据与ccm合并
ccm1=pd.merge(comp[['gvkey','datadate','be', 'count']],ccm,how='left',on=['gvkey'])
ccm1['yearend']=ccm1['datadate']+YearEnd(0)
# 后推6个月,即comp的基本面数据与crsp之后6个月的数据进行合并
ccm1['jdate']=ccm1['yearend']+MonthEnd(6)
# set link date bounds
ccm2=ccm1[(ccm1['jdate']>=ccm1['linkdt'])&(ccm1['jdate']<=ccm1['linkenddt'])]
ccm2=ccm2[['gvkey','permno','datadate','yearend', 'jdate','be', 'count']]
# link comp and crsp,并计算出Book-to-Market(使用前一年12月的Market Equity
ccm_jun=pd.merge(crsp_jun, ccm2, how='inner', on=['permno', 'jdate'])
ccm_jun['beme']=ccm_jun['be']*1000/ccm_jun['dec_me']
#------------------------------------------------------------------------------
# Part4 Size and Book to Market Portfolios
# Forming Portfolio by ME and BEME as of each June t
# Calculating NYSE Breakpoints for Market Equity(ME) and Book-to-Market (BEME)
#------------------------------------------------------------------------------
# select NYSE stocks for bucket breakdown
# exchcd = 1 and positive beme and positive me and shrcd in (10,11) and at least 2 years in comp
# 即计算出NYSE中size的中位数用于构建Small 和 Big组
# 计算NYSE中Book-to-Market的30%,70%分位数,用于构建Low, Median, High组
nyse=ccm_jun[(ccm_jun['exchcd']==1) & (ccm_jun['beme']>0) & (ccm_jun['me']>0) & (ccm_jun['count']>1) & ((ccm_jun['shrcd']==10) | (ccm_jun['shrcd']==11))]
# size breakdown,按年分组,计算market equity(size)的中位数
nyse_sz=nyse.groupby(['jdate'])['me'].median().to_frame().reset_index().rename(columns={'me':'sizemedn'})
# beme breakdown,按年分组,计算book-to-market的中位数
nyse_bm=nyse.groupby(['jdate'])['beme'].describe(percentiles=[0.3, 0.7]).reset_index()
nyse_bm=nyse_bm[['jdate','30%','70%']].rename(columns={'30%':'bm30', '70%':'bm70'})
nyse_breaks = pd.merge(nyse_sz, nyse_bm, how='inner', on=['jdate'])
# join back size and beme breakdown
# use Breakpoints to classify stock only at end of all June's
ccm1_jun = pd.merge(ccm_jun, nyse_breaks, how='left', on=['jdate'])
# Create Portfolios as of June
# Size Portfolios: S[mall] or B[ig]
# Book-to-Market Portfolios: L[ow], M[edian], H[igh]
# function to assign sz and bm bucket
def sz_bucket(row):
if row['me']==np.nan:
value=''
elif row['me']<=row['sizemedn']:
value='S'
else:
value='B'
return value
def bm_bucket(row):
if 0<=row['beme']<=row['bm30']:
value = 'L'
elif row['beme']<=row['bm70']:
value='M'
elif row['beme']>row['bm70']:
value='H'
else:
value=''
return value
# assign size portfolio,使用上述函数,指定每一只股票在6月份属于哪一个组合
# 对于不满足一定条件的,直接指定为空
ccm1_jun['szport']=np.where((ccm1_jun['beme']>0)&(ccm1_jun['me']>0)&(ccm1_jun['count']>=1), ccm1_jun.apply(sz_bucket, axis=1), '')
# assign book-to-market portfolio
ccm1_jun['bmport']=np.where((ccm1_jun['beme']>0)&(ccm1_jun['me']>0)&(ccm1_jun['count']>=1), ccm1_jun.apply(bm_bucket, axis=1), '')
# create positivebmeme and nonmissport variable
# 若book-to-market>0 且 bm所在组非空,则记为非缺失值
ccm1_jun['posbm']=np.where((ccm1_jun['beme']>0)&(ccm1_jun['me']>0)&(ccm1_jun['count']>=1), 1, 0)
ccm1_jun['nonmissport']=np.where((ccm1_jun['bmport']!=''), 1, 0)
# store portfolio assignment as of June
# 即保留下来每只股票属于哪一个组合
june=ccm1_jun[['permno','date', 'jdate', 'bmport','szport','posbm','nonmissport']]
june['ffyear']=june['jdate'].dt.year
# merge back with monthly records
# crsp3 记录了每只股票的收益率、累计收益率信息
crsp3 = crsp3[['date','permno','shrcd','exchcd','retadj','me','wt','cumretx','ffyear','jdate']]
ccm3=pd.merge(crsp3,
june[['permno','ffyear','szport','bmport','posbm','nonmissport']], how='left', on=['permno','ffyear'])
# keeping only records that meet the criteria
ccm4=ccm3[(ccm3['wt']>0)& (ccm3['posbm']==1) & (ccm3['nonmissport']==1) &
((ccm3['shrcd']==10) | (ccm3['shrcd']==11))]
############################
# Form Fama French Factors #
############################
# function to calculate value weighted return
def wavg(group, avg_name, weight_name):
d = group[avg_name] # d为ret_adj
w = group[weight_name] # w为wt
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return np.nan
# --------------------value-weigthed return
# 按照交易月份、所属size组合,所属bm组合分组,计算每组内部加权平均收益率(使用retadj)
# 此时cumret似乎是没有用的,只有在计算市值的时候有用?
vwret=ccm4.groupby(['jdate','szport','bmport']).apply(wavg, 'retadj','wt').to_frame().reset_index().rename(columns={0: 'vwret'})
vwret['sbport']=vwret['szport']+vwret['bmport']
# -------------------------firm count
vwret_n=ccm4.groupby(['jdate','szport','bmport'])['retadj'].count().reset_index().rename(columns={'retadj':'n_firms'})
vwret_n['sbport']=vwret_n['szport']+vwret_n['bmport']
# tranpose
ff_factors=vwret.pivot(index='jdate', columns='sbport', values='vwret').reset_index()
ff_nfirms=vwret_n.pivot(index='jdate', columns='sbport', values='n_firms').reset_index()
# create SMB and HML factors
#average High, Low, High minus Low
ff_factors['WH']=(ff_factors['BH']+ff_factors['SH'])/2
ff_factors['WL']=(ff_factors['BL']+ff_factors['SL'])/2
ff_factors['WHML'] = ff_factors['WH']-ff_factors['WL']
# aveage Small, Big, Small minus Big
ff_factors['WB']=(ff_factors['BL']+ff_factors['BM']+ff_factors['BH'])/3
ff_factors['WS']=(ff_factors['SL']+ff_factors['SM']+ff_factors['SH'])/3
ff_factors['WSMB'] = ff_factors['WS']-ff_factors['WB']
ff_factors=ff_factors.rename(columns={'jdate':'date'})
# n firm count
ff_nfirms['H']=ff_nfirms['SH']+ff_nfirms['BH']
ff_nfirms['L']=ff_nfirms['SL']+ff_nfirms['BL']
ff_nfirms['HML']=ff_nfirms['H']+ff_nfirms['L']
ff_nfirms['B']=ff_nfirms['BL']+ff_nfirms['BM']+ff_nfirms['BH']
ff_nfirms['S']=ff_nfirms['SL']+ff_nfirms['SM']+ff_nfirms['SH']
ff_nfirms['SMB']=ff_nfirms['B']+ff_nfirms['S']
ff_nfirms['TOTAL']=ff_nfirms['SMB']
ff_nfirms=ff_nfirms.rename(columns={'jdate':'date'})
#--------绘图展示一下
import matplotlib.pyplot as plt
plt.plot(ff_factors['WSMB'])
ff_test = ff_factors[['date','WSMB','WHML']]
ff_test['SMB_cum'] = (1+ff_test['WSMB']).cumprod()
ff_test['HML_cum'] = (1+ff_test['WHML']).cumprod()
plt.plot(ff_test['date'],ff_test['SMB_cum'],label='SMB')
plt.plot(ff_test['date'],ff_test['HML_cum'],label='HML')
plt.legend()
#与Fama French官网所给出的数据进行对比
import os
os.chdir('F:\\0-intern\\【2】国君金工\\【1】风格因子构建\\Fama-French-3factors')
fama_french = pd.read_csv('F-F_Research_Data_Factors.csv')
fama = fama_french[(fama_french['date'] >= '196207') & (fama_french['date']<='201105')]
fama['SMB_ff'] = fama['SMB'] / 100
fama['HML_ff'] = fama['HML'] / 100
fama = fama.reset_index()
fama['HML_cum'] = (1+fama['HML']/100).cumprod()
plt.plot(fama['HML_cum'])
plt.figure()
plt.plot(fama['SMB_ff'],label='SMB from French website')
plt.plot(ff_test['WSMB'],label='My SMB')
plt.legend()