每个证候输出一个结果文件,文件中显示所有西医指标的计算结果值,按照p值从小到大排序,前几个指标即是对该证候影响较大的指标。以气虚证和非气虚证为例,检验两组数据有无差异性,通过均值和标准差可以看出来气虚和非气虚的指标的高低,
通过p值可以看出有无统计学意义,若p<0.05差异显著,否则差异不显著;通过p值最小的几个指标,可以看出这些指标对于判断是否属于气虚证候的影响。
可使用任何语言解决,我们前期介绍了使用java解决,设计了算法框架。这里使用python实现,python解决时代码更为简洁,也更简单。
充分理解需求的情况下,设计算法框架。
要求遍历所有证候,,每个证候单独输出一个文件; 在这个文件中, 对所有西医指标进行考虑,即遍历所有西医指标;
对于每个西医指标,要考虑所有行的记录(即遍历所有行), 分别计算证候为0和为1时的西医指标值。
通过上述分析,整体的算法框架需要三层循环, 第一层为遍历每个证候, 第二层为遍历每个西医指标, 第三层为遍历每行记录,伪代码如下:
> For 证候 synIndex 1:6 //假设有6个证候 > 创建证候文件, 待输出数据 > For 西医指标med 7:56 //假设有50个西医指标 > 创建动态数组, 存储指标值以及证候值 > For 每行记录row > 将西医指标值,根据证候值分别加入不同的动态数组 > End 每行记录row > 依据数组中的值, 计算p值、t值、均值等等. > 将该指标名称、p值、t值等信息作为一行,追加写入文件. > End 西医指标med > 关闭csv文件,考虑下一个证候 > End For 证候
3.1 分组存储西医指标值,由于证候0和1的个数不确定, 所以它们数组大小也不确定,
需要使用动态数组分别存储证候0和1对应的西医指标值,可采用ist等数据结构来创建动态数组,用于存储西医指标值。
3.2 读入及操作xls文件,采用 pandas 读取xls文件数据,tcm_data =
pd.read_excel('data_expert.xls'), 读取出所有列名measures =
tcm_data.columns.values.tolist()。并按照算法来操作数据,进行运算处理。
3.3 使用stats.ttest_ind函数来进行Ttest检验,test_res = stats.ttest_ind(group2, group1,
equal_var =
False),group2为证候为1的指标组,group1为证候为0的指标组;test_res[1]值为返回p值,test_res[0]为返回t值。
3.4 使用 kruskalwallis函数进行非参数检验, h, p = kruskalwallis(group2,
group1),group2和group1的含义同上,返回的p值既是结果p值。
3.5 使用stats.normaltest函数进行判断是否属于正态分布,p1 =
stats.normaltest(groupAll),groupAll是group1和group2的合并.
若返回值p1>0.003,即为正态分布,若p1<0.003,则为非正态分布。
3.6 使用save = pd.DataFrame(final_table)
save.to_csv(OneCsvName,float_format='%.3f',na_rep="NAN!", header=False
),输出结果文件csv格式。
西医指标值存在为空情况,当空值的个数大于数据条数的一半时,不再进行计算处理,设特殊标记表示。
总体算法框架如下图所示
依据上述分析,按照算法框架,使用python编制代码,其中核心的实现代码如下
4.1 首先读取文件数据和列名称,初始化证候范围和西医指标范围
tcm_data = pd.read_excel('data_expert.xls')
measures = tcm_data.columns.values.tolist()
wid_s = 5 #5
wid_e = 16 #16
cid_s = 2 #1
cid_e = 4 #8
数据文件与源码文件在同一目录下。
4.2 我们把写出文件、考虑某证候把该证候值传递出去, tmp_data来接受返回的结果;他们
单独放在一个函数中,函数名为output_result_csv(),代码如下
def output_result_csv():
'''
For Chinese medicine diagnosis rules, save the relevant results in a file,
output a csv file for one rules.
Meanwhile, it can output many csv files based on measures for one rule
'''
for cid in range(2, 4): #2,3,4
tmp_data = contrast_test(cid)
tmp_ids = tmp_data['top_ids']
tmp_samples_size = tmp_data['top_samples_size']
tmp_means = tmp_data['top_means']
tmp_variances = tmp_data['top_variances']
tmp_normality = tmp_data['top_normality']
tmp_scores = tmp_data['top_scores']
tmp_scorest = tmp_data['top_scorest']
wes_measures = []
for t in tmp_ids:
wid = t + wid_s
wes_measures.append(measures[wid])
t_scores = tmp_scorest
p_scores = tmp_scores
class_west_m = class_west_measures(wes_measures) #call function to gieve
measure's class
tableold = [class_west_m, wes_measures, tmp_means[:,1], tmp_variances[:,1], \
tmp_means[:,0], tmp_variances[:,0],tmp_normality, p_scores]
np_table = np.array(tableold) #change to numpy type
np_tableT = np_table.T#change T
rowname = ('西医指标类别','西医指标','此证候人群均值','此证候人群方差',\
'非此证候人群均值','非此证候人群方差','是否正态分布', 'p值')#add the row name
final_table = np.row_stack((rowname, np_tableT))
contain all west measures
OneCsvName = measures[cid] + '.csv';
save = pd.DataFrame(final_table)
save.to_csv(OneCsvName,float_format='%.3f',na_rep="NAN!", header=False )
4.3 我们把对某个证候的具体计算处理,单独放在一个函数中,即4.2中的contrast_test()函数,如下
def contrast_test(cid):
'''
--- input cid: the index of a Chinese medicine diagnosis rule
--- return the top 5 colost relevant western biomedical measurements and the means, variances, as well as the t-test values.
'''
name = measures[cid]
lables = tcm_data[name].values.tolist()
wid_size = wid_e - wid_s + 1
means = np.zeros((wid_size,2))
variances = np.zeros((wid_size,2))
normality = np.zeros(wid_size) # save normality or not. 1 means normality,0
means not normality
scores = np.zeros(wid_size) #save p
scorest = np.zeros(wid_size)#save t
cnt = 0
samples = list()
samples_size = list()
for j in range(wid_s, wid_e + 1):
bio_measures = tcm_data[measures[j]].values.tolist()
n1 = 0
n2 = 0
x1 = 0
x2 = 0
s1 = 0
s2 = 0
group1 = list()
group2 = list()
for i in range(len(lables)):
if not np.isnan(bio_measures[i]):
if lables[i] == 0:
n1 += 1
x1 += bio_measures[i]
group1.append(bio_measures[i])
if lables[i] == 1:
n2 += 1
x2 += bio_measures[i]
group2.append(bio_measures[i])
tmp_sample = {'group1':group1, 'group2':group2}
samples.append(tmp_sample)
samples_size.append([n1, n2])
if (n1 + n2) <= (0.5 * len(lables)):
scores[cnt] = 100
cnt += 1
continue
x1 /= n1
x2 /= n2
for i in range(len(lables)):
if not np.isnan(bio_measures[i]):
if lables[i] == 0:
s1 += (bio_measures[i] - x1) ** 2
if lables[i] == 1:
s2 += (bio_measures[i] - x2) ** 2
means[cnt, 0] = x1 #存储证候为0时的指标均值
means[cnt, 1] = x2 #存储证候为1时的指标均值
variances[cnt, 0] = np.sqrt(s1 / (n1 - 1)) #存储证候为0时的指标方差
variances[cnt, 1] = np.sqrt(s2 / (n2 - 1)) #存储证候为1时的指标方差
normality[cnt] = normality_or_not(group2, group1)
if normality[cnt] == 1: #正态分布,使用Ttest检验
test_res = stats.ttest_ind(group2, group1, equal_var = False)
scores[cnt] = test_res[1] # save p, the smaller, the more significant
scorest[cnt] = "%.2f" % test_res[0] #save t, the bigger of abs(t), the more
significant
else: #非正态分布,使用非参数检验Kruskal-Wallis
h, p = kruskalwallis(group2, group1)
scores[cnt] = p
cnt += 1
samples = np.asarray(samples)
samples_size = np.asarray(samples_size)
''' Note that the next two lines is very important
top_ids = np.argsort(scores)[0:top_k] #Sorted operation by value of p to find
top_k measurements, return the index of value
top_ids = np.arange(0,len(scores),1) #Don't sort by value of p
'''
top_k = wid_size #find all the measurements, you can change this value to any
number you need
the index of value, if you need sort, use this line and don't use next line
top_ids = np.arange(0,len(scores),1) #Don't sort by value of p,return the
original index of value.
top_means = means[top_ids,:]
top_variances = variances[top_ids,:]
top_normality = normality[top_ids]
top_scores = scores[top_ids]#p
top_scorest = scorest[top_ids]#t
top_samples = samples[top_ids]
top_samples_size = samples_size[top_ids]
data_infos = {"cid":cid, "top_ids":top_ids,
"top_samples_size":top_samples_size,"top_means":top_means,
"top_variances":top_variances, "top_normality":top_normality,
"top_scores":top_scores, "top_samples":top_samples, "top_scorest":top_scorest}
return data_infos
4.4 我们使用 normality_or_not()函数来判断是否为正态分布,正态返回1,非正态返回0;函数如下
def normality_or_not(group2, group1):
'''
如果两组数据合并一起,若是正态分布,则返回1,后期使用Ttest检验;否则返回0,后期使用非参数检验
'''
groupAll = list(group1)
groupAll.extend(group2)
p1 = stats.normaltest(groupAll)
if(p1[1]>0.003): #and (p2[1]>0.003)):
normality = 1
else:
normality = 0
return normality
结果如下图所示,假设该证候为气虚证,结果显示出了有气虚证和无气虚证,他们之间西医指标的差别,以及指标对证候的影响,还可以看出,西医指标值大部分都不服从正态分布,凝血酶时间、直接胆红素、血清总胆汁这些指标对证候的影响明显。
![](https://img-
blog.csdn.net/20180129224100331?/gravity/SouthEast)
** 特别声明: **
** 上述教程源于实际的项目案例,在发布时对这些实验数据进行了处理改造。 **
**
**
**
该教程版权归河南中医药大学信息技术学院、香港科技大学计算机系、北京中医药大学共同所有,仅供于免费学习和交流,特别为河南中医药大学计算机文化基础课,python程序设计提供教学辅助。
**
**
**
** 转载请注明出处,若用于商业用途,版权所有者将保留追究法律权利。 **