这种方法最早出自The Automatic Creation of Literature Abstracts,其主要原理在于将关键词进行聚类,得到的“簇”表示的就是我们关键词的聚集,最终我们将“簇”认为是关键词的句子片段。



3.1.1 数据集展示

3.1.2 文本展示

APW19981101.0843 NEWS NEWSWIRE

In Honduras, at least 231 deaths

have been blamed on Mitch, the National Emergency Commission said

Saturday. El Salvador _ where 140 people died in flash floods _

declared a state of emergency Saturday, as did Guatemala, where 21

people died when floods swept away their homes. Mexico reported one

death from Mitch last Monday. In the Caribbean, the U.S. Coast Guard

widened a search for a tourist schooner with 31 people aboard that

hasn’t been heard from since Tuesday. By late Sunday, Mitch’s winds,

once near 180 mph (290 kph), had dropped to near 30 mph (50 kph), and

the storm _ now classified as a tropical depression _ was near

Tapachula, on Mexico’s southern Pacific coast near the Guatemalan

border. Mitch was moving west at 8 mph (13 kph) and was dissipating

but threatened to strengthen again if it moved back out to sea.

3.2 数据预处理

3.2.1 词干化
3.2.2 停用词



3.2.3 句子切分

‘(?<![\d.])0*(?: (\d+).?|.(0) |(.\d+?)|(\d+.\d+?) )0*(?![\d.])’

效果:将按照之前两步(停用词剔除和词干化)处理之后的文章,按照"."、"?"、"!" 将句子切分,并剔除句子内部的所有标点符号。
3.2.4 数据预处理Python实现

from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizeimport nltkfrom nltk.stem.porter import PorterStemmerdef stop_words(inpsen, mode='test'):if mode == 'test':example_sent = "This is a sample sentence, showing off the stop words filtration."else :example_sent = inpsenstop_words = set(stopwords.words('english'))word_tokens = word_tokenize(example_sent)filtered_sentence = [w for w in word_tokens if not w in stop_words]return filtered_sentencedef word_porter(inpsen, mode='test'):# 词干分析porter_stemmer = PorterStemmer()if mode == 'test':word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"else :word_data = inpsen# First Word tokenizationnltk_tokens = nltk.word_tokenize(word_data)# Next find the roots of the word# print(nltk_tokens)# for w in nltk_tokens:#print("Actual: %s Stem: %s" % (w, porter_stemmer.stem(w)))res = [porter_stemmer.stem(w) for w in nltk_tokens]return resdef Txt2Sent(text, mode='punctuation'):sent, sentlist = '', []for alpha in text :if alpha in ['.', '?', '!'] :if mode == 'punctuation' :sent += alphasentlist.append(sent)sent = ''else :sent += alphareturn sentlistdef list2str(lis):res = ''for w in lis :res += (w + ' ')return resclass datapremain():def DataPreMain(self, text, mode='train'):if mode == 'train':ported_word = word_porter(text, mode='using')ported_str = list2str(ported_word)final = stop_words(ported_str, mode='using')final_str = list2str(final)sent_lis = Txt2Sent(final_str)else :sent_lis = Txt2Sent(text, mode='punctuation')return sent_lis

3.3 Sent2Vec



def tf_idf(target_word, target_text):tf, idf = 0, 0target_str = ''for sent in target_text :target_str += (sent)target_text = target_str.split(' ')for word in target_text :if word == target_word :tf += 1tf /= len(target_text)for text in dox :text = text.split(' ')if target_word in text :idf += 1idf = np.log(len(doc) / (idf + 1))return tf * idf

3.3.2 Sent2Vec

def Sent2Vec():doc_mat = np.zeros((len(doc), 55))row = -1for text in docs :for sent in text :row += 1for num, word in enumerate(sent.split(' ')):doc_mat[row][num] = tf_idf(word, text)return doc_mat

3.4 句子之间的相似度计算



def counter_cosine_similarity(c1, c2):if not c2:c2.append('hhh')c1 = Counter(c1)c2 = Counter(c2)terms = set(c1).union(c2)dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))result = dotprod / (magA * magB)final = cosine_similarity_backprocess(result)return final

3.5 相似度聚类

3.5.1 K-Means


class KMEans(object):def __init__(self):super(KMEans, self).__init__()self.features, self.label = np.array([]), np.array([])self.centroids_index, self.centroids = np.array([]), np.array([])self.distance_sum, self.label_validation, self.clusters, self.acc_list = [], [], [], []def read_data(self, path):df = pd.read_csv(path)columns_name = df.columns.valuesself.features = np.zeros((len(df[columns_name[0]].values), len(columns_name[:-1])))self.label = np.zeros(len(df[columns_name[0]].values))self.label_validation = np.zeros_like(self.label)for feature in range(self.features.shape[0]):for elem in range(self.features.shape[1]):self.features[feature][elem] = np.float32(df[columns_name[elem]].values[feature])if elem == self.features.shape[1]-1:self.label[feature] = df[columns_name[elem + 1]].values[feature]self.label_validation[feature] = np.int(self.label[feature])return(self.features, range(1, len(self.label) + 1))def calculate_distance(self, p1, p2):dis = 0for elem in range(self.features.shape[1]):dis += np.square(np.int(p1[elem]) - np.int(p2[elem]))return np.sqrt(dis)def gen_init_centroids(self, K):self.centroids = np.zeros((K, self.features.shape[1]))self.centroids_index = np.random.randint(low=1, high=self.features.shape[0], size=K)for i in range(len(self.centroids_index)):self.centroids[i] = self.features[self.centroids_index[i]]def fit_model(self, K):self.gen_init_centroids(K)epoches = 10self.acc_list = []self.distance_sum = []for epoch in range(epoches):self.distance_sum.append(0)print('test')self.clusters = [[] for x in range(K)]for elem in range(self.features.shape[0]):#forward passdis_list = []for centorid in self.centroids:dis_list.append(self.calculate_distance(centorid, self.features[elem]))self.label[elem] = dis_list.index(min(dis_list))self.distance_sum[-1] += min(dis_list)self.clusters[int(self.label[elem])].append(elem)#update centroidsclusters_mean = np.zeros((K, self.features.shape[1]))for k in range(K):sum = [[0] for i in range(self.features.shape[1])]for f in range(self.features.shape[1]):for p in self.clusters[k]:sum[f] += self.features[p][f]for f in range(len(sum)):clusters_mean[k][f] = (sum[f][0]) / len(self.clusters[k])self.centroids[k] = clusters_mean[k]# calculate accuracyself.acc_list.append(self.encode_loss(K))return self.clusters

3.6 Rouge评测


ROUGE是由ISI的Lin和Hovy提出的一种自动摘要评价方法,现被广泛应用于DUC1(Document Understanding Conference)的摘要评测任务中。


其中,n-gram表示n元词,{Ref Summaries}表示参考摘要,即事先获得的标准摘要,Countmatch(n-gram)表示系统摘要和参考摘要中同时出现n-gram的个数,Count(n-gram)则表示参考摘要中出现的n- gram个数。

不难看出,ROUGE公式是由召回率的计算公式演变而来的,分子可以看作“检出的相关文档数目”,即系统生成摘要与标准摘要相匹配的N-gram个数,分母可以看作“相关文档数目”,即标准摘要中所有的N-gram个数。Python 实现:

def rouge(a, b):rouge = Rouge()rouge_score = rouge.get_scores(a, b, avg=True) # a和b里面包含多个句子的时候用rouge_score1 = rouge.get_scores(a, b) # a和b里面只包含一个句子的时候用# 以上两句可根据自己的需求来进行选择r1 = rouge_score["rouge-1"]r2 = rouge_score["rouge-2"]rl = rouge_score["rouge-l"]return r1, r2, rl

4. 摘要生成

4.1 Baseline


Honduras braced for potential catastrophe Tuesday as Hurricane Mitch

roared through the northwest Caribbean, churning up high waves and

intense rain that sent coastal residents scurrying for safer ground.

Hurricane Mitch paused in its whirl through the western Caribbean

on Wednesday to punish Honduras with 120-mph (205-kph) winds, topping

trees, sweeping away bridges, flooding neighborhoods and killing at

least 32 people.

Hurricane Mitch cut through the Honduran coast like a ripsaw Thursday,

its devastating winds whirling for a third day through resort islands

and mainland communities.

At least 231 people have been confirmed dead in Honduras from former-hurricane

Mitch, bringing the storm’s death toll in the region to 357, the National

Emergency Commission said Saturday.

In Honduras, at least 231 deaths have been blamed on Mitch, the National

Emergency Commission said Saturday.

Nicaraguan Vice President Enrique Bolanos said Sunday night that between

1,000 and 1,500 people were buried in a 32-square mile (82.

BRUSSELS, Belgium (AP) - The European Union on Tuesday approved 6.

Pope John Paul II appealed for aid Wednesday for the Central American

countries stricken by hurricane Mitch and said he feels close to the

thousands who are suffering.

Better information from Honduras’ ravaged countryside enabled officials

to lower the confirmed death toll from Hurricane Mitch from 7,000

to about 6,100 on Thursday, but leaders insisted the need for help

was growing.

Aid workers struggled Friday to reach survivors of Hurricane Mitch,

who are in danger of dying from starvation and disease in the wake

of the storm that officials estimate killed more than 10,000 people.

Rouge 评测:Python实现:

def baseline(self, file, mode='BaseLine'):abstract = ''with open(file, 'r') as f:data = f.readlines()txt = ''for row in data[5:-2]:txt += rowfor alpha in txt:if alpha != '\n':abstract += alphaif alpha == '.' and mode == 'BaseLine':breakreturn abstractdef baselineMain():abstracts = ''self.text = []for text in self.PathList:abstracts += self.baseline(path + '\\' + text, mode='BaseLine')self.text.append(self.baseline(path + '\\' + text, mode='BaseLine'))self.abstracts = abstractsreturn abstracts, self.text

4.2 句子关联度排序摘要生成


停用词、词干化、句子切分、TF-IDF预处理;Topic中的10个文章,看为一个长句子;切分好的短句子和长句子进行余弦相似度计算;每一个短句子按照相似度进行排序;第一次选定关联度最高的句子放入摘要;之后按照关联度排序,选取关联度最高的句子和现有摘要进行关联度比对;人工阈值设定:规定包含第1330(665 * 2)个单词的句子之前的所有句子与长句子的余弦相似度平均值作为关联阈值;当此目前关联度最高的句子和现有摘要的余弦相似度大于阈值时,将此句子放入摘要;当摘要字数达到665个单词的时候结束。


Nicaragua’s leftist Sandinistas, who maintained close relations

with Fidel Castro during their 1979-90 rule, had criticized the refusal by

President Arnoldo Aleman’s administration. Nicaraguan

leaders previously had refused Cuba’s offer of medical

help, saying it did not have the means to transport or support the doctors. Nicaragua

said Friday it will accept Cuba’s offer to send doctors as long as the communist

nation flies them in on its own helicopters and with their own supplies. ``It’s a coincidence that the ships

are there but they’ve got men and equipment that can

be put to work in an organized way,’’ said International Development Secretary Clare.


def rank(sentmat, orignal, orignal_doc):result = ''rank_dict = {}for num, sent in enumerate(sentmat):rank_dict[num] = counter_cosine_similarity(sent, orignal)values_list = list(rank_dict.values())keys_list = list(rank_dict.keys())for i in range(len(values_list)) :for j in range(i, len(values_list)):if values_list[i] <= values_list[j] :mid, midloc = values_list[i], keys_list[i]values_list[i], keys_list[i] = values_list[j], keys_list[j]values_list[j], keys_list[j] = mid, midloci = 0while len(result) <= 665 :result += orignal_doc[keys_list[i]]i += 1return resultdef main():global aa, doc, docs, dox, orignal_strpath_list = os.listdir(root_path)for i in range(len(path_list)):path_list[i] = root_path + '\\' + path_list[i]for path in path_list:pre_abstract = aa.baseline(file=path, mode='using')doc.append(pre_abstract.replace('\n', ' '))dox = docdocs, orignal_list = datapre()doc = []orignal_doc = []orignal_str = ''for sent in dox :orignal_str += sentfor i in docs:doc += ifor i in orignal_list:orignal_doc += isentmat = Word2Num()result = rank(sentmat, orignal_str, orignal_doc)num = 0flag = 0for alpha in result :print(alpha, end='')num += 1if num >= 25:flag = 1if num >= 25 and flag == 1 and alpha == ' ':print('\n')num, flag = 0, 0return rouge(result, str3)

4.3 利用K-Means聚类后进行关联度排序摘要




In Washington on Thursday, President Bill Clinton ordered dlrs 30

million in Defense Department equipment and services and dlrs 36

million in food, fuel and other aid be sent to Honduras, Nicaragua,

El Salvador and Guatemala.At least 231 people have been confirmed dead

in Honduras from former-hurricane Mitch, bringing the storm’s death

toll in the region to 357, the National Emergency Commission said

Saturday. About 100 victims had been buried around Tegucigalpa, Mayor

Nahum Valladeres said. Until now, we have had a short amount of

time and few resources to get reliable information. Former U. It also

kicked up huge waves that pounded seaside communities. Hillary Rodham

Clinton also will travel to the region, visiting Nicaragua and

Honduras on Nov. We’re trying to move food as fast as possible to

help people as soon as possible,’’ Rowe said. commitment to

providing humanitarian relief. Mexico reported one death from Mitch

last Monday.The county is semi-destroyed and awaits the maximum

effort and most fervent and constant work of every one of its

children,’’ he said. The hurricane has destroyed almost

everything,’’ said Mike Brown, a resident of Guanaja Island which was

within miles (kms) of the eye of the hurricane.’’ The entire coast of

Honduras was under a hurricane warning and up to 15 inches (38

centimeters) of rain was forecast in mountain areas. The latest EU aid

follows an initial 400,000 ecu (dlrs 480,000).


def cal_simi_in(clus, clusent, data):clus_simi_in = {}for key in clus.keys():if key not in clus_simi_in.keys():clus_simi_in[key] = []for sent in clus[key] :clus_simi_in[key].append({sent: counter_cosine_similarity(data[sent], clusent[key])})maxi = []locy = ['在聚类中的位置信息', '在data中的位置信息', '最大值']locs = []for key in clus_simi_in.keys():maxi.append(0)for elem in clus_simi_in[key] :if list(elem.values())[0] >= maxi[-1]:maxi[-1] = list(elem.values())[0]locy = [key, list(elem.keys())[0], maxi[-1]]locs.append(locy)return locsdef cal_simi_out(clusent):doc_data = []simi_out = {}for sent in doc :for word in sent.split(' '):doc_data.append(tf_idf(word, doc))for key in clusent.keys():if key not in simi_out.keys():simi_out[key] = []simi_out[key] = counter_cosine_similarity(clusent[key], doc_data)return simi_outdef rank(simi_in, simi_out, orignal_doc):result = ''max_clu, max_sent = [0, 'key'], [0, 'key', 'loc in mat']keys_list = list(simi_out.keys())for i in range(len(keys_list)) :for j in range(i, len(keys_list)):if keys_list[i] <= keys_list[j] :mid = keys_list[i]keys_list[i] = keys_list[j]keys_list[j] = midfor i in keys_list:for elem in simi_in:if elem[0] == i:result += orignal_doc[elem[1]]print(result)return resultdef main():global aa, doc, docs, doxpath_list = os.listdir(root_path)for i in range(len(path_list)):path_list[i] = root_path + '\\' + path_list[i]for path in path_list:pre_abstract = aa.baseline(file=path, mode='fuck')doc.append(pre_abstract.replace('\n', ' '))dox = docdocs, orignal_list = datapre()doc = []orignal_doc = []orignal_str = ''for sent in dox :orignal_str += sentfor i in docs:doc += ifor i in orignal_list:orignal_doc += isentmat = Word2Num()clus, clusent = clustering(sentmat, np.sqrt(sentmat.shape[0]))simi_in = cal_simi_in(clus, clusent, sentmat)simi_out = cal_simi_out(clusent)result = rank(simi_in, simi_out, orignal_doc)print(result)return rouge(result, str3)

5. 在工程搭建时的问题及解决办法

[1 ] 数据集的问题,部分符号不规范,且有分行符合,只能通过对字符串按字符进行操作,去除分行符;[2] 在改进方法2中的阈值选取的时候,开始是按精度为1e-3取得前1330个单词的句子对于Topic的众数,但是众数就存在相似度句子堆积问题,导致在特征空间上相似度较大的句子和摘要相似度较小,后来换为平均值,可以有效解决此问题;[3] 在改进方法1中,使用TF-IDF构建矩阵向量集时,存在句子长短不一,导致每一个维度的长度不确定,在K-Means中,对于数据集的要求是特征数量需要相同,因此我们构建出的向量集无法直接使用K-Means,但因为对NLP方向的Trick了解的不多,只是觉得对于一个句子来讲,其特征之间应该存在一些关联性,因此对于较短的句子的补齐采用类似于CNN的零Padding。
