1200字范文 > 小语种nlp文本预处理——数据清洗

小语种nlp文本预处理——数据清洗

时间：2023-05-31 10:08:21

相关推荐

小语种nlp文本预处理——数据清洗

开始继续完成大数据实验室招新题~

（roman urdu小语种为例）

link: /home/competition/5c77ab9c1ce0af002b55af86/content/1

本练习赛所用数据，是名为「Roman Urdu DataSet」的公开数据集。

这些数据，均为文本数据。原始数据的文本，对应三类情感标签：Positive, Negative, Netural。

本练习赛，移除了标签为Netural的数据样例。因此，练习赛中，所有数据样例的标签为Positive和Negative。本练习赛的任务是「分类」。「分类目标」是用训练好的模型，对测试集中的文本情感进行预测，判断其情感为「Negative」或者「Positive」。

数据清洗

函数等相关内容转载自：/ssswill/article/details/88533623

这位大佬已经做的很完美，我仅仅只能是跟着他的步骤重新做了一遍，并记录下笔记。

当得到数据时，发现数据中存在很多的的符号，没法对数据进行直接分词处理。

并在此之前检查数据是否存在空缺

#test if nan existsdf_train.isnull().sum()

在此之后采用np.as_matrix()库将数据转换为矩阵形式

numpy_array = df_train.as_matrix()numpy_array_test = df_test.as_matrix()

因此采用了常见的清理函数，对数据中的符号标点进行了清洗替换

本次主要使用的是re库的sub函数，也可以使用python基本库中的replace库。

re.sub(pattern, repl, string, count=0, flags=0)

参数：

pattern：表示正则表达式中的模式字符串；

repl：被替换的字符串（既可以是字符串，也可以是函数）；

string：要被处理的，要被替换的字符串；

count：匹配的次数, 默认是全部替换

flags：具体用处不详

str.replace(old, new[, max])

参数：

old：将被替换的子字符串。

new：新字符串，用于替换old子字符串。

max：可选字符串, 替换不超过 max 次

#nlp清理函数def cleaner(word):word = re.sub(r'\#\.', '', word)word = re.sub(r'\n', '', word)word = re.sub(r',', '', word)word = re.sub(r'\-', ' ', word)word = re.sub(r'\.', '', word)word = re.sub(r'\\', ' ', word)word = re.sub(r'\\x\.+', '', word)word = re.sub(r'\d', '', word)word = re.sub(r'^_.', '', word)word = re.sub(r'_', ' ', word)word = re.sub(r'^ ', '', word)word = re.sub(r' $', '', word)word = re.sub(r'\?', '', word)return word.lower()def hashing(word):word = re.sub(r'ain$', r'ein', word)word = re.sub(r'ai', r'ae', word)word = re.sub(r'ay$', r'e', word)word = re.sub(r'ey$', r'e', word)word = re.sub(r'ie$', r'y', word)word = re.sub(r'^es', r'is', word)word = re.sub(r'a+', r'a', word)word = re.sub(r'j+', r'j', word)word = re.sub(r'd+', r'd', word)word = re.sub(r'u', r'o', word)word = re.sub(r'o+', r'o', word)word = re.sub(r'ee+', r'i', word)if not re.match(r'ar', word):word = re.sub(r'ar', r'r', word)word = re.sub(r'iy+', r'i', word)word = re.sub(r'ih+', r'eh', word)word = re.sub(r's+', r's', word)if re.search(r'[rst]y', 'word') and word[-1] != 'y':word = re.sub(r'y', r'i', word)if re.search(r'[bcdefghijklmnopqrtuvwxyz]i', word):word = re.sub(r'i$', r'y', word)if re.search(r'[acefghijlmnoqrstuvwxyz]h', word):word = re.sub(r'h', '', word)word = re.sub(r'k', r'q', word)return worddef array_cleaner(array):# X = arrayX = []for sentence in array:clean_sentence = ''words = sentence.split(' ')for word in words:clean_sentence = clean_sentence +' '+ cleaner(word)X.append(clean_sentence)return XX_test = numpy_array_test[:,1]X_train = numpy_array[:, 1]#Clean X hereX_train = array_cleaner(X_train)X_test = array_cleaner(X_test)y_train = numpy_array[:, 2]

处理后观察

发现数据已经被完全清洗。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。