07-20 01:29 阅读 72

svm需要归一化吗,sklearn支持向量机拟合

我打算构建垃圾邮件分类器。我们从互联网上收集了多个数据集(例如垃圾邮件/垃圾邮件SpamAssassin数据库)，构建了import os操作系统

导入编号

frompandasimportggddp/pfromsklearn.feature _ extraction.text import风格的汽车/pfromsklearn.pipelineimportpipeline

froms klearn.cross _ validationimportrydcg/pfromsklearn.metricsimportconfusion _ matrix，f1_score

from sklearn import svm

NEWLINE='\n '

HAM='ham '

SPAM='spam '

sources=wyd QC/p (c :/data/spam )、spam )、

(C:/data/easy_HAM )、ham )、

# ('C:/data/hard_HAM '，ham )，Commented out，since they take too long

# ('C:/data/beck-s '，HAM )，

# ('C:/data/farmer-d '，HAM )，

# ('C:/data/kaminski-v '，HAM )，

# ('C:/data/kitchen-l '，HAM )，

# ('C:/data/lokay-m '，HAM )，

# ('C:/data/williams-w3 '，HAM )，

# ('C:/data/BG '，SPAM )，

# ('C:/data/GP '，SPAM )，

# ('C:/data/SH '，SPAM )

SKIP_FILES={'cmds'}

efread_files(path ) :

for root，dir_names，file_names in os.walk(path ) :

for path in dir_names:

read_files(OS.path.join ) root，path ) )

for file_name in file_names:

if file _ namenotinskip _ files :

file_path=OS.path.join(root，file_name ) )。

ifOS.path.isfile(file_path ) :

past_header，lines=False，sydbd

f=open(file_path，encoding='latin-1 ' )

for line in f:

if past_header:

lines.append(line )。

elif line==NEWLINE:

past_header=True

f.close () )

content=Newline.join(Lines ) ) ) ) ) ) )。

yield file_path，content

efbuild_data_frame(path，classification ) :

rows=sydbd

index=sydbd

for file_name，textinread_files(path ) :

rows.append((text ) :text，(class ) :classification ) )

index.append(file_name )。

DATA_frame=dataframe(rows，index=index ) ) ) ) ) ) ) ) ) ) ) ) )。

return data_frame

data=data frame ((text (: syd BD，) class (: syd BD ) ) ) )。

for path，classification in SOURCES:

data=data.append (build _ data _ frame ) path，classification ) )

data=data.reindex (numpy.random.permutation ) data.index ) )

pipeline=pipeline (wydqc/p (count _ vectorize r )，countvectorizer ) ngram _ range=(1，2 ) ) ) )，

(' classifier '，SVM.SVC (伽玛=0.001，C=100 ) )

k_fold=KFold(n=len(data )，n_folds=6) ) ) )。

scores=sydbd

confusion=numpy.array ([ 0，0 ]，[ 0，0 ] ) )

for train_indices，test_indices in k_fold:

train _ text=data.iloc [ train _ indices ] [ ' text ' ].values

train _ y=data.iloc [ train _ indices ] [ ' class ' ].values.as type [ str ]

test _ text=data.iloc [ test _ indices ] [ ' text ' ].values

test _ y=data.iloc [ test _ indices ] [ ' class ' ].values.as type [ str ]

Peline.fit(train_text，train_y ) ) ) ) ) ) ) ) )。

predictions=pipeline.predict (test _ text ) )。

confusion=confusion _ matrix (test _ y，predictions )

score=F1_score(test_y，predictions，pos_label=SPAM ) ) ) ) ) ) ) )。

scores.append(score ) )。

print (' totalemailsclassified : '，Len ) ) (数据) )

打印(supportvectormachineoutput : ) )

打印(score : ) str ) (sum(scores )/len (scores ) ) 100 ) (% ) )

打印(confusion matrix : ) )

打印(配置) )。

我评论的那一行是邮件集合，评论了大多数数据集，即使选择了邮件量最少的，也会非常慢(约15分钟)，准确率约为91%。如何提高速度和准确性？在