正文

《python机器学习及实践》----模型实用技巧

wangshuang1631  wangshuang1631  2022-12-10  424

关键词：

本片博客是根据《Python机器学习及实践》一书中的实例，所有代码均在本地编译通过。数据为从该书指定的百度网盘上下载的，或者是sklearn自带数据下载到本地使用的。
代码片段：

measurements = ['city': 'Dubai','temperature': 33,'city': 'London','temperature': 12.,'city': 'San Fransisco','temperature': 18.]
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
print vec.fit_transform(measurements).toarray()
print vec.get_feature_names()

from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25, random_state = 33)
from sklearn.feature_extraction.text import CountVectorizer
count_vec = CountVectorizer()
X_count_train = count_vec.fit_transform(X_train)
X_count_test = count_vec.transform(X_test)
from sklearn.naive_bayes import MultinomialNB
mnb_count = MultinomialNB()
mnb_count.fit(X_count_train,y_train)
print 'The accuracy of classifying 20newsgroups using Navie Bayes (CountVectorizer without filtering stopwords):',mnb_count.score(X_count_test,y_test)
y_count_predict = mnb_count.predict(X_count_test)
from sklearn.metrics import classification_report
print classification_report(y_test,y_count_predict,target_names=news.target_names)

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
X_tfidf_train = tfidf_vec.fit_transform(X_train)
X_tfidf_test = tfidf_vec.transform(X_test)
mnb_tfidf = MultinomialNB()
mnb_tfidf.fit(X_tfidf_train,y_train)
print 'The accuracy of classifying 20newsgroups using Navie Bayes (TfidfVectorizer without filtering stopwords):',mnb_tfidf.score(X_tfidf_test,y_test)
y_tfidf_predict = mnb_tfidf.predict(X_tfidf_test)
print classification_report(y_test,y_tfidf_predict,target_names=news.target_names)

count_filter_vec,tfidf_filter_vec = CountVectorizer(analyzer='word',stop_words='english'),TfidfVectorizer(analyzer='word',stop_words='english')
X_count_filter_train = count_filter_vec.fit_transform(X_train)
X_count_filter_test = count_filter_vec.transform(X_test)
X_tfidf_filter_train = tfidf_filter_vec.fit_transform(X_train)
X_tfidf_filter_test = tfidf_filter_vec.transform(X_test)
mnb_count_filter = MultinomialNB()
mnb_count_filter.fit(X_count_filter_train,y_train)
print 'The accuracy of classifying 20newsgroups using Navie Bayes (CountVectorizer by filtering stopwords):',mnb_count_filter.score(X_count_filter_test,y_test)
y_count_filter_predict = mnb_count_filter.predict(X_count_filter_test)
mnb_tfidf_filter = MultinomialNB()
mnb_tfidf_filter.fit(X_tfidf_filter_train,y_train)
print 'The accuracy of classifying 20newsgroups using Navie Bayes (TfidfVectorizer by filtering stopwords):',mnb_tfidf_filter.score(X_tfidf_filter_test,y_test)
y_tfidf_filter_predict = mnb_tfidf_filter.predict(X_tfidf_filter_test)
from sklearn.metrics import classification_report
print classification_report(y_test,y_count_filter_predict,target_names=news.target_names)
print classification_report(y_test,y_tfidf_filter_predict,target_names=news.target_names)

import pandas as pd
titanic = pd.read_csv('D:\\Source Code\\machinelearn\\\\titanic.txt')
y = titanic['survived']
X = titanic.drop(['row.names', 'name', 'survived'], axis = 1)
X['age'].fillna(X['age'].mean(), inplace=True)
X.fillna('UNKNOWN', inplace=True)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))
print len(vec.feature_names_)
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion='entropy')
dt.fit(X_train, y_train)
dt.score(X_test, y_test)
from sklearn import feature_selection
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=20)
X_train_fs = fs.fit_transform(X_train, y_train)
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
dt.score(X_test_fs, y_test)
from sklearn.cross_validation import cross_val_score
import numpy as np
percentiles = range(1, 100, 2)
results = []
for i in percentiles:
    fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile = i)
    X_train_fs = fs.fit_transform(X_train, y_train)
    scores = cross_val_score(dt, X_train_fs, y_train, cv=5)
    results = np.append(results, scores.mean())
    print results
    opt = np.where(results == results.max())[0]
    print 'Optimal number of features %d' %percentiles[opt]
import pylab as pl
pl.plot(percentiles, results)
pl.xlabel('percentiles of features')
pl.ylabel('accuracy')
pl.show()
from sklearn import feature_selection
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=7)
X_train_fs = fs.fit_transform(X_train, y_train)
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
dt.score(X_test_fs, y_test)

X_train = [[6], [8], [10], [14], [18]]
y_train = [[7], [9], [13], [17.5], [18]]
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
import numpy as np
xx = np.linspace(0, 26, 100)
xx = xx.reshape(xx.shape[0], 1)
yy = regressor.predict(xx)
import matplotlib.pyplot as plt
plt.scatter(X_train, y_train)
plt1, = plt.plot(xx, yy, label="Degree=1")
plt.axis([0, 25, 0, 25])
plt.xlabel('Diameter of Pizza')
plt.ylabel('Price of Pizza')
plt.legend(handles = [plt1])
plt.show()
print 'The R-squared value of Linear Regressor performing on the training data is', regressor.score(X_train, y_train)
from sklearn.preprocessing import PolynomialFeatures
poly2 = PolynomialFeatures(degree=2)
X_train_poly2 = poly2.fit_transform(X_train)
regressor_poly2 = LinearRegression()
regressor_poly2.fit(X_train_poly2, y_train)
xx_poly2 = poly2.transform(xx)
yy_poly2 = regressor_poly2.predict(xx_poly2)
plt.scatter(X_train, y_train)
plt1, = plt.plot(xx, yy, label='Degree=1')
plt2, = plt.plot(xx, yy_poly2, label='Degree=2')
plt.axis([0, 25, 0, 25])
plt.xlabel('Diameter of Pizza')
plt.ylabel('Price of Pizza')
plt.legend(handles = [plt1, plt2])
plt.show()
print 'The R-squared value of Polynominal Regressor (Degree=2) performing on the training data is', regressor_poly2.score(X_train_poly2, y_train)
from sklearn.preprocessing import PolynomialFeatures
poly4 = PolynomialFeatures(degree=4)
X_train_poly4 = poly4.fit_transform(X_train)
regressor_poly4 = LinearRegression()
regressor_poly4.fit(X_train_poly4, y_train)
xx_poly4 = poly4.transform(xx)
yy_poly4 = regressor_poly4.predict(xx_poly4)
plt.scatter(X_train, y_train)
plt1, = plt.plot(xx, yy, label='Degree=1')
plt2, = plt.plot(xx, yy_poly2, label='Degree=2')
plt4, = plt.plot(xx, yy_poly4, label='Degree=4')
plt.axis([0, 25, 0, 25])
plt.xlabel('Diameter of Pizza')
plt.ylabel('Price of Pizza')
plt.legend(handles = [plt1, plt2, plt4])
plt.show()
print 'The R-squared value of Polynominal Regressor (Degree=4) performing on the training data is',regressor_poly4.score(X_train_poly4, y_train)
X_test = [[6], [8], [11], [16]]
y_test = [[8], [12], [15], [18]]
regressor.score(X_test, y_test)
X_test_poly2 = poly2.transform(X_test)
regressor_poly2.score(X_test_poly2, y_test)
X_test_poly4 = poly4.transform(X_test)
regressor_poly4.score(X_test_poly4, y_test)

from sklearn.linear_model import Lasso
lasso_poly4 = Lasso()
lasso_poly4.fit(X_train_poly4, y_train)
print lasso_poly4.score(X_test_poly4, y_test)
print lasso_poly4.coef_
print regressor_poly4.score(X_test_poly4, y_test)
print regressor_poly4.coef_
print regressor_poly4.coef_
print np.sum(regressor_poly4.coef_ ** 2)
from sklearn.linear_model import Ridge
ridge_poly4 = Ridge()
ridge_poly4.fit(X_train_poly4, y_train)
print ridge_poly4.score(X_test_poly4, y_test)
print ridge_poly4.coef_
print np.sum(ridge_poly4.coef_ ** 2)

from sklearn.datasets import fetch_20newsgroups
import numpy as np
news = fetch_20newsgroups(subset='all')
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(news.data[:3000], news.target[:3000], test_size=0.25, random_state=33)
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
clf = Pipeline([('vect', TfidfVectorizer(stop_words='english', analyzer='word')), ('svc', SVC())])
parameters = 'svc__gamma': np.logspace(-2, 1, 4), 'svc__C': np.logspace(-1, 1, 3)
from sklearn.grid_search import GridSearchCV
gs = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3)
time_= gs.fit(X_train, y_train)
gs.best_params_, gs.best_score_
print gs.score(X_test, y_test)

from sklearn.datasets import fetch_20newsgroups
import numpy as np
news = fetch_20newsgroups(subset='all')
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(news.data[:3000], news.target[:3000], test_size=0.25, random_state=33)
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
clf = Pipeline([('vect', TfidfVectorizer(stop_words='english', analyzer='word')), ('svc', SVC())])
parameters = 'svc__gamma': np.logspace(-2, 1, 4), 'svc__C': np.logspace(-1, 1, 3)
from sklearn.grid_search import GridSearchCV
gs = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3, n_jobs=-1)
time_= gs.fit(X_train, y_train)
gs.best_params_, gs.best_score_
print gs.score(X_test, y_test)

《python机器学习及实践》----监督学习经典模型

本片博客是根据《Python机器学习及实践》一书中的实例，所有代码均在本地编译通过。数据为从该书指定的百度网盘上下载的，或者是sklearn自带数据下载到本地使用的。代码片段：importpandasaspdimportnumpyasnpcolumn_names=... 查看详情

《python机器学习及实践》----监督学习经典模型

python机器学习及实践——进阶篇5（模型检验）

前面时不时提到模型检验或者交叉验证等词汇，特别是在对不同模型的配置，不同的特征组合，在相同的数据和任务下进行评价的时候。究其原因是因为仅仅使用默认配置的模型与不经处理的数据特征，在大多数... 查看详情

python机器学习及实践——进阶篇3（模型正则化之欠拟合与过拟合）(代码片段)

一个需要记住的重要事项：任何机器学习模型在训练集上的性能表现，都不能作为其对未知测试数据预测能力的评估。这里讲详细及时什么是模型的泛化力以及如何保证模型的泛化力，一次会阐述模型复杂度与泛化力... 查看详情

python机器学习及实践+从零开始通往kaggle竞赛之路

...习与数据挖掘的实践及竞赛感兴趣的读者，从零开始，以Python编程语言为基础，在不涉及大量数学模型与复杂编程知识的前提下，逐步带领读者熟悉并且掌握当下最流行的机器学习、数据挖掘与自然语言处理工具，如Scikitlearn... 查看详情

python机器学习及实践——进阶篇2（特征提升之特征筛选）

总体来说，良好的数据特征组合不需太多，便可以使得模型的性能表现突出。比如我们在“良/恶性乳腺癌肿瘤预测“问题中，仅仅使用两个描述肿瘤形态的特征便取得较高的识别率。冗余的特征虽然不会影响模型性... 查看详情

python机器学习及实践——特征降维

特征降维是无监督学习的另一个应用，目的有两个：一是我们经常在实际项目中遭遇特征维度非常高的训练样本，而往往无法借助自己的领域知识人工构建有效特征；二是在数据表现方面，我们无法用肉眼观... 查看详情

python机器学习及实践——基础篇3（svm）(代码片段)

图中有三种颜色的线，用来划分这两种类别的训练样本。其中绿色直线H1在这些训练样本上表现不佳，本身就带有分类错误；蓝色直线H2和红色直线H3如果作为这个二分类问题的线性分类模型，在训练集上的... 查看详情

python机器学习及实践——进阶篇6（超参数搜索）(代码片段)

前面所提到的模型配置，我们一般统称为模型的超参数，如K近邻算法中的K值支持向量机中不同的和函数等。多数情况下，超参数的选择是无限的。因此在有限的时间内，除了可以验证人工预设几种超参数组合以... 查看详情

python机器学习算法学习的步骤机器学习的应用及流程(获取数据特征工程模型模型评估)(代码片段)

机器学习入门机器学习中需要理论性的知识，如数学知识为微积分(求导过程，线性回归的梯度下降法)，线性代数(多元线性回归，高纬度的数据，矩阵等)，概率论(贝叶斯算法)，统计学(贯穿整个学习过... 查看详情

《python机器学习及实践》----良/恶性乳腺癌肿瘤预测

本片博客是根据《Python机器学习及实践》一书中的实例，所有代码均在本地编译通过。数据为从该书指定的百度网盘上下载的。代码片段：importpandasaspdimportmatplotlib.pyplotaspltimportnumpyasnpfromsklearn.linear_modelimportLogisticRegression... 查看详情

python机器学习及实践——进阶篇5（模型检验）

python机器学习及实践——进阶篇3（模型正则化之欠拟合与过拟合）(代码片段)

《python机器学习及实践》----无监督学习之数据聚类

本片博客是根据《Python机器学习及实践》一书中的实例，所有代码均在本地编译通过。数据为从该书指定的百度网盘上下载的，或者是sklearn自带数据下载到本地使用的。代码片段：#coding:utf-8#分别导入numpy、matplotlib以... 查看详情

《python机器学习及实践》----无监督学习之特征降维

本片博客是根据《Python机器学习及实践》一书中的实例，所有代码均在本地编译通过。数据为从该书指定的百度网盘上下载的，或者是sklearn自带数据下载到本地使用的。代码片段：#coding:utf-8importnumpyasnpM=np.array([[1,... 查看详情

python机器学习《机器学习python实践》整理，sklearn库应用详解(代码片段)

TableofContents1 初始1.1 初识机器学习1.2 python机器学习的生态圈1.3 第一个机器学习项目1.3.1 机器学习中的helloworld项目1.3.2 导入数据1.3.3 概述数据1.3.4 数据可视化1.3.5 评估算法1.3.5.1 分离评估数据集1.3.5.2 创... 查看详情

机器学习数据科学基础——机器学习基础实践(代码片段)

【机器学习】数据科学基础——机器学习基础实践（一）活动地址：[CSDN21天学习挑战赛](https://marketing.csdn.net/p/bdabfb52c5d56532133df2adc1a728fd)作者简介：在校大学生一枚，华为云享专家，阿里云星级博主，... 查看详情

python机器学习|超参数优化黑盒（black-box）非凸优化技术实践

文章目录一、关键原理二、Python实践CSDN叶庭云：https://yetingyun.blog.csdn.net/一、关键原理为什么要做超参数优化？机器学习建模预测时，超参数是用于控制机器学习模型学习过程的参数。为了与从数据中学到的机器学习模型参数区... 查看详情