内容简介
机器学习和预测分析正在改变商业和其他组织的运作模式。
《Python机器学习(影印版)》将带你进入预测分析的世界,通过演示告诉你为什么Python是世界数据科学语言之一。如果你想询问更深入的数据问题,或是想增进、拓展机器学习系统的能力,这本实用的书籍可谓是无价之宝。
《Python机器学习(影印版)》涵盖了包括scikit-learn、Theano和Keras在内的大量功能强大的Python库、操作指南以及从情感分析到神经网络的各色小技巧,很快你就能够解答你个人及组织所面对的那些*重要的问题。
作者简介
Sebastian Raschka,a PhD student at Michigan State University, who develops new computational methods in the field of computational biology. He has been ranked as the number one most influential data scientist on GitHub by Analytics Vidhya. He has a yearlong experience in Python programming and he has conducted several seminars on the practical applications of data science and machine learning. Talking and writing about data science, machine learning, and Python really motivated Sebastian to write this book in order to help people develop data-driven solutions without necessarily needing to have a machine learning background. He has also actively contributed to open source projects and methods that he implemented, which are now successfully used in machine learning competitions, such as Kaggle. In his free time, he works on models for sports predictions, and if he is not in front of the computer, he enjoys playing sports.
内页插图
目录
Preface
Chapter 1: Givin Computers the Ability to Learn from Data
Building intelligent machines to transform data into knowledge
The three different types of machine learning
Making predictions about the future with supervised learning
Classification for predicting class labels
Regression for predicting continuous outcomes
Solving interactive problems with reinforcement learning
Discovering hidden structures with unsupervised learning
Finding subgroups with clustering
Dimensionality reduction for data compression
An introduction to the basic terminology and notations
A roadmap for building machine learning systems
Preprocessing-getting data into shape
Training and selecting a predictive model
Evaluating models and predicting unseen data instances
Using Python for machine learning
Installing Python packages
Summary
Chapter 2: Training Machine Learning Algorithms
for Classification
Artificial neurons-a brief glimpse into the early history
of machine learning
Implementing a perceptron learning algorithm in Python
Training a perceptron model on the Iris dataset
Adaptive linear neurons and the convergence of learning
Minimizing cost functions with gradient descent
Implementing an Adaptive Linear Neuron in Python
Large scale machine learning and stochastic gradient descent
Summary
Chapter 3: A Tour of Machine Learning Classifiers Using
Scikit-learn
Choosing a classification algorithm
First steps with scikit-learn
Training a perceptron via scikit-learn
Modeling class probabilities via logistic regression
Logistic regression intuition and conditional probabilities
Learning the weights of the logistic cost function
Training a logistic regression model with scikit-learn
Tackling overfitting via regularization
Maximum margin classification with support vector machines
Maximum margin intuition
Dealing with the nonlinearly separablecase using slack variables
Alternative implementations in scikit-learn
Solving nonlinear problems using a kernel SMM
Using the kernel trick to find separating hyperplanes in higher
dimensional space
Decision tree learning
Maximizing information gain-getting the most bang for the buck
Building a decision tree
Combining weak to strong learners via random forests
K-nearest neighbors-a lazy learning algorithm
Summary
Chapter 4: Building Good Training Sets-Data Preprocessing
Dealing with missing data
Eliminating samples or features with missing values
Imputing missing values
Understanding the scikit-learn estimator API
Handling categorical data
Mapping ordinal features
Encoding class labels
Performing one-hot encoding on nominal features
Partitioning a dataset in training and test sets
Bringing features onto the same scale
Selecting meaningful features
Sparse solutions with L1 regularization
Sequential feature selection algorithms
Assessing feature importance with random forests
Summary
Chapter 5: Com~ Data via Di~ Reduction
Unsupervised dimensionality reduction via principal
component analysis
Total and explained variance
Feature transformation
Principal component analysis in scikit-learn
Supervised data compression via linear discriminant analysis
Computing the scatter matrices
Selecting linear discriminants for the new feature subspace
Projecting samples onto the new feature space
LDA via scikit-learn
Using kernel principal component analysis for nonlinear mappings
Kernel functions and the kernel trick
Implementing a kernel principal component analysis in Python
Example 1-separating half-moon shapes
Example 2-separating concentric circles
Projecting new data points
Kernel principal component analysis in scikit-learn
Summary
Chapter 6: Learning Best Practices for Model Evaluation
and Hyperparameter Tuni~
Streamlining workflows with pipelines
Loading the Breast Cancer Wisconsin dataset
Combining transformers and estimators in a pipeline
Using k-fold cross-validation to assess model performance
The holdout method
K-fold cross-validation
Debugging algorithms with learning and validation curves
Diagnosing bias and variance problems with learning curves
Addressing overfitting and underfitting with validation curves
Fine-tuning machine learning models via grid search
Tuning hyperparameters via grid search
Algorithm selection with nested cross-validation
Looking at different performance evaluation metrics
Reading a confusion matrix
Optimizing the precision and recall of a classification model
Plotting a receiver operating characteristic
The scoring metrics for multiclass classification
Summary
Chapter 7: Combining Different Models for Ensemble Learning
Learning with ensembles
Implementing a simple majority vote classifier
Combining different algorithms for classification with majority vote
Evaluating and tuning the ensemble classifier
Bagging-building an ensemble of classifiers from
bootstrap samples
Leveraging weak learners via adaptive boosting
Summary
Chapter 8: Applying Machine Learning to Sentiment Analysis
Obtaining the IMDb movie review dataset
Introducing the bag-of-words model
Transforming words into feature vectors
Assessing word relevancy via term frequency-inverse
document frequency
Cleaning text data
Processing documents into tokens
Training a logistic regression model for document classification
Working with bigger data-online algorithms and
out-of-core learning
Summary
Chapter 9: Embedding a Machine Learning Model into
a Web Application
Serializing fitted scikit-learn estimators
Setting up a SQLite database for data storage
Developing a web application with Flask
Our first Flask web application
Form validation and rendering
Turning the movie classifier into a web application
Deploying the web application to a public sewer
Updating the movie review classifier
Summary
Chapter 10: Predicting Continuous Target Variables
with R_Re_gression Analysis
Introducing a simple linear regression model
Exploring the Housing Dataset
Visualizing the important characteristics of a dataset
Implementing an ordinary least squares linear regression model
Solving regression for regression parameters with gradient descent
Estimating the coefficient of a regression model via scikit-learn
Fitting a robust regression model using RANSAC
Evaluating the performance of linear regression models
Using regularized methods for regression
Turning a linear regression model into a curve-polynomial regression
Modeling nonlinear relationships in the Housing Dataset
Dealing with nonlinear relationships using random forests
Decision tree regression
Random forest regression
Summary
Chapter 11: Working with Unlabeled Data- Cluste~
Grouping objects by similarity using k-means
K-means++
Hard versus soft clustering
Using the elbow method to find the optimal number of clusters
Quantifying the quality of clustering via silhouette plots
Organizing clusters as a hierarchical tree
Performing hierarchical clustering on a distance matrix
Attaching dendrograms to a heat map
Applying agglomerative clustering via scikit-learn
Locating regions of high density via DBSCAN
Summary
Chapter 12: Training Artificial Neural Networks for Image Recognition
Modeling complex functions with artificial neural networks
Single-layer neural network recap
Introducing the multi-layer neural network architecture
Activating a neural network via forward propagation
Classifying handwritten digits
Obtaining the MNIST dataset
Implementing a multi-layer perceptron
Training an artificial neural network
Computing the logistic cost function
Training neural networks via backpropagation
Developing your intuition for backpropagation
Debugging neural networks with gradient checking
Convergence in neural networks
Other neural network architectures
Convolutional Neural Networks
Recurrent Neural Networks
A few last words about neural network implementation
Summary
Chapter 13: Parallelizing Neural Network Training with Theano
Building, compiling, and running expressions with Theano
What is Theano?
First steps with Theano
Configuring Theano
Working with array structures
Wrapping things up-a linear regression example
Choosing activation functions for feedforward neural networks
Logistic function recap
Estimating probabilities in multi-class classification via the
softmax function
Broadening the output spectrum by using a hyperbolic tangent
Training neural networks efficiently using Keras
Summary
Index
前言/序言
We live in the midst of a data deluge. According to recent estimates, 2.5 quintillion (10i8) bytes of data are generated on a daily basis. This is so much data that over 90 percent of the information that we store nowadays was generated in the past decade alone. Unfortunately, most of this information cannot be used by humans. Either the data is beyond the means of standard analytical methods, or it is simply too vast for our limited minds to even comprehend.
Through Machine Learning, we enable computers to process, learn from, and draw actionable insights out of the otherwise impenetrable walls of big data. From the massive supercomputers that support Google s search engines to the smart phones that we carry in our pockets, we rely on Machine Learning to power most of the world around us - often, without even knowing it.
As modem pioneers in the brave new world of big data, it then behooves us to learn more about Machine Learning. What is Machine Learning and how does it work? How can I use Machine Learning to take a glimpse into the unknown, power my business, or just find out what the Internet at large thinks about my favorite movie? All of this and more will be covered in the following chapters authored by my good friend and colleague, Sebastian Raschka. When away from taming my otherwise irascible pet dog, Sebashan has tirelessly devoted his free time to the open source Machine Learning community. Over the past several years, Sebastian has developed dozens of popular tutorials that cover topics in Machine Learning and data visualization in Python. He has also developed and contributed to several open source Python packages, several of which are now part of the core Python Machine Learning workflow.
Owing to his vast expertise in this field, I am confident that Sebashan's insights into the world of Machine Learning in Python will be invaluable to users of all experience levels. l wholeheartedly recommendy this book to anyone looking to gain a broader and more practical und Yerstanding of Machine Learning.
《Python 数据科学实战:从基础到进阶》 深度探索数据分析与机器学习的奥秘,点亮您的数据驱动决策之路。 在这个信息爆炸的时代,数据已成为引领变革的核心驱动力。从商业洞察到科学研究,从用户体验优化到社会福祉提升,理解和驾驭数据的重要性不言而喻。然而,面对海量、复杂的数据,如何有效地提取价值,洞察规律,甚至预测未来,是每一位数据从业者和渴望在数据领域有所建树的读者所面临的挑战。《Python 数据科学实战:从基础到进阶》正是为解决这一挑战而生,它将带领您踏上一段系统、全面且富有实践性的数据科学探索之旅。 本书并非仅仅罗列技术和工具,而是着眼于数据科学的整个生命周期,涵盖从数据采集、清洗、探索性分析,到模型构建、评估与部署的全过程。我们选择Python作为核心编程语言,因为它拥有极其丰富和强大的生态系统,包括NumPy、Pandas、Matplotlib、Seaborn、Scikit-learn、TensorFlow、PyTorch等一系列业界领先的库,为数据科学的各个环节提供了坚实的技术支撑。 本书内容深度剖析: 第一部分:数据科学基础与Python环境搭建 在正式开启数据探索之旅前,建立坚实的基础至关重要。本部分将从零开始,为您构建必要的数据科学知识体系和实践环境。 Python语言入门与进阶: 对于Python新手,我们会提供清晰易懂的语法讲解,包括数据类型、控制流、函数、面向对象编程等核心概念,并辅以针对数据科学的常用技巧。对于有一定Python基础的读者,则将深入探讨更高级的主题,如生成器、装饰器、上下文管理器等,以提升代码效率和可读性。 核心数据科学库概览: NumPy: 理解其多维数组(ndarray)的核心作用,掌握数组的创建、索引、切片、运算以及广播机制,这是进行高效数值计算的基石。 Pandas: 深入学习Series和DataFrame这两个核心数据结构,掌握数据的读取(CSV, Excel, SQL等)、清洗(缺失值处理、异常值检测)、转换、合并、分组聚合等一系列数据操作技巧。我们将通过大量的实际案例,让您熟悉如何用Pandas高效地处理真实世界的数据集。 Matplotlib与Seaborn: 数据可视化是理解数据、呈现洞察的关键。我们将从基础绘图(折线图、散点图、柱状图、饼图)入手,逐步深入到更复杂的统计图表(箱线图、小提琴图、热力图、分布图),并学习如何进行图表的定制化美化,使其更具表现力。Seaborn库将帮助您轻松绘制出美观且信息丰富的统计图形。 Jupyter Notebook/Lab与IDE: 熟悉交互式开发环境,如Jupyter Notebook和Jupyter Lab,它们是进行数据探索、原型开发和结果展示的理想工具。同时,也会介绍VS Code等主流IDE在Python数据科学开发中的应用。 第二部分:数据预处理与探索性数据分析 (EDA) 原始数据往往是混乱、不完整且充满噪声的。本部分将聚焦于如何将原始数据转化为可供分析的“干净”数据,并从中发现隐藏的模式和见解。 数据清洗技术: 缺失值处理: 探讨多种策略,如删除、插补(均值、中位数、众数、模型预测),并分析不同方法的优劣。 异常值检测与处理: 识别数据中的离群点,并学习如何根据业务场景选择合适的处理方式(移除、转换、保留)。 数据类型转换与规范化: 确保数据类型正确,处理文本数据中的编码问题、日期时间格式等。 重复数据处理: 有效识别和移除重复项。 特征工程基础: 特征创建: 从现有特征派生新特征,如日期分解、文本特征提取(词袋模型、TF-IDF)。 特征编码: 处理类别型变量,如独热编码(One-Hot Encoding)、标签编码(Label Encoding)。 特征缩放: 理解标准化(Standardization)和归一化(Normalization)的原理及应用场景,为后续模型训练做准备。 探索性数据分析 (EDA): 描述性统计: 计算均值、方差、分位数、偏度、峰度等统计量,全面了解数据的分布特征。 相关性分析: 计算变量间的相关系数(Pearson, Spearman),识别潜在的线性或单调关系。 数据可视化驱动的洞察: 利用各类图表直观展示数据分布、变量关系、分组差异等,发现数据中的潜在模式、趋势和异常。例如,通过散点图观察两个数值变量的关系,通过箱线图比较不同组别的数值分布,通过热力图展示特征之间的相关性矩阵。 第三部分:机器学习模型构建与评估 这是本书的核心部分,我们将系统学习各种主流的机器学习算法,并掌握如何使用它们来解决实际问题。 监督学习: 回归模型: 线性回归: 理解模型原理,学习如何处理多项式回归、正则化(Lasso, Ridge)。 决策树回归: 掌握树的生长过程,理解过拟合问题及剪枝。 集成学习(回归): 学习Bagging(随机森林)和Boosting(Gradient Boosting, XGBoost, LightGBM)的工作原理及其强大的预测能力。 分类模型: 逻辑回归: 理解其概率模型和分类决策边界。 K近邻(KNN): 学习基于距离的分类方法。 支持向量机(SVM): 掌握核技巧在非线性分类中的应用。 朴素贝叶斯: 理解其概率推理和文本分类的应用。 决策树分类: 集成学习(分类): 随机森林、XGBoost、LightGBM在分类任务中的应用。 无监督学习: 聚类算法: K-Means: 学习如何发现数据中的簇。 DBSCAN: 识别任意形状的簇。 层次聚类: 构建类别的层次结构。 降维算法: 主成分分析(PCA): 理解其寻找数据主要方差方向,实现降维。 t-SNE: 学习其用于高维数据可视化降维。 模型评估与选择: 回归模型评估指标: MSE, RMSE, MAE, R-squared。 分类模型评估指标: 准确率 (Accuracy), 精确率 (Precision), 召回率 (Recall), F1-score, ROC曲线与AUC值, 混淆矩阵。 交叉验证: 理解k折交叉验证等方法,确保模型泛化能力。 超参数调优: Grid Search, Random Search等方法。 Scikit-learn实战: 充分利用Scikit-learn库,它提供了统一的API,让您能够便捷地实现上述各种模型。我们将演示如何加载数据、预处理、训练模型、进行预测和评估。 第四部分:深度学习基础与应用 随着深度学习的兴起,它在图像识别、自然语言处理等领域取得了突破性进展。本部分将为您揭开深度学习的神秘面纱。 神经网络基础: 感知机与多层感知机(MLP): 理解神经元的工作原理,激活函数的作用。 反向传播算法: 掌握模型训练的核心机制。 损失函数与优化器: 学习如何衡量模型误差并更新权重。 主流深度学习框架: TensorFlow与Keras: 掌握Keras提供的简洁API,快速构建和训练神经网络。 PyTorch: 了解PyTorch的动态计算图和灵活性。 常用深度学习模型: 卷积神经网络(CNN): 尤其适用于图像处理任务,学习卷积层、池化层等。 循环神经网络(RNN)及变种(LSTM, GRU): 适用于序列数据,如文本和时间序列。 深度学习应用案例: 通过实际案例,如图像分类、文本情感分析,展示深度学习的强大能力。 第五部分:模型部署与实战项目 学习的最终目的是解决实际问题。本部分将指导您如何将训练好的模型部署到实际应用中,并提供贯穿全书的综合实战项目。 模型持久化: 学习如何保存训练好的模型(如使用`pickle`或`joblib`)。 模型部署基础: 介绍将模型集成到Web应用(如使用Flask或Django)或进行API服务部署的初步概念。 实战项目: 案例一:房价预测。 使用线性回归、集成学习等模型,从二手房交易数据中预测房价。 案例二:客户流失预测。 利用逻辑回归、SVM、随机森林等模型,识别可能流失的客户。 案例三:图像识别(如MNIST手写数字识别)。 利用CNN模型进行图像分类。 案例四:文本情感分析。 利用朴素贝叶斯、RNN或Transformer模型分析用户评论的情感倾向。 本书特色: 理论与实践并重: 深入浅出地讲解算法原理,并通过大量代码示例和实战项目,帮助读者将理论知识转化为实践技能。 循序渐进的难度: 从Python基础和数据科学入门,逐步过渡到高级模型和深度学习,适合不同水平的读者。 贴近实际需求: 选取真实世界的数据集和应用场景,让学习过程更具针对性和实用性。 丰富的可视化: 大量使用图表来解释概念和展示数据洞察,使学习过程更加直观。 代码质量高: 提供的代码经过精心设计和测试,易于理解和复用。 适合读者: 希望掌握数据科学核心技能的初学者。 希望系统学习Python数据分析和机器学习的在校学生。 希望提升数据处理和建模能力的软件工程师、数据分析师。 希望将数据科学应用于业务场景的商业分析师、产品经理。 对人工智能和机器学习领域感兴趣的任何人士。 《Python 数据科学实战:从基础到进阶》 将是您在数据科学道路上不可或缺的伙伴。通过本书的学习,您将不仅能够熟练运用Python进行数据分析和建模,更重要的是,能够培养出独立解决复杂数据问题的能力,从而在数据驱动的未来中占据先机。现在就开始您的数据探索之旅吧!