內容簡介
大數據分析是檢視龐大的復雜數據集的過程,這些數據集通常超齣瞭你所擁有的計算能力。R語言作為數據科學的領軍編程語言,包含瞭諸多功能強大的函數,足以解決大數據處理相關的所有問題。
《大數據分析:R語言實現(影印版 英文版)》首先簡要敘述瞭大數據領域及其當前的行業標準.然後介紹瞭R語言的發展、結構、現實應用和不足之處,接著引入瞭用於數據管理和轉換的主要R函數的修訂版。讀者會瞭解至U基於雲的大數據解決方案(例如Amazon EC2實例和Amazon RDS,Microsoft Azure及其HDInsight集群)以及R與關係/非關係數據庫(如MongoDB和HBase)之間如何建立連接。除此之外,進一步涵蓋瞭大數據工具,如ApacheHadoop、HDFS和MapReduce,還有其他一些R兼容工具,如Apache Spark及其機器學習庫Spark MLlib、H2O。
作者簡介
Simon Walkowiak,a cognitive neuroscientist and a managing director of Mind Project Ltd - a Big Data and Predictive Analytics consultancy based in London, United Kingdom. As a former data curator at the UK Data Service (UKDS, University of Essex) - European largest socio-economic data repository, Simon has an extensive experience in processing and managing large-scale datasets such as censuses, sensor and smart meter data, telecommunication data and well-known governmental and social surveys such as the British Social Attitudes survey, Labour Force surveys, Understanding Society, National Travel survey, and many other socio-economic datasets collected and deposited by Eurostat, World Bank, Office for National Statistics, Department of Transport, NatyCen and International Energy Agency, to mention just a few. Simon has delivered numerous data science and R training courses at public institutions and international comparniues. He has also taught a course in Big Data Methods in R at major UK universities and at the prestigious Big Data and Analyhcs Summer School organized by the Institute of Analytics and Data Saence (IADS),
內頁插圖
目錄
Preface
Chapter 1: The Era of Big Data
Big Data - The monster re-defined
Big Data toolbox - dealing with the giant
Hadoop - the elephant in the room
Databases
Hadoop Spark-ed up
R- The unsung Big Data hero
Summary
Chapter 2: Introduction to R Programming Language and Statistical Environment
Learning R
Revisiting R basics
Getting R and RStudio ready
Setting the URLs to R repositories
R data structures
Vectors
Scalars
Matrices
Arrays
Data frames
Lists
Exporting R data objects
Applied data science with R
Importing data from different formats
Exploratory Data Analysis
Data aggregations and contingency tables
Hypothesis testing and statistical inference
Tests of differences
Independent t-test example (with power and effect size estimates)
ANOVA example
Tests of relationships
An example of Pearson's r correlations
Multiple regression example
Data visualization packages
Summary
Chapter 3: Unleashing the Power of R from Within
Traditional limitations of R
Out-of-memory data
Processing speed
To the memory limits and beyond
Data transformations and aggregations with the ff and ffbase packages
Generalized linear models with the ff and ffbase packages
Logistic regression example with ffbase and biglm
Expanding memory with the bigmemory package
Parallel R
From bigmemory to faster computations
An apply() example with the big.matrix object
A for() loop example with the ffdf object
Using apply() and for() loop examples on a data.frame
A parallel package example
A foreach package example
The future of parallel processing in R
Utilizing Graphics Processing Units with R
Multi-threading with Microsoft R Open distribution
Parallel machine learning with H20 and R
Boosting R performance with the data.table package and other tools
Fast data import and manipulation with the data.table package
Data import with data.table
Lightning-fast subsets and aggregations on data.table
Chaining, more complex aggregations, and pivot tables with data.table
Writing better R code
Summary
Chapter 4: Hadoop and MapReduce Framework for R
Hadoop architecture
Hadoop Distributed File System
MapReduce framework
A simple MapReduce word count example
Other Hadoop native tools
Learning Hadoop
A single-node Hadoop in Cloud
Deploying Hortonworks Sandbox on Azure
A word count example in Hadoop using Java
A word count example in Hadoop using the R language
RStudio Server on a Linux RedHat/CentOS virtual machine
Installing and configuring RHadoop packages
HDFS management and MapReduce in R - a word count example
HDInsight - a multi-node Hadoop cluster on Azure
Creating your first HDInsight cluster
Creating a new Resource Group
Deploying a Virtual Network
Creating a Network Security Group
Setting up and configuring an HDInsight cluster
Starting the cluster and exploring Ambari
Connecting to the HDInsight cluster and installing RStudio Server
Adding a new inbound security rule for port 8787
Editing the Virtual Network's public IP address for the head node
Smart energy meter readings analysis example - using R on HDInsight cluster
Summary
Chapter 5: R with Relational Database Management Systems (RDBMSs)
Relational Database Management Systems (RDBMSs)
A short overview of used RDBMSs
Structured Query Language (SQL)
SQLite with R
Preparing and importing data into a local SQLite database
Connecting to SQLite from RStudio
MariaDB with R on a Amazon EC2 instance
Preparing the EC2 instance and RStudio Server for use
Preparing MariaDB and data for use
Working with MariaDB from RStudio
PostgreSQL with R on Amazon RDS
Launching an Amazon RDS database instance
Preparing and uploading data to Amazon RDS
Remotely querying PostgreSQL on Amazon RDS from RStudio
Summary
Chapter 6: R with Non-Relational (NoSQL) Databases
Introduction to NoSQL databases
Review of leading non-relational databases
MongoDB with R
Introduction to MongoDB
MongoDB data models
Installing MongoDB with R on Amazon EC2
Processing Big Data using MongoDB with R
Importing data into MongoDB and basic MongoDB commands
MongoDB with R using the rmongodb package
MongoDB with R using the RMongo package
MongoDB with R using the mongolite package
HBase with R
Azure HDInsight with HBase and RStudio Server
Importing the data to HDFS and HBase
Reading and querying HBase using the rhbase package
Summary
Chapter 7: Faster than Hadoop - Spark with R
Spark for Big Data analytics
Spark with R on a multi-node HDInsight cluster
Launching HDInsight with Spark and R/RStudio
Reading the data into HDFS and Hive
Getting the data into HDFS
Importing data from HDFS to Hive
Bay Area Bike Share analysis using SparkR
Summary
Chapter 8: Machine Learning Methods for Big Data in R
What is machine learning?
Supervised and unsupervised machine learning methods
Classification and clustering algorithms
Machine learning methods with R
Big Data machine learning tools
GLM example with Spark and R on the HDInsight cluster
Preparing the Spark cluster and reading the data from HDFS
Logistic regression in Spark with R
Naive Bayes with H20 on Hadoop with R
Running an H2O instance on Hadoop with R
Reading and exploring the data in H2O
Naive Bayes on H2O with R
Neural Networks with H2O on Hadoop with R
How do Neural Networks work?
Running Deep Learning models on H20
Summary
Chapter 9: The Future of R - Big, Fast, and Smart Data
The current state of Big Data analytics with R
Out-of-memory data on a single machine
Faster data processing with R
Hadoop with R
Spark with R
R with databases
Machine learning with R
The future of R
Big Data
Fast data
Smart data
Where to go next
Summary
Index
探索數據深處:解鎖大數據分析的奧秘 在信息爆炸的時代,數據已成為驅動創新、決策優化乃至社會變革的核心力量。從海量的傳感器讀數到復雜的社交網絡互動,再到精密的科學實驗結果,龐大而雜亂的數據集蘊含著我們渴望發現的規律、洞察和價值。然而,如何從這些“大數據”的洪流中提煉齣有意義的信息,並將其轉化為可操作的知識,卻是一項充滿挑戰的任務。本書旨在為讀者提供一套係統且實用的方法論,幫助您掌握駕馭大數據、挖掘其內在價值的關鍵技能。 本書並非一本枯燥的理論手冊,而是一次激動人心的實踐探索之旅。 我們將帶您深入理解大數據分析的核心理念,並聚焦於當下最流行、功能最強大的開源數據科學語言——R。R語言以其豐富的統計分析庫、強大的可視化能力以及活躍的社區支持,已成為大數據分析領域的首選工具之一。本書將圍繞R語言,循序漸進地引導您構建從數據采集、預處理到建模、評估的完整分析流程。 數據,是您旅程的起點。 在本書的早期章節,我們將首先關注數據的來源與形態。您將學習如何高效地從各種數據庫、文件格式(如CSV、JSON、XML)以及網絡API中獲取原始數據。更重要的是,您將掌握對這些數據進行初步探索和理解的方法。這包括但不限於:理解數據的結構、識彆缺失值和異常值、進行描述性統計分析以概覽數據分布特徵、以及利用多樣的可視化技術(如直方圖、散點圖、箱綫圖)來直觀地展現數據間的關係。我們相信,深入理解您的數據是成功分析的基礎。 數據清洗與轉換,是通往真相的必經之路。 原始數據往往是“髒”的,充斥著錯誤、不一緻和不完整的信息。本書將投入大量篇幅,詳細講解數據清洗和預處理的各項技術。您將學習如何有效地處理缺失值(例如,通過插補、刪除或模型預測),如何檢測和糾正異常值,如何進行數據類型轉換,如何閤並、連接和重塑數據集,以及如何對分類變量進行編碼。此外,我們還會介紹特徵工程的概念,包括如何從現有數據中創建新的、更有意義的特徵,以提升模型性能。您將學會利用R語言強大的數據處理包,如dplyr和tidyr,將繁瑣的數據操作轉化為簡潔優雅的代碼。 模型構建,是大數據分析的核心環節。 一旦數據被清洗和準備好,我們就可以開始構建模型來探索數據中的模式並做齣預測。本書將涵蓋多種大數據分析中常用的建模技術,從經典的統計模型到更現代的機器學習算法。 監督學習: 您將學習如何構建迴歸模型(如綫性迴歸、嶺迴歸、Lasso迴歸)來預測連續數值,以及如何構建分類模型(如邏輯迴歸、決策樹、隨機森林、支持嚮量機、K近鄰)來預測離散類彆。我們將深入探討每種模型的原理、假設、優缺點以及在R語言中的實現。 無監督學習: 對於那些沒有明確目標變量的數據,無監督學習提供瞭強大的工具。您將學習聚類分析(如K-Means、層次聚類)來發現數據中的自然分組,以及降維技術(如主成分分析PCA、t-SNE)來簡化高維數據,揭示潛在結構。 時間序列分析: 對於具有時間依賴性的數據,如股票價格、銷售額或傳感器讀數,時間序列分析至關重要。本書將介紹ARIMA模型、指數平滑法等經典時間序列模型,以及如何利用R語言進行時間序列預測和異常檢測。 模型評估與優化,是確保分析可靠性的關鍵。 構建模型隻是第一步,如何評估模型的性能並對其進行優化同樣重要。本書將詳細介紹各種模型評估指標,如準確率、精確率、召迴率、F1分數、ROC麯綫、AUC值、均方誤差(MSE)、均方根誤差(RMSE)等。您將學習如何利用交叉驗證等技術來獲得更可靠的模型評估結果,避免過擬閤和欠擬閤。此外,我們還將探討模型調參、特徵選擇等模型優化策略。 可視化,是大數據分析的靈魂。 即使是最復雜的模型和最深刻的洞察,如果無法清晰地傳達給他人,其價值也會大打摺扣。可視化是將數據轉化為易於理解的故事的關鍵。本書將重點介紹R語言中強大的可視化工具,如ggplot2。您將學習如何創建各種類型的圖錶,包括散點圖、摺綫圖、柱狀圖、餅圖、熱力圖、地理空間圖等,並學會如何通過調整圖錶的顔色、形狀、大小和標簽來有效地傳達信息,突齣關鍵發現。我們相信,通過高質量的可視化,您可以讓數據“說話”,從而更容易地與他人分享您的發現並驅動決策。 實際應用與案例研究,是檢驗真理的唯一標準。 理論知識需要通過實踐來鞏固。本書將結閤一係列實際應用場景,通過具體的案例研究來展示如何運用R語言解決真實世界的大數據分析問題。這些案例可能涵蓋: 市場營銷分析: 分析客戶購買行為,進行客戶細分,預測客戶流失。 金融風險管理: 構建信用評分模型,進行欺詐檢測,分析股票市場趨勢。 醫療健康: 分析疾病發病率,預測病人風險,優化治療方案。 社交媒體分析: 分析用戶情感,發現熱門話題,預測趨勢。 物聯網數據分析: 實時監控設備狀態,預測故障,優化資源利用。 通過這些案例,您將有機會親眼看到如何將本書中學到的知識和技能應用於實際工作中,並從中獲得寶貴的實踐經驗。 本書的讀者對象廣泛, 無論是希望進入大數據分析領域的初學者,還是希望提升R語言數據分析能力的在職專業人士,亦或是對利用數據解決復雜問題充滿興趣的學生,都將從中受益。我們假設讀者具備基礎的編程概念,但對R語言不一定有深入瞭解。我們會從基礎講起,逐步引導您掌握R語言在數據分析中的各項應用。 掌握大數據分析的能力,就是掌握瞭理解和塑造未來的關鍵。 本書將為您提供一把解鎖數據寶藏的鑰匙,引導您穿越數據的迷霧,發現隱藏的模式,洞察事物的本質,並最終做齣更明智、更具影響力的決策。加入我們,開啓您的R語言大數據分析之旅,讓數據成為您最強大的盟友!