本書作者比爾·錢伯斯和馬太·扎哈里亞在強(qiáng)調(diào)Spark 2.0的改進(jìn)和新功能的同時(shí),將Spark題分為不同的部分,每個(gè)部分都有其獨(dú)特的目標(biāo)。你將探索Spark的結(jié)構(gòu)化API的基本操作和常見功能以及Structured Streaming,后者是用于構(gòu)建端到端流應(yīng)用的一種全新的高層API。開發(fā)人員和系統(tǒng)管理員會(huì)學(xué)Spark監(jiān)控、調(diào)優(yōu)、調(diào)試的基礎(chǔ)知識(shí),探索機(jī)器學(xué)習(xí)技術(shù)以及Spark可擴(kuò)展機(jī)器學(xué)習(xí)庫(kù)MLlib的部署場(chǎng)景。
Preface
Part I. Gentle Overview of Big Data and Spark
1. What Is Apache Spark
Apache Spark's Philosophy
Context: The Big Data Problem
History of Spark
The Present and Future of Spark
Running Spark
Downloading Spark Locally
Launching Spark's Interactive Consoles
Running Spark in the Cloud
Data Used in This Book
2. A Gentle Introduction to Spark
Spark's Basic Architecture
Spark Applications
Spark's Language APIs
Spark's APIs
Starting Spark
The SparkSession
DataFrames
Partitions
Transformations
Lazy Evaluation
Actions
Spark UI
An End-to-End Example
DataFrames and SQL
Conclusion
3. A Tour of Spark's Too1set
Running Production Applications
Datasets: Type-Safe Structured APIs
Structured Streaming
Machine Learning and Advanced Analytics
Lower-Level APIs
SparkR
Spark's Ecosystem and Packages
Conclusion
Part II. Structured APls--DataFrames, SQL, and Datasets
4. Structured API Overview
DataFrames and Datasets
Schemas
Overview of Structured Spark Types
DataFrames Versus Datasets
Columns
Rows
Spark Types
Overview of Structured API Execution
Logical Planning
Physical Planning
Execution