This document discusses PySpark and how it relates to Spark, Hadoop, and Python for data analysis (PyData). It provides an overview of key PySpark concepts like RDDs and DataFrames. It also discusses common file formats like Parquet and Apache Arrow that can be used with PySpark for efficient data storage and transfer between Spark and Python tools.
29. Spark PyData
▸ CSV JSON
▸Parquet Spark DataFrame API
Python
fastparquet pyarrow
▸ Performance comparison of different file formats and storage engines
in the Hadoop ecosystem
▸
=