PySpark

pyspark 是 Apache Spark 的 Python 接口，它允许在分布式环境中使用 Python 进行大规模数据处理和实时分析。

有关如何使用 PySpark 在 Hub 上分析数据集的详细指南，请查看这篇博客。

要开始在 PySpark 中使用 Parquet 文件，您首先需要将文件添加到 Spark 上下文。以下是如何读取单个 Parquet 文件的示例

from pyspark import SparkFiles, SparkContext, SparkFiles
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("WineReviews").getOrCreate()

# Add the Parquet file to the Spark context
spark.sparkContext.addFile("https://huggingface.co/api/datasets/james-burton/wine_reviews/parquet/default/train/0.parquet")

# Read the Parquet file into a DataFrame
df = spark.read.parquet(SparkFiles.get("0.parquet"))

如果您的数据集被分片成多个 Parquet 文件，您需要将每个文件分别添加到 Spark 上下文。以下是如何操作

import requests

# Fetch the URLs of the Parquet files for the train split
r = requests.get('https://huggingface.co/api/datasets/james-burton/wine_reviews/parquet')
train_parquet_files = r.json()['default']['train']

# Add each Parquet file to the Spark context
for url in train_parquet_files:
  spark.sparkContext.addFile(url)

# Read all Parquet files into a single DataFrame
df = spark.read.parquet(SparkFiles.getRootDirectory() + "/*.parquet")

一旦您将数据加载到 PySpark DataFrame 中，您就可以执行各种操作来探索和分析它

print(f"Shape of the dataset: {df.count()}, {len(df.columns)}")

# Display first 10 rows
df.show(n=10)

# Get a statistical summary of the data
df.describe().show()

# Print the schema of the DataFrame
df.printSchema()

< > 在 GitHub 上更新

Dataset viewer

PySpark