問(wèn)題描述
我正在嘗試從 mysql 讀取數(shù)據(jù)并將其寫(xiě)回 s3 中具有特定分區(qū)的 parquet 文件,如下所示:
Im trying to read data from mysql and write it back to parquet file in s3 with specific partitions as follows:
df=sqlContext.read.format('jdbc')\
.options(driver='com.mysql.jdbc.Driver',url="""jdbc:mysql://<host>:3306/<>db?user=<usr>&password=<pass>""",
dbtable='tbl',
numPartitions=4 )\
.load()
df2=df.withColumn('updated_date',to_date(df.updated_at))
df2.write.parquet(path='s3n://parquet_location',mode='append',partitionBy=['updated_date'])
我的問(wèn)題是它只打開(kāi)一個(gè)到 mysql 的連接(而不是 4 個(gè)),并且在它從 mysql 獲取所有數(shù)據(jù)之前它不會(huì)寫(xiě)入 parquert,因?yàn)槲以?mysql 中的表很大(100M 行)進(jìn)程失敗內(nèi)存不足.
My problem is that it open only one connection to mysql (instead of 4) and it doesn't write to parquert until it fetches all the data from mysql, because my table in mysql is huge (100M rows) the process failed on OutOfMemory.
有沒(méi)有辦法配置Spark打開(kāi)多個(gè)mysql連接并將部分?jǐn)?shù)據(jù)寫(xiě)入parquet?
Is there a way to configure Spark to open more than one connection to mysql and to write partial data to parquet?
推薦答案
你應(yīng)該設(shè)置這些屬性:
partitionColumn,
lowerBound,
upperBound,
numPartitions
正如這里記錄的那樣:http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
這篇關(guān)于spark從mysql并行讀取數(shù)據(jù)的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!