問題描述
我正在使用 Python SDK v0.5.5 編寫一個非常基本的 DataFlow 管道.該管道使用帶有傳入查詢的 BigQuerySource,該查詢正在從位于歐盟的數據集中查詢 BigQuery 表.
I'm writing a very basic DataFlow pipeline using the Python SDK v0.5.5. The pipeline uses a BigQuerySource with a query passed in, which is querying BigQuery tables from datasets that reside in EU.
執行管道時出現以下錯誤(項目名稱匿名):
When executing the pipeline I'm getting the following error (project name anonymized):
HttpError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/XXXXX/queries/93bbbecbc470470cb1bbb9c22bd83e9d?alt=json&maxResults=10000>: response: <{'status': '400', 'content-length': '292', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Thu, 09 Feb 2017 10:28:04 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Thu, 09 Feb 2017 10:28:04 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; ma=2592000; v="35,34"', 'content-type': 'application/json; charset=UTF-8'}>, content <{
"error": {
"errors": [
{
"domain": "global",
"reason": "invalid",
"message": "Cannot read and write in different locations: source: EU, destination: US"
}
],
"code": 400,
"message": "Cannot read and write in different locations: source: EU, destination: US"
}
}
在指定項目、數據集和表名時也會出現該錯誤.但是,從可用的公共數據集(位于美國——如莎士比亞)中選擇數據時沒有錯誤.我也有運行 SDK 的 v0.4.4 的作業,但沒有此錯誤.
The error also occurs when specifying a project, dataset and table name. However there's no error when selecting data from the public datasets available (which reside in US - like shakespeare). I also have jobs running v0.4.4 of the SDK which don't have this error.
這些版本之間的區別在于臨時數據集的創建,如管道啟動時的警告所示:
The difference between these versions is the creation of a temp dataset, as is shown by the warning at pipeline startup:
WARNING:root:Dataset does not exist so we will create it
我簡要了解了 SDK 的不同版本,差異似乎在于這個臨時數據集.看起來當前版本默認創建了一個臨時數據集,其位置在美國(取自 master):
I've briefly taken a look at the different versions of the SDK and the difference seems to be around this temp dataset. It looks like the current version creates a temp dataset by default with a location in US (taken from master):
- 創建數據集一個>
- 默認數據集位置
我還沒有找到禁用創建這些臨時數據集的方法.我是否忽略了某些東西,或者在從歐盟數據集中選擇數據時這確實不再起作用?
I haven't found a way to disable the creation of these temp datasets. Am I overlooking something, or is this indeed not working anymore when selecting data from EU datasets?
推薦答案
感謝您報告此問題.我假設您使用的是 DirectRunner.我們更改了 DirectRunner 的 BigQuery 讀取轉換的實現,以創建臨時數據集(適用于 SDK 版本 0.5.1 及更高版本)以支持大型數據集.似乎我們在這里沒有正確設置區域.我們會研究解決這個問題.
Thanks for reporting this issue. I assume you are using DirectRunner. We changed the implementation of BigQuery read transform for DirectRunner to create a temporary dataset (for SDK versions 0.5.1 and later) to support large datasets. Seems like we are not setting the region correctly here. We'll look into fixing this.
如果您使用在正確區域創建臨時數據集的 DataflowRunner,則不會出現此問題.
This issue should not occur if you use DataflowRunner which creates temporary datasets in the correct region.
這篇關于Google DataFlow 無法在不同位置讀寫(Python SDK v0.5.5)的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!