問題描述
我有一個 dynamodb 表來存儲電子郵件屬性信息.我在電子郵件上有一個哈希鍵,在時間戳(數字)上有一個范圍鍵.使用電子郵件作為哈希鍵的最初想法是按電子郵件查詢所有電子郵件.但我想做的一件事是檢索所有電子郵件 ID(在哈希鍵中).我為此使用 boto,但我不確定如何檢索不同的電子郵件 ID.
I have a dynamodb table to store email attribute information. I have a hash key on the email, range key on timestamp(number). The initial idea for using email as hash key is to query all emails by per email. But one thing I trying to do is retrieve all email ids(in hash key). I am using boto for this, but I am unsure as to how to retrieve distinct email ids.
我當前提取 10,000 條電子郵件記錄的代碼是
My current code to pull 10,000 email records is
conn=boto.dynamodb2.connect_to_region('us-west-2')
email_attributes = Table('email_attributes', connection=conn)
s = email_attributes.scan(limit=10000,attributes=['email'])
但是要檢索不同的記錄,我必須進行全表掃描,然后在代碼中選擇不同的記錄.我的另一個想法是維護另一個表,該表將僅存儲這些電子郵件并進行條件寫入以查看電子郵件 ID 是否存在,如果不存在則寫入.但是我正在嘗試考慮這是否會更昂貴,并且會是有條件的寫入.
But to retrieve the distinct records, I will have to do a full table scan and then pick the distinct records in the code. Another idea that I have is to maintain another table that will just store these emails and do conditional writes to see if an email id exists, if not then write. But I am trying to think if this will be more expensive and it will be a conditional write.
Q1.) Is there a way to retrieve distinct records using a DynamoDB scan?
Q2.) Is there a good way to calculate the cost per query?
推薦答案
使用 DynamoDB 掃描,您需要在客戶端過濾掉重復項(在您的情況下,使用 boto).即使您使用反向架構創建 GSI,您仍然會得到重復項.給定一個名為 stamped_emails 的 email_id+timestamp 的 H+R 表,所有唯一 email_ids 的列表是 H+R stamped_emails 表的物化視圖.您可以啟用 DynamoDB Stream 在 stamped_emails 表上,訂閱 Lambda 函數對 stamped_emails 的 Stream 執行 PutItem (email_id) 到名為 emails_only 的僅哈希表.然后,您可以 Scan emails_only 并且不會收到重復郵件.
Using a DynamoDB Scan, you would need to filter out duplicates on the client side (in your case, using boto). Even if you create a GSI with the reverse schema, you will still get duplicates. Given a H+R table of email_id+timestamp called stamped_emails, a list of all unique email_ids is a materialized view of the H+R stamped_emails table. You could enable a DynamoDB Stream on the stamped_emails table, subscribe a Lambda function to stamped_emails' Stream that does a PutItem (email_id) to a Hash-only table called emails_only. Then, you could Scan emails_only and you would get no duplicates.
最后,關于您關于成本的問題,即使您只請求這些項目的某些預計屬性,Scan 也會讀取整個項目.其次,Scan 必須通讀每個項目,即使它被 FilterExpression(條件表達式)過濾掉.第三,掃描順序讀取項目.這意味著為了計量目的,每個掃描調用都被視為一次大讀取.這樣做的成本含義是,如果一個 Scan 調用讀取 200 個不同的項目,它不一定會花費 100 個 RCU.如果每個項目的大小為 100 字節,則該 Scan 調用將花費 ROUND_UP((20000 字節/1024 kb/字節)/8 kb/EC RCU) = 3 RCU.即使此調用僅返回 123 個項目,如果 Scan 必須讀取 200 個項目,在這種情況下您將產生 3 個 RCU.
Finally, regarding your question about cost, Scan will read entire items even if you only request certain projected attributes from those items. Second, Scan has to read through every item, even if it is filtered out by a FilterExpression (Condition Expression). Third, Scan reads through items sequentially. That means that each scan call is treated as one big read for metering purposes. The cost implication of this is that if a Scan call reads 200 different items, it will not necessarily cost 100 RCU. If the size of each of those items is 100 bytes, that Scan call will cost ROUND_UP((20000 bytes / 1024 kb/byte) / 8 kb / EC RCU) = 3 RCU. Even if this call only returns 123 items, if the Scan had to read 200 items, you would incur 3 RCU in this situation.
這篇關于從哈希鍵中檢索不同的值 - DynamoDB的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!