問題描述
我有一個 Python 日期時間時間戳和一個大字典(索引),其中鍵是時間戳,值是我感興趣的其他一些信息.
I have a Python datetime timestamp and a large dict (index) where keys are timestamps and the values are some other information I'm interested in.
我需要盡可能高效地在索引中找到最接近時間戳的日期時間(鍵).
I need to find the datetime (the key) in index that is closest to timestamp, as efficiently as possible.
目前我正在做類似的事情:
At the moment I'm doing something like:
for timestamp in timestamps:
closestTimestamp = min(index,key=lambda datetime : abs(timestamp - datetime))
這可行,但耗時太長 - 我的索引字典有數百萬個值,我正在搜索數千次.我對數據結構等很靈活 - 時間戳大致是連續的,所以我從第一個時間戳迭代到最后一個時間戳.同樣,我加載到字典中的文本文件中的時間戳也是連續的.
which works, but takes too long - my index dict has millions of values, and I'm doing the search thousands of times. I'm flexible with data structures and so on - the timestamps are roughly sequential, so that I'm iterating from the first to the last timestamps. Likewise the timestamps in the text file that I load into the dict are sequential.
任何關于優化的想法都將不勝感激.
Any ideas for optimisation would be greatly appreciated.
推薦答案
沒有組織字典以進行有效的接近未命中搜索.它們專為精確匹配而設計(使用 哈希表).
Dictionaries aren't organized for efficient near miss searches. They are designed for exact matches (using a hash table).
您最好維護一個單獨的、可快速搜索的有序結構.
You may be better-off maintaining a separate, fast-searchable ordered structure.
一個簡單的開始方法是使用 bisect 模塊對于快速 O(log N) 搜索但較慢 O(n) 插入:
A simple way to start off is to use the bisect module for fast O(log N) searches but slower O(n) insertions:
def nearest(ts):
# Given a presorted list of timestamps: s = sorted(index)
i = bisect_left(s, ts)
return min(s[max(0, i-1): i+2], key=lambda t: abs(ts - t))
適用于非靜態、動態更新的字典的更復雜的方法是使用 blist 它采用樹結構進行快速 O(log N) 插入和查找.只有當 dict 會隨著時間而改變時,你才需要這個.
A more sophisticated approach suitable for non-static, dynamically updated dicts, would be to use blist which employs a tree structure for fast O(log N) insertions and lookups. You only need this if the dict is going to change over time.
如果您想繼續使用基于字典的方法,請考慮將具有附近時間戳的條目聚集在一起的 dict-of-lists:
If you want to stay with a dictionary based approach, consider a dict-of-lists that clusters entries with nearby timestamps:
def get_closest_stamp(ts):
'Speed-up timestamp search by looking only at entries in the same hour'
hour = round_to_nearest_hour(ts)
cluster = daydict[hour] # return a list of entries
return min(cluster, key=lambda t: abs(ts - t))
注意,對于靠近集群邊界的準確結果,請在主集群和相鄰集群中存儲接近邊界的時間戳.
Note, for exact results near cluster boundaries, store close-to-the-boundary timestamps in both the primary cluster and the adjacent cluster.
這篇關于Python - 定位最近的時間戳的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!