久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

通過 hashlib 查找重復(fù)文件?

Finding duplicate files via hashlib?(通過 hashlib 查找重復(fù)文件?)
本文介紹了通過 hashlib 查找重復(fù)文件?的處理方法,對(duì)大家解決問題具有一定的參考價(jià)值,需要的朋友們下面隨著小編來一起學(xué)習(xí)吧!

問題描述

我知道以前有人問過這個(gè)問題,并且我已經(jīng)看到了一些答案,但是這個(gè)問題更多的是關(guān)于我的代碼以及完成這項(xiàng)任務(wù)的最佳方式.

I know that this question has been asked before, and I've saw some of the answers, but this question is more about my code and the best way of accomplishing this task.

我想掃描一個(gè)目錄并查看該目錄中是否有任何重復(fù)項(xiàng)(通過檢查 MD5 哈希).以下是我的代碼:

I want to scan a directory and see if there are any duplicates (by checking MD5 hashes) in that directory. The following is my code:

import sys
import os
import hashlib

fileSliceLimitation = 5000000 #bytes

# if the file is big, slice trick to avoid to load the whole file into RAM
def getFileHashMD5(filename):
     retval = 0;
     filesize = os.path.getsize(filename)

     if filesize > fileSliceLimitation:
        with open(filename, 'rb') as fh:
          m = hashlib.md5()
          while True:
            data = fh.read(8192)
            if not data:
                break
            m.update(data)
          retval = m.hexdigest()

     else:
        retval = hashlib.md5(open(filename, 'rb').read()).hexdigest()

     return retval

searchdirpath = raw_input("Type directory you wish to search: ")
print ""
print ""    
text_file = open('outPut.txt', 'w')

for dirname, dirnames, filenames in os.walk(searchdirpath):
    # print path to all filenames.
    for filename in filenames:
        fullname = os.path.join(dirname, filename)
        h_md5 = getFileHashMD5 (fullname)
        print h_md5 + " " + fullname
        text_file.write("
" + h_md5 + " " + fullname)   

# close txt file
text_file.close()


print "


Reading outPut:"
text_file = open('outPut.txt', 'r')

myListOfHashes = text_file.read()

if h_md5 in myListOfHashes:
    print 'Match: ' + " " + fullname

這給了我以下輸出:

Please type in directory you wish to search using above syntax: /Users/bubble/Desktop/aF

033808bb457f622b05096c2f7699857v /Users/bubble/Desktop/aF/.DS_Store
409d8c1727960fddb7c8b915a76ebd35 /Users/bubble/Desktop/aF/script copy.py
409d8c1727960fddb7c8b915a76ebd25 /Users/bubble/Desktop/aF/script.py
e9289295caefef66eaf3a4dffc4fe11c /Users/bubble/Desktop/aF/simpsons.mov

Reading outPut:
Match:  /Users/bubble/Desktop/aF/simpsons.mov

我的想法是:

1) 掃描目錄2)將MD5哈希+文件名寫入文本文件3) 以只讀方式打開文本文件4) 再次掃描目錄并檢查文本文件...

1) Scan directory 2) Write MD5 hashes + Filename to text file 3) Open text file as read only 4) Scan directory AGAIN and check against text file...

我發(fā)現(xiàn)這不是一個(gè)好方法,而且它不起作用.匹配"只是打印出最后處理的文件.

I see that this isn't a good way of doing it AND it doesn't work. The 'match' just prints out the very last file that was processed.

我怎樣才能讓這個(gè)腳本真正找到重復(fù)項(xiàng)?有人可以告訴我完成這項(xiàng)任務(wù)的更好/更簡單的方法.

How can I get this script to actually find duplicates? Can someone tell me a better/easier way of accomplishing this task.

非常感謝您的幫助.抱歉,這篇文章很長.

Thank you very much for any help. Sorry this is a long post.

推薦答案

識(shí)別重復(fù)項(xiàng)的明顯工具是哈希表.除非您正在處理 非常大 數(shù)量的文件,否則您可以執(zhí)行以下操作:

The obvious tool for identifying duplicates is a hash table. Unless you are working with a very large number of files, you could do something like this:

from collections import defaultdict

file_dict = defaultdict(list)
for filename in files:
    file_dict[get_file_hash(filename)].append(filename)

在此過程結(jié)束時(shí),file_dict 將包含每個(gè)唯一哈希的列表;當(dāng)兩個(gè)文件具有相同的哈希值時(shí),它們都會(huì)出現(xiàn)在該哈希值的列表中.然后過濾 dict 以查找大于 1 的值列表,并比較文件以確保它們相同 - 如下所示:

At the end of this process, file_dict will contain a list for every unique hash; when two files have the same hash, they'll both appear in the list for that hash. Then filter the dict looking for value lists longer than 1, and compare the files to make sure they're the same -- something like this:

for duplicates in file_dict.values():   # file_dict.itervalues() in Python 2
    if len(duplicates) > 1:
        # double-check reported duplicates and generate output

或者這個(gè):

duplicates = [files for files in file_dict.values() if len(files) > 1]

get_file_hash 可以使用 MD5s;或者它可以像 Ramchandra Apte 在上面的評(píng)論中建議的那樣簡單地獲取文件的第一個(gè)和最后一個(gè)字節(jié);或者它可以簡單地使用上面評(píng)論中建議的文件大小.不過,后兩種策略中的每一種都更有可能產(chǎn)生誤報(bào).您可以將它們結(jié)合起來以降低誤報(bào)率.

get_file_hash could use MD5s; or it could simply get the first and last bytes of the file as Ramchandra Apte suggested in the comments above; or it could simply use file sizes as tdelaney suggested in the comments above. Each of the latter two strategies are more likely to produce false positives though. You could combine them to reduce the false positive rate.

如果您正在處理非常大量文件,則可以使用更復(fù)雜的數(shù)據(jù)結(jié)構(gòu),例如 布隆過濾器.

If you're working with a very large number of files, you could use a more sophisticated data structure like a Bloom Filter.

這篇關(guān)于通過 hashlib 查找重復(fù)文件?的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!

【網(wǎng)站聲明】本站部分內(nèi)容來源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問題,如果有圖片或者內(nèi)容侵犯了您的權(quán)益,請(qǐng)聯(lián)系我們刪除處理,感謝您的支持!

相關(guān)文檔推薦

How to draw a rectangle around a region of interest in python(如何在python中的感興趣區(qū)域周圍繪制一個(gè)矩形)
How can I detect and track people using OpenCV?(如何使用 OpenCV 檢測和跟蹤人員?)
How to apply threshold within multiple rectangular bounding boxes in an image?(如何在圖像的多個(gè)矩形邊界框中應(yīng)用閾值?)
How can I download a specific part of Coco Dataset?(如何下載 Coco Dataset 的特定部分?)
Detect image orientation angle based on text direction(根據(jù)文本方向檢測圖像方向角度)
Detect centre and angle of rectangles in an image using Opencv(使用 Opencv 檢測圖像中矩形的中心和角度)
主站蜘蛛池模板: 国产第二区 | 免费观看成人 | 四虎在线免费观看 | 电家庭影院午夜 | 黄色片视频在线观看 | 国产调教视频 | 免费网站观看www在线观 | 欧美日韩在线一区二区 | 亚洲国产精品久久久久久久 | 国产精品香蕉 | 99热99re6国产在线播放 | av免费观看在线 | 青青草精品视频 | 99视频网 | 三级在线看 | 日皮视频在线观看 | 国产黄a三级三级三级看三级男男 | 天天色天天爱 | 99国产精品99久久久久久 | 国产三级在线观看 | 午夜久久久 | 欧美日韩精品久久久免费观看 | 18岁毛片 | 热久久免费视频 | 一级做a爱片性色毛片 | 黄色录像大片 | 九九国产视频 | 久久成人免费视频 | 久久手机视频 | 免费国产黄色 | 福利在线播放 | 国产精品久久久一区二区三区 | 青娱乐福利视频 | 国产激情在线 | 日韩欧美综合 | 成人小视频在线观看 | 成人小视频在线观看 | 欧美日韩久久 | 亚洲精品影院 | 波多野吉衣一二三区乱码 | 日韩一级大片 |