久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

從 Python 中的大文件中刪除重復(fù)的行

Remove duplicate rows from a large file in Python(從 Python 中的大文件中刪除重復(fù)的行)
本文介紹了從 Python 中的大文件中刪除重復(fù)的行的處理方法,對(duì)大家解決問題具有一定的參考價(jià)值,需要的朋友們下面隨著小編來一起學(xué)習(xí)吧!

問題描述

我有一個(gè) csv 文件,我想從中刪除重復(fù)的行,但它太大而無法放入內(nèi)存.我找到了一種方法來完成它,但我的猜測(cè)是這不是最好的方法.

I've a csv file that I want to remove duplicate rows from, but it's too large to fit into memory. I found a way to get it done, but my guess is that it's not the best way.

每行包含 15 個(gè)字段和數(shù)百個(gè)字符,并且需要所有字段來確定唯一性.我不是比較整行來查找重復(fù)項(xiàng),而是比較 hash(row-as-a-string) 以嘗試節(jié)省內(nèi)存.我設(shè)置了一個(gè)過濾器,將數(shù)據(jù)劃分為大致相等的行數(shù)(例如一周中的幾天),并且每個(gè)分區(qū)足夠小,以至于該分區(qū)的哈希值查找表將適合內(nèi)存.我為每個(gè)分區(qū)傳遞一次文件,檢查唯一行并將它們寫入第二個(gè)文件(偽代碼):

Each row contains 15 fields and several hundred characters, and all fields are needed to determine uniqueness. Instead of comparing the entire row to find a duplicate, I'm comparing hash(row-as-a-string) in an attempt to save memory. I set a filter that partitions the data into a roughly equal number of rows (e.g. days of the week), and each partition is small enough that a lookup table of hash values for that partition will fit in memory. I pass through the file once for each partition, checking for unique rows and writing them out to a second file (pseudo code):

import csv

headers={'DayOfWeek':None, 'a':None, 'b':None}
outs=csv.DictWriter(open('c:dedupedFile.csv','wb')
days=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

outs.writerows(headers)

for day in days:
    htable={}
    ins=csv.DictReader(open('c:igfile.csv','rb'),headers)
    for line in ins:
        hvalue=hash(reduce(lambda x,y:x+y,line.itervalues()))
        if line['DayOfWeek']==day:
            if hvalue in htable:
                pass
            else:
                htable[hvalue]=None
                outs.writerow(line)

我想加快速度的一種方法是找到更好的過濾器來減少必要的通過次數(shù).假設(shè)行的長度是均勻分布的,也許不是

One way I was thinking to speed this up is by finding a better filter to reduce the number of passes necessary. Assuming the length of the rows is uniformly distributed, maybe instead of

for day in days: 

if line['DayOfWeek']==day:

我們有

for i in range(n):

if len(reduce(lambda x,y:x+y,line.itervalues())%n)==i:

在內(nèi)存允許的范圍內(nèi),'n' 盡可能小.但這仍然使用相同的方法.

where 'n' as small as memory will allow. But this is still using the same method.

Wayne Werner在下面提供了一個(gè)很好的實(shí)用解決方案;從算法的角度來看,我很好奇是否有更好/更快/更簡單的方法來做到這一點(diǎn).

Wayne Werner provided a good practical solution below; I was curious if there was better/faster/simpler way to do this from an algorithm perspective.

附:我僅限于 Python 2.5.

P.S. I'm limited to Python 2.5.

推薦答案

如果你想要一個(gè)非常簡單的方法來做到這一點(diǎn),只需創(chuàng)建一個(gè) sqlite 數(shù)據(jù)庫:

If you want a really simple way to do this, just create a sqlite database:

import sqlite3
conn = sqlite3.connect('single.db')
cur = conn.cursor()
cur.execute("""create table test(
f1 text,
f2 text,
f3 text,
f4 text,
f5 text,
f6 text,
f7 text,
f8 text,
f9 text,
f10 text,
f11 text,
f12 text,
f13 text,
f14 text,
f15 text,
primary key(f1,  f2,  f3,  f4,  f5,  f6,  f7,  
            f8,  f9,  f10,  f11,  f12,  f13,  f14,  f15))
"""
conn.commit()

#simplified/pseudo code
for row in reader:
    #assuming row returns a list-type object
    try:
        cur.execute('''insert into test values(?, ?, ?, ?, ?, ?, ?, 
                       ?, ?, ?, ?, ?, ?, ?, ?)''', row)
        conn.commit()
    except IntegrityError:
        pass

conn.commit()
cur.execute('select * from test')

for row in cur:
    #write row to csv file

那么您自己就不必?fù)?dān)心任何比較邏輯 - 只需讓 sqlite 為您處理.它可能不會(huì)比散列字符串快得多,但它可能要容易得多.當(dāng)然,如果需要,您可以修改存儲(chǔ)在數(shù)據(jù)庫中的類型,或者視情況而定.當(dāng)然,由于您已經(jīng)將數(shù)據(jù)轉(zhuǎn)換為字符串,因此您可以只使用一個(gè)字段.這里有很多選擇.

Then you wouldn't have to worry about any of the comparison logic yourself - just let sqlite take care of it for you. It probably won't be much faster than hashing the strings, but it's probably a lot easier. Of course you'd modify the type stored in the database if you wanted, or not as the case may be. Of course since you're already converting the data to a string you could just have one field instead. Plenty of options here.

這篇關(guān)于從 Python 中的大文件中刪除重復(fù)的行的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!

【網(wǎng)站聲明】本站部分內(nèi)容來源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問題,如果有圖片或者內(nèi)容侵犯了您的權(quán)益,請(qǐng)聯(lián)系我們刪除處理,感謝您的支持!

相關(guān)文檔推薦

How to draw a rectangle around a region of interest in python(如何在python中的感興趣區(qū)域周圍繪制一個(gè)矩形)
How can I detect and track people using OpenCV?(如何使用 OpenCV 檢測(cè)和跟蹤人員?)
How to apply threshold within multiple rectangular bounding boxes in an image?(如何在圖像的多個(gè)矩形邊界框中應(yīng)用閾值?)
How can I download a specific part of Coco Dataset?(如何下載 Coco Dataset 的特定部分?)
Detect image orientation angle based on text direction(根據(jù)文本方向檢測(cè)圖像方向角度)
Detect centre and angle of rectangles in an image using Opencv(使用 Opencv 檢測(cè)圖像中矩形的中心和角度)
主站蜘蛛池模板: 久久精品在线播放 | 97精品久久 | 欧美日韩国产精品一区二区 | 成人中文网| 中文字幕一区在线 | 免费观看一级黄色录像 | 成人福利在线观看 | 欧美中文字幕一区二区 | 国产成人免费视频网站高清观看视频 | 亚洲欧美日韩在线不卡 | 91九色视频| 久久久久久久久久毛片 | 国产精品久久久久9999鸭 | 黄色大片视频 | 91精品国产综合久久久久久蜜臀 | 国产精品久久久久永久免费观看 | 国产视频1 | 亚洲精品久久久久久一区二区 | 久久久精 | 欧美日韩精品一区 | 日韩一区二区三区四区五区 | 成人精品一区二区三区四区 | 黄色高清视频 | 一级片在线播放 | 91亚洲国产成人久久精品网站 | hdfreexxxx中国妞| 欧美精品日韩精品国产精品 | 午夜视频在线播放 | 成人小视频在线观看 | 中文字幕一区二区三区不卡 | 亚洲国产情侣 | 91人人看| 日韩二 | 欧美激情久久久 | 伊人网综合在线观看 | 观看av| 中文字幕一区二区视频 | 另类在线 | 亚洲va欧美va天堂v国产综合 | 中文字幕啪啪 | 亚洲一区二区中文字幕 |