問題描述
我正在嘗試刪除 pandas DataFrame 中的一些觀察結果,其中相似性幾乎為 100%,但不完全一致.見下圖:
I am attempting to remove some observations in a pandas DataFrame where the similarities are ALMOST 100% but not quite. See frame below:
注意John"、Mary"和Wesley"是如何出現的具有幾乎相同的觀察結果,但有一列不同.真實數據集有 15 列和 215,000 多個觀測值.在我可以直觀驗證的所有情況下,相似之處同樣是:在 15 列中,其他觀察每次最多匹配 14 列.為了項目的目的,我決定刪除重復的觀察結果(并將它們存儲到另一個 DataFrame 中,以防我的老板要求查看它們).
Notice how "John", "Mary", and "Wesley" have nearly identical observations, but have one column being different. The real data set has 15 columns, and 215,000+ observations. In all of the cases I could visually verify, the similarities were likewise: out of 15 columns, the other observation would match up to 14 columns, every time. For the purpose of the project I have decided to remove the repeated observations (and store them into another DataFrame just in case my boss asks to see them).
我顯然已經想到了 remove_duplicates(keep='something')
,但這行不通,因為觀察結果并不完全相似.有沒有人遇到過這樣的問題?有什么補救辦法嗎?
I have evidently thought of remove_duplicates(keep='something')
, but that would not work since the observations are not ENTIRELY similar. Has anyone ever encounter such an issue? Any idea on a remedy?
推薦答案
關于列子集的簡單循環怎么樣:
What about a simple loop over subset of columns :
import pandas as pd
df = pd.DataFrame(
[
['John', 45, 85000, 'DC'],
['Netcha', 25, 48000, 'NYC'],
['Mary', 45, 85000, 'DC'],
['Wesley', 36, 72500, 'LA'],
['Porter', 22, 98750, 'Seattle'],
['John', 45, 105500, 'DC'],
['Mary', 28, 85000, 'DC'],
['Wesley', 36, 72500, 'Boston'],
],
columns=['Name', 'Age', 'Salary', 'City'])
cols = df.columns.tolist()
cols.remove('Name')
for col in cols:
observed_cols = df.drop(col, axis=1).columns.tolist()
df.drop_duplicates(observed_cols, keep='first', inplace=True)
print(df)
返回:
Name Age Salary City
0 John 45 85000 DC
1 Netcha 25 48000 NYC
2 Mary 45 85000 DC
3 Wesley 36 72500 LA
4 Porter 22 98750 Seattle
這篇關于刪除 *NEARLY* 重復的觀察 - Python的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!