問題描述
我對數據庫工作還很陌生,所以請耐心等待.我已經閱讀了許多類似的問題,但似乎沒有一個在談論我面臨的同一問題.
I am still new to working in databases, so please have patience with me. I have read through a number of similar questions, but none of them seem to be talking about the same issue I am facing.
只是一些關于我在做什么的信息,我有一個填滿聯系信息的表格,一些聯系人是重復的,但大多數重復的行都有一個截斷的電話號碼,這使得這些數據毫無用處.
Just a bit of info on what I am doing, I have a table filled with contact information, and some of the contacts are duplicated, but most of the duplicated rows have a truncated phone number, which makes that data useless.
我編寫了以下查詢來搜索重復項:
I wrote the following query to search for the duplicates:
WITH CTE (CID, Firstname, lastname, phone, email, length, dupcnt) AS
(
SELECT
CID, Firstname, lastname, phone, email, LEN(phone) AS length,
ROW_NUMBER() OVER (PARTITION BY Firstname, lastname, email
ORDER BY Firstname) AS dupcnt
FROM
[data.com_raw]
)
SELECT *
FROM CTE
WHERE dupcnt > 1
AND length <= 10
我假設此查詢會根據我指定的三列查找所有具有重復項的記錄,并選擇 dupcnt
大于 1 的任何記錄,以及具有長度的電話列小于或等于 10.但是當我多次運行查詢時,每次執行都會得到不同的結果集.一定有一些我在這里遺漏的邏輯,但我對此完全感到困惑.所有列都是 varchar
數據類型,除了 CID,它是 int
.
I assumed that this query would find all records that have duplicates based on the three columns that I have specified, and select any that have the dupcnt
greater than 1, and a phone column with a length less than or equal to 10. But when I run the query more than once I get different result sets each execution. There must be some logic that I am missing here, but I am completely baffled by this. All of the columns are of varchar
datatype, except for CID, which is int
.
推薦答案
代替 ROW_NUMBER()
使用 COUNT(*)
,并刪除 ORDER BY 因為那不是必須使用 COUNT(*)
.
Instead of ROW_NUMBER()
use COUNT(*)
, and remove the ORDER BY since that's not necessary with COUNT(*)
.
按照您現在的方式,您正在通過 firstname
/lastname
/email
將記錄分成相似的記錄組/分區.然后您按 firstname
對每個組/分區進行排序.Firstname
是分區的一部分,這意味著該組/分區中的每個名字都是相同的.您將獲得不同的結果,具體取決于 SQL Server 從存儲中獲取結果的方式(它首先找到的記錄是 1
,第二個找到的是 2
).每次獲取記錄時(每次運行此 sql 時),它都可能以不同的順序從磁盤或緩存中獲取每條記錄.
The way you have it now, you are chunking up records into similar groups/partitions of records by firstname
/lastname
/email
. Then you are ORDERING each group/partition by firstname
. Firstname
is part of the partition, meaning every firstname in that group/partition is identical. You will get different results depending on how SQL Server fetches the results from storage (which record it found first is 1
, what it found second is 2
). Every time it fetches records (every time you run this sql) it may fetch each record from disk or cache at a different order.
Count(*)
將返回所有重復的行
改為:
COUNT(*) OVER (PARTITION BY Firstname, lastname, email ) AS dupcnt
這將返回共享相同名字、姓氏和電子郵件的記錄數.然后您保留任何大于 1 的記錄.
Which will return the number of records that share the same firstname, lastname, and email. You then keep any record that is greater than 1.
這篇關于相同的查詢給出不同的結果的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!