問題描述
假設(shè)我有一個(gè)事件的輪班列表(格式為開始日期/時(shí)間、結(jié)束日期/時(shí)間) - 是否有某種算法可以用來創(chuàng)建日程的概括摘要?大多數(shù)輪班陷入某種常見的重復(fù)模式(即星期一上午 9:00 到下午 1:00,星期二上午 10:00 到下午 3:00 等)是很常見的.但是,此規(guī)則可以(并且將會(huì))有例外(例如,其中一個(gè)班次在假期發(fā)生并被重新安排在第二天).最好從我的摘要"中排除那些,因?yàn)槲蚁M峁┮粋€(gè)更一般的答案,說明此事件通常何時(shí)發(fā)生.
我想我正在尋找某種統(tǒng)計(jì)方法來確定發(fā)生的日期和時(shí)間,并根據(jù)列表中找到的最頻繁出現(xiàn)的情況創(chuàng)建描述.對(duì)于這樣的事情是否有某種通用算法?有沒有人創(chuàng)建過類似的東西?
理想情況下,我正在尋找 C# 或 VB.NET 中的解決方案,但不介意從任何其他語言移植.
提前致謝!
您可以使用
在這里您可以清楚地看到我們的七個(gè)集群.
這解決了您的部分問題:識(shí)別數(shù)據(jù).現(xiàn)在您還希望能夠?qū)ζ溥M(jìn)行標(biāo)記.
因此,我們將獲取每個(gè)集群并取平均值(四舍五入):
Table[Round[Mean[clusters[[i]]]], {i, 7}]
結(jié)果是:
日開始結(jié)束{1",10",15"},{1",12",17"},{3"、10"、15"}、{3",14",17"},{5"、10"、15"}、{5"、11"、15"}、{1"、7"、9"}
這樣你就可以重新獲得七門課了.
現(xiàn)在,也許您想對(duì)班次進(jìn)行分類,無論是哪一天.如果同一個(gè)人每天在同一時(shí)間做同樣的任務(wù),那么稱之為周一從 10 點(diǎn)到 15 點(diǎn)"是沒有用的,因?yàn)樗舶l(fā)生在周三和周五(如我們的例子中).
讓我們不考慮第一列來分析數(shù)據(jù):
集群=FindClusters[Take[data, All, -2],Method->{Agglomerate",Linkage"->Complete"}];
在這種情況下,我們不會(huì)選擇要檢索的集群數(shù)量,而是由包決定.
結(jié)果是
您可以看到已識(shí)別出五個(gè)集群.
讓我們嘗試標(biāo)記"他們和以前一樣:
Grid[Table[Round[Mean[clusters[[i]]]], {i, 5}]]
結(jié)果是:
開始 結(jié)束{10",15"},{12",17"},{14",17"},{11",15"},{7",9"}
這正是我們懷疑"的:每天同一時(shí)間都有重復(fù)的事件可以組合在一起.
夜班和標(biāo)準(zhǔn)化
如果您有(或計(jì)劃有)從一天開始到下一天結(jié)束的輪班,最好建模
{Start-Day Start-Hour Length}//正確!
比
{Start-Day Start-Hour End-Day End-Hour}//不正確!
那是因?yàn)榕c任何統(tǒng)計(jì)方法一樣,必須明確變量之間的相關(guān)性,否則該方法會(huì)失敗.該原則可以運(yùn)行類似保持您的候選數(shù)據(jù)規(guī)范化"的內(nèi)容.兩個(gè)概念幾乎一樣(屬性應(yīng)該是獨(dú)立的).
--- 編輯結(jié)束---
現(xiàn)在我猜你已經(jīng)很清楚你可以用這種 if 分析做什么樣的事情了.
一些參考
- 當(dāng)然,維基百科及其參考資料"和進(jìn)一步閱讀"是很好的向?qū)?
- 一個(gè)不錯(cuò)的視頻此處展示了 Statsoft 的功能,但您可以到達(dá)那里許多關(guān)于你可以用算法做的其他事情的想法.
- 這里是算法的基本解釋涉及
- 在這里您可以找到R 令人印象深刻的聚類分析功能(R 是一個(gè)非常好的選擇)
- 最后,這里您可以找到一長(zhǎng)串用于統(tǒng)計(jì)的免費(fèi)和商業(yè)軟件,包括聚類.
HTH!
Assuming I have a list of shifts for an event (in the format start date/time, end date/time) - is there some sort of algorithm I could use to create a generalized summary of the schedule? It is quite common for most of the shifts to fall into some sort of common recurrence pattern (ie. Mondays from 9:00 am to 1:00 pm, Tuesdays from 10:00 am to 3:00 pm, etc). However, there can (and will be) exceptions to this rule (eg. one of the shifts fell on a holiday and was rescheduled for the next day). It would be fine to exclude those from my "summary", as I'm looking to provide a more general answer of when does this event usually occur.
I guess I'm looking for some sort of statistical method to determine the day and time occurences and create a description based on the most frequent occurences found in the list. Is there some sort of general algorithm for something like this? Has anyone created something similar?
Ideally I'm looking for a solution in C# or VB.NET, but don't mind porting from any other language.
Thanks in advance!
You may use Cluster Analysis.
Clustering is a way to segregate a set of data into similar components (subsets). The "similarity" concept involves some definition of "distance" between points. Many usual formulas for the distance exists, among others the usual Euclidean distance.
Practical Case
Before pointing you to the quirks of the trade, let's show a practical case for your problem, so you may get involved in the algorithms and packages, or discard them upfront.
For easiness, I modelled the problem in Mathematica, because Cluster Analysis is included in the software and very straightforward to set up.
First, generate the data. The format is { DAY, START TIME, END TIME }.
The start and end times have a random variable added (+half hour, zero, -half hour} to show the capability of the algorithm to cope with "noise".
There are three days, three shifts per day and one extra (the last one) "anomalous" shift, which starts at 7 AM and ends at 9 AM (poor guys!).
There are 150 events in each "normal" shift and only two in the exceptional one.
As you can see, some shifts are not very far apart from each other.
I include the code in Mathematica, in case you have access to the software. I'm trying to avoid using the functional syntax, to make the code easier to read for "foreigners".
Here is the data generation code:
Rn[] := 0.5 * RandomInteger[{-1, 1}];
monshft1 = Table[{ 1 , 10 + Rn[] , 15 + Rn[] }, {150}]; // 1
monshft2 = Table[{ 1 , 12 + Rn[] , 17 + Rn[] }, {150}]; // 2
wedshft1 = Table[{ 3 , 10 + Rn[] , 15 + Rn[] }, {150}]; // 3
wedshft2 = Table[{ 3 , 14 + Rn[] , 17 + Rn[] }, {150}]; // 4
frishft1 = Table[{ 5 , 10 + Rn[] , 15 + Rn[] }, {150}]; // 5
frishft2 = Table[{ 5 , 11 + Rn[] , 15 + Rn[] }, {150}]; // 6
monexcp = Table[{ 1 , 7 + Rn[] , 9 + Rn[] }, {2}]; // 7
Now we join the data, obtaining one big dataset:
data = Join[monshft1, monshft2, wedshft1, wedshft2, frishft1, frishft2, monexcp];
Let's run a cluster analysis for the data:
clusters = FindClusters[data, 7, Method->{"Agglomerate","Linkage"->"Complete"}]
"Agglomerate" and "Linkage" -> "Complete" are two fine tuning options of the clustering methods implemented in Mathematica. They just specify we are trying to find very compact clusters.
I specified to try to detect 7 clusters. If the right number of shifts is unknown, you can try several reasonable values and see the results, or let the algorithm select the more proper value.
We can get a chart with the results, each cluster in a different color (don't mind the code)
ListPointPlot3D[ clusters,
PlotStyle->{{PointSize[Large], Pink}, {PointSize[Large], Green},
{PointSize[Large], Yellow}, {PointSize[Large], Red},
{PointSize[Large], Black}, {PointSize[Large], Blue},
{PointSize[Large], Purple}, {PointSize[Large], Brown}},
AxesLabel -> {"DAY", "START TIME", "END TIME"}]
And the result is:
Where you can see our seven clusters clearly apart.
That solves part of your problem: identifying the data. Now you also want to be able to label it.
So, we'll get each cluster and take means (rounded):
Table[Round[Mean[clusters[[i]]]], {i, 7}]
The result is:
Day Start End
{"1", "10", "15"},
{"1", "12", "17"},
{"3", "10", "15"},
{"3", "14", "17"},
{"5", "10", "15"},
{"5", "11", "15"},
{"1", "7", "9"}
And with that you get again your seven classes.
Now, perhaps you want to classify the shifts, no matter the day. If the same people make the same task at the same time everyday, so it's no useful to call it "Monday shift from 10 to 15", because it happens also on Weds and Fridays (as in our example).
Let's analyze the data disregarding the first column:
clusters=
FindClusters[Take[data, All, -2],Method->{"Agglomerate","Linkage"->"Complete"}];
In this case, we are not selecting the number of clusters to retrieve, leaving the decision to the package.
The result is
You can see that five clusters have been identified.
Let's try to "label" them as before:
Grid[Table[Round[Mean[clusters[[i]]]], {i, 5}]]
The result is:
START END
{"10", "15"},
{"12", "17"},
{"14", "17"},
{"11", "15"},
{ "7", "9"}
Which is exactly what we "suspected": there are repeated events each day at the same time that could be grouped together.
Edit: Overnight Shifts and Normalization
If you have (or plan to have) shifts that start one day and end on the following, it's better to model
{Start-Day Start-Hour Length} // Correct!
than
{Start-Day Start-Hour End-Day End-Hour} // Incorrect!
That's because as with any statistical method, the correlation between the variables must be made explicit, or the method fails miserably. The principle could run something like "keep your candidate data normalized". Both concepts are almost the same (the attributes should be independent).
--- Edit end ---
By now I guess you understand pretty well what kind of things you can do with this kind if analysis.
Some references
- Of course, Wikipedia, its "references" and "further reading" are good guide.
- A nice video here showing the capabilities of Statsoft, but you can get there many ideas about other things you can do with the algorithm.
- Here is a basic explanation of the algorithms involved
- Here you can find the impressive functionality of R for Cluster Analysis (R is a VERY good option)
- Finally, here you can find a long list of free and commercial software for statistics in general, including clustering.
HTH!
這篇關(guān)于給定輪班列表,創(chuàng)建時(shí)間表的摘要描述的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!