Mining Frequent patterns from Distributed Data Streams over P2P File-Sharing Networks and the Cloud
Research Team Lead
Nick Cercone
Nick Cercone
OCAD University
Abstract:
There are many systems where data is distributed over a large and dynamic network of nodes containing no special server or client nodes. An interesting example for large scale distributed systems is Peer-to-Peer (P2P) systems where a huge mass of data is generated every few seconds. We also need smarter computing that is designed for big data, tuned for specific tasks and managed in the cloud.
Research Description:
Standard data mining algorithms are not useable in P2P networks or the cloud unless the total data is centralized in one place. However, that data is distributed so widely that it will usually not be feasible to collect it for central processing. It must be processed in place by distributed algorithms suitable to this kind of computing environment.

mining-frequent-patterns A scenario that centralizes the total data is not scalable because any change must be announced to the central peer. Also the data changes at a faster ration than the rate at which it can be centralized. There is a large body of literature for frequent itemset mining in a distributed environment. However, most of these works address frequent itemset mining in a small scale distributed environment. Data mining in P2P networks and/or the cloud needs different kinds of algorithms because these environments introduce some new problems. The first problem is that global communication is impossible in large P2P/cloud systems. Therefore, the nodes should communicate through local negotiation by exchanging information about their local databases with their immediate peers. Another problem comes from the dynamic nature of large scale systems in which a node departs or joins the system in mid-computation. A further complication comes from the dynamic nature of the data stream. A data stream is an unbounded sequence of data elements continuously generated at rapid rate and have a data distribution that often changes with time.

Mining data streams is one of the most challenging problems in data mining. There are some inherent challenges for data stream mining. First, each data element can be examined at most once. Second, although the data elements are continuously generated, the consumption of memory space should be limited. Third, every incoming data element should be processed as fast as possible. Fourth, the analytical result of data stream should be available with an acceptable quality when users request. Due to the characteristics of data streams, traditional frequent pattern mining algorithms cannot be directly applied. Thus data mining in P2P networks/clouds requires different kinds of algorithms which are decentralized, asynchronous and can manage dynamically changing data and the network. We will develop a completely decentralized, asynchronous and scalable algorithm for doing frequent itemset mining in P2P networks/clouds that can handle changes in the data stream and network. One of our main design goals is to limit the communication required to find the global frequent itemsets.
Aijun An
Professor in the Department of Electrical Engineering and Computer Science at York University
Amir Asif
Professor in the Department of Electrical Engineering and Computer Science at York University
Adetokunbo Makanju
Post-doctoral Fellow, York University
Morteza Zihayat-Kermani
PhD Candidate, York University
Vida Movahedi
Post-doctoral Fellow, York University