Abstract
Due to the ability of graphs to represent more generic and more complicated relationships among different objects, graph mining has played a significant role in data mining, attracting increasing attention in the data mining community. In addition, frequent coherent subgraphs can provide valuable knowledge about the underlying internal structure of a graph database, and mining frequently occurring coherent subgraphs from large dense graph databases has witnessed several applications and received considerable attention in the graph mining community recently. In this article, we study how to efficiently mine the complete set of coherent closed quasi-cliques from large dense graph databases, which is an especially challenging task due to the fact that the downward-closure property no longer holds. By fully exploring some properties of quasi-cliques, we propose several novel optimization techniques which can prune the unpromising and redundant subsearch spaces effectively. Meanwhile, we devise an efficient closure checking scheme to facilitate the discovery of closed quasi-cliques only. Since large databases cannot be held in main memory, we also design an out-of-core solution with efficient index structures for mining coherent closed quasi-cliques from large dense graph databases. We call this Cocain*. Thorough performance study shows that Cocain* is very efficient and scalable for large dense graph databases.
- Abello, J., Resende, M. G., and Sudarsky, S. 2002. Massive quasi-clique detection. In Proceedings of the 5th Latin American Symposium on Theoretical Informatics (LATIN) (Cancun, Mexico). 598--612. Google ScholarDigital Library
- Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Washington, D.C.). 207--216. Google ScholarDigital Library
- Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB) (Santiago, Chile). 487--499. Google ScholarDigital Library
- Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering (ICDE) (Taipei, Taiwan). 3--14. Google ScholarDigital Library
- Boginski, V., Butenko, S., and Pardalos, P. M. 2004. On structural properties of the market graph. In Innovations in Financial and Economic Networks, A. Nagurney ed. Edward Elgar. 29--45.Google Scholar
- Borgelt, C. and Berthold, M. R. 2002. Mining molecular fragments: Finding relevant substructures of molecules. In Proceedings of the IEEE International Conference on Data Mining (ICDM) (Washington, DC). 51--58. Google ScholarDigital Library
- Brin, S., Motwani, R., and Silverstein, C. 1997. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Tucson, AZ). 265--276. Google ScholarDigital Library
- Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. 2000. Graph structure in the web: Experiments and models. In Proceedings of the 9th International World Wide Web Conference (WWW) (Amsterdam, the Netherlands). 309--320. Google ScholarDigital Library
- Bron, C. and Kerbosch, J. 1973. Finding all cliques of an undireced graph. Commun. ACM 16, 9, 575--576. Google ScholarDigital Library
- Buehrer, G., Parthasarathy, S., and Ghoting, A. 2006. Out-of-Core frequent pattern mining on a commodity PC. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD). (Philadelphia, PA) 86--95. Google ScholarDigital Library
- Chakrabarti, D. and Faloutsos, C. 2006. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38, 1 (Mar.), Article 2. Google ScholarDigital Library
- Chen, Q., Lim, A., and Ong, K. W. 2003. D(k)-index: An adaptive structural summary for graph-structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (San Diego, CA). 134--144. Google ScholarDigital Library
- Chi, Y., Nijssen, S., Muntz, R., and Kok, J. 2005. Frequent subtree mining---An overview. Fundam. Inf. 66, 1-2, 161--198. Google ScholarDigital Library
- Dehaspe, L., Toivonen, H., and King, R. 1998. Finding frequent substructures in chemical compounds. In Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (New York). 30--36.Google Scholar
- Deshpande, M., Kuramochi, M., and Wale, N. 2005. Frequent substructure-based approaches for classifying chemical compounds. IEEE Trans. Knowl. Data Eng. 17, 8, 1036--1050. Google ScholarDigital Library
- Dong, G. and Li, J. 1999. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (San Diego, CA). 43--52. Google ScholarDigital Library
- Feige, U., Goldwasser, S., Lovasz, L., Safra, S., and Szegedy, M. 1991. Approximating clique is almost NP-complete. In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science (FOCS) (San Juan, PR). 2--12. Google ScholarDigital Library
- Frawley, W. J., Piatetsky-Shapiro, G., and Matheus, C. J. 1992. Knowledge discovery in databases---An overview. AI Mag. 13, 3, 57--70. Google ScholarDigital Library
- Hashimoto, K., Aoki-Kinoshita, K. F., Ueda, N., Kanehisa, M., and Mamitsuka, H. 2006. A new efficient probabilistic model for mining labeled ordered trees. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Philadelphia, PA). 177--186. Google ScholarDigital Library
- Hastad, J. 1996. Clique is hard to approximate within n<sup>1−ϵ</sup>. In Proceedings of the 37th Annual Symposium on Foundations of Computer Science (FOCS) (Burlington, VT). 627--636. Google ScholarDigital Library
- Horvath, T., Bringmann, B., and Raedt., L. D. 2006. Frequent hypergraph mining. In Proceedings of the 16th International Conference on Inductive Logic Programming (ILP) (Santiago, Spain).Google Scholar
- Horvath, T., Ramon, J., and Wrobel, S. 2006. Frequent subgraph mining in outerplanar graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Philadelphia, PA). 197--206. Google ScholarDigital Library
- Hu, Y., Olman, V., and Xu, D. 2002. Clustering gene expression data using a graph-theoretic approach: An application of minimum spanning trees. Bioinformatics 18, 4, 536--545.Google ScholarCross Ref
- Hu, H., Yan, X., Hang, Y., Han, J., and Zhou, X. J. 2005. Mining coherent dense subgraphs across massive biological network for functional discovery. Bioinformatics 21, 213--221. Google ScholarDigital Library
- Huan, J., Wang, W., and Prins, J. 2003. Efficient mining of frequent subgraphs in the presence of isomorphism. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM) (Melbourne, FL). 549--552. Google ScholarDigital Library
- Inokuchi, A., Washio, T., and Motoda, H. 2000. An apriori-based algorithm for mining frequent substructures from graph data. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD) (Freiburg, Germany). 13--23. Google ScholarDigital Library
- Karp, R. 1972. Reducibility among combinational problems. In Complexity of Computer Computations, R. E. Miller and Thatcher eds. Plenum Press, New York. 85--103.Google Scholar
- Kato, H. and Takahashi, Y. 2001. Automated identification of three-dimensional common structural features of proteins. Genome Inf. 8, 296--297.Google Scholar
- Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkamo, A. I. 1994. Finding interesting rules from large sets of discovered association rules. In Proceedings of the 3rd International Conference on Information and Knowledge Management (CIKM) (Gaithersburg, MD). 401--407. Google ScholarDigital Library
- Kuramochi, M. and Karypis, G. 2001. Frequent subgraph discovery. In Proceedings of the IEEE International Conference on Data Mining (ICDM) (San Jose, CA). 313--320. Google ScholarDigital Library
- Laxman, S. and Unnikrishnan, K. P. 2005. Discovering frequent episodes and learning hidden Markov models: A formal connection. IEEE Trans. Knowl. Data Eng. 17, 11, 1505--1517. Google ScholarDigital Library
- Mannila, H., Toivonen, H., and Verkamo, A. I. 1997. Discovery of frequent episodes in event sequences. Data Mining Knowl. Discov. 1, 3, 259--289. Google ScholarDigital Library
- Matsuda, H., Ishihara, T., and Hashimoto, A. 1999. Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci. 210, 2, 305--320. Google ScholarDigital Library
- Ostergard, P. R. 2002. A fast algorithm for the maximum clique problem. Discrete Appl. Math. 120, 1-3, 197--207. Google ScholarDigital Library
- Papadias, D., Tao, Y., Mouratidis, K., and Hui, C. K. 2005. Aggregate nearest neighbor queries in spatial databases. ACM Trans. Database Syst. 30, 2, 529--576. Google ScholarDigital Library
- Pei, J., Jiang, D., and Zhang, A. 2005. On mining cross-graph quasi-cliques. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Chicago, IL). 228--238. Google ScholarDigital Library
- Pensa, R.G., Robardet, C., and Boulicaut, J.F. 2005. A bi-clustering framework for categorical data. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) (Porto, Portugal). 643--650.Google Scholar
- Selmaoui, N., Leschi, C., Gay, D., and Boulicaut, J.F. 2006. Feature construction and delta-free sets in 0/1 samples. In Proceedings of the 9th International Conference on Discovery Science (DS) (Barcelona, Spain). 363--367. Google ScholarDigital Library
- Silverstein, C., Brin, S., Motwani, R., and Ullman, J. 2000. Scalable techniques for mining causal structures. Data Mining Knowl. Discov. 4, 2-3, 163--192. Google ScholarDigital Library
- Vanetik, N., Gudes, E., and Shimony, S. E. 2002. Computing frequent graph patterns from semistructured data. In Proceedings of the IEEE International Conference on Data Mining (ICDM) (Maebashi City, Japan). 458--465. Google ScholarDigital Library
- Wang, C., Wang, W., Pei, J., Zhu, Y., and Shi, B. 2004. Scalable mining of large disk-based graph databases. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Seattle, WA). 316--325. Google ScholarDigital Library
- Wang, H., Wang, W., Yang, J., and Yu, P. S. 2002. Clustering by pattern similarity in large data sets. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). Madison, WI). 394--405. Google ScholarDigital Library
- Wang, J., Han, J., and Pei, J. 2006a. Closed constrained gradient mining in retail databases. IEEE Trans. Knowl. Data Eng. 18, 6, 764--769. Google ScholarDigital Library
- Wang, J., Zeng, Z., and Zhou, L. 2006b. Clan: An algorithm for mining closed cliques from large dense graph databases. In Proceedings of the 22nd International Conference on Data Engineering (ICDE) (Atlanta, GA). Article 73. Google ScholarDigital Library
- Yan, X. and Han, J. 2002. GSPAN: Graph-Based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining (ICDM) (Maebashi City, Japan). 721--724. Google ScholarDigital Library
- Yan, X. and Han, J. 2003. Closegraph: Mining closed frequent graph patterns. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Washington, DC). 286--295. Google ScholarDigital Library
- Yan, X., Yu, P. S., and Han, J. 2004. Graph indexing: A frequent structure-based approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Paris). 335--346 Google ScholarDigital Library
- Yan, X., Zhou, X. J., and Han, J. 2005. Mining closed relational graphs with connectivity constraints. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Chicago, IL). 324--333. Google ScholarDigital Library
- Yang, L., Lee, M. L., and Hsu, W. 2003. Efficient mining of XML query patterns for caching. In Proceedings of 29th International Conference on Very Large Data Bases (VLDB) (Berlin). 69--80.Google Scholar
- Zaki, M. J. 2002. Efficiently mining frequent trees in a forest. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Edmonton, Alberta, Canada). 71--80. Google ScholarDigital Library
- Zeng, Z., Wang, J., Zhou, L., and Karypis, G. 2006. Coherent closed quasi-clique discovery from large dense graph databases. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Philadelphia, PA). 797--802. Google ScholarDigital Library
- Zhang, J., Hsu, W., and Lee, M. 2005. Clustering in dynamic spatial databases. J. Intell. Inf. Syst. 24, 1, 5--27. Google ScholarDigital Library
- Zhang, M., Kao, B., Cheung, D. W., and Yip, K. Y. 2005. Mining periodic patterns with gap requirement from sequences. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Chicago, IL). 623--633. Google ScholarDigital Library
Index Terms
-
Out-of-core coherent closed quasi-clique mining from large dense graph databases
-
Recommendations
-
On mining cross-graph quasi-cliques
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data miningJoint mining of multiple data sets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in cross-market customer segmentation, a group of customers who behave similarly in ...
-
Coherent closed quasi-clique discovery from large dense graph databases
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningFrequent coherent subgraphs can provide valuable knowledge about the underlying internal structure of a graph database, and mining frequently occurring coherent subgraphs from large dense graph databases has been witnessed several applications and ...
-
Mining frequent cross-graph quasi-cliques
Joint mining of multiple datasets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in bioinformatics, jointly mining multiple gene expression datasets obtained by different ...
Comments