skip to main content
article

Out-of-core coherent closed quasi-clique mining from large dense graph databases

Published:01 June 2007Publication History
Skip Abstract Section

Abstract

Due to the ability of graphs to represent more generic and more complicated relationships among different objects, graph mining has played a significant role in data mining, attracting increasing attention in the data mining community. In addition, frequent coherent subgraphs can provide valuable knowledge about the underlying internal structure of a graph database, and mining frequently occurring coherent subgraphs from large dense graph databases has witnessed several applications and received considerable attention in the graph mining community recently. In this article, we study how to efficiently mine the complete set of coherent closed quasi-cliques from large dense graph databases, which is an especially challenging task due to the fact that the downward-closure property no longer holds. By fully exploring some properties of quasi-cliques, we propose several novel optimization techniques which can prune the unpromising and redundant subsearch spaces effectively. Meanwhile, we devise an efficient closure checking scheme to facilitate the discovery of closed quasi-cliques only. Since large databases cannot be held in main memory, we also design an out-of-core solution with efficient index structures for mining coherent closed quasi-cliques from large dense graph databases. We call this Cocain*. Thorough performance study shows that Cocain* is very efficient and scalable for large dense graph databases.

References

  1. Abello, J., Resende, M. G., and Sudarsky, S. 2002. Massive quasi-clique detection. In Proceedings of the 5th Latin American Symposium on Theoretical Informatics (LATIN) (Cancun, Mexico). 598--612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Washington, D.C.). 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB) (Santiago, Chile). 487--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering (ICDE) (Taipei, Taiwan). 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Boginski, V., Butenko, S., and Pardalos, P. M. 2004. On structural properties of the market graph. In Innovations in Financial and Economic Networks, A. Nagurney ed. Edward Elgar. 29--45.Google ScholarGoogle Scholar
  6. Borgelt, C. and Berthold, M. R. 2002. Mining molecular fragments: Finding relevant substructures of molecules. In Proceedings of the IEEE International Conference on Data Mining (ICDM) (Washington, DC). 51--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Brin, S., Motwani, R., and Silverstein, C. 1997. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Tucson, AZ). 265--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. 2000. Graph structure in the web: Experiments and models. In Proceedings of the 9th International World Wide Web Conference (WWW) (Amsterdam, the Netherlands). 309--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bron, C. and Kerbosch, J. 1973. Finding all cliques of an undireced graph. Commun. ACM 16, 9, 575--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Buehrer, G., Parthasarathy, S., and Ghoting, A. 2006. Out-of-Core frequent pattern mining on a commodity PC. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD). (Philadelphia, PA) 86--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chakrabarti, D. and Faloutsos, C. 2006. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38, 1 (Mar.), Article 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chen, Q., Lim, A., and Ong, K. W. 2003. D(k)-index: An adaptive structural summary for graph-structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (San Diego, CA). 134--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chi, Y., Nijssen, S., Muntz, R., and Kok, J. 2005. Frequent subtree mining---An overview. Fundam. Inf. 66, 1-2, 161--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dehaspe, L., Toivonen, H., and King, R. 1998. Finding frequent substructures in chemical compounds. In Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (New York). 30--36.Google ScholarGoogle Scholar
  15. Deshpande, M., Kuramochi, M., and Wale, N. 2005. Frequent substructure-based approaches for classifying chemical compounds. IEEE Trans. Knowl. Data Eng. 17, 8, 1036--1050. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dong, G. and Li, J. 1999. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (San Diego, CA). 43--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Feige, U., Goldwasser, S., Lovasz, L., Safra, S., and Szegedy, M. 1991. Approximating clique is almost NP-complete. In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science (FOCS) (San Juan, PR). 2--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Frawley, W. J., Piatetsky-Shapiro, G., and Matheus, C. J. 1992. Knowledge discovery in databases---An overview. AI Mag. 13, 3, 57--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hashimoto, K., Aoki-Kinoshita, K. F., Ueda, N., Kanehisa, M., and Mamitsuka, H. 2006. A new efficient probabilistic model for mining labeled ordered trees. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Philadelphia, PA). 177--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hastad, J. 1996. Clique is hard to approximate within n<sup>1&minus;&epsiv;</sup>. In Proceedings of the 37th Annual Symposium on Foundations of Computer Science (FOCS) (Burlington, VT). 627--636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Horvath, T., Bringmann, B., and Raedt., L. D. 2006. Frequent hypergraph mining. In Proceedings of the 16th International Conference on Inductive Logic Programming (ILP) (Santiago, Spain).Google ScholarGoogle Scholar
  22. Horvath, T., Ramon, J., and Wrobel, S. 2006. Frequent subgraph mining in outerplanar graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Philadelphia, PA). 197--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hu, Y., Olman, V., and Xu, D. 2002. Clustering gene expression data using a graph-theoretic approach: An application of minimum spanning trees. Bioinformatics 18, 4, 536--545.Google ScholarGoogle ScholarCross RefCross Ref
  24. Hu, H., Yan, X., Hang, Y., Han, J., and Zhou, X. J. 2005. Mining coherent dense subgraphs across massive biological network for functional discovery. Bioinformatics 21, 213--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Huan, J., Wang, W., and Prins, J. 2003. Efficient mining of frequent subgraphs in the presence of isomorphism. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM) (Melbourne, FL). 549--552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Inokuchi, A., Washio, T., and Motoda, H. 2000. An apriori-based algorithm for mining frequent substructures from graph data. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD) (Freiburg, Germany). 13--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Karp, R. 1972. Reducibility among combinational problems. In Complexity of Computer Computations, R. E. Miller and Thatcher eds. Plenum Press, New York. 85--103.Google ScholarGoogle Scholar
  28. Kato, H. and Takahashi, Y. 2001. Automated identification of three-dimensional common structural features of proteins. Genome Inf. 8, 296--297.Google ScholarGoogle Scholar
  29. Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkamo, A. I. 1994. Finding interesting rules from large sets of discovered association rules. In Proceedings of the 3rd International Conference on Information and Knowledge Management (CIKM) (Gaithersburg, MD). 401--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Kuramochi, M. and Karypis, G. 2001. Frequent subgraph discovery. In Proceedings of the IEEE International Conference on Data Mining (ICDM) (San Jose, CA). 313--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Laxman, S. and Unnikrishnan, K. P. 2005. Discovering frequent episodes and learning hidden Markov models: A formal connection. IEEE Trans. Knowl. Data Eng. 17, 11, 1505--1517. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mannila, H., Toivonen, H., and Verkamo, A. I. 1997. Discovery of frequent episodes in event sequences. Data Mining Knowl. Discov. 1, 3, 259--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Matsuda, H., Ishihara, T., and Hashimoto, A. 1999. Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci. 210, 2, 305--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ostergard, P. R. 2002. A fast algorithm for the maximum clique problem. Discrete Appl. Math. 120, 1-3, 197--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Papadias, D., Tao, Y., Mouratidis, K., and Hui, C. K. 2005. Aggregate nearest neighbor queries in spatial databases. ACM Trans. Database Syst. 30, 2, 529--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Pei, J., Jiang, D., and Zhang, A. 2005. On mining cross-graph quasi-cliques. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Chicago, IL). 228--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Pensa, R.G., Robardet, C., and Boulicaut, J.F. 2005. A bi-clustering framework for categorical data. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) (Porto, Portugal). 643--650.Google ScholarGoogle Scholar
  38. Selmaoui, N., Leschi, C., Gay, D., and Boulicaut, J.F. 2006. Feature construction and delta-free sets in 0/1 samples. In Proceedings of the 9th International Conference on Discovery Science (DS) (Barcelona, Spain). 363--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Silverstein, C., Brin, S., Motwani, R., and Ullman, J. 2000. Scalable techniques for mining causal structures. Data Mining Knowl. Discov. 4, 2-3, 163--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Vanetik, N., Gudes, E., and Shimony, S. E. 2002. Computing frequent graph patterns from semistructured data. In Proceedings of the IEEE International Conference on Data Mining (ICDM) (Maebashi City, Japan). 458--465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Wang, C., Wang, W., Pei, J., Zhu, Y., and Shi, B. 2004. Scalable mining of large disk-based graph databases. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Seattle, WA). 316--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Wang, H., Wang, W., Yang, J., and Yu, P. S. 2002. Clustering by pattern similarity in large data sets. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). Madison, WI). 394--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Wang, J., Han, J., and Pei, J. 2006a. Closed constrained gradient mining in retail databases. IEEE Trans. Knowl. Data Eng. 18, 6, 764--769. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Wang, J., Zeng, Z., and Zhou, L. 2006b. Clan: An algorithm for mining closed cliques from large dense graph databases. In Proceedings of the 22nd International Conference on Data Engineering (ICDE) (Atlanta, GA). Article 73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yan, X. and Han, J. 2002. GSPAN: Graph-Based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining (ICDM) (Maebashi City, Japan). 721--724. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yan, X. and Han, J. 2003. Closegraph: Mining closed frequent graph patterns. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Washington, DC). 286--295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Yan, X., Yu, P. S., and Han, J. 2004. Graph indexing: A frequent structure-based approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Paris). 335--346 Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yan, X., Zhou, X. J., and Han, J. 2005. Mining closed relational graphs with connectivity constraints. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Chicago, IL). 324--333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Yang, L., Lee, M. L., and Hsu, W. 2003. Efficient mining of XML query patterns for caching. In Proceedings of 29th International Conference on Very Large Data Bases (VLDB) (Berlin). 69--80.Google ScholarGoogle Scholar
  50. Zaki, M. J. 2002. Efficiently mining frequent trees in a forest. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Edmonton, Alberta, Canada). 71--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Zeng, Z., Wang, J., Zhou, L., and Karypis, G. 2006. Coherent closed quasi-clique discovery from large dense graph databases. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Philadelphia, PA). 797--802. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Zhang, J., Hsu, W., and Lee, M. 2005. Clustering in dynamic spatial databases. J. Intell. Inf. Syst. 24, 1, 5--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zhang, M., Kao, B., Cheung, D. W., and Yip, K. Y. 2005. Mining periodic patterns with gap requirement from sequences. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD) (Chicago, IL). 623--633. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Out-of-core coherent closed quasi-clique mining from large dense graph databases

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 32, Issue 2
      June 2007
      267 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/1242524
      Issue’s Table of Contents

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 June 2007
      Published in tods Volume 32, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader