Clustering
Generalized Principal Component Analysis 豆瓣
作者: René Vidal / Yi Ma Springer 2016 - 4
This book provides a comprehensive introduction to the latest advances in the mathematical theory and computational tools for modeling high-dimensional data drawn from one or multiple low-dimensional subspaces (or manifolds) and potentially corrupted by noise, gross errors, or outliers. This challenging task requires the development of new algebraic, geometric, statistical, and computational methods for efficient and robust estimation and segmentation of one or multiple subspaces. The book also presents interesting real-world applications of these new methods in image processing, image and video segmentation, face recognition and clustering, and hybrid system identification etc.
This book is intended to serve as a textbook for graduate students and beginning researchers in data science, machine learning, computer vision, image and signal processing, and systems theory. It contains ample illustrations, examples, and exercises and is made largely self-contained with three Appendices which survey basic concepts and principles from statistics, optimization, and algebraic-geometry used in this book.
Graph Representation Learning 豆瓣
作者: William L. Hamilton Morgan & Claypool 2020 - 9
Graph-structured data is ubiquitous throughout the natural and social sciences, from telecommunication networks to quantum chemistry. Building relational inductive biases into deep learning architectures is crucial for creating systems that can learn, reason, and generalize from this kind of data. Recent years have seen a surge in research on graph representation learning, including techniques for deep graph embeddings, generalizations of convolutional neural networks to graph-structured data, and neural message-passing approaches inspired by belief propagation. These advances in graph representation learning have led to new state-of-the-art results in numerous domains, including chemical synthesis, 3D vision, recommender systems, question answering, and social network analysis.
This book provides a synthesis and overview of graph representation learning. It begins with a discussion of the goals of graph representation learning as well as key methodological foundations in graph theory and network analysis. Following this, the book introduces and reviews methods for learning node embeddings, including random-walk-based methods and applications to knowledge graphs. It then provides a technical synthesis and introduction to the highly successful graph neural network (GNN) formalism, which has become a dominant and fast-growing paradigm for deep learning with graph data. The book concludes with a synthesis of recent advancements in deep generative models for graphs--a nascent but quickly growing subset of graph representation learning.
Partitional Clustering via Nonsmooth Optimization 豆瓣
作者: Adil M. Bagirov / Napsu Karmitsa Springer
This book describes optimization models of clustering problems and clustering algorithms based on optimization techniques, including their implementation, evaluation, and applications. The book gives a comprehensive and detailed description of optimization approaches for solving clustering problems; the authors' emphasis on clustering algorithms is based on deterministic methods of optimization. The book also includes results on real-time clustering algorithms based on optimization techniques, addresses implementation issues of these clustering algorithms, and discusses new challenges arising from big data. The book is ideal for anyone teaching or learning clustering algorithms. It provides an accessible introduction to the field and it is well suited for practitioners already familiar with the basics of optimization.
Statistics for High-Dimensional Data 豆瓣
作者: Peter Bühlmann / Sara van de Geer Springer 2011 - 6
Modern statistics deals with large and complex data sets, and consequently with models containing a large number of parameters. This book presents a detailed account of recently developed approaches, including the Lasso and versions of it for various models, boosting methods, undirected graphical modeling, and procedures controlling false positive selections. A special characteristic of the book is that it contains comprehensive mathematical theory on high-dimensional statistics combined with methodology, algorithms and illustrations with real data examples. This in-depth approach highlights the methods’ great potential and practical applicability in a variety of settings. As such, it is a valuable resource for researchers, graduate students and experts in statistics, applied mathematics and computer science.
High-Dimensional Probability 豆瓣
作者: Roman Vershynin Cambridge University Press 2018 - 9
High-dimensional probability offers insight into the behavior of random vectors, random matrices, random subspaces, and objects used to quantify uncertainty in high dimensions. Drawing on ideas from probability, analysis, and geometry, it lends itself to applications in mathematics, statistics, theoretical computer science, signal processing, optimization, and more. It is the first to integrate theory, key tools, and modern applications of high-dimensional probability. Concentration inequalities form the core, and it covers both classical results such as Hoeffding's and Chernoff's inequalities and modern developments such as the matrix Bernstein's inequality. It then introduces the powerful methods based on stochastic processes, including such tools as Slepian's, Sudakov's, and Dudley's inequalities, as well as generic chaining and bounds based on VC dimension. A broad range of illustrations is embedded throughout, including classical and modern results for covariance estimation, clustering, networks, semidefinite programming, coding, dimension reduction, matrix completion, machine learning, compressed sensing, and sparse regression.
Foundations of Data Science 豆瓣
作者: Avrim Blum / John Hopcroft Cambridge University Press 2020 - 1
Description Contents Resources Courses About the Authors
This book provides an introduction to the mathematical and algorithmic foundations of data science, including machine learning, high-dimensional geometry, and analysis of large networks. Topics include the counterintuitive nature of data in high dimensions, important linear algebraic techniques such as singular value decomposition, the theory of random walks and Markov chains, the fundamentals of and important algorithms for machine learning, algorithms and analysis for clustering, probabilistic models for large networks, representation learning including topic modelling and non-negative matrix factorization, wavelets and compressed sensing. Important probabilistic techniques are developed including the law of large numbers, tail inequalities, analysis of random projections, generalization guarantees in machine learning, and moment methods for analysis of phase transitions in large random graphs. Additionally, important structural and complexity measures are discussed such as matrix norms and VC-dimension. This book is suitable for both undergraduate and graduate courses in the design and analysis of algorithms for data.
2019年10月2日 在读 看目录觉得很实用!微软主页: https://www.microsoft.com/en-us/research/video/foundations-of-ds/ 手稿(2018年):https://www.cs.cornell.edu/jeh/book.pdf
Clustering Data_Mining
Optimization Algorithms on Matrix Manifolds 豆瓣
作者: Absil, P. a./ Mahony, Robert/ Sepulchre, Rodolphe Princeton Univ Pr 2007
Many problems in the sciences and engineering can be rephrased as optimization problems on matrix search spaces endowed with a so-called manifold structure. This book shows how to exploit the special structure of such problems to develop efficient numerical algorithms. It places careful emphasis on both the numerical formulation of the algorithm and its differential geometric abstraction - illustrating how good algorithms draw equally from the insights of differential geometry, optimization, and numerical analysis. Two more theoretical chapters provide readers with the background in differential geometry necessary to algorithmic development. In the other chapters, several well-known optimization methods such as steepest descent and conjugate gradients are generalized to abstract manifolds. The book provides a generic development of each of these methods, building upon the material of the geometric chapters. It then guides readers through the calculations that turn these geometrically formulated methods into concrete numerical algorithms. The state-of-the-art algorithms given as examples are competitive with the best existing algorithms for a selection of eigenspace problems in numerical linear algebra. "Optimization Algorithms on Matrix Manifolds" offers techniques with broad applications in linear algebra, signal processing, data mining, computer vision, and statistical analysis. It can serve as a graduate-level textbook and will be of interest to applied mathematicians, engineers, and computer scientists.
Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers 豆瓣
作者: Stephen Boyd / Neal Parikh Now Publishers Inc 2011
https://web.stanford.edu/~boyd/papers/admm_distr_stats.html
Many problems of recent interest in statistics and machine learning can be posed in the framework of convex optimization. Due to the explosion in size and complexity of modern datasets, it is increasingly important to be able to solve problems with a very large number of features or training examples. As a result, both the decentralized collection or storage of these datasets as well as accompanying distributed solution methods are either necessary or at least highly desirable. In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn's method of partial inverses, Dykstra's alternating projections, Bregman iterative algorithms for ℓ1 problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop Map Reduce implementations.
High-Dimensional Statistics 豆瓣 谷歌图书
作者: Martin J. Wainwright Cambridge University Press 2019 - 1
Recent years have witnessed an explosion in the volume and variety of data collected in all scientific disciplines and industrial settings. Such massive data sets present a number of challenges to researchers in statistics and machine learning. This book provides a self-contained introduction to the area of high-dimensional statistics, aimed at the first-year graduate level. It includes chapters that are focused on core methodology and theory - including tail bounds, concentration inequalities, uniform laws and empirical process, and random matrices - as well as chapters devoted to in-depth exploration of particular model classes - including sparse linear models, matrix models with rank constraints, graphical models, and various types of non-parametric models. With hundreds of worked examples and exercises, this text is intended both for courses and for self-study by graduate students and researchers in statistics, machine learning, and related fields who must understand, apply, and adapt modern statistical methods suited to large-scale data.
Computational Optimal Transport 豆瓣
作者: Gabriel Peyré / Marco Cuturi Now Publishers Inc 2019 - 5
https://optimaltransport.github.io/
https://www.nowpublishers.com/article/Details/MAL-073
The goal of Optimal Transport (OT) is to define geometric tools that are useful to compare probability distributions. Their use dates back to 1781.
Recent years have witnessed a new revolution in the spread of OT, thanks to the emergence of approximate solvers that can scale to sizes and dimensions that are relevant to data sciences. Thanks to this newfound scalability, OT is being increasingly used to unlock various problems in imaging sciences (such as color or texture processing), computer vision and graphics (for shape manipulation) or machine learning (for regression, classification and density fitting). This monograph reviews OT with a bias toward numerical methods and their applications in data sciences, and sheds lights on the theoretical properties of OT that make it particularly useful for some of these applications.
Computational Optimal Transport presents an overview of the main theoretical insights that support the practical effectiveness of OT before explaining how to turn these insights into fast computational schemes. Written for readers at all levels, the authors provide descriptions of foundational theory at two-levels. Generally accessible to all readers, more advanced readers can read the specially identified more general mathematical expositions of optimal transport tailored for discrete measures. Furthermore, several chapters deal with the interplay between continuous and discrete measures, and are thus targeting a more mathematically-inclined audience.
This monograph will be a valuable reference for researchers and students wishing to get a thorough understanding of Computational Optimal Transport, a mathematical gem at the interface of probability, analysis and optimization.
Handbook of Cluster Analysis 豆瓣
作者: Christian Hennig / Marina Meila CRC Press 2016
Handbook of Cluster Analysis provides a comprehensive and unified account of the main research developments in cluster analysis. Written by active, distinguished researchers in this area, the book helps readers make informed choices of the most suitable clustering approach for their problem and make better use of existing cluster analysis tools.
The book is organized according to the traditional core approaches to cluster analysis, from the origins to recent developments. After an overview of approaches and a quick journey through the history of cluster analysis, the book focuses on the four major approaches to cluster analysis. These approaches include methods for optimizing an objective function that describes how well data is grouped around centroids, dissimilarity-based methods, mixture models and partitioning models, and clustering methods inspired by nonparametric density estimation. The book also describes additional approaches to cluster analysis, including constrained and semi-supervised clustering, and explores other relevant issues, such as evaluating the quality of a cluster.
This handbook is accessible to readers from various disciplines, reflecting the interdisciplinary nature of cluster analysis. For those already experienced with cluster analysis, the book offers a broad and structured overview. For newcomers to the field, it presents an introduction to key issues. For researchers who are temporarily or marginally involved with cluster analysis problems, the book gives enough algorithmic and practical details to facilitate working knowledge of specific clustering areas.
Deep Learning through Sparse and Low-Rank Modeling 豆瓣
作者: Zhangyang Wang / Yun Fu Academic Press 2019 - 4
https://www.elsevier.com/books/deep-learning-through-sparse-and-low-rank-modeling/wang/978-0-12-813659-1
Description:
Deep Learning through Sparse Representation and Low-Rank Modeling bridges classical sparse and low rank models—those that emphasize problem-specific Interpretability—with recent deep network models that have enabled a larger learning capacity and better utilization of Big Data. It shows how the toolkit of deep learning is closely tied with the sparse/low rank methods and algorithms, providing a rich variety of theoretical and analytic tools to guide the design and interpretation of deep learning models. The development of the theory and models is supported by a wide variety of applications in computer vision, machine learning, signal processing, and data mining.
This book will be highly useful for researchers, graduate students and practitioners working in the fields of computer vision, machine learning, signal processing, optimization and statistics.
Key Features:
Combines classical sparse and low-rank models and algorithms with the latest advances in deep learning networks
Shows how the structure and algorithms of sparse and low-rank methods improves the performance and interpretability of Deep Learning models
Provides tactics on how to build and apply customized deep learning models for various applications
Readership:
Researchers and graduate students in computer vision, machine learning, signal processing, optimization, and statistics
2019年4月23日 在读 已上传 http://booksdescr.org/item/index.php?md5=167383D00A7B6D3B368DCEA48960BD30
Clustering Machine_Learning
Graph Embedding for Pattern Analysis 豆瓣
作者: Yun Fu / Yunqian Ma 2013
Graph Embedding for Pattern Recognition covers theory methods, computation, and applications widely used in statistics, machine learning, image processing, and computer vision. This book presents the latest advances in graph embedding theories, such as nonlinear manifold graph, linearization method, graph based subspace analysis, L1 graph, hypergraph, undirected graph, and graph in vector spaces. Real-world applications of these theories are spanned broadly in dimensionality reduction, subspace learning, manifold learning, clustering, classification, and feature selection. A selective group of experts contribute to different chapters of this book which provides a comprehensive perspective of this field.
Clustering 豆瓣
作者: Boris Mirkin Chapman and Hall/CRC 2012 - 10
Often considered more of an art than a science, books on clustering have been dominated by learning through example with techniques chosen almost through trial and error. Even the two most popular, and most related, clustering methods-K-Means for partitioning and Ward's method for hierarchical clustering-have lacked the theoretical underpinning required to establish a firm relationship between the two methods and relevant interpretation aids. Other approaches, such as spectral clustering or consensus clustering, are considered absolutely unrelated to each other or to the two above mentioned methods. Clustering: A Data Recovery Approach, Second Edition presents a unified modeling approach for the most popular clustering methods: the K-Means and hierarchical techniques, especially for divisive clustering. It significantly expands coverage of the mathematics of data recovery, and includes a new chapter covering more recent popular network clustering approaches-spectral, modularity and uniform, additive, and consensus-treated within the same data recovery approach. Another added chapter covers cluster validation and interpretation, including recent developments for ontology-driven interpretation of clusters. Altogether, the insertions added a hundred pages to the book, even in spite of the fact that fragments unrelated to the main topics were removed. Illustrated using a set of small real-world datasets and more than a hundred examples, the book is oriented towards students, practitioners, and theoreticians of cluster analysis. Covering topics that are beyond the scope of most texts, the author's explanations of data recovery methods, theory-based advice, pre- and post-processing issues and his clear, practical instructions for real-world data mining make this book ideally suited for teaching, self-study, and professional reference.
Data Mining 豆瓣
作者: Jiawei Han / Micheline Kamber Morgan Kaufmann 2011 - 7
The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges.
* Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of your data
2018年11月4日 在读 [Cluster Analysis in Data Mining Chapters 2, 10, 11,13] https://www.coursera.org/learn/cluster-analysis/supplement/5zaoa/syllabus
Clustering
Data Clustering 豆瓣
作者: Aggarwal, Charu C.; Reddy, Chandan K.; Chapman and Hall/CRC 2013 - 8
Research on the problem of clustering tends to be fragmented across the pattern recognition, database, data mining, and machine learning communities. Addressing this problem in a unified way, Data Clustering: Algorithms and Applications provides complete coverage of the entire area of clustering, from basic methods to more refined and complex data clustering approaches. It pays special attention to recent issues in graphs, social networks, and other domains. The book focuses on three primary aspects of data clustering: Methods, describing key techniques commonly used for clustering, such as feature selection, agglomerative clustering, partitional clustering, density-based clustering, probabilistic clustering, grid-based clustering, spectral clustering, and nonnegative matrix factorization Domains, covering methods used for different domains of data, such as categorical data, text data, multimedia data, graph data, biological data, stream data, uncertain data, time series clustering, high-dimensional clustering, and big data Variations and Insights, discussing important variations of the clustering process, such as semisupervised clustering, interactive clustering, multiview clustering, cluster ensembles, and cluster validation In this book, top researchers from around the world explore the characteristics of clustering problems in a variety of application areas. They also explain how to glean detailed insight from the clustering process-including how to verify the quality of the underlying clusters-through supervision, human intervention, or the automated generation of alternative clusters.
数据聚类(精) 豆瓣
作者: 张宪超 2018 - 4
2016年初,谷歌围棋Alpha Go以4:1的成绩战胜了人类围棋世界冠军李世石,引起全世界的关注,这标志着人工智能的发展进入了一个全新的阶段。近几年来,人工智能得到飞速的发展,在很多领域如图像识别、语音识别等方面取得了突破性的进步。人工智能的研究也得到全世界学术界和产 业界的高度关注,进入了一个新的高潮期。种种迹象表明,人类进入全方位智能时代已经为期不远了。所有这一切几乎均得益于神经网络的新技术——深度学习的发现和发展(非常有趣的是人工智能的几次高潮均来自神经网络的进步,可见神经网络的生命力)。深度学习的概念由Hinton等于2006年提出,在近年来已经逐渐成为机器学习的主流技术,在多数应用领域的性能明显超出已有技术。
机器学习包括监督学习和无监督学习。目前的深度学习基本上只带来监督学习的进步,但仅靠监督学习是无法实现完整的人工智能的。作为智能系统,监督学习似乎足够“能”而不足够“智”。足够“能”体现为它能够在大数据中挖掘知识,这甚至是人脑做不到的。事实上人脑并不是处理大数据的系统,人类在任何领域所掌握的知识均有限,例如,每个人仅认识数干个汉字或单词。不足够“智”体现为监督学习需要大量人工标记的训练样本。人脑的学习并不需要大量的样本训练,人类是在没有指导或少量指导的条件下获得知识的,而且人脑会不断地学习并强化自己在各个领域的知识。人类在有限知识的基础上体现出惊人的创造力。类似人脑的智能系统更需要无监督学习、小样本学习、强化学习和迁移学习等功能。因此,人工智能的发展仍然任重而道远。
本书讨论聚类技术。聚类是无监督学习的主要内容,在很多文献中人们甚至把聚类和无监督学习两个概念等价使用。聚类一直是机器学习、数据挖掘、模式识别等领域的重要组成内容,近年来更得到高度重视。2015年,中国人工智能学会理事长李德毅院士在“新一代信息技术产业发展高峰论坛”上指出:“人类的认知科学要想有所突破,首先就要在大数据聚类上取得突破,聚类是挖掘大数据资产价值的第一步。”同年,深度学习的领军人物Lecun、Bengio和Hinton在Nature上的综述指出:“人和动物的学习很大程度上是无监督的:我们通过观察发现世界的结构,而不是对每个物体命名。”
那么什么是聚类呢?《周易·系辞上》说:“方以类聚,物以群分,吉凶生矣。”自然的事物总是按一定的规律组织起来的,人们通过认识这些组织的结构特征获得知识,从而做出决策。以生物为例(我们这个世界是因为有生物而活泼生动的),人们根据生物的相似程度(包括形态结构和生理功能等),把生物划分为种和属等不同的等级,并对每一类群的形态结构和生理功能等特征进行科学的描述,以弄清不同类群之间的亲缘关系和进化关系。相信很多人小时候学习生物时都会惊讶于鲸居然是哺乳动物而不是鱼,猫和老虎是同一科等。
和分类(监督学习的主要任务)不同,聚类是在无标记样本的条件下将数据分组,从而发现数据的天然结构。聚类在数据分析中扮演重要的角色,它通常被用于以下三个方面。
(1)发现数据的潜在结构:深入洞察数据、产生假设、检测异常、确定主要特征。
(2)对数据进行自然分组:确定不同组织之间的相似程度(系统关系)。
(3)对数据进行压缩:将聚类原型作为组织和概括数据的方法。
这几个方面的功能使聚类既可以作为预处理程序,又可以作为独立的数据分析工具。
聚类是典型的交叉学科,在很多领域有广泛的应用,其研究已有60多年的历史。生物分类学者、社会学者、哲学家、生物学家、统计学家、数学家、工程师、计算机科学家、医学研究者等众多收集和处理实际数据的工作者都对聚类方法做出了贡献。在不同的领域,聚类还可能被称为Q-分析、拓扑、凝结、分类等。聚类的概念最早出现在1954年的一篇处理人类学数据的论文中。自此开始,聚类一直是相关领域重要的研究内容之一。
2018年7月15日 在读 emmmm... 豆瓣官方导入的条目太不专业了,我给补充了一下。话说这书定价188也太贵了吧,我导师说8月济南开研讨会时找张宪超要本,到时候我也去。
Clustering Machine_Learning
机器学习(从公理到算法)/中国计算机学会学术著作丛书 豆瓣
作者: 于剑 2017 - 7
《机器学习:从公理到算法(中国计算机学会学术著作丛书)》是一本基于公理研究学习算法的书。共17章,由两部分组成。第一部分是机器学习公理以及部分理论演绎,包括第1、2、6、8章,论述学习公理以及相应的聚类、分类理论。第二部分关注如何从公理推出经典学习算法,包括单类、多类和多源问题。第3~5章为单类问题,分别论述密度估计、回归和单类数据降维。第7、9~16章为多类问题,包括聚类、神经网络、K近邻、支持向量机、Logistic回归、贝叶斯分类、决策树、多类降维与升维等经典算法。最后第17章研究了多源数据学习问题。
《机器学习:从公理到算法(中国计算机学会学术著作丛书)》可以作为高等院校计算机、自动化、数学、统计学、人工智能及相关专业的研究生教材,也可以供机器学习的爱好者参考。
Unsupervised Learning Algorithms 豆瓣
Springer International Publishing AG 2016 - 5
This book summarizes the state-of-the-art in unsupervised learning. The contributors discuss how with the proliferation of massive amounts of unlabeled data, unsupervised learning algorithms, which can automatically discover interesting and useful patterns in such data, have gained popularity among researchers and practitioners. The authors outline how these algorithms have found numerous applications including pattern recognition, market basket analysis, web mining, social network analysis, information retrieval, recommender systems, market research, intrusion detection, and fraud detection. They present how the difficulty of developing theoretically sound approaches that are amenable to objective evaluation have resulted in the proposal of numerous unsupervised learning algorithms over the past half-century. The intended audience includes researchers and practitioners who are increasingly using unsupervised learning algorithms to analyze their data. Topics of interest include anomaly detection, clustering, feature extraction, and applications of unsupervised learning. Each chapter is contributed by a leading expert in the field.