RESOURCE – Data Mining Lab (数据挖掘实验室)

“Everything that informs us of something useful that we didn’t already know is a potential signal. If it matters and deserves a response, its potential is actualized.”

– Steven Few

Data Collections

1. Computer Vision Data

CIFAR-10 dataset: The CIFAR-10 dataset consists of 60000 32×32 colour images in 10 classes, with 6000 images per class. The 10 classes of labels include: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck.( MD5 Checkcum: c99cafc152244af753f735de768cd75f. File Size: 29.55MB. )
Fashion MNIST: Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes.
ImageNet: ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. The project has been instrumental in advancing computer vision and deep learning research. The data is available for free to researchers for non-commercial use.
IMDB-WIKI faces: Faces from the list of the most popular 100,000 actors as listed on the IMDb website and (automatically) crawled from their profiles date of birth, name, gender and all images related to that person. 460,723 face images from 20,284 celebrities from IMDb and 62,328 from Wikipedia, thus 523,051 in total.

2.Natural Language Processing Data

BERT pre-training model: Bidirectional Encoder Representations from Transformers (BERT) is a popular pre-training NLP model introduced in 2018 by researchers at Google.The pre-trained model can be fine-tuned on small-data NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch.( MD5 Checkcum: 950f63930cd0e17b57057d810eb8e4db. File Size: 417.71MB. )
Yelp open dataset: The Yelp dataset contains data about businesses, reviews, and user data for use in personal, educational, and academic purposes. Available in both JSON and SQL files.
Twitter100k: Twitter100k dataset is characterized by two aspects: 1) it has 100,000 image-text pairs randomly crawled from Twitter and thus has no constraint in the image categories; 2) text in Twitter100k is written in informal language by the users.
MultiNLI: The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation.

3. Network Data

Stanford Large Network Dataset Collection: A substantial collection of data sets describing very large networks, including social networks, communications networks, and transportation networks.
UCI: The UC Irvine Machine Learning Repository, which currently maintain 349 data sets as a service to the machine learning community.
FODAVA: The FODAVA initiative focuses on the creation of mathematical and computational sciences required to transform all types of digital data into ways that make visual understanding possible.
UCINet data sets: Social network data sets released with the UCINet software by Steve Borgatti et al.
Pajek data sets: Example data sets released with the Pajek software by Vladimir Batagelj and Andrej Mrvar.
Duncan Watts’ data sets: Data compiled by Prof. Duncan Watts and collaborators at Columbia University, including data on the structure of the Western States Power Grid and the neural network of the worm C. Elegans.
Laszlo Barabasi’s data sets: Data compiled by Prof. Albert-Laszlo Barabasi and collaborators at the University of Notre Dame, including web data and biochemical networks.
Alex Arenas’s data sets: Data compiled by Prof. Alexandre Arenas and collaborators at Universidad Rovira i Virgili, including metabolic network data and the network from their study of the collaboration patterns of jazz musicians.
kdnuggets’s data sets: Data Mining, Analytics, Big Data, and Data Science. and here data mining software, tutorials, course and other resource are included , specially, there are various data sets. Organization : KDnuggets.
Dynamic network data: The page provides a collection of datasets obtained through the SocioPatterns sensing platform. and all data sets are described as dynamic network about communication in school. and provided by ISI Foundation et. al institutions and companies.
Erik Demaine’s data sets: Data sets are provided by Erik Demaine (MIT) & MohammadTaghi Hajiaghayi (UMD), and those data sets both describe dynamic networks such as paper citation data, brain connectome data, web graph data, DBLP data, social network data, google social network data, Twitter data.
CAIDA: CAIDA collects several different types of data at geographically and topologically diverse locations, and makes this data available to the research community to the extent possible while preserving the privacy of individuals and organizations who donate data or network access.
Attributed Graphs: Data compiled by Prof. Adriana Prado and collaborators, including static and dynamic attributed graphs.
Explore and Compare Network Statistics Interactively: A graph and network repository containing hundreds of real-world networks and benchmark datasets.
Microsoft Academic Graph: A large and heterogeneous graph containing scientific publication records, citation relationships between publications, as well as authors, institutions, journal and conference et al.
GDM@FUDAN: GDM@FUDAN focuses on studying and developing effective and efficient solutions to manage and mine these graph data, aiming at understanding real graphs and supporting real applications built upon large real graphs. Recently, we are especially interested in knowlege graphs and its application.
KONECT: The Koblenz Network Collection is a project to collect large network datasets of all types in order to perform research in network science and related fields, collected by the Institute of Web Science and Technologies at the University of Koblenz–Landau. KONECT contains over a hundred network datasets of various types, including directed, undirected, bipartite, weighted, unweighted, signed and rating networks.
Pew Research Center Pew Research Center makes its data available to the public for secondary analysis after a period of time.
ASU Network Data Social Computing Data Repository hosts data from a collection of many different social media sites, most of which have blogging capacity. Some of the prominent social media sites included in this repository are BlogCatalog, Twitter, MyBlogLog, Digg, StumbleUpon, del.icio.us, MySpace, LiveJournal, The Unofficial Apple Weblog (TUAW), Reddit, etc. The repository contains various facets of blog data including blog site metadata like, user defined tags, predefined categories, blog site description; blog post level metadata like, user defined tags, date and time of posting; blog posts; blog post mood (which is defined as the blogger’s emotions when (s)he wrote the blog post); blogger name; blog post comments; and blogger social network.
Datasets for Social Analysis It contains over 300 datasets that are available for a course of UCI.
Duke Network Dataset Datasets collected by Duke Network Analysis Center.
Mark Newman Dataset It contains links to some network data sets he has compiled over the years. All of these are free for scientific use to the best of his knowledge, meaning that the original authors have already made the data freely available, or that he has consulted the authors and received permission to the post the data here, or that the data are mine.

4. Data Streams

Forest Covertype: Contains the forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. It contains 581, 012 instances and 54 attributes, and it has been used in several papers on data stream classification. From: UCI Machine Learning Repository Normalized Dataset.
Poker-Hand: Consists of 1, 000, 000 instances and 11 attributes. Each record of the Poker-Hand dataset is an example of a hand consisting of five playing cards drawn from a standard deck of 52. Each card is described using two attributes (suit and rank), for a total of 10 predictive attributes. There is one class attribute that describes the “Poker Hand”. From UCI Machine Learning Repository Normalized Dataset
Electricity: is another widely used dataset described by M. Harries and analysed by Gama. This data was collected from the Australian New South Wales Electricity Market. In this market, prices are not fixed and are affected by demand and supply of the market. They are set every five minutes. The ELEC dataset contains 45, 312 instances. The class label identifies the change of the price relative to a moving average of the last 24 hours. Original Dataset. Normalized Dataset.These are normalized versions of these datasets, so that the numerical values are between 0 and 1. With the Poker-Hand dataset, the cards are not ordered, i.e. a hand can be represented by any permutation, which makes it very hard for propositional learners, especially for linear ones. This dataset is a modified version, where cards are sorted by rank and suit, and have removed duplicates.
Airlines Dataset: Inspired in the regression dataset from Elena Ikonomovska. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure. Dataset

5. Spatio-temporal Data

T-Drive Taxi Trajectories: A sample of trajectories from Microsoft Research T-Drive project, generated by over 10,000 taxicabs in a week of 2008 in Beijing.
Movebank Animal Tracking Data: Movebank is a free, online database of animal tracking data, helping animal tracking researchers to manage, share, protect, analyze, and archive their data.
Hurricane Trajectories: This dataset is provided by the National Hurricane Service (NHS), containing 1,740 trajectories of Atlantic Hurricanes (formally defined as tropical cyclone) from 1851 to 2012. NHS also provides annotations of typical hurricane tracks for each month throughout the annual hurricane season that spans from June to November. The dataset can be used to test trajectory clustering and uncertainty.
The Greek Truck Trajectories: This dataset contains 1,100 trajectories from 50 different trucks delivering concrete around Athens, Greece. It was used to evaluate trajectory pattern mining task in Giannotti et al.

6. Neuroimaging Data

GraphVIS: There are hundreds of data sets about kinds of networks such as biological networks, brain networks, collaboration networks, cheminformatics and so on. it also provides tools of analysis data and visualization.
HPMS: HPMS Public Release Shapefiles spatially represent limited data from the Highway Performance Monitoring System (HPMS) for 2011-2013 data years.

7. Bioinformatical(gene expression) Datasets

Bioinformatical datasets for data mining: Collected by School of Informatics at The University of Edinburgh, contains a list of datasets that were selected for the projects for Data Mining and Exploration.
Biclustering of Expression Data: This is an online repository of high-dimentional biomedical data sets, including gene expression data, protein profiling data and genomic sequence data that are related to classification and that are published recently in Science, Nature and so on prestigious journals. These biomedical applications are also challenging problems to the machine learning and data mining community. As the file formats of these original raw data are different from common ones used in most of machine learning softwares, The authors have transformed these data sets into the standard .data and .names format and stored them in this repository. Besides, it also provides data in .arff format which is used by Weka

8. Text datasets

Popular Text Datasets: The page provides a series of popular text datasets in Matlab format

9. Hybrid Datasets

Data repositories: This is a list of repositories and databases for open data. The page contains the datasets related to Archaeology, Astronomy, Biology, Chemistry, Computer science, Energy, Environmental sciences, Geology, Geosciences and geospatial data, Linguistics, Marine sciences, Medicine, Multidisciplinary repositories, Physics, Social sciences
A collection of MATLAB data sets used by PMTK: This is a collection of Matlab/Octave functions, written by Matt Dunham, Kevin Murphy and various other people. The toolkit is primarily designed to accompany Kevin Murphy’s textbook Machine learning: a probabilistic perspective, but can also be used independently of this book. The goal is to provide a unified conceptual and software framework encompassing machine learning, graphical models, and Bayesian statistics.
Indiana University data sets: A set of very large data sets, including some non-network data sets, compiled by the School of Library and Information Science at Indiana University. Network data sets include the NBER data set of US patent citations and a data set of links between articles in the on-line encyclopedia Wikipedia.
Awesome Public Datasets: Those datasets are cited by blog, which include kinds of datasets, such as Agriculture, Biology, Climate/Weather, Complex Networks, Computer Networks, Contextual Data, Data Challenges， Economics Education, Finance, Geology, GIS/Environment, Government, Healthcare, Image Processing, Machine Learning, Museums, Natural Language, Physics, Psychology/Cognition, Public Domains, Search Engines, Social Networks, Social Sciences, Software, Sports, Time Series, Transportation ,Complementary Collections
KDD Summary of data sets: It is summary of data sets by data type, and those include Discrete Sequence Data, Image Data, Multivariate Data, Relational Data, Spatio-Temporal Data, Test, Time Series, Web Data ; Data sets come from The UCI KDD Archive; Information and Computer Science; University of California, Irvine
KONECT: KONECT contains over a hundred network datasets of various types, including directed, undirected, bipartite, weighted, unweighted, signed and rating networks. The networks of KONECT are collected from many diverse areas such as social networks, hyperlink networks, authorship networks, physical networks, interaction networks and communication networks.

10. Remote Sensing Data

11. Recommendation

GroupLens: GroupLens is a research lab in the Department of Computer Science and Engineering at the University of Minnesota, Twin Cities specializing in recommender systems, online communities, mobile and ubiquitous technologies, digital libraries, and local geographic information systems.
Yelp Dataset Challenge: This is a dataset about business and its nearby.

12. Compression Benchmark Datasets

VSCode Data: A memory image data, which is captured during the execution of Visual Studio Code(VSCode) application. It includes code segments, libraries, runime data, and stack/heap memory, serving as a representative benchmark for testing compression techniques on real-world, high-entropy memory content.
Win11: A system image, represents a full image (removed all the zeroed and duplicated 8-kb blocks) of a machine running the Windows 11 operating system with some common applications installed. This dataset includes system files, executables, libraries(DLLs), configuration data, temporary files, and possibly user data.
Ubuntu: A system image, represents a full image (removed all the zeroed and duplicated 8-kb blocks) of a machine running the Ubuntu 20.04 operating system with some common applications installed. This dataset includes OS kernel, system libraries, user-space applications, configuration files, and potentially user data and logs.
MacOS: A system image, represents a full image (removed all the zeroed and duplicated 8-kb blocks) of a machine running the MacOS Monterey operating system with some common applications installed. This dataset includes a wide range of components such as system frameworks, kernel extensions, user applications, configuration files, runtime caches, and user data.

Public Data Mining Packages

MTBA: A new Matlab toolbox, developped by Intelligent Informatics and Automation Laboratory, designed to perform a variety of biclustering algorithms under a common user interface. Although some implementations are available for the proposed biclustering algorithms, each program is accompanied by a different user interface and use different input-output formats. MTBA tries to fill this gap by providing multiple functionalities for data handling, preprocessing, biclustering and visualization.
Some state-of-art algorithms code, developped by Professor Mohammed J. Zaki’s faculty
WEKA: Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
ELKI: ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection.
Scikit: Simple and efficient tools for data mining and data analysis. Accessible to everybody, and reusable in various contexts. Built on NumPy, SciPy, and matplotlib. Open source, commercially usable – BSD license
LibSVM: LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM). It supports multi-class classification.
MOA: MOA is the most popular open source framework for data stream mining, with a very active growing community (blog). It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, concept drift detection and recommender systems) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems.
LibRec: LibRec is a GPL-licensed Java library for recommender systems (version 1.7 or higher required). It implements a suite of state-of-the-art recommendation algorithms. It consists of three major components: interfaces, data structures and recommendation algorithms.
Libfm: Factorization machines (FM) are a generic approach that allows to mimic most factorization models by feature engineering. This way, factorization machines combine the generality of feature engineering with the superiority of factorization models in estimating interactions between categorical variables of large domain. libFM is a software implementation for factorization machines that features stochastic gradient descent (SGD) and alternating least squares (ALS) optimization as well as Bayesian inference using Markov Chain Monte Carlo (MCMC).
Svdfeature: SVDFeature is a toolkit designed to efficiently solve large-scale collaborative filtering problems with auxiliary information. Unlike traditional engineering approaches for collaborative filtering which requires writing specific solver for each algorithm, SVDFeature solves a general form of CF problems thus allow develop new models just by defining new features. The feature-based setting allows us to include many kinds of information into the model. Using the toolkit, we can easily incorporate information such as temporal dynamics, neighborhood relationship, and hierarchical information into the model.
MyMediaLite Recommender System Library: MyMediaLite is a recommender system library for the Common Language Runtime (CLR, often called .NET).
PyMc3: PyMC3 is a python module for Bayesian statistical modeling and model fitting which focuses on advanced Markov chain Monte Carlo fitting algorithms. Its flexibility and extensibility make it applicable to a large suite of problems.
Plotly: Magnificent visualization model compatible with a variety of tools(compatible with Python, R, Matlab, Excel, JS).
openML: Open Media Library (OpenML) is a free, cross-platform programming environment designed by the Khronos Group for capturing, transporting, processing, displaying, and synchronizing digital media (2D and 3D graphics, audio and video processing, I/O, and networking).

Softwares

Data Generator
PREA: PREA (Personalized Recommendation Algorithms Toolkit) is an open source Java software that provides easy comparison of collaborative filtering algorithms.

Algorithms & Codes

Informap: The code of algorithm Informap are provided by Associate professor Martin Rosvall according to (Multilevel compression of random walks on networks reveals hierarchical organization in large integrated systems), which is used to cluster undirected graph, directed graph, weighted graph.
MCL: A cluster algorithm for graphs, called MCL , is proposed by Stijn van Dongen.
METIS: A cluster algorithm for graphs, called METIS, is provide by George Karypis, a Professor at the Department of Computer Science & Engineering at the University of Minnesota in the Twin Cities of Minneapolis and Saint Paul and a member of the Digital Technology Center (DTC) at the University of Minnesota.
Ncut: A data clustering method.
Matlab Tools for Network Analysis: A matlab toolbox for network mining containing basic graph representation and algorithm and some community detection method such as modularity optimization.
Semi-supervised Learning: This web page gives us lots of semi-supervised learning algorithm, which is provided to facilitate machine learning research, including semi-supervised classification and clustering.

Video

PGM: The teacher, called Daphne Koller in Stanford, appearing in the video is also the author of probabilistic graph model and introduce most knowledge about probabilistic graph model.
Genomes, Networks, Evolution.
A list of public class of machine learning: a PhD student in AI @ MIT, Franck Dernoncourt , tidies a list of various resource links about public class of machine learning in Quora such as Gaussian Process, SVM, Deep Learning, Statistical Learning and so on. and it is a pretty good list.
Machine Learning: Stanford University’s Machine Learning course given by Andrew Ng, this course provides a broad introduction to machine learning, data mining, and statistical pattern recognition.
Recommender system: In this class we will study the most important of those recommender systems including how they work, how to use them, how to evaluate them, and their strengths and weaknesses in practice. The algorithms we will study include content-based filtering, user-user collaborative filtering, item-item collaborative filtering, dimensionality reduction, and interactive critique-based recommenders.
Convex Optimization: Stanford open online course for convex optimization whose lecture are the authors of the well-known book, ‘Convex Optimization’.
Scientific Writing: Stanford open online course for Scientific Writing, which trains students ability of writing academic papers.
Social and Information Network Analysis: The course will cover recent research on the structure and analysis of such large social and information networks and on models and algorithms that abstract their basic properties.

Books & Research Summary

Mining Heterogeneous Information Networks Principles and Methodologies
Recommender Systems An Introduction: This book offers an overview of approaches to developing state-of-the-art recommender systems. The authors present current algorithmic approaches for generating personalized buying proposals, such as collaborative and content-based filtering, as well as more interactive and knowledge-based approaches. They also discuss how to measure the effectiveness of recommender systems and illustrate the methods with practical case studies.