Research Guides: ENGG/ENGM 182: Data Analytics: Open Data Sources

Archives and repositories

100+ Interesting Data Sets for Statistics
Academic Torrents
Service designed to facilitate storage of all the data used in research, including datasets as well as publications.
AIRBNB data
Data behind the Inside Airbnb site is sourced from publicly available information from the Airbnb site. The data has been analyzed, cleansed and aggregated where appropriate to faciliate public discussion.
American Economic Association U.S. macroeconomic data
AudioSet (Google)
Ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.
Awesome Public Datasets
Topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses.
AWS Registry of Open Data
Sentinel-2, Landsat 8, IRS 990 Filings, Global Database of Events, Language and Tone (GDELT), New York City Taxi and Limousine Commission (TLC) Trip Record Data, 1000 Genomes ...
Berkeley Deep Drive
Explore 100,000 HD video sequences of over 1,100-hour driving experience across many different times in the day, weather conditions, and driving scenarios. Video sequences also include GPS locations, IMU data, and timestamps.

more... less...

https://arxiv.org/abs/1805.04687
CaseLaw Access Project
Caselaw Access Project (“CAP”) expands public access to U.S. law. Our goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library.
Citizens Police Data Project
Collects and publishes information about police misconduct in Chicago
Data For Everyone
Favorite open datasets created on the Figure Eight platform. They’re free for any and everyone to download.
DataHub
DataSearch (from Elsevier)
Search for research data across domains and types, from many domain-specific, cross-domain and institutional data repositories.
Data USA
Data USA puts public US Government data in your hands. Instead of searching through multiple data sources that are often incomplete and difficult to access, you can simply point to Data USA to answer your questions. Data USA provides an open, easy-to-use platform that turns data into knowledge. It allows millions of people to conduct their own analyses and create their own stories about America – its people, places, industries, skill sets and educational institutions.
DataZoa
Access to many public data series; mix data from hundreds of authoritative data sites; add the data to your account once, stay current forever.
EarthWorks
GIS data and maps
ENERGYDATA.INFO
Open data platform providing access to datasets and data analytics that are relevant to the energy sector.
Enigma Public Data
European Data Portal
The European Data Portal harvests the metadata of Public Sector Information available on public data portals across European countries. Information regarding the provision of data and the benefits of re-using data is also included.
European Union Open Data Portal
Access to open data published by EU institutions and bodies.
Facebook Research
Fact Extraction and VERification (FEVER) dataset
Dataset of 200,000 true and false claims
Fashion-MNIST
Dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples.
Figshare
Academic research with datasets.
FiveThirtyEight Data
Data and code behind some of its articles and graphics.

more... less...

https://github.com/fivethirtyeight/data
FMA: A Dataset For Music Analysis
A data dump of the Free Music Archive (FMA), an interactive library of high-quality, legal audio downloads.
Food Environment Atlas
Food environment factors--such as store/restaurant proximity, food prices, food and nutrition assistance programs, and community characteristics--interact to influence food choices and diet quality. Research is beginning to document the complexity of these interactions, but more is needed to identify causal relationships and effective policy interventions. The objectives of the Atlas are to assemble statistics on food environment indicators to stimulate research on the determinants of food choices and diet quality, and to provide a spatial overview of a community's ability to access healthy food and its success in doing so.
GeoCommons Archive
Community contributed collection of open data from around the world. Uploaded by the public, data are often from public and open government website and sources. The searchable archive includes over 150,000 datasets as GeoJSON.
GESIS
Over 5,000 German and international studies, providing a broad range of topics for secondary analysis, are made available for secondary analysis. There you also find data from Historical Studies.
gesisDataSearch
Search for social and economic research data across a diverse portfolio of data repositories and metadata services.
Github Data Packaged Core Datasets
Important, commonly-used datasets in high quality, easy-to-use & open form as data packages
Global Human Settlement Layer
Produces global spatial information about the human presence on the planet over time. This in the form of built up maps, population density maps and settlement maps. This information is generated with evidence-based analytics and knowledge using new spatial data mining technologies. The framework uses heterogeneous data including global archives of fine-scale satellite imagery, census data, and volunteered geographic information. The data is processed fully automatically and generates analytics and knowledge reporting objectively and systematically about the presence of population and built-up infrastructures.
Google Dataset Search
Google Public Data
Harvard Dataverse
A collaboration with Harvard Library, Harvard University IT, and IQSS
ICPSR
Maintains a data archive of more than 250,000 files of research in the social and behavioral sciences. It hosts 21 specialized collections of data in education, aging, criminal justice, substance abuse, terrorism, and other fields.
IMDb Datasets
Kaggle Datasets
Open datasets on everything from government, health, and science to popular games and dating trends.
Machine Learning Repository from University of Callifornia at Irvine
Collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
Microsoft R Application Network Data Sources on the Web
Million Song Dataset
Collection of audio features and metadata for a million contemporary popular music tracks.
THE MNIST DATABASE of handwritten digits
Training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST.
Netflix Prize Data Set
Official data set used in the Netflix Prize competition
Netlytic
Community-supported text and social networks analyzer that can automatically summarize and discover social networks from online conversations on social media sites.

more... less...

Capture data from social media sites (Twitter, Facebook, YouTube, RSS Feed & text/csv file)
Discover popular topics
Find & explore emerging themes of discussions
Build, visualize and analyze online social networks using social network analysis
Map geo-coded social media data
OpenDataSoft Open Data portals around the world
Geotagged intergovernmental organization portals
Open Images Dataset V4
Open Images is a dataset of ~9 million images that have been annotated with image-level labels and object bounding boxes.
The training set of V4 contains 14.6M bounding boxes for 600 object classes on 1.74M images, making it the largest existing dataset with object location annotations. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8.4 per image on average). Moreover, the dataset is annotated with image-level labels spanning thousands of classes.
OpenML
Open, collaborative, frictionless, automated machine learning environment
OpenStreetMap
Data about roads, trails, cafés, railway stations, and much more, all over the world.
OPUS the open parallel corpus
Collection of translated texts. The OPUS project tries to convert and align free online data, to add linguistic annotation, and create a publicly available parallel corpus.
Our World in Data
Trends in health, food provision, the growth and distribution of incomes, violence, rights, wars, culture, energy use, education, and environmental changes are empirically analyzed and visualized; for each topic the quality of the data is discussed and the data sources provided.
Papers With Code
The mission of Papers With Code is to create a free and open resource with Machine Learning papers, code and evaluation tables.
Pew Research Center Raw Datasets
Register to download data.
Pew Research Data
Pew Research Center regularly makes available the full datasets that underlie most of its reports. Includes topics:
U.S. Politics & Policy; Journalism & Media; Internet, Science & Tech; Religion & Public Life; Hispanic Trends; Global Attitudes & Trends; Social & Demographic Trends; American Trends Panel
Planet OSM
Regularly-updated, complete copies of the OpenStreetMap.org database
POPGRID Data Collaborative
Enhanced Population, Settlement and Infrastructure Data

more... less...

Spatially accurate and up-to-date population and settlement data are widely used in planning and decision making in both the public and private sectors to improve the effectiveness and efficiency of decisions, monitor impacts, and identify those who might otherwise be left behind. Understanding where people live and work, and the type and condition of their housing and other infrastructure, is critical in times of disaster, enabling emergency responders to reach those most in need more quickly with appropriate assistance. Such data can help improve access to public and private services, increase the sustainability of natural resources, and facilitate progress towards meeting the internationally accepted Sustainable Development Goals (SDGs). The POPGRID Data Collaborative aims to bring together and expand the international community of data providers, users, and sponsors concerned with georeferenced data on population, human settlements and infrastructure.
Public transport networks for research
Browse, visualize, & download curated public transport network data for 20+ cities.
Data formats: GTFS, network edge lists, event lists, GeoJson, SQLite databases.
Qualitative Data Repository
Dedicated archive for storing and sharing digital data (and accompanying documentation) generated or collected through qualitative and multi-method research in the social sciences. QDR provides search tools to facilitate the discovery of data, and also serves as a portal to material beyond its own holdings, with links to U.S. and international archives. The repository’s initial emphasis is on political science.
RadioTalk: a large-scale corpus of talk radio transcripts
The corpus is available in the S3 bucket radio-talk at s3://radio-talk/v1.0/. The entire corpus is available as one file of about 9.3 GB at s3://radio-talk/v1.0/radiotalk.json.gz, and there's also a version with one file per month under s3://radio-talk/v1.0/monthly/.
Rdatasets
Collection of 1161 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.
re3data.org
Global registry of research data repositories that covers research data repositories from different academic disciplines. It presents repositories for the permanent storage and access of data sets to researchers, funding bodies, publishers and scholarly institutions.
Reddit Datasets
Research at Google Data
In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.
Science On a Sphere
Datasets from NOAA, NASA, universities, science centers and other organizations. The datasets are divided into the categories of Atmosphere, Ocean, Land, Astronomy, Models and Simulations, and Extras.
Socioeconomic Data and Applications Center
SEDAC, the Socioeconomic Data and Applications Center, is one of the Distributed Active Archive Centers (DAACs) in the Earth Observing System Data and Information System (EOSDIS) of the U.S. National Aeronautics and Space Administration. Focusing on human interactions in the environment, SEDAC has as its mission to develop and operate applications that support the integration of socioeconomic and earth science data and to serve as an "Information Gateway" between earth sciences and social sciences.
Stanford Large Network Dataset Collection
Actively developed since 2004 and is organically growing as a result of our research pursuits in analysis of large social and information networks. Largest network we analyzed so far using the library was the Microsoft Instant Messenger network from 2006 with 240 million nodes and 1.3 billion edges.

more... less...

The datasets available on the website were mostly collected (scraped) for the purposes of our research.

The website was launched in July 2009.
STL-10 dataset
Image recognition dataset with a corpus of 100000 unlabelled images and 500 training images
TransitFeeds
Extensive archive of public transit data
Transitland
A community-edited data service aggregating transit networks across metropolitan and rural areas around the world. Aggregates stop, route, and schedule data from transit operators' authoritative GTFS feeds.
TweetsKB
TweetsKB is a public RDF corpus of anonymized data for a large collection of annotated tweets. The dataset currently contains data for more than 1.5 billion tweets, spanning almost 5 years (January 2013 - November 2017). Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies. For the sake of privacy, we encrypt the usernames and we do not provide the text of the tweets. However, through the tweet IDs, actual tweet content and further information can be fetched.
U.S. Bureau of Labor Statistics
U.S. BUREAU OF TRANSPORTATION STATISTICS Transtats
U.S. Census Data
U.S. Data.gov
U.S. Department of Energy Data
U.S. Geological Survey Science Data Catalog
U.S. Housing and Urban Development Data Sets
Original data sets generated by PD&R-sponsored data collection efforts, including the American Housing Survey, median family incomes and income limits, as well as microdata from research initiatives on topics such as housing discrimination, the HUD-insured multifamily housing stock, and the public housing population.
U.S. NOAA Data Discovery
U.S. Senate Lobbying Databases
Downloadable files include all documents received from January 1 through December 31 of any year, except the current year, by quarter. The current year includes all LD-1 and LD-2 documents received from January 1 to date by quarter.
UCR Spatio-temporal Active Repository
Provides access to large spatio-temporal datasets through an interactive exploratory interface.
UK Data Archive
Acquire, curate and provide access to the UK's largest collection of social and economic data.
University of Florida Stats Department
USPTO Bulk Data
Patent and trademark bulk data
Where to find data: an incomplete list
from Storytelling with Data
Wolfram Data Repository
A public resource that hosts an expanding collection of computable datasets, curated and structured to be suitable for immediate use in computation, visualization, analysis and more. Get Wolfram : https://caligari.dartmouth.edu/public/downloads/mathematica/
World Bank Data Catalog
Yahoo Webscope Datasets
The Yahoo Webscope Program is a reference library of interesting and scientifically useful datasets for non-commercial use by academics and other scientists.
Yelp Open Dataset
Subset of Yelp's businesses, reviews, and user data
Zenodo

Data API's

Awesome_APIs
Collection of APIs for developers.
CORE API
CORE harvests, maintains, enriches and makes available metadata and full-text content (typically a PDF) from many Open Access journals and repositories.
Data USA
Dat Project
Nonprofit-backed data sharing protocol for applications
Europeana APIs
Access to collections drawn from the major museums and galleries across Europe.
HathiTrust Research Center Analytics
HTRC Extracted Features and HathiTrust+Bookworm: 15,722,079 volumes
HTRC Analytics algorithms and Data Capsules: 5,978,217 volumes (public domain only)
JSTOR Data for Research
Data for Research (DfR) provides datasets of content on JSTOR for use in research and teaching. Researchers may use DfR to define and submit their desired dataset to be automatically processed. Data available through the service includes metadata, n-grams, and word counts for most articles and book chapters, and for all research reports and pamphlets on JSTOR. Datasets are produced at no cost to researchers and may include data for up to 25,000 documents.
OECD data for developers
Provide access to datasets in the catalogue of OECD databases.
ProgrammableWeb's API Directory
Public-APIs
Attempt to categorise different APIs scoured from the web which make their resources available for consumption.
rOpenSci Packages
Carefully vetted, staff- and community-contributed R software tools for working with scientific data sources and data sources that support research applications.
SHARE
Harvests metadata nightly from 100+ repositories, transforms that metadata into one format, and makes it accessable via a web API.
Simple API for UCI Machine Learning Dataset Repository
Present a simple and intuitive API for UCI ML portal, where users can easily look up a dataset description, search for a particular dataset they are interested, and even download datasets categorized by size or machine learning task.
U.S. Bureau of Labor Statistics' APIs
U.S. Census: Available APIs
U.S. Library of Congress for Robots
U.S. NASA API portal
Make NASA data, including imagery, eminently accessible to application developers.
U.S. National Library Medicine Products and Services
U.S. NLM NCBI Entrez Programming Utilities
U.S. OpenFDA
Provides APIs and full sets of downloadable files to a number of high-value, high priority and scalable structured datasets, including adverse events, drug product labeling, and recall enforcement reports.
UN Comtrade Web Services / API
Access data from the United Nations Commodity Trade Statistics database, including International Merchandise Trade Statistics (IMTS) and the work of the International Merchandise Trade Statistics Section (IMTSS) of the United Nations Statistics Division.
World Bank Developer Information
Currently has three different APIs to provide access to different datasets: one for Indicators (or time series data), one for Projects (or data on the World Bank’s operations), and one for the World Bank financial data (World Bank Finances API).

Health & Medicine Data

Big Cities Health Inventory Data
Access and analyze health data from 26 cities, for 34 health indicators, and across six demographic indicators.
Centers for Disease Control and Prevention
Child Health and Developmental Studies
Data on how health and disease are passed on between generations--not just by genes, but also through social, personal, and environmental surroundings.
Dartmouth Atlas
Medicare data to provide information and analysis about national, regional, and local markets, as well as hospitals and their affiliated physicians.
Healthcare Cost and Utilization Project (HCUP)
Largest collection of longitudinal hospital care data in the United States.
Healthcare Delivery Research Program Public Data
HealthData.gov
Includes clinical care provider quality information, nationwide health service provider directories, databases of the latest medical and scientific knowledge, consumer product data, community health performance information, government spending data.
Human Mortality Database
Detailed mortality and population data
Mammographic Image Analysis
Mammographic Image Analysis Society (MIAS) database and the Digital Database for Screening Mammography (DDSM)
Medicare Provider Utilization and Payment Data: Physician and Other Supplier
Information about services and procedures provided to Medicare beneficiaries by physicians and other healthcare professionals, with information about utilization, payment, and submitted charges organized by National Provider Identifier (NPI), Healthcare Common Procedure Coding System (HCPCS) code, and place of service.
National Cancer Institute Data Catalog
National Center for Health Statistics (NCHS)
Data visualization, searchable statistics, and interactive queries on health and health care.
OpenNEURO
Sharing neuroimaging data

Guides & Tutorials

Beginner's Guide to Twitter Data
Learn how to acquire Twitter data and process them to make them usable for further analysis.
Where to get Twitter data for academic research
Describes the options for getting Twitter data for academic research.

Frameworks & Tools

DAGsHub
DAGsHub was created to be a home for open source data science, where everyone can contribute and make the research and development process transparent, inclusive and better for everyone.
DVC
Open-source Version Control System for Machine Learning Projects
Essential list of useful R packages for data scientists
Preset of all the most needed packages for data science, statistical usage and every-day usage with R.