Resources


Open Source Tools

  1. Apache Hadoop is a project developing open-source software for reliable, scalable, distributed computing

  2. Apache Flink is an open-source platform for distributed stream and batch data processing

  3. Apache Spark is a general engine for large-scale data processing


Data Privacy

  1. Enabling Big Data through Europe’s New Data Protection Regulation. Viktor Mayer-Schönberger & Yann Padova

  2. Privacy in the Age of Big Data, The Stanford Law Review


Ethics and Big Data

  1. Perspectives on Big Data, Ethics, and Society. May 23, 2016 / By  Jacob Metcalf, Emily F. Keller Danah Boyd

  2. The Social, Cultural, & Ethical Dimensions of “Big Data”, March 17, 2014 – New York, NY

  3. Council for Big Data, Ethics, and Society



Advanced AI Tools

  1. TensorFlow  - an open source software library for numerical computation using data flow graphs.

  2. The Microsoft Cognitive Toolkit: A free, open-source, commercial-grade toolkit that trains deep learning algorithms to learn like the human brain.


Geographic Datasets

  1. Global Map: A set of consistent GIS layers covering the whole globe at 1km resolution including: transportation, elevation, drainage, vegetation, administrative boundaries, land cover, land use and population centres. Produced by the International Steering Committee on Global Mapping.

  2. Koordinates: GIS data aggregation site including data in a number of categories such as elevation, environment, climate etc. Some global datasets, some based on continents, some for specific countries. Registration required.

  3. European Environment Agency: Maps and datasets from the European Environment Agency, covering a huge range of physical geography and environmental topics. Europe only.

  4. Satellite Application Facility on Climate Monitoring: Provides near real-time and retroactively-generated datasets of cloud cover, type and temperature, surface radiation budget and temperatures, among others.

  5. Gridded climatic data for North America, South America and Europe: A huge range of climatic data at 1km and 4km resolution, derived from various models, including temperature, precipitation, snow and derived variables such as water deficit.

  6. Natural Disaster Hazards: Hazard Frequency, Mortality and Economic Loss Risk as gridded data for the globe. Covers cyclones, drought, earthquakes, flood, landslide, volcano and a combination of them all.

  7. Natural Disaster Hotspots: A wide range of geographic data on natural disasters (including volcanoes, earthquakes, landslide, flood and 'multihazards') with hazard frequency, economic loss etc.

  8. Open Flights: Airport, airline and route data across the globe. Data is provided as CSV files which can be easily processed to produce GIS outputs. Data includes all known airports, and a large number of routes betwen airports.

  9. Global Roads Open Access Data Set: A vector dataset of roads across the world, using a globally consistent data model, and suitable for mapping at the 1:250,000 level. Only roads between settlements are included, not residential streets, and the dataset is accurate to approximately 50m.

  10. Earth Engine’s public data catalog includes a variety of standard Earth science raster datasets. 

  11. Capitaine European Train Stations: Metadata for all train stations in Europe including latitude and longitude.

  12. GAR15: UN dataset for Global Assessment of Risk, showing the amount of capital invested in infrastructure at a 5km resolution. Useful for assessment of infrastructure risk and cost of natural disasters.

  13. MODIS provides continuous global coverage every one to two days, and collects data from 36 spectral bands. Resolution: 250-1000m. 1999  Wide range of different datasets.


DATASETS for Data Science, Machine Learning and AI courses

The following datasets have been filtered and refined from a social media (Twitter) dataset, which can be used for courses on Big data, Machine Learning, Data Science and AI.

  1. DTdata  has a header row consisting of four attributes such as Topic, TWDate, RTNumber and Demand, and 27 rows of training data. Demand would be the output variable as the predicted class, and the others would be the input variables.

  2. NBdata is same with upper dataset (i.e. DTdata), except for the number of retweet. The RTNumber column containing numerical numbers is transformed to categorical values for easy calculating the probabilities. In addition, the data set contains one record as test data.

  3. KMdata contains 161 tweets with location data (i.e. GPS coordinates) to group it. Note that we created the latitude and longitude of extracted physical addresses from the collected tweets by performing a geocoding procedure, and negative values of the west longitudes were changed into positive values to fulfil the k-mean clustering.

  4. SVMdata1 is generated by grouping into two or three clusters for the KMdata. It contains four column TWNumber, Latitude, Longitude and ClusterValue. The column ClusterValue indicates group numbers as the results of k-means clustering.

  5. ANNdata is manipulated from an original data set and consequently contains five columns such as TWDate, RTNumber as integer, Latitude, Longitude and Demand. The TWDate was modified as generation days (i.e. 27, 28 and 29), and the Demand was distinguished into three values (i.e. 0, 0.5 and 1). The values denote the relevance degree of tweets for demand, in other words "0" and "1" respectively represent "no relevance for demand" and "related to demand."

  6. Data Github Repository. In this repository you can find direct links to all the Public datasets, and you can find datasets for all the domains.

  7. UCI(University of California) datasets. Here you can get access to the free data sets.

  8. Open ML https://openml.org You can find more than 20,000 datasets here.


Datasets in Norway

  1. Norwegian Mapping Agency Open Data: Open data from the Norwegian Mapping Agency, including topographical maps, road networks, elevation data, place names etc.

  2. An API with ready-made datasets from SSB

  3. Floods datasets in Norway

  4. Transport datasets

  5. Norweigan Land Cover: Various datasets concerning land resources in Norway provided by the Norwegian Landscape and Forest Institute, including land type, forest, tree species and site index .

  6. Open and free geospatial data from Norway

  7. Geological Survey of Norway: Geological data for Norway

  8. Norwegian Petroleum Directorate: Data on licensed extraction areas, wells, fields, pipelines and survey data

  9. HSDPA-bandwidth logs for mobile HTTP streaming scenarios (source: UiO)

  10. Soccer Video and Player Position Dataset

  11.  Statistics and Social Network of YouTube Videos Dataset


Other Video/Audio Datasets

  1. Berkeley DeepDrive BDD100k: The dataset for self-driving AI. It has over 100,000 videos of over 1,100-hour driving experiences across different times of the day and weather conditions. The annotated images come from New York and San Francisco areas.

  2. Europeana Data, contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana - the trusted and comprehensive resource for European cultural heritage content.

  3. Pouring Dataset: Videos of people pouring a variety of liquids from and into a variety of receptacles, used for research on unsupervised imitation learning (This data is licensed by Google Inc. under a Creative Commons Attribution 4.0 International License.)

  4. An autonomous driving dataset and benchmark for optical flow: HD1K Benchmark Suite.

  5. Multi-view video datasets based on 360° cameras.

  6. The Cityscapes Dataset focuses on semantic understanding of urban street scenes.

  7. DriveU Traffic Light Dataset — a dataset which addresses to researchers in the field of traffic light recognition/detection.


Related Websites, Datasets and Software

  1. The Humanitarian Computing Library 

  2. iRevolutions 

  3. Crisis Commons 

  4. Social Media for Good 

  5. Humanitarian OpenStreetMap Team (HOT) 

  6. MapAction 

  7. Sahana (Open and free system)

  8. Ushahidi (Open and free system)

  9. GeoNames (Geo-tagging software)

  10. OpenStreetMap (Geographical information, important for gazeteers)

  11. PyBossa (Crowdsourcing software)

  12. Data visualization tools

  13. GATE (Text processing)

  14.  WEKA (Open-source data mining software in Java)

  15. ArkNLP (Twitter specific Natural Language Processing)

  16. HDX (Humanitarian Data eXchange, datasets of humanitarian variables by UN OCHA)

  17. TREC Temporal Summarization Track (Corpus for social media update summarization)

  18. Twitter Events Corpus  120 million tweets, with relevance judgments for over 500 events

  19. Disaster Risk - Datasets

  20. TREC Microblog Corpus (Corpus of social media messages)

  21. TREC Temporal Summarization – crisis events from 2012 aligned with TREC KBA Corpus

  22. CrisisLex (Corpora of disaster-related social media messages)

  23. CredBank (Corpus for credibility research)

  24. Google Person Finder

  25. Google Crisis Map

  26. Japan Radiation Map (derived from the SPEEDI data set) 


Open Source Projects (Machine Learning, AI) 

  1. TensorFlow  system is designed to facilitate research in machine learning, and to make it quick and easy to transition from research prototype to production system. Github URL: Tensorflow

  2. Scikit-learn is simple and efficient tools for data mining and data analysis, accessible to anyone, and reusable in several context, built on NumPy, SciPy, and matplotlib, open source, commercially usable – BSD license. Github URL: Scikit-learn

  3. Keras, a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. 

  4. PyBrain is a modular Machine Learning Library for Python. Github URL: PyBrain

  5. Fuel is a data pipeline framework which provides your machine learning models with the data they need. It is planned to be used by both the Blocks and Pylearn2 neural network libraries. Github URL: Fuel

  6. PyTorch, Tensors and Dynamic neural networks in Python with strong GPU acceleration. Github URL: pytorch

  7. Theano allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Github URL: Theano

  8. Gensim is a free Python library with features such as scalable statistical semantics, analyse plain-text documents for semantic structure, retrieve semantically similar documents. Github URL: Gensim

  9. Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors. Github URL: Caffe

  10. Chainer is a Python-based, standalone open source framework for deep learning models.  Github URL: Chainer

  11. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.  Github URL: Statsmodels

  12. Shogun is Machine learning toolbox which provides a wide range of unified and efficient Machine Learning (ML) methods. Github URL: Shogun

  13. Neon is Nervana's Python-based deep learning library. It provides ease of use while delivering the highest performance.
    Contributors: 78 (66% up), Commits: 1112, Github URL: Neon

  14. Nilearn is a Python module for fast and easy statistical learning on NeuroImaging data. It leverages the scikit-learn Python toolbox for multivariate statistics with applications such as predictive modelling, classification, decoding, or connectivity analysis. Github URL: Nilearn

  15. Pylearn2 is a machine learning library. Most of its functionality is built on top of Theano. Github URL: Pylearn2

  16. NuPIC is an open source project based on a theory of neocortex called Hierarchical Temporal Memory (HTM).  Github URL: NuPIC

  17. Orange3 is open source machine learning and data visualization for novice and expert. Interactive data analysis workflows with a large toolbox. Github URL: Orange3

  18. Pymc is a python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Github URL: Pymc

  19. Deap is a novel evolutionary computation framework for rapid prototyping and testing of ideas. IGithub URL: Deap

  20. Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. Github URL: Annoy