After years of being left for dead, SQL today is making a comeback. How come? And what effect will this have on the data community?
Since the dawn of computing, we have been collecting exponentially growing amounts of data, constantly asking more from our data storage, processing, and analysis technology. In the past decade, this caused software developers to cast aside SQL as a relic that couldn’t scale with these growing data volumes, leading to the rise of NoSQL: MapReduce and Bigtable, Cassandra, MongoDB, and more.
In this post we examine why the pendulum today is swinging back to SQL, and what this means for the future of the data engineering and analysis community.
Part 1: A New Hope
To understand why SQL is making a comeback, let’s start with why it was designed in the first place.
Our story starts at IBM Research in the early 1970s, where the relational database was born. At that time, query languages relied on complex mathematical logic and notation. Two newly minted PhDs, Donald Chamberlin and Raymond Boyce, were impressed by the relational data model but saw that the query language would be a major bottleneck to adoption. They set out to design a new query language that would be (in their own words): “more accessible to users without formal training in mathematics or computer programming.”
70 free data sources for 2017 on government, crime, health, financial and economic data, marketing and social media, journalism and media, real estate, company directory and review, and more to start working on your data projects.
Every great data visualization starts with good and clean data. Most of people believe that collecting big data would be a rough thing, but it’s simply not true. There are thousands of free data sets available online, ready to be analyzed and visualized by anyone. Here we’ve rounded up 70 free data sources for 2017 on government, crime, health, financial and economic data,marketing and social media, journalism and media, real estate, company directory and review, and more.
We hope you could enjoy this and save a lot time and energy searching blindly online.
Free Data Source: Government
Data.gov: It is the first stage and acts as a portal to all sorts of amazing information on everything from climate to crime freely by the US Government.
Data.gov.uk: There are datasets from all UK central departments and a number of other public sector and local authorities. It acts as a portal to all sorts of information on everything, including business and economy, crime and justice, defence, education, environment, government, health, society and transportation.
US. Census Bureau: The website is about the government-informed statistics on the lives of US citizens including population, economy, education, geography, and more.
The CIA World Factbook: Facts on every country in the world; focuses on history, government, population, economy, energy, geography, communications, transportation, military, and transnational issues of 267 countries.
Socrata: Socratais a mission-driven software company that is another interesting place to explore government-related data with some visualization tools built-in. Its data as a service has been adopted by more than 1200 government agencies for open data, performance management and data-driven government.
European Union Open Data Portal: It is the single point of access to a growing range of data from the institutions and other bodies of the European Union. The data boosts includes economic development within the EU and transparency within the EU institutions, including geographic, geopolitical and financial data, statistics, election results, legal acts, and data on crime, health, the environment, transport and scientific research. They could be reused in different databases and reports. And more, a variety of digital formats are available from the EU institutions and other EU bodies. The portal provides a standardised catalogue, a list of apps and web tools reusing these data, a SPARQL endpoint query editor and rest API access, and tips on how to make best use of the site.
Canada Open Datais a pilot project with many government and geospatial datasets. It could help you explore how the Government of Canada creates greater transparency, accountability, increases citizen engagement, and drives innovation and economic opportunities through open data, open information, and open dialogue.
Datacatalogs.org: It offers open government data from US, EU, Canada, CKAN, and more.
UK Data Service: The UK Data Service collection includes major UK government-sponsored surveys, cross-national surveys, longitudinal studies, UK census data, international aggregate, business data, and qualitative data.
Free Data Source: Crime
Uniform Crime Reporting: The UCR Program has been the starting place for law enforcement executives, students, researchers, members of the media, and the public seeking information on crime in the US.
FBI Crime Statistics: Statistical crime reports and publications detailing specific offenses and outlining trends to understand crime threats at both local and national levels.
Bureau of Justice Statistics: Information on anything related to U.S. justice system, including arrest-related deaths, census of jail inmates, national survey of DNA crime labs, surveys of law enforcement gang units, etc.
National Sex Offender Search: It is an unprecedented public safety resource that provides the public with access to sex offender data nationwide. It presents the most up-to-date information as provided by each Jurisdiction.
Free Data Source: Health
U.S. Food & Drug Administration: Here you will find a compressed data file of the Drugs@FDA database. Drugs@FDA, is updated daily, this data file is updated once per week, on Tuesday.
UNICEF: UNICEF gathers evidence on the situation of children and women around the world. The data sets include accurate, nationally representative data from household surveys and other sources.
Healthdata.gov: 125 years of US healthcare data including claim-level Medicare data, epidemiology and population statistics.
NHS Health and Social Care Information Centre: Health data sets from the UK National Health Service. The organization produces more than 260 official and national statistical publications. This includes national comparative data for secondary uses, developed from the long-running Hospital Episode Statistics which can help local decision makers to improve the quality and efficiency of frontline care.
Free Data Source: Financial and Economic Data
World Bank Open Data: Education statistics about everything from finances to service delivery indicators around the world.
IMF Economic Data: An incredibly useful source of information that includes global financial stability reports, regional economic reports, international financial statistics, exchange rates, directions of trade, and more.
UN Comtrade Database: Free access to detailed global trade data with visualizations. UN Comtrade is a repository of official international trade statistics and relevant analytical tables. All data is accessible through API.
Global Financial Data: With data on over 60,000 companies covering 300 years, Global Financial Data offers a unique source to analyze the twists and turns of the global economy.
Google Finance: Real-time stock quotes and charts, financial news, currency conversions, or tracked portfolios.
Google Public Data Explorer: Google’s Public Data Explorer provides public data and forecasts from a range of international organizations and academic institutions including the World Bank, OECD, Eurostat and the University of Denver. These can be displayed as line graphs, bar graphs, cross sectional plots or on maps.
U.S. Bureau of Economic Analysis: U.S. official macroeconomic and industry statistics, most notably reports about the gross domestic product (GDP) of the United States and its various units. They also provide information about personal income, corporate profits, and government spending in their National Income and Product Accounts (NIPAs).
Financial Data Finder at OSU: Plentiful links to anything related to finance, no matter how obscure, including World Development Indicators Online, World Bank Open Data, Global Financial Data, International Monetary Fund Statistical Databases, and EMIS Intelligence.
Financial Times: The Financial Times provides a broad range of information, news and services for the global business community.
Free Data Source: Marketing and Social Media
Amazon API: Browse Amazon Web Services’Public Data Sets by category for a huge wealth of information. Amazon API Gateway allows developers to securely connect mobile and web applications to APIs that run on Amazon Web(AWS) Lambda, Amazon EC2, or other publicly addressable web services that are hosted outside of AWS.
American Society of Travel Agents: ASTA is the world’s largest association of travel professionals. It provides members information including travel agents and the companies whose products they sell such as tours, cruises, hotels, car rentals, etc.
Social Mention: Social Mention is a social media search and analysis platform that aggregates user-generated content from across the universe into a single stream of information.
Google Trends: Google Trends shows how often a particular search-term is entered relative to the total search-volume across various regions of the world in various languages.
Facebook API: Learn how to publish to and retrieve data from Facebook using the Graph API.
Twitter API: The Twitter Platform connects your website or application with the worldwide conversation happening on Twitter.
Instagram API: The Instagram API Platform can be used to build non-automated, authentic, high-quality apps and services.
Foursquare API: The Foursquare API gives you access to our world-class places database and the ability to interact with Foursquare users and merchants.
HubSpot: A large repository of marketing data. You could find the latest marketing stats and trends here. It also provides tools for social media marketing, content management, web analytics, landing pages and search engine optimization.
Moz: Insights on SEO that includes keyword research, link building, site audits, and page optimization insights in order to help companies to have a better view of the position they have on search engines and how to improve their ranking.
The New York Times Developer Network– Search Times articles from 1851 to today, retrieving headlines, abstracts and links to associated multimedia. You can also search book reviews, NYC event listings, movie reviews, top stories with images and more.
Associated Press API: The AP Content API allows you to search and download content using your own editorial tools, without having to visit AP portals. It provides access to images from AP-owned, member-owned and third-party, and videos produced by AP and selected third-party.
Google Books Ngram Viewer: It is an online search engine that charts frequencies of any set of comma-delimited search strings using a yearly count of n-grams found in sources printed between 1500 and 2008 in Google’s text corpora.
Wikipedia Database: Wikipedia offers free copies of all available content to interested users.
FiveThirtyEight: It is a website that focuses on opinion poll analysis, politics, economics, and sports blogging. The data and code on Github is behind the stories and interactives at FiveThirtyEight.
Google Scholar: Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. It includes most peer-reviewed online academic journals and books, conference papers, theses and dissertations, preprints, abstracts, technical reports, and other scholarly literature, including court opinions and patents.
Free Data Source: Real Estate
Castles: Castles are a successful, privately owned independent agency. Established in 1981, they offer a comprehensive service incorporating residential sales, letting and management, and surveys and valuations.
Realestate.com: RealEstate.com serves as the ultimate resource for first-time home buyers, offering easy-to-understand tools and expert advice at every stage in the process.
Gumtree: Gumtree is the first site for free classifieds ads in the UK. Buy and sell items, cars, properties, and find or offer jobs in your area is all available on the website.
James Hayward: It provides an innovative database approach to residential sales, lettings & management.
LinkedIn: LinkedIn is a business- and employment-oriented social networking service that operates via websites and mobile apps. It has 500 million members in 200 countries and you could find the business directory here.
OpenCorporates: OpenCorporates is the largest open database of companies and company data in the world, with in excess of 100 million companies in a similarly large number of jurisdictions. Our primary goal is to make information on companies more usable and more widely available for the public benefit, particularly to tackle the use of companies for criminal or anti-social purposes, for example corruption, money laundering and organised crime.
Yellowpages: The original source to find and connect with local plumbers, handymen, mechanics, attorneys, dentists, and more.
Craigslist: Craigslist is an American classified advertisements website with sections devoted to jobs, housing, personals, for sale, items wanted, services, community, gigs, résumés, and discussion forums.
GAF Master Elite Contractor: Founded in 1886, GAF has become North America’s largest manufacturer of commercial and residential roofing (Source: Fredonia Group study). Our success in growing the company to nearly $3 billion in sales has been a result of our relentless pursuit of quality, combined with industry-leading expertise and comprehensive roofing solutions. Jim Schnepper is the President of GAF, an operating subsidiary of Standard Industries. When you are looking to protect the things you treasure most, here are just some of the reasons why we believe you should choose GAF.
CertainTeed: You could find contractors, remodelers, installers or builders in the US or Canada on your residential or commercial project here.
Manta: Manta is one of the largest online resources that deliver products, services and educational opportunities. The Manta directory boasts millions of unique visitors every month who search comprehensive database for individual businesses, industry segments and geographic-specific listings.
Kansas Bar Association: Directory for lawyers. The Kansas Bar Association (KBA) was founded in 1882 as a voluntary association for dedicated legal professionals and has more than 7,000 members, including lawyers, judges, law students, and paralegals.
Free Data Source: Other Portal Websites
Capterra: Directory about business software and reviews.
Monster: Data source for jobs and career opportunities.
Glassdoor: Directory about jobs and information about inside scoop on companies with employee reviews, personalized salary tools, and more.
Google recently announced that Google Cloud Dataprep—the new managed data wrangling service developed in collaboration with Trifacta—is now available in public beta. This service enables analysts and data scientists to visually explore and prepare data for analysis in seconds within the Google Cloud Platform.
Now that the Google Cloud Dataprep beta is open to the public, more companies can experience the benefits of Trifacta’s data preparation platform. From predictive transformation to interactive exploration, Trifacta’s intuitive workflow has accelerated the data preparation process for Google Cloud Dataprep customers who have tried it out within private beta.
True SaaS offering With Google Cloud Dataprep, there’s no software to install or manage. Unlike a marketplace offering that deploys into Google ecosystem, Cloud Dataprep is a fully-managed service that does not require configuration or administration.
Single Sign On through Google Cloud Identity & Access Management All users can easily access Cloud Dataprep using the same login / credential that they already used for any other Google service. This ensures highly secure and consistent access to Google services and data based on the permissions and roles defined through Google IAM.
Integration to Google Cloud Storage and Google BigQuery (read & write) Users can browse, preview, import data from and publish results to Google Cloud Storage and Google BigQuery directly through Cloud Dataprep. This is a huge boon for the teams that rely upon Google-generated data. For example:
Marketing teams leveraging DoubleClick Ads data can make that data available in Google BigQuery, then use Cloud Dataprep to prepare and publish the result back into BigQuery for downstream analysis. Learn more here.
Telematics data scientists can connect Cloud Dataprep directly to raw log data (often in JSON format) stored on Google Cloud Storage, and then prepare it for machine learning models executed in TensorFlow.
Retail business analysts can upload Excel data from their desktop to Google Cloud Storage, parse and combine it with BigQuery data to augment the results (beyond the limits of Excel), and eventually making the data available to various analytic tools like Google Data Studio, Looker, Tableau, Qlik or Zoomdata.
Big data scale provided by Cloud Dataflow By leveraging a serverless, auto-scaling data processing engine (Google Cloud Dataflow), Cloud Dataprep can handle any size of data, located anywhere in the world. This means that users don’t have to worry about optimizing their logic as their data grows, nor have to choose where their jobs run. At the same time, IT can rely on Cloud Dataflow to efficiently scale resources only as needed. Finally, it allows for enterprise-grade monitoring and logging in Google Stackdriver.
World-class Google support As a Google service, Cloud Dataprep is subject to the same standards as other Google Beta product & services. These benefits include:
World class uptime and availability around the world
Official support provided by Google Cloud Platform
Centralized usage-based billing managed on a per project basis with quotas and detailed reports
Early Google Cloud Dataprep Customer Feedback
Although Cloud Dataprep has only been in private beta for a short amount of time, we’ve had substantial participation from thousands of early private beta users and we’re excited to share some of the great feedback. Here’s a sample of what early users are saying:
Merkle Inc. “Cloud Dataprep allows us to quickly view and understand new datasets, and its flexibility supports our data transformation needs. The GUI is nicely designed, so the learning curve is minimal. Our initial data preparation work is now completed in minutes, not hours or days,” says Henry Culver, IT Architect at Merkle. “The ability to rapidly see our data, and to be offered transformation suggestions in data delivery, is a huge help to us as we look to rapidly assimilate new datasets.”
Venture Development Center
“We needed a platform that was versatile, easy to utilize and provided a migration path as our needs for data review, evaluation, hygiene, interlinking and analysis advanced. We immediately knew that Google Cloud Platform, with Cloud Dataprep and BigQuery, were exactly what we were looking for. As we develop our capability and movement into the data cataloging, QA and delivery cycle, Cloud Dataprep allows us to accomplish this quickly and adeptly,” says Matthew W. Staudt, President of Venture Development Center.
For more information on these customers check out Google’s blog on the public beta launch here.
Cloud Dataprep Public Beta: Furthering Wrangling Success
Now that the beta version of Google Cloud Dataprep is open to the public, we’re excited to see more organizations achieve data wrangling success from the launch of the public beta of Google Cloud Dataprep. From multinational banks to consumer retail companies to government agencies, there’s a growing number of customers using Trifacta’s consistent transformation logic, user experience, workflow, metadata management, and comprehensive data governance to reduce data preparation times and improve data quality.
If you’re interested in Google Dataprep, you can sign up with your own personal account for free access OR login using your company’s existing Google account. Visit cloud.google.com/dataprep to learn more.
Some tasks are common to almost all users, though, regardless of subject area: data import, data wrangling and data visualization. The table below show my favorite go-to packages for one of these three tasks (plus a few miscellaneous ones tossed in). The package names in the table are clickable if you want more information. To find out more about a package once you’ve installed it, type help(package = "packagename") in your R console (of course substituting the actual package name ).
As Python has gained a lot of traction in the recent years in Data Science industry, I wanted to outline some of its most useful libraries for data scientists and engineers, based on recent experience.
And, since all of the libraries are open sourced, we have added commits, contributors count and other metrics from Github, which could be served as a proxy metrics for library popularity.
1. NumPy (Commits: 15980, Contributors: 522)
When starting to deal with the scientific task in Python, one inevitably comes for help to Python’s SciPy Stack, which is a collection of software specifically designed for scientific computing in Python (do not confuse with SciPy library, which is part of this stack, and the community around this stack). This way we want to start with a look at it. However, the stack is pretty vast, there is more than a dozen of libraries in it, and we want to put a focal point on the core packages (particularly the most essential ones).
The most fundamental package, around which the scientific computation stack is built, is NumPy (stands for Numerical Python). It provides an abundance of useful features for operations on n-arrays and matrices in Python. The library provides vectorization of mathematical operations on the NumPy array type, which ameliorates performance and accordingly speeds up the execution.
2. SciPy (Commits: 17213, Contributors: 489)
SciPy is a library of software for engineering and science. Again you need to understand the difference between SciPy Stack and SciPy Library. SciPy contains modules for linear algebra, optimization, integration, and statistics. The main functionality of SciPy library is built upon NumPy, and its arrays thus make substantial use of NumPy. It provides efficient numerical routines as numerical integration, optimization, and many others via its specific submodules. The functions in all submodules of SciPy are well documented — another coin in its pot.