70 free data sources for 2017 on government, crime, health, financial and economic data, marketing and social media, journalism and media, real estate, company directory and review, and more to start working on your data projects.
Every great data visualization starts with good and clean data. Most of people believe that collecting big data would be a rough thing, but it’s simply not true. There are thousands of free data sets available online, ready to be analyzed and visualized by anyone. Here we’ve rounded up 70 free data sources for 2017 on government, crime, health, financial and economic data,marketing and social media, journalism and media, real estate, company directory and review, and more.
We hope you could enjoy this and save a lot time and energy searching blindly online.
Free Data Source: Government
Data.gov: It is the first stage and acts as a portal to all sorts of amazing information on everything from climate to crime freely by the US Government.
Data.gov.uk: There are datasets from all UK central departments and a number of other public sector and local authorities. It acts as a portal to all sorts of information on everything, including business and economy, crime and justice, defence, education, environment, government, health, society and transportation.
US. Census Bureau: The website is about the government-informed statistics on the lives of US citizens including population, economy, education, geography, and more.
The CIA World Factbook: Facts on every country in the world; focuses on history, government, population, economy, energy, geography, communications, transportation, military, and transnational issues of 267 countries.
Socrata: Socratais a mission-driven software company that is another interesting place to explore government-related data with some visualization tools built-in. Its data as a service has been adopted by more than 1200 government agencies for open data, performance management and data-driven government.
European Union Open Data Portal: It is the single point of access to a growing range of data from the institutions and other bodies of the European Union. The data boosts includes economic development within the EU and transparency within the EU institutions, including geographic, geopolitical and financial data, statistics, election results, legal acts, and data on crime, health, the environment, transport and scientific research. They could be reused in different databases and reports. And more, a variety of digital formats are available from the EU institutions and other EU bodies. The portal provides a standardised catalogue, a list of apps and web tools reusing these data, a SPARQL endpoint query editor and rest API access, and tips on how to make best use of the site.
Canada Open Datais a pilot project with many government and geospatial datasets. It could help you explore how the Government of Canada creates greater transparency, accountability, increases citizen engagement, and drives innovation and economic opportunities through open data, open information, and open dialogue.
Datacatalogs.org: It offers open government data from US, EU, Canada, CKAN, and more.
UK Data Service: The UK Data Service collection includes major UK government-sponsored surveys, cross-national surveys, longitudinal studies, UK census data, international aggregate, business data, and qualitative data.
Free Data Source: Crime
Uniform Crime Reporting: The UCR Program has been the starting place for law enforcement executives, students, researchers, members of the media, and the public seeking information on crime in the US.
FBI Crime Statistics: Statistical crime reports and publications detailing specific offenses and outlining trends to understand crime threats at both local and national levels.
Bureau of Justice Statistics: Information on anything related to U.S. justice system, including arrest-related deaths, census of jail inmates, national survey of DNA crime labs, surveys of law enforcement gang units, etc.
National Sex Offender Search: It is an unprecedented public safety resource that provides the public with access to sex offender data nationwide. It presents the most up-to-date information as provided by each Jurisdiction.
Free Data Source: Health
U.S. Food & Drug Administration: Here you will find a compressed data file of the Drugs@FDA database. Drugs@FDA, is updated daily, this data file is updated once per week, on Tuesday.
UNICEF: UNICEF gathers evidence on the situation of children and women around the world. The data sets include accurate, nationally representative data from household surveys and other sources.
Healthdata.gov: 125 years of US healthcare data including claim-level Medicare data, epidemiology and population statistics.
NHS Health and Social Care Information Centre: Health data sets from the UK National Health Service. The organization produces more than 260 official and national statistical publications. This includes national comparative data for secondary uses, developed from the long-running Hospital Episode Statistics which can help local decision makers to improve the quality and efficiency of frontline care.
Free Data Source: Financial and Economic Data
World Bank Open Data: Education statistics about everything from finances to service delivery indicators around the world.
IMF Economic Data: An incredibly useful source of information that includes global financial stability reports, regional economic reports, international financial statistics, exchange rates, directions of trade, and more.
UN Comtrade Database: Free access to detailed global trade data with visualizations. UN Comtrade is a repository of official international trade statistics and relevant analytical tables. All data is accessible through API.
Global Financial Data: With data on over 60,000 companies covering 300 years, Global Financial Data offers a unique source to analyze the twists and turns of the global economy.
Google Finance: Real-time stock quotes and charts, financial news, currency conversions, or tracked portfolios.
Google Public Data Explorer: Google’s Public Data Explorer provides public data and forecasts from a range of international organizations and academic institutions including the World Bank, OECD, Eurostat and the University of Denver. These can be displayed as line graphs, bar graphs, cross sectional plots or on maps.
U.S. Bureau of Economic Analysis: U.S. official macroeconomic and industry statistics, most notably reports about the gross domestic product (GDP) of the United States and its various units. They also provide information about personal income, corporate profits, and government spending in their National Income and Product Accounts (NIPAs).
Financial Data Finder at OSU: Plentiful links to anything related to finance, no matter how obscure, including World Development Indicators Online, World Bank Open Data, Global Financial Data, International Monetary Fund Statistical Databases, and EMIS Intelligence.
Financial Times: The Financial Times provides a broad range of information, news and services for the global business community.
Free Data Source: Marketing and Social Media
Amazon API: Browse Amazon Web Services’Public Data Sets by category for a huge wealth of information. Amazon API Gateway allows developers to securely connect mobile and web applications to APIs that run on Amazon Web(AWS) Lambda, Amazon EC2, or other publicly addressable web services that are hosted outside of AWS.
American Society of Travel Agents: ASTA is the world’s largest association of travel professionals. It provides members information including travel agents and the companies whose products they sell such as tours, cruises, hotels, car rentals, etc.
Social Mention: Social Mention is a social media search and analysis platform that aggregates user-generated content from across the universe into a single stream of information.
Google Trends: Google Trends shows how often a particular search-term is entered relative to the total search-volume across various regions of the world in various languages.
Facebook API: Learn how to publish to and retrieve data from Facebook using the Graph API.
Twitter API: The Twitter Platform connects your website or application with the worldwide conversation happening on Twitter.
Instagram API: The Instagram API Platform can be used to build non-automated, authentic, high-quality apps and services.
Foursquare API: The Foursquare API gives you access to our world-class places database and the ability to interact with Foursquare users and merchants.
HubSpot: A large repository of marketing data. You could find the latest marketing stats and trends here. It also provides tools for social media marketing, content management, web analytics, landing pages and search engine optimization.
Moz: Insights on SEO that includes keyword research, link building, site audits, and page optimization insights in order to help companies to have a better view of the position they have on search engines and how to improve their ranking.
The New York Times Developer Network– Search Times articles from 1851 to today, retrieving headlines, abstracts and links to associated multimedia. You can also search book reviews, NYC event listings, movie reviews, top stories with images and more.
Associated Press API: The AP Content API allows you to search and download content using your own editorial tools, without having to visit AP portals. It provides access to images from AP-owned, member-owned and third-party, and videos produced by AP and selected third-party.
Google Books Ngram Viewer: It is an online search engine that charts frequencies of any set of comma-delimited search strings using a yearly count of n-grams found in sources printed between 1500 and 2008 in Google’s text corpora.
Wikipedia Database: Wikipedia offers free copies of all available content to interested users.
FiveThirtyEight: It is a website that focuses on opinion poll analysis, politics, economics, and sports blogging. The data and code on Github is behind the stories and interactives at FiveThirtyEight.
Google Scholar: Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. It includes most peer-reviewed online academic journals and books, conference papers, theses and dissertations, preprints, abstracts, technical reports, and other scholarly literature, including court opinions and patents.
Free Data Source: Real Estate
Castles: Castles are a successful, privately owned independent agency. Established in 1981, they offer a comprehensive service incorporating residential sales, letting and management, and surveys and valuations.
Realestate.com: RealEstate.com serves as the ultimate resource for first-time home buyers, offering easy-to-understand tools and expert advice at every stage in the process.
Gumtree: Gumtree is the first site for free classifieds ads in the UK. Buy and sell items, cars, properties, and find or offer jobs in your area is all available on the website.
James Hayward: It provides an innovative database approach to residential sales, lettings & management.
LinkedIn: LinkedIn is a business- and employment-oriented social networking service that operates via websites and mobile apps. It has 500 million members in 200 countries and you could find the business directory here.
OpenCorporates: OpenCorporates is the largest open database of companies and company data in the world, with in excess of 100 million companies in a similarly large number of jurisdictions. Our primary goal is to make information on companies more usable and more widely available for the public benefit, particularly to tackle the use of companies for criminal or anti-social purposes, for example corruption, money laundering and organised crime.
Yellowpages: The original source to find and connect with local plumbers, handymen, mechanics, attorneys, dentists, and more.
Craigslist: Craigslist is an American classified advertisements website with sections devoted to jobs, housing, personals, for sale, items wanted, services, community, gigs, résumés, and discussion forums.
GAF Master Elite Contractor: Founded in 1886, GAF has become North America’s largest manufacturer of commercial and residential roofing (Source: Fredonia Group study). Our success in growing the company to nearly $3 billion in sales has been a result of our relentless pursuit of quality, combined with industry-leading expertise and comprehensive roofing solutions. Jim Schnepper is the President of GAF, an operating subsidiary of Standard Industries. When you are looking to protect the things you treasure most, here are just some of the reasons why we believe you should choose GAF.
CertainTeed: You could find contractors, remodelers, installers or builders in the US or Canada on your residential or commercial project here.
Manta: Manta is one of the largest online resources that deliver products, services and educational opportunities. The Manta directory boasts millions of unique visitors every month who search comprehensive database for individual businesses, industry segments and geographic-specific listings.
Kansas Bar Association: Directory for lawyers. The Kansas Bar Association (KBA) was founded in 1882 as a voluntary association for dedicated legal professionals and has more than 7,000 members, including lawyers, judges, law students, and paralegals.
Free Data Source: Other Portal Websites
Capterra: Directory about business software and reviews.
Monster: Data source for jobs and career opportunities.
Glassdoor: Directory about jobs and information about inside scoop on companies with employee reviews, personalized salary tools, and more.
Some tasks are common to almost all users, though, regardless of subject area: data import, data wrangling and data visualization. The table below show my favorite go-to packages for one of these three tasks (plus a few miscellaneous ones tossed in). The package names in the table are clickable if you want more information. To find out more about a package once you’ve installed it, type help(package = "packagename") in your R console (of course substituting the actual package name ).
As Python has gained a lot of traction in the recent years in Data Science industry, I wanted to outline some of its most useful libraries for data scientists and engineers, based on recent experience.
And, since all of the libraries are open sourced, we have added commits, contributors count and other metrics from Github, which could be served as a proxy metrics for library popularity.
1. NumPy (Commits: 15980, Contributors: 522)
When starting to deal with the scientific task in Python, one inevitably comes for help to Python’s SciPy Stack, which is a collection of software specifically designed for scientific computing in Python (do not confuse with SciPy library, which is part of this stack, and the community around this stack). This way we want to start with a look at it. However, the stack is pretty vast, there is more than a dozen of libraries in it, and we want to put a focal point on the core packages (particularly the most essential ones).
The most fundamental package, around which the scientific computation stack is built, is NumPy (stands for Numerical Python). It provides an abundance of useful features for operations on n-arrays and matrices in Python. The library provides vectorization of mathematical operations on the NumPy array type, which ameliorates performance and accordingly speeds up the execution.
2. SciPy (Commits: 17213, Contributors: 489)
SciPy is a library of software for engineering and science. Again you need to understand the difference between SciPy Stack and SciPy Library. SciPy contains modules for linear algebra, optimization, integration, and statistics. The main functionality of SciPy library is built upon NumPy, and its arrays thus make substantial use of NumPy. It provides efficient numerical routines as numerical integration, optimization, and many others via its specific submodules. The functions in all submodules of SciPy are well documented — another coin in its pot.
Python caught up with R and (barely) overtook it; Deep Learning usage surges to 32%; RapidMiner remains top general Data Science platform; Five languages of Data Science.
The 18th annual KDnuggets Software Poll again got huge participation from analytics and data science community and vendors, attracting about 2,900 voters, almost exactly the same as last year. Here is the initial analysis, with more detailed results to be posted later.
Python, whose share has been growing faster than R for the last several years, has finally caught up with R, and (barely) overtook it, with 52.6% share vs 52.1% for R.
The biggest surprise is probably the phenomenal share of Deep Learning tools, now used by 32% of all respondents, while only 18% used DL in 2016 and 9% in 2015. Google Tensorflow rapidly became the leading Deep Learning platform with 20.2% share, up from only 6.8% in 2016 poll, and entered the top 10 tools.
RapidMiner remains the most popular general platform for data mining/data science, with about 33% share, almost exactly the same as in 2016.
We note that many vendors have encouraged their users to vote, but all vendors had equal chances, so this does not violate KDnuggets guidelines. We have not seen any bot voting or direct links to vote for only one tool this year.
Spark grew to about 23% and kept its place in top 10 ahead of Hadoop.
Besides TensorFlow, another new tool in the top tier is Anaconda, with 22% share.
Top Analytics/Data Science Tools
Fig 1: KDnuggets Analytics/Data Science 2017 Software Poll: top tools in 2017, and their share in the 2015-6 polls
From data scooping to facial recognition, Amazon’s latest additions give devs new, wide-ranging powers in the cloud
In the beginning, life in the cloud was simple. Type in your credit card number and—voilà—you had root on a machine you didn’t have to unpack, plug in, or bolt into a rack.
That has changed drastically. The cloud has grown so complex and multifunctional that it’s hard to jam all the activity into one word, even a word as protean and unstructured as “cloud.” There are still root logins on machines to rent, but there are also services for slicing, dicing, and storing your data. Programmers don’t need to write and install as much as subscribe and configure.
Here, Amazon has led the way. That’s not to say there isn’t competition. Microsoft, Google, IBM, Rackspace, and Joyent are all churning out brilliant solutions and clever software packages for the cloud, but no company has done more to create feature-rich bundles of services for the cloud than Amazon. Now Amazon Web Services is zooming ahead with a collection of new products that blow apart the idea of the cloud as a blank slate. With the latest round of tools for AWS, the cloud is that much closer to becoming a concierge waiting for you to wave your hand and give it simple instructions.
Here are 10 new services that show how Amazon is redefining what computing in the cloud can be.
Anyone who has done much data science knows it’s often more challenging to collect data than it is to perform analysis. Gathering data and putting it into a standard data format is often more than 90 percent of the job.
Glue is a new collection of Python scripts that automatically crawls your data sources to collect data, apply any necessary transforms, and stick it in Amazon’s cloud. It reaches into your data sources, snagging data using all the standard acronyms, like JSON, CSV, and JDBC. Once it grabs the data, it can analyze the schema and make suggestions.
The Python layer is interesting because you can use it without writing or understanding Python—although it certainly helps if you want to customize what’s going on. Glue will run these jobs as needed to keep all the data flowing. It won’t think for you, but it will juggle many of the details, leaving you to think about the big picture.
Field Programmable Gate Arrays have long been a secret weapon of hardware designers. Anyone who needs a special chip can build one out of software. There’s no need to build custom masks or fret over fitting all the transistors into the smallest amount of silicon. An FPGA takes your software description of how the transistors should work and rewires itself to act like a real chip.
Amazon’s new AWS EC2 F1 brings the power of FGPA to the cloud. If you have highly structured and repetitive computing to do, an EC2 F1 instance is for you. With EC2 F1, you can create a software description of a hypothetical chip and compile it down to a tiny number of gates that will compute the answer in the shortest amount of time. The only thing faster is etching the transistors in real silicon.
Who might need this? Bitcoin miners compute the same cryptographically secure hash function a bazillion times each day, which is why many bitcoin miners use FPGAs to speed up the search. Anyone with a similar compact, repetitive algorithm you can write into silicon, the FPGA instance lets you rent machines to do it now. The biggest winners are those who need to run calculations that don’t map easily onto standard instruction sets—for example, when you’re dealing with bit-level functions and other nonstandard, nonarithmetic calculations. If you’re simply adding a column of numbers, the standard instances are better for you. But for some, EC2 with FGPA might be a big win.
As Docker eats its way into the stack, Amazon is trying to make it easier for anyone to run Docker instances anywhere, anytime. Blox is designed to juggle the clusters of instances so that the optimum number are running—no more, no less.
Blox is event driven, so it’s a bit simpler to write the logic. You don’t need to constantly poll the machines to see what they’re running. They all report back, so the right number can run. Blox is also open source, which makes it easier to reuse Blox outside of the Amazon cloud, if you should need to do so.
Monitoring the efficiency and load of your instances used to be simply another job. If you wanted your cluster to work smoothly, you had to write the code to track everything. Many people brought in third parties with impressive suites of tools. Now Amazon’s X-Ray is offering to do much of the work for you. It’s competing with many third-party tools for watching your stack.
When a website gets a request for data, X-Ray traces as it as flows your network of machines and services. Then X-Ray will aggregate the data from multiple instances, regions, and zones so that you can stop in one place to flag a recalcitrant server or a wedged database. You can watch your vast empire with only one page.
Rekognition is a new AWS tool aimed at image work. If you want your app to do more than store images, Rekognition will chew through images searching for objects and faces using some of the best-known and tested machine vision and neural-network algorithms. There’s no need to spend years learning the science; you simply point the algorithm at an image stored in Amazon’s cloud, and voilà, you get a list of objects and a confidence score that ranks how likely the answer is correct. You pay per image.
The algorithms are heavily tuned for facial recognition. The algorithms will flag faces, then compare them to each other and references images to help you identify them. Your application can store the meta information about the faces for later processing. Once you put a name to the metadata, your app will find people wherever they appear. Identification is only the beginning. Is someone smiling? Are their eyes closed? The service will deliver the answer, so you don’t need to get your fingers dirty with pixels. If you want to use impressive machine vision, Amazon will charge you not by the click but by the glance at each image.
Working with Amazon’s S3 has always been simple. If you want a data structure, you request it and S3 looks for the part you want. Amazon’s Athena now makes it much simpler. It will run the queries on S3, so you don’t need to write the looping code yourself. Yes, we’ve become too lazy to write loops.
Athena uses SQL syntax, which should make database admins happy. Amazon will charge you for every byte that Athena churns through while looking for your answer. But don’t get too worried about the meter running out of control because the price is only $5 per terabyte. That’s about 50 billionths of a cent per byte. It makes the penny candy stores look expensive.
The original idea of a content delivery network was to speed up the delivery of simple files like JPG images and CSS files by pushing out copies to a vast array of content servers parked near the edges of the Internet. Amazon is taking this a step further by letting us push Node.js code out to these edges where they will run and respond. Your code won’t sit on one central server waiting for the requests to poke along the backbone from people around the world. It will clone itself, so it can respond in microseconds without being impeded by all that network latency.
Amazon will bill your code only when it’s running. You won’t need to set up separate instances or rent out full machines to keep the service up. It is currently in a closed test, and you must apply to get your code in their stack.
If you want some kind of physical control of your data, the cloud isn’t for you. The power and reassurance that comes from touching the hard drive, DVD-ROM, or thumb drive holding your data isn’t available to you in the cloud. Where is my data exactly? How can I get it? How can I make a backup copy? The cloud makes anyone who cares about these things break out in cold sweats.
The Snowball Edge is a box filled with data that can be delivered anywhere you want. It even has a shipping label that’s really an E-Ink display exactly like Amazon puts on a Kindle. When you want a copy of massive amounts of data that you’ve stored in Amazon’s cloud, Amazon will copy it to the box and ship the box to wherever you are. (The documentation doesn’t say whether Prime members get free shipping.)
Snowball Edge serves a practical purpose. Many developers have collected large blocks of data through cloud applications and downloading these blocks across the open internet is far too slow. If Amazon wants to attract large data-processing jobs, it needs to make it easier to get large volumes of data out of the system.
If you’ve accumulated an exabyte of data that you need somewhere else for processing, Amazon has a bigger version called Snowmobile that’s built into an 18-wheel truck complete with GPS tracking.
Oh, it’s worth noting that the boxes aren’t dumb storage boxes. They can run arbitrary Node.js code too so you can search, filter, or analyze … just in case.
Once you’ve amassed a list of customers, members, or subscribers, there will be times when you want to push a message out to them. Perhaps you’ve updated your app or want to convey a special offer. You could blast an email to everyone on your list, but that’s a step above spam. A better solution is to target your message, and Amazon’s new Pinpoint tool offers the infrastructure to make that simpler.
You’ll need to integrate some code with your app. Once you’ve done that, Pinpoint helps you send out the messages when your users seem ready to receive them. Once you’re done with a so-called targeted campaign, Pinpoint will collect and report data about the level of engagement with your campaign, so you can tune your targeting efforts in the future.
Who gets the last word? Your app can, if you use Polly, the latest generation of speech synthesis. In goes text and out comes sound—sound waves that form words that our ears can hear, all the better to make audio interfaces for the internet of things.
Bayesian Inference is a way of combining information from data with things we think we already know. For example, if we wanted to get an estimate of the mean height of people, we could use our prior knowledge that people are generally between 5 and 6 feet tall to inform the results from the data we collect. If our prior is informative and we don’t have much data, this will help us to get a better estimate. If we have a lot of data, even if the prior is wrong (say, our population is NBA players), the prior won’t change the estimate much. You might say that including such “subjective” information in a statistical model isn’t right, but there’s subjectivity in the selection of any statistical model. Bayesian Inference makes that subjectivity explicit.
IBM announced that Watson Analytics, a breakthrough natural language-based cognitive service that can provide instant access to powerful predictive and visual analytic tools for businesses, is available in beta. See Vine(vine.co/v/Ov6uvi1m7lT) for a sneak peek now.
I’m pleased to announce that I have my access, and its amazing. Uploading raw CSV data in and playing with it as a great shortcut to finding insights. It works really well and really quickly.
IBM Watson Analytics automates the once time-consuming tasks such as data preparation, predictive analysis, and visual storytelling for business professionals. Offered as a cloud-based freemium service, all business users can now access Watson Analytics from any desktop or mobile device. Since being announced on September 16, more than 22,000 people have already registered for the beta. The Watson Analytics Community, a user group for sharing news, best practices, technical support and training, is also accessible starting today.
This news follows IBM’s recently announced global partnership with Twitter, which includes plans to offer Twitter data as part of IBM Watson Analytics.
Learn more about how IBM Watson Analytics works:
As part of its effort to equip all professionals with the tools needed to do their jobs better, Watson Analytics provides business professionals with a unified experience and natural language dialogue so they can better understand data and more quickly reach business goals. For example, a marketing, HR or sales rep can quickly source data, cleanse and refine it, discover insights, predict outcomes, visualize results, create reports and dashboards and explain results in familiar business terms.
Once I paid attention, it started to feel like there were a whole lot of national food and drink days in the United States. National Chili Dog Day. National Donut Day. National Beer Day. I’m totally for this, as I will accept any excuse to consume any of these items. But still, there seems to be a lot.
According to this list, 214 days of the year are a food or drink day. Every single day of July was one. How is one to keep track? I gotta plan, you know?
So here are all the days in calendar form. Today, August 18, is National Pinot Noir Day. Tomorrow is National Potato Day.
You’ve heard of the “margin of error” in polling. Just about every article on a new poll dutifully notes that the margin of error due to sampling is plus or minus three or four percentage points.
But in truth, the “margin of sampling error” – basically, the chance that polling different people would have produced a different result – doesn’t even come close to capturing the potential for error in surveys.
Polling results rely as much on the judgments of pollsters as on the science of survey methodology. Two good pollsters, both looking at the same underlying data, could come up with two very different results.
How so? Because pollsters make a series of decisions when designing their survey, from determining likely voters to adjusting their respondents to match the demographics of the electorate. These decisions are hard. They usually take place behind the scenes, and they can make a huge difference.
To illustrate this, we decided to conduct a little experiment. On Monday, in partnership with Siena College, the Upshot published a pollof 867 likely Florida voters. Our poll showed Hillary Clinton leading Donald J. Trump by one percentage point.
We decided to share our raw data with four well-respected pollsters and asked them to estimate the result of the poll themselves.
• Margie Omero, Robert Green andAdam Rosenblatt, of Penn Schoen Berland Research, a Democratic polling and research firm that conducted surveys for Mrs. Clinton in 2008.
• Sam Corbett-Davies, Andrew Gelman and David Rothschild, of Stanford University, Columbia University and Microsoft Research. They’re at the forefront of using statistical modeling in survey research.
Here’s what they found:
Omero, Green, Rosenblatt
Penn Schoen Berland Research
Corbett-Davies, Gelman, Rothschild
Stanford University/Columbia University/Microsoft Research
NYT Upshot/Siena College
Well, well, well. Look at that. A net five-point difference between the five measures, including our own, even though all are based on identical data. Remember: There are no sampling differences in this exercise. Everyone is coming up with a number based on the same interviews.
Their answers shouldn’t be interpreted as an indication of what they would have found if they had conducted their own survey. They all would have designed the survey at least a little differently – some almost entirely differently.
But their answers illustrate just a few of the different ways that pollsters can handle the same data – and how those choices can affect the result.
So what’s going on? The pollsters made different decisions in adjusting the sample and identifying likely voters. The result was four different electorates, and four different results.
Omero, Green, Rosenblatt
Penn Schoen Berland Research
Corbett-Davies, Gelman, Rothschild
Stanford University/Columbia University/Microsoft Research
NYT Upshot/Siena College
There are two basic kinds of choices that our participants are making: one about adjusting the sample and one about identifying likely voters.
How to make the sample representative?
Pollsters usually make statistical adjustments to make sure that their sample represents the population – in this case, voters in Florida. They usually do so by giving more weight to respondents from underrepresented groups. But this is not so simple.
What source? Most public pollsters try to reach every type of adult at random and adjust their survey samples to match the demographic composition of adults in the census. Most campaign pollsters take surveys from lists of registered voters and adjust their sample to match information from the voter file.
Which variables? What types of characteristics should the pollster weight by? Race, sex and age are very standard. But what about region, party registration, education or past turnout?
How? There are subtly different ways to weight a survey. One of our participants doesn’t actually weight the survey in a traditional sense, but builds a statistical model to make inferences about all registered voters (the same technique that yields our pretty dot maps).
Who is a likely voter?
There are two basic ways that our participants selected likely voters:
Self-reported vote intention Public pollsters often use the self-reported vote intention of respondents to choose who is likely to vote and who is not.
Vote history Partisan pollsters often use voter file data on the past vote history of registered voters to decide who is likely to cast a ballot, since past turnout is a strong predictor of future turnout.
Our participants’ choices
The participants split across all these choices.
Who is Likely Voter?
Type of weight
Tries to match…
Omero, Green, Rosenblatt
Penn Schoen Berland Research
Corbett-Davies, Gelman, Rothschild
Stanford University/Columbia University/Microsoft Research
NYT Upshot/Siena College
Report + history
Their varying decisions on these questions add up to big differences in the result. In general, the participants who used vote history in the likely-voter model showed a better result for Mr. Trump.
At the end of this article, we’ve posted detailed methodological choices of each of our pollsters. Before that, a few of my own observations from this exercise:
• These are all good pollsters, who made sensible and defensible decisions. I have seen polls that make truly outlandish decisions with the potential to produce even greater variance than this.
• Clearly, the reported margin of error due to sampling, even when including a design effect (which purports to capture the added uncertainty of weighting), doesn’t even come close to capturing total survey error. That’s why we didn’t report a margin of error in our original article.
• You can see why “herding,” the phenomenon in which pollsters make decisions that bring them close to expectations, can be such a problem. There really is a lot of flexibility for pollsters to make choices that generate a fundamentally different result. And I get it: If our result had come back as “Clinton +10,” I would have dreaded having to publish it.
• You can see why we say it’s best to average polls, and to stop fretting so much about single polls.
Finally, a word of thanks to the four pollsters for joining us in this exercise. Election season is as busy for pollsters as it is for political journalists. We’re grateful for their time.
Below, the methodological choices of the other pollsters.
Charles Franklin Clinton +3
Mr. Franklin approximated the approach of a traditional pollster and did not use any of the information on the voter registration file. He weighed the sample to an estimate of the demographic composition of Florida’s registered voters in 2016, based on census data, by age, sex, education, gender and race. Mr. Franklin’s likely voters were those who said they were “almost certain” to vote.
Patrick Ruffini Clinton +1
Mr. Ruffini weighted the sample by voter file data on age, race, gender and party registration. He next added turnout scores: an estimate for how likely each voter is to turn out, based exclusively on their voting history. He then weighted the sample to the likely turnout profile of both registered and likely voters – basically making sure that there were the right number of likely and unlikely voters in the voter file. This is probably the approach most similar to the Upshot/Siena methodology, so it is not surprising that it also is the closest result.
Sam Corbett-Davies, Andrew Gelman and David Rothschild Trump +1
Stanford University/Columbia University/Microsoft Research
Long story short: They built a model that tries to figure out what characteristics predict support for Mrs. Clinton and Mr. Trump based on many of the same variables used for weighting. They then predicted how every person in the state would vote, based on that model. It’s the same approach we used to make the pretty dot maps of Florida. The likely electorate was determined exclusively by vote history, not self-reported voice choice. They included 2012 voters – which is why their electorate has more black voters than the others – and then included newly registered voters according to a model of voting history based on registration.
Margie Omero, Robert Green, Adam Rosenblatt Clinton +4
Penn Schoen Berland Research
The sample was weighted to state voter file data for party registration, gender, race and ethnicity. They then excluded the people who said they were unlikely to vote. These self-reported unlikely voters were 7 percent of the sample, so this is the most permissive likely voter screen of the groups. In part as a result, it’s also Mrs. Clinton’s best performance. In an email, Ms. Omero noted that every scenario they examined showed an advantage for Clinton.
When assessing academic studies, media members are often confronted by pages not only full of numbers, but also loaded with concepts such as “selection bias,” “p-value” and “statistical inference.”
Statistics courses are available at most universities, of course, but are often viewed as something to be taken, passed and quickly forgotten. However, for media members and public communicators of many kinds it is imperative to do more than just read study abstracts; understanding the methods and concepts that underpin academic studies is essential to being able to judge the merits of a particular piece of research. Even if one can’t master statistics, knowing the basic language can help in formulating better, more critical questions for experts, and it can foster deeper thinking, and skepticism, about findings.
Further, the emerging field of data journalism requires that reporters bring more analytical rigor to the increasingly large amounts of numbers, figures and data they use. Grasping some of the academic theory behind statistics can help ensure that rigor.
Most studies attempt to establish a correlation between two variables — for example, how having good teachers might be “associated with” (a phrase often used by academics) better outcomes later in life; or how the weight of a car is associated with fatal collisions. But detecting such a relationship is only a first step; the ultimate goal is to determinecausation: that one of the two variables drives the other. There is a time-honored phrase to keep in mind: “Correlation is not causation.” (This can be usefully amended to “correlation is not necessarily causation,” as the nature of the relationship needs to be determined.)
Another key distinction to keep in mind is that studies can either explore observed data (descriptive statistics) or use observed data to predict what is true of areas beyond the data (inferential statistics). The statement “From 2000 to 2005, 70% of the land cleared in the Amazon and recorded in Brazilian government data was transformed into pasture” is a descriptive statistic; “Receiving your college degree increases your lifetime earnings by 50%” is an inferential statistic.
Here are some other basic statistical concepts with which journalism students and working journalists should be familiar:
A sample is a portion of an entire population. Inferential statistics seek to make predictions about a population based on the results observed in a sample of that population.
There are two primary types of population samples: random and stratified. For a random sample, study subjects are chosen completely by chance, while a stratified sample is constructed to reflect the characteristics of the population at large (gender, age or ethnicity, for example). There are a wide range of sampling methods, each with its advantages and disadvantages.
Attempting to extend the results of a sample to a population is called generalization. This can be done only when the sample is truly representative of the entire population.
Generalizing results from a sample to the population must take into account sample variation. Even if the sample selected is completely random, there is still a degree of variance within the population that will require your results from within a sample to include a margin of error. For example, the results of a poll of likely voters could give the margin of error in percentage points: “47% of those polled said they would vote for the measure, with a margin of error of 3 percentage points.” Thus, if the actual percentage voting for the measure was as low as 44% or as high as 50%, this result would be consistent with the poll.
The greater the sample size, the more representative it tends to be of a population as a whole. Thus the margin of error falls and the confidence level rises.
Significance tests of the study’s results determine the probability of seeing such results if the null hypothesis were true; the p-value indicates how unlikely this would be. If the p-value is 0.05, there is only a 5% probability of seeing such “interesting” results if the null hypothesis were true; if the p-value is 0.01, there is only a 1% probability.
The other threat to a sample’s validity is the notion of bias. Bias comes in many forms but most common bias is based on the selection of subjects. For example, if subjects self-select into a sample group, then the results are no longer externally valid, as the type of person who wants to be in a study is not necessarily similar to the population that we are seeking to draw inference about.
When two variables move together, they are said to be correlated. Positive correlation means that as one variable rises or falls, the other does as well — caloric intake and weight, for example. Negative correlationindicates that two variables move in opposite directions — say, vehicle speed and travel time. So if a scholar writes “Income is negatively correlated with poverty rates,” what he or she means is that as income rises, poverty rates fall.
Causation is when change in one variable alters another. For example, air temperature and sunlight are correlated (when the sun is up, temperatures rise), but causation flows in only one direction. This is also known as cause and effect.
Regression analysis is a way to determine if there is or isn’t a correlation between two (or more) variables and how strong any correlation may be. At its most basic, this involves plotting data points on a X/Y axis (in our example cited above, vehicle weight and fatal accidents) looking for the average causal effect. This means looking at how the graph’s dots are distributed and establishing a trend line. Again, correlation isn’t necessarily causation.
While causation is sometimes easy to prove, frequently it can often be difficult because of confounding variables(unknown factors that affect the two variables being studied). Studies require well-designed and executed experiments to ensure that the results are reliable.
When causation has been established, the factor that drives change (in the above example, sunlight) is theindependent variable. The variable that is driven is the dependent variable.
Elasticity, a term frequently used in economics studies, measures how much a change in one variable affects another. For example, if the price of vegetables rises 10% and consumers respond by cutting back purchases by 10%, the expenditure elasticity is 1.0 — the increase in price equals the drop in consumption. But if purchases fall by 15%, the elasticity is 1.5, and consumers are said to be “price sensitive” for that item. If consumption were to fall only 5%, the elasticity is 0.5 and consumers are “price insensitive” — a rise in price of a certain amount doesn’t reduce purchases to the same degree.
Standard deviation provides insight into how much variation there is within a group of values. It measures the deviation (difference) from the group’s mean (average).
Be careful to distinguish the following terms as you interpret results: Average, mean and median. The first two terms are synonymous, and refer to the average value of a group of numbers. Add up all the figures, divide by the number of values, and that’s the average or mean. A median, on the other hand, is the central value, and can be useful if there’s an extremely high or low value in a collection of values — say, a Wall Street CEO’s salary in a list of people’s incomes. (For more information, read “Math for Journalists” or go to one of the “related resources” at right.)
Pay close attention to percentages versus percentage points — they’re not the same thing. For example, if 40 out of 100 homes in a distressed suburb have “underwater” mortgages, the rate is 40%. If a new law allows 10 homeowners to refinance, now only 30 mortgages are troubled. The new rate is 30%, a drop of 10 percentage points (40 – 30 = 10). This is not 10% less than the old rate, however — in fact, the decrease is 25% (10 / 40 = 0.25 = 25%).
In descriptive statistics, quantiles can be used to divide data into equal-sized subsets. For example, dividing a list of individuals sorted by height into two parts — the tallest and the shortest — results in two quantiles, with the median height value as the dividing line. Quartiles separate data set into four equal-sized groups, deciles into 10 groups, and so forth. Individual items can be described as being “in the upper decile,” meaning the group with the largest values, meaning that they are higher than 90% of those in the dataset.
Note that understanding statistical terms isn’t a license to freely salt your stories with them. Always work to present studies’ key findings in clear, jargon-free language. You’ll be doing a service not only for your readers, but also for the researchers.
Related: See this more general overview of academic theory and critical reasoning courtesy of MIT’s Stephen Van Evera. A new open, online course offered on Harvard’s EdX platform, “Introduction to Statistics: Inference,” from UC Berkeley professors, explores “statistical ideas and methods commonly used to make valid conclusions based on data from random samples.”
There are lots of good reasons you might want to analyze public data, from detecting salary trends in government data to uncovering insights about a potential investment (or your favorite sports team).
But before you can run analyses and visualize trends, you need to have the data. The packages listed below make it easy to find economic, sports, weather, political and other publicly available data and import it directly into R — in a format that’s ready for you to work your analytics magic.
Packages that are on CRAN can be installed on your system by using the R command install.packages("packageName") — you only need to run this once. GitHub packages are best installed with the devtools package — install that once with install.packages("devtools") and then use that to install packages from GitHub using the formatdevtools::install_github("repositoryName/packageName"). Once installed, you can load a package into your working session once each session using the formatlibrary("packageName").
Some of the sample code below comes from package documentation or blog posts by package authors. For more information about a package, you can runhelp(package="packageName") in R to get info on functions included in the package and, if available, links to package vignettes (R-speak for additional documentation). To see sample code for a particular function, tryexample(topic="functionName", package="packageName") or simply ?functionName for all available help about a function including any sample code (not all documentation includes samples).
There are several other R packages that work with data from the U.S. Census, but this aims to be complete and offer data from all the bureau’s APIs, not just from one or two surveys. API key required. GitHub.
Pull historical weather data from cities/airports around the world. CRAN. If you have trouble pulling data, especially on a Mac, try uninstalling and re-installing a different version with the codeinstall_github("ozagordi/weatherData")
We now have our presidential candidates, and for the next few months you get to hear about the changing probability of Hillary Clinton and Donald Trump winning the election. As of this writing,the Upshot estimates a 68% probability for Clinton and 32% for Donald Trump. FiveThirtyEight estimates 52% and 48% for Clinton and Trump, respectively. Forecasts are kind of all over the place this far out from November. Plus, the numbers aren’t especially accurate post-convention.
But the probabilities will start to converge and grow more significant.
So what does it mean when Clinton has a 68% chance of becoming president? What if there were a 90% chance that Trump wins?
Some interpret a high percentage as a landslide, which often isn’t the case with these election forecasts, and it certainly doesn’t mean the candidate with a low chance will lose. If this were the case, the Cleveland Cavaliers would not have beaten the Golden State Warriors, and I would not be sitting here hating basketball.
Fiddle with the probabilities in the graphic simulation here to see what I mean.
Even when you shift the probability far left or far right, the opposing candidate still gets some wins. That doesn’t mean a forecast was wrong. That’s just randomness and uncertainty at play.
The probability estimates the percentage of times you get an outcome if you were to do something multiple times. In the case of Clinton’s 68% chance, run an election hundreds of times, and the statistical model that spit out the percentage thinks that Clinton wins about 68% of those theoretical elections. Conversely, it thinks Trump wins 32% of them.
So as we get closer to election day, even if there’s a high probability for one candidate over the other, what I’m saying is — there’s a chance.