Completing your first project is a major milestone on the road to becoming a data scientist and helps to both reinforce your skills and provide something you can discuss during the interview process. It’s also an intimidating process. The first step is to find an appropriate, interesting data set. You should decide how large and how messy a data set you want to work with; while cleaning data is an integral part of data science, you may want to start with a clean data set for your first project so that you can focus on the analysis rather than on cleaning the data.
We’ve selected data sets of varying types and complexity that we think work well for first projects (some of them work for research projects as well!). These data sets cover a variety of sources: demographic data, economic data, text data, and corporate data.
1. United States Census Data
The U.S. Census Bureau publishes reams of demographic data at the state, city, and even zip code level. It is a fantastic data set for students interested in creating geographic data visualizations and can be accessed on the Census Bureau website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the choroplethr.In general, this data is very clean, very comprehensive and nuanced, and a good choice for data visualization projects as it does not require you to manually clean it.
2. FBI Crime Data
3. CDC Cause of Death
The Centers for Disease Control and Prevention maintains a database on cause of death. The data can be segmented in almost every way imaginable: age, race, year, and so on. Since this is such a massive data set, it’s good to use for data processing projects.
4. Medicare Hospital Quality
The Centers for Medicare & Medicaid Services maintains a database on quality of care at more than 4,000 Medicare-certified hospitals across the U.S., providing for interesting comparisons. Since this data will be spread over multiple files and might take a bit of research to fully understand, this could be a good data cleaning project.
5. SEER Cancer Incidence
The U.S. government also has data about cancer incidence, again segmented by age, race, gender, year, and other factors. It comes from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program. The data goes back to 1975 and has 18 databases, so you’ll have plenty of options for analysis.
6. Bureau of Labor Statistics
Many important economic indicators for the United States (like unemployment and inflation) can be found on the Bureau of Labor Statistics website. Most of the data can be segmented both by time and by geography. This large data set can be used for data processing and data visualization projects.
7. Bureau of Economic Analysis
The Bureau of Economic Analysis also has national and regional economic data, including gross domestic product and exchange rates. There’s a huge range in the different groups of data found here—you can browse by place, economic accounts, and topics—and these groups are organized into even smaller subsets throughout.
8. IMF Economic Data
For access to global financial statistics and other data, check out the International Monetary Fund’s website. There are a few different sets here, so you can use them for a wide range of projects like visualization or even cleaning.
9. Dow Jones Weekly Returns
Predicting stock prices is a major application of data analysis and machine learning. One relevant data set to explore is the weekly returns of the Dow Jones Index from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine. This is one of the sets specially made for machine learning projects.
The British government's official data portal offers access to tens of thousands of data sets on topics such as crime, education, transportation, and health. Since this is an open data source with millions of entries, you’ll be able to practice data cleaning across different groupings.
11. Enron Emails
After the collapse of Enron, a free data set of roughly 500,000 emails with message text and metadata were released. The data set is now famous and provides an excellent testing ground for text-related analysis. You also can explore other research uses of this data set through the page.
12. Google Books Ngrams
If you’re interested in truly massive data, the Ngram viewer data set counts the frequency of words and phrases by year across a huge number of text sources. The resulting file is 2.2 TB! While this might be difficult to use for a visualization project, it’s an excellent data set for cleaning as it’s nuanced and will require additional research.
If data about the lives of children around the world is of interest, UNICEF is the most credible source. The organization’s public data sets touch upon nutrition, immunization, and education, among others, making for a great resource for visualization projects.
14. Reddit Comments
Wikipedia provides instructions for downloading the text of English-language articles, in addition to other projects from the Wikimedia Foundation. The Wikipedia Database Download is available for mirroring and personal use and even has its own open-source application that you can use to download the entirety of Wikipedia to your computer, leaving you with limitless options for processing and cleaning projects.
16. Lending Club
Lending Club provides data about loan applications it has rejected as well as the performance of loans that it has issued. The free data set lends itself both to categorization techniques (will a given loan default) as well as regressions (how much will be paid back on a given loan).
Walmart has released historical sales data for 45 stores located in different regions across the United States. This offers a huge set of data to read and analyze, and many different questions to ask about it—making for a solid resource for data processing projects.
Inside Airbnb offers different data sets related to Airbnb listings in dozens of cities around the world. This dataset, given its specificity to the travel industry, is great for practicing your visualization skills.
Yelp maintains a free dataset for use in personal, educational, and academic purposes. It includes 6 million reviews spanning 189,000 businesses in 10 metropolitan areas. Students are welcome to participate in Yelp’s dataset challenge, giving you quite a few options and an additional incentive for various types of data projects.
20. Google Trends Data
Google has one of the most interesting data sets to analyze. While we’re using “e-learning” in this example, you can explore different search terms and go as far back as 2004. All you have to do is download the dataset into a CSV file to analyze the data outside of the Google Trends webpage. You can download data on interest levels for a given search term, interest by location, related topics, categories, search types (video, images, etc), and more! Google also lists out a large collection of publicly available datasets on the Google Public Data Explorer. Make sure to check it out!
21. World Trade Organization
For students looking to learn through analysis, the World Trade Organization offers many data sets available for download that give students insight into trade flows and predictions. Those with a knack for business insights will particularly appreciate this set this dataset, as it provides tons of opportunities to not only get into data science but also deepen your understanding of the trading industry.
22. International Monetary Fund
This site has several free excel data sets for download on different key economic indicators. From Gross Domestic Product (GDP) to inflation. Taking the data from multiple files and condensing it for clarity and patterns is an excellent (and satisfying!) way to practice data cleaning.
23. U.S Energy Information Administration Open Data
This source has free and open data that is available in the bulk file, in Excel via the add-in, in Google Sheets via an add-on, and via widgets that embed interactive data visualizations of EIA data on any website. The website also notes that the EIA data is available in machine-readable formats, making it a great resource for machine learning projects.
24. TensorFlow Image Dataset: CelebA
For practice with machine learning, you’ll need a specialized dataset such as TensorFlow. The TensorFlow library includes all sorts of tools, models, and machine learning guides along with its datasets. CelebA is an extremely large, publicly available online, and contains over 200,000 celebrity images.
25. TensorFlow Text Dataset
Another TensorFlow set is C4: Common Crawl’s Web Crawl Corpus. Available in 40+ languages, this open-source repository of web page data spans seven years of data, making for an excellent resource for machine learning dataset practice.
26. Our World In Data
Our World In Data is an interesting case study in open data. Not only can you find the underlying public data sets, but visualizations are already presented in order to splice up the data. The site mainly deals with large-scale country-by-country comparisons on important statistical trends, from the rate of literacy to economic progress.
27. Crypto Data Download
Do you want some insight into the emergence of cryptocurrencies? Cryptodatadownload offers free public data sets of cryptocurrency exchanges and historical data that tracks the exchanges and prices of cryptocurrencies. Use it to do historical analyses or try to piece together if you can predict the madness.
28. Kaggle Data
Kaggle datasets are an aggregation of user-submitted and curated datasets. It’s a bit like Reddit for datasets, with rich tooling to get started with different datasets, comment, and upvote functionality, as well as a view on which projects are already being worked on in Kaggle. A great all-around resource for a variety of open datasets across many domains.
29. Github Collection (Open Data)
GitHub is the central hub of open data and open-source code. With different open datasets that are hosted on GitHub itself (including data on every member of Congress from 1789 onwards and data on food inspections in Chicago), this collection lets you get familiar with Github and the vast amount of open data that resides on it.
30. Github (Awesome Public Datasets)
The Awesome collection of repositories on Github is a user-contributed collection of resources. In this case, the repository contains a variety of open data sources categorized across different domains. Use this resource to find different open datasets—and contribute back to it if you can.
31. Microsoft Azure Open Datasets
Microsoft Azure is the cloud solution provided by Microsoft: they have a variety of open public datasets that are connected to their Azure services. You can access featured datasets on everything from weather to satellite imagery.
32. Google BigQuery Datasets
Google BigQuery is Google’s cloud solution for processing large datasets in a SQL-like manner. You can have a preview of these very large public datasets with the subreddit Wiki dedicated to BigQuery with everything from very rich data from Wikipedia, to datasets dedicated to cancer genomics.
Both data analytics and data science are growing and lucrative fields, and you can’t go wrong with either. Looking to prepare for data analytics and data science roles? Check out the University of South Florida Data Analytics Bootcamp and Data Science Bootcamp. You’ll learn both the technical and business thinking skills to get hired in these mentor-led programs.