DataViz – Collecting & Cleaning

This past week I tasked myself with doing some data collection for the sake of sharpening my data cleaning skills. I finished up with three different datasets that came from reliable sources such as the data.gov, the United States Census Bureau, and the Billboard archive.

I found that searching for data came to be tougher than I anticipated. When doing a search, especially on large sites such as the United States Census Bureau and data.gov, the amount of data in each set can be a bit overwhelming. It’s up to the researcher to decide how much of a certain dataset we want to utilize and if we need to grab any additional data from a different, but related set.

At the end of my search I had gathered the following datasets:

  1. Fresh Fruit – The average retail price per pound or pint and per cup equivalent, 2016 (USDA)
  2. Occupation by Sex for the Civilian Employed Population 16 Years And Over, 2013-2017 American Community Survey 5-Year Estimates
  3. Number One Songs on the Billboard Hot 100, January 2018 – June 2019

 I intend to take these sets of data and create some visualizations with them, so let’s talk through my plans for each.

Fresh Fruit – The average retail price per pound or pint and per cup equivalent, 2016 (USDA)

When I stumbled upon this data I thought it was pretty interesting because the prices of produce and food, in general, is something that affects us all. Especially with many Americans trying to live healthier lifestyles nowadays, being able to easily see the cost of all fruit in comparison to one another would be useful for budgetary reasons. The USDA also included something called the “preparation yield factor” in these datasets, which help consumers see how much of the fruit is actually edible after things like peels, cores, pits, etc. are taken into consideration. I thought this was a great dimension of the data because it even further allows the consumers to see what fruits you get the more bang for your buck from.

In order to get the spreadsheet of data I wanted for this set I had to do some cleaning. The USDA has all of its fruit retail information separated. Apples have their own spreadsheet, bananas have their own, and so on and so forth. I started to compile all of the spreadsheets into one so that you can easily compare one fruit to another. After all the fruits were on one spreadsheet, I then began to eliminate data I felt I did not need. This included the prices of frozen fruit, fruit juices, and dried fruit. I wanted to keep these numbers to strictly fresh options.

I ended up with the following spreadsheet of data:

Screen Shot 2019-06-09 at 4.40.03 PM

Planning ahead for this DataViz, I see myself creating a scatter chart. I will explain. I want my chart to answer the question, “What fruits can I buy for a good price that yield the most edible amount of fruit?” Therefore, I will have a chart that has a Y-axis to display the Preparation Yield Factor (continuous ratio data) and an X-axis to display prices (continuous ratio data) because these are the two dimensions I want to make a correlation between. Where the points meet for each fruit will be where my plotting points will lay on the chart. Onlookers will then be able to easily see the true value of each kind of fruit.

Occupation by Sex for the Civilian Employed Population 16 Years And Over, 2013-2017 American Community Survey 5-Year Estimates

My second dataset is one that discusses the comparison of women and men in jobs across the country. This data came from the United States Census Bureau.

This set of data didn’t require as much cleaning and compilation as my last. The Census Bureau makes it very easy for people looking for data because they allow you to get rid of any additional columns or rows of data you may not want before downloading their excel spreadsheet. This is done with a simple toggle feature on their website. Very efficient!

My final dataset looks like this…

Screen Shot 2019-06-09 at 9.27.07 PM

To create a graphic to show this information, I think a stacked column or bar chart would be best. With a Y-axis to display the job field and an X-axis to show the number of people in each job, it will be easy to compare each job field with the others. Not only that but within each individual bar will be the comparison of men and women in each job. The viewer will be able to see which job fields are the most popular and least popular overall along with which jobs are the most and least popular between the sexes.

Number One Songs on the Billboard Hot 100, January 2018 – June 2019

My final data set was one that I researched on my own, so no cleaning necessary, just some compilation. Billboard has a great archive of their music charts that dates back to the ’50s. It made getting my data together pretty easy, which was great!

I decided to go through each chart from the last year and a half and collect information on each number one song. I recorded the song name, artist name, the number of weeks that the chart stayed at number one, how many times each song re-entered the number one spot, how many weeks each of the artists had at number one, and lastly how many unique songs each artist had at number one over the time period.

My final dataset can be seen below.

Screen Shot 2019-06-09 at 9.29.42 PM

I want the onlookers of this final DataViz to answer questions like, “Which songs have spent the most time at number one?” “What artists have dominated the charts in the last year and a half?” and “How many songs reached the number one spot in 2018 and 2019?” In order to have all of these questions answered, I have to create a visual that has multiple pieces and charts that can give onlookers an overall picture of the Hot 100 over the last year and a half. I plan to do a series of pie and donut charts to display song and artist ratios. Each song and artist is “part of a whole.” The “whole” being all 76 Billboard charts in review.

Leave a comment