Data prep is the first step every data scientist takes when they come across a new dataset. It is a important process that turns data found in the wild into a usable form for use in models. Even the most diligently made datasets won’t be a perfect fit for your work, so you must put in the effort to clean it up until its all nice and shiny and ready to be used in your model. In this post, I will go over some steps I take when encountering a new dataset.
In this post I will be using the python library pandas in Jupyter Notebook, and will be using the sql dataset found in the link below!
1. Exploring data
The first obvious step is to get a good overview of how your data looks. To do this, a simple
fires_df.head() function will suffice.
With this, we can tell that our dataset has 39 columns in total with each column having different data types. To actually see the data types, or dtypes, we can run
What a nice variety of integers, floats, and strings; we will come back to this later when we are exploring our different columns.
Some more things we can do for our preliminary data exploring is to look at the number of duplicates and NaN values in our data. For duplicates,
fires_df.duplicated.sum() will return the number of duplicated rows in our data, and similarly we can use
fires_df.isna().sum()/fires_df.shape to check the status of NaN values.
Luckily in this dataset, we do not have any duplicate rows.
However, on the other hand the NaN results are much more interesting. While some columns have around 40% of their values as NaNs, we have some columns that are sitting at a 98% or 99% of their values being NaNs. These columns can be easily dropped without much thought with a
While we still have several columns with high NaN counts, we will put them on standby for now and comeback to them when we’re exploring each individual column.
2. Removing unnecessary data
Now we will delve into data itself and start judging based on our goals. From earlier, our current goal is to see if we can make a model that will predict the fire size of a fire given other information. Since we only care about the fire itself, we can cut out quite a number of columns from our data that don’t have much relation to it. In particular, the columns with data relating to bureaucratic policies are ones that we can leave out. Columns such as FIPS_CODE, which according to the source of the data, is a “Three-digit code from the Federal Information Process Standards (FIPS) publication 6–4 for
representation of counties and equivalent entities.” Adding columns like this to our model is more likely to lower our performance than raise it, so it is nice to drop them now so we don’t have to look at it again in the future. Like before, this can be done with another
fires_df.drop() call again.
Other columns that can be easily dropped are the ones relating to codes or IDs used in-house by firemen societies. The fact that we are simple laymen working on the dataset that most likely have no clue what the codes stand for, we can only make an educated guess based on their names and description to see if they could possibly be useful or not.
Here are the columns that I chose to drop.
Now we can return to our old columns that still had Nan values, we can apply the same logic to determine whether or not they are worth keeping around. Out of all the columns with NaN values from before, it seems like only the CONT columns (CONT_DATE, CONT_DOY, CONT_TIME) and the DISCOVERY_TIME seemed usable, so we keep those while dropping the rest as well.
With the remaining columns with NaN values, we must decide how to deal with them. For now, we will take the easy way out and look to see if simply dropping all the rows with a NaN value in the column will suffice, with the
Luckily for us, we have an abundance of data to work with; even if we drop half of our data, we still have nearly 900,000 rows which should be plenty. For now we can use this method, and come back in the future if we ever desire to optimize.
Now looking at our data, we are down to 892,000 rows and 13 columns, which is a third of our initial column count. This will cut down on the clutter quite a bit and will both be easier on our model and also our eyes when viewing the data.
3. Cleaning up the remaining data
While our dataset can be quite clean from an overall perspective, we can still work on improvements regarding the data in the individual columns. One thing we can work on is ensuring that all of our dtypes are set properly. Examples could be if our data had numbers in a column, but they were inputted as strings instead of actual integers. Thus we must go through our data in search of such occasions.
I have discovered that the dates set in our TIME columns were kept in string form. As the values were stored in the format of 0000 to 2400, if we were to change the values to integers, it would make comparing the two times a simple task if we were to transform them into integers with
Another technique we could do is to utilize mapping in order to group values together. An example of this is if we had a distribution of numbers between 0 and 10, we can group them into two groups ‘Odds’ and ‘Evens’. Similarly in this dataset, our Date of Year columns have their values saved as a number between 0 to 365. As it is quite a hassle attempting to visualize what a number like 95 would mean, we can use mapping to group them into values that are much easier to understand. In this case, we can split our year into four groups, and sort out each date by which season they belong to. Similarly, for the time column we can do the same but instead group them by what time of day they belong to, such as Morning, Noon, Afternoon, or Night.
By using the
fires_df.apply() function, we will sort our data out into a more visually appealing form. While this will indeed somewhat lower our potential performance on our models, it will greatly help our understanding of the data and can also be reversed at any time.
Now that our dataset is somewhat clean, it is ready for experimenting on with models! When it comes to feature engineering, we can retrace our steps and redo some of the cleaning methods we did in order to improve performance. Things like different approaches to dealing with NaNs or grouping can both hurt or help the model, so cleaning the data to better your understanding of the dataset can help immensely!