Data Type Conversion in pandas


This entry is part 3 of 3 in the series Pandas EDA Structuring

Convert Strings to Floats

This post discusses one of the EDA Structuring tasks using Python and pandas. Here we are working with a pandas DataFrame. We need to convert a column of strings to floats. We can see that the data is in millions of dollars. You may copy and paste this Python code into Jupyter Notebook, for example, and follow along.

import pandas as pd
data = {'company': ['ABC Inc.', 'XYZ Corp.', 'Acme Ltd', 'Widget LLC'],
       'sales': ['$1.286M', '$6.722M', '$3.320M', '$4.197M'],
       'industry': ['Technology', 'Foods', 'Foods', 'Technology'],
        'date_founded': ['2/25/2006', '5/17/2003', '3/7/2011', '11/2/2012']}
df = pd.DataFrame(data)
df

Click to enlarge

# Convert the sales column to a numeric
df['sales'] = df['sales'].str.strip('$M').astype(float)
df

So now we are set.

Convert Numeric to String

Here we use astype() again: df[‘sales’] = df[‘sales’].astype(str)

Remove Commas

Suppose we had commas in our column of numbers. How would we get rid of those? Strip(‘,’) doesn’t work You could use the following code.

df2['sales'] = df2['sales'].str.replace(',','').astype(float)

Convert an Object to a Datetime

Suppose your DataFrame was called df. Suppose your column was called date. Here’s how you could convert the date column from an object to a datetime.

df['my_date'] = pd.to_datetime(df['my_date'])

You can use format also. Here’s an example.

# Convert `my_date` column to datetime format 
df['my_date'] = pd.to_datetime(df['my_date'],format='%m/%d/%Y %I:%M:%S %p')
print('Data type of my_date:', df['my_date'].dtype)

Convert a Float to a Boolean (0 or 1)

Suppose you have a column with floating point numbers and you wish to create another column of zeros and ones. Suppose you want the number to be 1 if the floating point number is at or above 20%, or 0.2. Your DataFrame is called df and the column is called percent. Your new column is called generous_tipper.

df['generous_tipper'] = df['percent']   # create new column
df['generous_tipper'] = (df['generous_tipper'] >= 0.2)   # convert to True or False
df['generous_tipper'] = df['generous_tipper'].astype(int)   # convert to an integer
Series Navigation<< EDA Structuring with PandasGroupby Multiple Columns >>

Leave a Reply