Pandas Overview

6 years ago 3 minutes read (About 465 words)

Pandas is a wide-spread used data structure in Tensorflow, the famous architecture in Machine Learning.
We go through it briefly here.

Installation and import

The prerequite Python packages are numpy and scipy. We can use pip or conda to install.

pip install pandas

When in Python 3 environment, importing pandas is easy

from __future__ import print_function
import pandas as pd
pd.__version__ # check it

DataFrame & Series

A pandas DataFrame, like a table in Excel, is consisted of one or more Series which resemble columns in Excel and share the same number of rows. See the following example:

city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
sample=pd.DataFrame({'City name': city_names, 'Population': population })

It is very convinent to get some statistics using describe() method

sample.describe()

Pandas can also import data from a file. Here we create a csv file named ‘housing_train.csv’. Here are the content of this file:

“longitude”,”latitude”,”housing_median_age”,”total_rooms”,”total_bedrooms”,”population”,”households”,”median_income”,”median_house_value”
-114.310000,34.190000,15.000000,5612.000000,1283.000000,1015.000000,472.000000,1.493600,66900.000000
-114.470000,34.400000,19.000000,7650.000000,1901.000000,1129.000000,463.000000,1.820000,80100.000000
-114.560000,33.690000,17.000000,720.000000,174.000000,333.000000,117.000000,1.650900,85700.000000
-114.570000,33.640000,14.000000,1501.000000,337.000000,515.000000,226.000000,3.191700,73400.000000
-114.570000,33.570000,20.000000,1454.000000,326.000000,624.000000,262.000000,1.925000,65500.000000
-114.580000,33.630000,29.000000,1387.000000,236.000000,671.000000,239.000000,3.343800,74000.000000
-114.580000,33.610000,25.000000,2907.000000,680.000000,1841.000000,633.000000,2.676800,82400.000000
-114.590000,34.830000,41.000000,812.000000,168.000000,375.000000,158.000000,1.708300,48500.000000
-114.590000,33.610000,34.000000,4789.000000,1175.000000,3134.000000,1056.000000,2.178200,58400.000000

I put it to my directory of Documents, so I import it by typing:

housing_dataframe = pd.read_csv("~/Documents/housing_train.csv", sep=",")

If the data is too big, you can check the front several rows by

housing_dataframe.head()

See the hist by typing

import matplotlib.pyplot as plt
housing_dataframe.hist('housing_median_age')
plt.show()

It can also be accessed using familiar Python dict / list operations

>>> type(housing_dataframe['households'])
<class 'pandas.core.series.Series'>
>>> housing_dataframe['households'][0:2]
0    472.0
1    463.0
Name: households, dtype: float64

Manipulating data

Actually, pandas Series can be used as arguments to most Numpy functions:

>>> import numpy as np
>>> np.log(population)
0    13.655892
1    13.831172
2    13.092314
dtype: float64

Modifying DataFrames is also straightforward. For example, the following code adds two Series to an existing DataFrame:

cities = pd.DataFrame({'City name': city_names, 'Population': population })
cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
cities['Population density'] = cities['Population'] / cities['Area square miles']
cities

Exercise 1

Modify the cities table by adding a new boolean column that is True if and only if both of the following are True:

The city is named after a San.
The city has an area greater than 50 square miles.

Hint: using lambda function and Python binary operation &

Indexes

By default, the indexes of the Dataframe or Series are stable. But we can use reindex to modify the presence just like rearrange the index:

cities.reindex([2, 0, 1])
cities

Reindexing is a great way to shuffle (randomise) a DataFrame. In the example below, we take the index, which is array-like, and pass it to NumPy’s random.permutation function, which shuffles its values in place. Calling reindex with this shuffled array causes the DataFrame rows to be shuffled in the same way. Try running the following cell multiple times!

cities.reindex(np.random.permutation(cities.index))

#machine learning #Pandas