Pandas is a wide-spread used data structure in Tensorflow, the famous architecture in Machine Learning.
We go through it briefly here.
Installation and import
The prerequite Python packages are numpy and scipy. We can use pip
or conda
to install.
pip install pandas
When in Python 3 environment, importing pandas is easy
from __future__ import print_function
import pandas as pd
pd.__version__ # check it
DataFrame & Series
A pandas DataFrame, like a table in Excel, is consisted of one or more Series which resemble columns in Excel and share the same number of rows. See the following example:
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
sample=pd.DataFrame({'City name': city_names, 'Population': population })
It is very convinent to get some statistics using describe()
method
sample.describe()
Pandas can also import data from a file. Here we create a csv file named ‘housing_train.csv’. Here are the content of this file:
“longitude”,”latitude”,”housing_median_age”,”total_rooms”,”total_bedrooms”,”population”,”households”,”median_income”,”median_house_value”
-114.310000,34.190000,15.000000,5612.000000,1283.000000,1015.000000,472.000000,1.493600,66900.000000
-114.470000,34.400000,19.000000,7650.000000,1901.000000,1129.000000,463.000000,1.820000,80100.000000
-114.560000,33.690000,17.000000,720.000000,174.000000,333.000000,117.000000,1.650900,85700.000000
-114.570000,33.640000,14.000000,1501.000000,337.000000,515.000000,226.000000,3.191700,73400.000000
-114.570000,33.570000,20.000000,1454.000000,326.000000,624.000000,262.000000,1.925000,65500.000000
-114.580000,33.630000,29.000000,1387.000000,236.000000,671.000000,239.000000,3.343800,74000.000000
-114.580000,33.610000,25.000000,2907.000000,680.000000,1841.000000,633.000000,2.676800,82400.000000
-114.590000,34.830000,41.000000,812.000000,168.000000,375.000000,158.000000,1.708300,48500.000000
-114.590000,33.610000,34.000000,4789.000000,1175.000000,3134.000000,1056.000000,2.178200,58400.000000
I put it to my directory of Documents, so I import it by typing:
housing_dataframe = pd.read_csv("~/Documents/housing_train.csv", sep=",")
If the data is too big, you can check the front several rows by
housing_dataframe.head()
See the hist by typing
import matplotlib.pyplot as plt
housing_dataframe.hist('housing_median_age')
plt.show()
It can also be accessed using familiar Python dict / list operations
>>> type(housing_dataframe['households'])
<class 'pandas.core.series.Series'>
>>> housing_dataframe['households'][0:2]
0 472.0
1 463.0
Name: households, dtype: float64
Manipulating data
Actually, pandas Series can be used as arguments to most Numpy functions:
>>> import numpy as np
>>> np.log(population)
0 13.655892
1 13.831172
2 13.092314
dtype: float64
Modifying DataFrames is also straightforward. For example, the following code adds two Series to an existing DataFrame:
cities = pd.DataFrame({'City name': city_names, 'Population': population })
cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
cities['Population density'] = cities['Population'] / cities['Area square miles']
cities
Exercise 1
Modify the cities
table by adding a new boolean column that is True if and only if both of the following are True:
- The city is named after a San.
- The city has an area greater than 50 square miles.
Hint: using lambda function and Python binary operation &
Indexes
By default, the indexes of the Dataframe or Series are stable. But we can use reindex to modify the presence just like rearrange the index:
cities.reindex([2, 0, 1])
cities
Reindexing is a great way to shuffle (randomise) a DataFrame. In the example below, we take the index, which is array-like, and pass it to NumPy’s random.permutation function, which shuffles its values in place. Calling reindex with this shuffled array causes the DataFrame rows to be shuffled in the same way. Try running the following cell multiple times!
cities.reindex(np.random.permutation(cities.index))
Comments
shortname
for Disqus. Please set it in_config.yml
.