Introduction
- Pandas is a Python based data manipulation tool
Importing and getting started
Datatypes
- Two main data types:
pd.Series
: one dimensional array
pd.DataFrame
: two dimensional array
Importing & exporting data
pd.read_csv("path/filename.csv")
: imports your CSV file as a DataFrame
pd.to_csv("filename.csv")
: exports your DataFrame as CSV
Describing data
.dtypes
: shows what datatype each column contains
.describe()
: returns a quick statistical overview of the numerical columns
.info()
: shows some useful information about a DataFrame like how many rows there are, whether there are missing values, the datatype of each column
- Statistical methods
.columns
: shows all the columns of the DataFrame
.index
: shows the values in a DataFrame’s index
len(DataFrame)
: shows the length (number of rows) of a DataFrame
Viewing & selecting data
DataFrame.head()
DataFrame.tail()
DataFrame.loc[]
: accesses a group of rows and columns by labels or a boolean array
DataFrame.iloc[]
: accesses a group of rows and columns by integer indices
DataFrame['column name']
DataFrame['column name'] > 5
: filters and shows only the rows that meet the condition
DataFrame.plot()
DataFrame.hist()
pd.crosstab
: computes a cross-tabulation of two or more factors
Manipulating data
.fillna()
: fills in missing data
.dropna()
: removes all data that has missing values
.drop('column name', axis=1)
: removes the specified column
.sample(frac=1)
: randomly samples different rows from a DataFrame and the frac
parameter indicates the fraction of rows (1=100%, 0.5=50%, and so on)
sample(n=1)
: same as above, but you can specify the number of rows to sample instead of the percentage by using the n
parameter
reset_index()
: adds a new column of indices
.apply(lambda x: x / 1.6)
converts the values from km to mi
Other functions & methods
DataFrame[column].cumsum()
: cumulative sum of specified column
pd.date_range("1/1/2020", periods)
: periods is how many entries