Applied Data Science with Python – Part 1

This article will introduce you to the basics of the Python programming environment and applied data science with python, including fundamental python programming techniques such as lambdas, reading and manipulating CSV files, and the numpy library.

Functions

add_numbers is a function that takes two numbers and adds them together.

add_numbers updated to take an optional 3rd parameter. Using allowsprint printing of multiple expressions within a single cell.

add_numbers updated to take an optional flag parameter.

Assign the function add_numbers to a variable a

Types and Sequences

Use type to return the object’s type.

 

Tuples are an immutable data structure (cannot be altered).

Lists are a mutable data structure.

Use append to append an object to a list.

This is an example of how to loop through each item in the list.

Or using the indexing operator:

Use + to concatenate lists.

Use * to repeat lists.

Use the in operator to check if something is inside a list.

Now let’s look at strings. Use bracket notation to slice a string.

This will return the last element of the string.

This will return the slice starting from the 4th element from the end and stop before the 2nd element from the end.

This is a slice from the beginning of the string and stopping before the 3rd element.

And this is a slice starting from the 3rd element of the string and going all the way to the end.

split returns a list of all the words in a string, or a list split on a specific character.

Make sure you convert objects to strings before concatenating.

Dictionaries associate keys with values.

Iterate over all of the keys:

Iterate over all of the values:

Iterate over all of the items in the list:

You can unpack a sequence into different variables:

Make sure the number of values you are unpacking matches the number of variables being assigned.

More on Strings

Python has a built in method for convenient string formatting.

Reading and Writing CSV files

Let’s import our datafile mpg.csv, which contains fuel economy data for 234 cars. Download

  • mpg: miles per gallon
  • class: car classification
  • cty: city mpg
  • cyl: # of cylinders
  • displ: engine displacement in litres
  • drv : f = front-wheel drive, r = rear wheel drive, 4 = 4wd
  • fl : fuel (e = ethanol E85, d = diesel, r = regular, p = premium, c = CNG)
  • hwy: highway mpg
  • manufacturer: automobile manufacturer
  • model: the model of car
  • trans: type of transmission
  • year: model year

csv.Dictreader has read in each row of our csv file as a dictionary. len shows that our list is comprised of 234 dictionaries.

keys gives us the column names of our csv.

This is how to find the average cty fuel economy across all cars. All values in the dictionaries are strings, so we need to convert to float.

Similarly, this is how to find the average hwy fuel economy across all cars.

Use set to return the unique values for the number of cylinders the cars in our dataset have.

Here’s a more complex example where we are grouping the cars by the number of cylinders, and finding the average cty mpg for each group.

Use set to return the unique values for the class types in our dataset.

And here’s an example of how to find the average hwy mpg for each class of vehicle in our dataset.

Dates and Times

time returns the current time in seconds since the Epoch. (January 1st, 1970)

Convert the timestamp to datetime.

Handy datetime attributes:

timedelta is a duration expressing the difference between two dates.

date.today returns the current local date.

Objects and map()

An example of a class in python:

Here’s an example of mapping the min function between two lists.

Now let’s iterate through the map object to see the values.

Lambda and List Comprehensions

Here’s an example of lambda that takes in three parameters and adds the first two.

Let’s iterate from 0 to 999 and return the even numbers.

Now the same thing but with list comprehension.

Numerical Python (NumPy)

Creating Arrays

Create a list and convert it to a numpy array

Or just pass in a list directly

Pass in a list of lists to create a multidimensional array.

Use the shape method to find the dimensions of the array. (rows, columns)

arange returns evenly spaced values within a given interval.

reshape returns an array with the same data with a new shape.

linspace returns evenly spaced numbers over a specified interval.

resize changes the shape and size of the array in-place.

ones returns a new array of given shape and type, filled with ones.

zeros returns a new array of given shape and type, filled with zeros.

eye returns a 2-D array with ones on the diagonal and zeros elsewhere.

diag extracts a diagonal or constructs a diagonal array.

Create an array using repeating list (or see np.tile)

Repeat elements of an array using.repeat

Combining Arrays

Use vstack to stack arrays in sequence vertically (row wise).

Use hstack to stack arrays in sequence horizontally (column wise).

Operations

Use +-*/ and ** to perform element-wise addition, subtraction, multiplication, division and power.

dot product in python

Let’s look at transposing arrays. Transposing permutes the dimensions of the array.

The shape of array zis before(2,3) transposing.

Use .T to get the transpose.

The number of rows has swapped with the number of columns.

Use .dtype to see the data type of the elements in the array.

Use .astype to cast to a specific type.

Math Functions

Numpy has many built-in math functions that can be performed on arrays.

argmax and argmin return the index of the maximum and minimum values in the array.

Indexing / Slicing

Use bracket notation to get the value at a specific index. Remember that indexing starts at 0.

Use : to indicate a range. array[start:stop]
Leaving start or stop empty will default to the beginning/end of the array.

Use negatives to count from the back.

A second : can be used to indicate step-size. array[start:stop:stepsize]
Here we are starting 5th element from the end, and counting backwards by 2 until the beginning of the array is reached.

Let’s look at a multidimensional array.

Use bracket notation to slice: array[row, column]

And use : to select a range of rows or columns

This is a slice of the last row, and only every other element.

We can also perform conditional indexing. Here we are selecting values from the array that are greater than 30. (Also see np.where)

Here we are assigning all values in the array that are greater than 30 to the value of 30.

Copying Data

Be careful with copying and modifying arrays in NumPy!
r2 is a slice of r

Set this slice’s values to zero ([:] selects the entire array)

r has also been changed!

To avoid this, use r.copy to create a copy that will not affect the original array

Now when r_copy is modified, r will not be changed.

Iterating Over Arrays

Let’s create a new 4 by 3 array of random numbers 0-9.

Iterate by row:

Iterate by index:

Iterate by row and index:

Use zip to iterate over multiple iterables.

Next On Data Science Using Python:
The Series Data Structure
The DataFrame Data Structure
Dataframe Indexing and Loading
Querying a DataFrame
Indexing Data frames
Missing values

You might also like More from author