Joseph Nathan Cohen

Department of Sociology, CUNY Queens College, New York, NY

Starting Your Analysis Script in Python

Lines to start your data analysis session on Python.

When teaching analysis, I advise my students to start each session with a standard setup.  This setup looks like this:

# Clear Memory
%reset -f

# Set Working Directory
import os
os.chdir('insert your working directory path here)

# Set Random Seed
import numpy as np
np.random.seed(insert a random seed number here)

Clear the Memory

Example:

%reset -f

This line is used to clear the memory in Jupyter notebooks. It deletes all variables, functions, and imported modules. This step ensures your work from a clean Python session.  Doing so ensures that your script does not result in, or rely on, leftover data or definitions from previous sessions. 

Set the Working Directory

Example:

import os
os.chdir('C:\\Users\\JCohen\\Documents\\Research\\Household Finance Paper')

Note that the syntax has two slashes, as opposed to a single slash in the Windows OS. 

Here, I am setting the directory to the “Household Finance Paper” folder, contained in my “Research” folder, on my Windows device.  This will be where Python looks for data and scripts to execute your script, and where Python will save any files that are created during your session.  This reduces the hazard of path-related problems with your script.

Load Essential Libraries

Although it is not always necessary, I encourage students to start out by loading up some workhorse Python libraries used to data analysis:

  • Numpy (numpy): Essential for numerical computations, especially those involving arrays and matrices. It’s often used in the background by other libraries for performing efficient numerical operations.
  • Pandas (pandas): Indispensable for data manipulation and cleaning. It introduces DataFrame and Series data structures that are ideal for handling and analyzing structured data.
  • Matplotlib (matplotlib): A foundational plotting library, useful for creating a wide range of static, animated, and interactive visualizations.
  • Seaborn (seaborn): Builds on Matplotlib and provides a high-level interface for drawing attractive and informative statistical graphics.

To install it:

pip install numpy
pip install pandas
pip install matplotlib
pip install seaborn

To call up the libraries in your session:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Set Random Seed

Many statistical operations rely on randomly generated numbers. This makes replication difficult, because random numbers inject randomness into an analysis. When you set a random seed, you are effectively calling up a predetermined list of random numbers associated with a seed number. If two analysts set the same seed, then their analysis will use the same random numbers. The analysis should replicate.

To set a random seed in numpy, which sets the seed at 55. You can choose any number:

np.random.seed(55)

Leave a Reply

Your email address will not be published. Required fields are marked *