Joseph Nathan Cohen

Department of Sociology, CUNY Queens College, New York, NY

Starting Your Analysis Script in Python

Lines to start your data analysis session on Python.

When teaching analysis, I advise my students to start each session with a standard setup.  This setup looks like this:

Clear the Memory

Example:

This line is used to clear the memory in Jupyter notebooks. It deletes all variables, functions, and imported modules. This step ensures your work from a clean Python session.  Doing so ensures that your script does not result in, or rely on, leftover data or definitions from previous sessions. 

Set the Working Directory

Example:

Note that the syntax has two slashes, as opposed to a single slash in the Windows OS. 

Here, I am setting the directory to the “Household Finance Paper” folder, contained in my “Research” folder, on my Windows device.  This will be where Python looks for data and scripts to execute your script, and where Python will save any files that are created during your session.  This reduces the hazard of path-related problems with your script.

Load Essential Libraries

Although it is not always necessary, I encourage students to start out by loading up some workhorse Python libraries used to data analysis:

  • Numpy (numpy): Essential for numerical computations, especially those involving arrays and matrices. It’s often used in the background by other libraries for performing efficient numerical operations.
  • Pandas (pandas): Indispensable for data manipulation and cleaning. It introduces DataFrame and Series data structures that are ideal for handling and analyzing structured data.
  • Matplotlib (matplotlib): A foundational plotting library, useful for creating a wide range of static, animated, and interactive visualizations.
  • Seaborn (seaborn): Builds on Matplotlib and provides a high-level interface for drawing attractive and informative statistical graphics.

To install it:

To call up the libraries in your session:

Set Random Seed

Many statistical operations rely on randomly generated numbers. This makes replication difficult, because random numbers inject randomness into an analysis. When you set a random seed, you are effectively calling up a predetermined list of random numbers associated with a seed number. If two analysts set the same seed, then their analysis will use the same random numbers. The analysis should replicate.

To set a random seed in numpy, which sets the seed at 55. You can choose any number:

Leave a Reply

Your email address will not be published. Required fields are marked *