So, What is the best Python Integrated Development Environment (IDE)?
This question gets asked all the time. The quick answer is... “It depends”.
What problem are you trying to solve and where in the CRISP-DM methodology are we operating?
Figure 2. CRISP-DM Methodology
Some IDEs are better for the Data Understanding and Data Preparation piece while some IDEs are better in the Modeling, Deployment and sharing analysis piece.
We actually have three architecture options for Python development – command line, IDE, or Notebook. For tool selection, we need to look at which part of the data science process we are in and how well the tool meets our trade-offs between cost, quality, and time to market.
For example, in the data cleansing phase of a project you may just need to use the command line. There are many benefits to this. One great use case for using the command line is maximizing your memory resources with parallel processing for large data sets (see Article by Adam Drake). Python shell scripts work as a great lightweight tool to parallelize existing memory resources.
However, if we want to integrate these tools into the data exploration and model-building phase of the projects as well as reuse these tools in other applications – we are going to need an Integrated Development Environment (IDE) for development. IDE’s provide the features for authoring, modifying, compiling, deploying and debugging software.
There are a multiple number of IDEs out there and I have experimented with several. I’ve tried Yhat’s Rodeo platform (released after the stackoverflow spreadsheet (Figure 1) was put together), Spyder, PyCharm, Jupyter, and RStudio. I have also done extensive research on stack overflow and various data science blog reviews. My best source however was the Operation Code slack channel. Operation Code is the largest community dedicated to helping military veterans and families launch software development careers. Great content and collaboration for any military veterans transitioning to software development careers. (https://operationcode.org)
Here are my thoughts: For Python development and initial code syntax training, you want PyCharm or a similar IDE with Intellisense. PyCharm and Intellisense help new developers with syntax and proper formatting techniques. Intellisense is intelligent code completion and a few IDEs offer this. I was fond of the four Python IDEs that I directly worked with and tested. I thought they were all very easy to use with Yhat’s Rodeo and PyCharm my overall favorites. Yhat has a great data science blog (http://blog.yhat.com) that initially brought me to Rodeo. Ultimately, I had to use PyCharm for a class and stuck with it due to its overall functionality, nice layout, and ease of use.
Figure 3: PyCharm Example
In Figure 3, our PyCharm example, we see an example of Python code with the yellow highlights indicating Python best practices for syntax. The lines on the right margin indicate severity of the issue by color-coding and where there are conflicts. Yellow indicates a best practice for format tip. If lines to the right were red, we would have a syntax or logic issue causing our code not to run.
For data understanding and data preparation, we are going to want something similar to RStudio, Spyder, or Rodeo. The positives with these IDEs are having a variable explorer view so you can see what variables are stored and can double click to view the underlying data and Rodeo automates or at least makes saving the images from graphs very easy.
I like RStudio the best due to the ease of use switching between Python, R, and SQL. The ability to move seamlessly between the R and Python in a single environment is particularly useful for cleaning and manipulating large datasets; some tasks are simply better suited to Python, and others to R. One additional benefit to RStudio and Jupyter notebooks is how the code executes in memory. PyCharm, Rodeo, and Spyder have to import packages each time you execute code and some dataframes can take a while to load.
With RStudio and Jupyter notebooks it is all in memory so minimal lag time. It is also very easy to share analysis and demonstrate findings. Another great feature of RStudio is the ability to convert notebook and analysis to slides with a simple declaration in the output line:
• beamer_presentation - PDF presentations with beamer • ioslides_presentation - HTML presentations with ioslides • slidy_presentation - HTML presentations with slidy • revealjs::revealjs_presentation - HTML presentations with reveal.js
Figure 4: RStudio Notebook IDE With ‘reveal js_presentation’ Slide Output
My preferred method for new functionality is to develop and test large functions in PyCharm and then move to RStudio notebook for data exploration and building analytics pipelines. You can actually cut and paste Python code directly into R Markdown. All you have to do is tell R Markdown what type of ‘chunk’ to run.
For Python: ```{python} … For SQL: ```{r} library(DBI) db <- dbConnect(RSQLite::SQLite(), dbname = "chinook.db") query <- "SELECT * FROM tracks" ``` ```{sql, connection=db, code = query} ``` Note: A future blog post will talk about the convergence in functionality on large datasets between Structured Query Language (SQL) and the R package ‘dplyr’.
Figure 5: An example of Python running in an R Markdown document inside the RStudio Notebook IDE
For model development and final deployment – here it depends on the size of the dataset and whether or not we will need to use distributed processing with Spark. If we have a large amount of images or any other type of large dataset, we should use Spark’s Databricks platform. Databricks works interactively with Amazon Web Services (AWS) to quickly set up and terminate server clusters for distributed processing.
Figure 6. Databricks Notebook Workspace
Databricks also automates the install of software packages and libraries to the Amazon cluster greatly decreasing environment setup and configuration time.
Figure 7. Databricks Spark Deep Learning Package
With the Databricks Community Edition, users will have access to 6GB clusters as well as a cluster manager and the notebook environment to prototype simple applications. The Databricks Community Edition access is not time-limited and users will not incur AWS costs for their cluster usage.
The full Databricks platform offers production-grade functionality, such as an unlimited number of clusters that can easily scale up or down, a job launcher, collaboration, advanced security controls, JDBC/ODBC integrations, and expert support. Users can process data at scale, or build Apache Spark applications in a team setting. Additional pricing on top of AWS charges is based on Databricks processing units (DBUs).
Figure 8. Databricks Pricing Model (https://databricks.com/product/pricing)
Figure 9: Databricks Pricing Example for Production Edition
You will need to balance the time saved with Databricks versus the cost of analysts setting up the same environment with other tools but the automated Spark and AWS cluster integration make this a wonderful IDE to work with.
Conclusion My top picks...
If going to develop a custom algorithm or a custom package in Python – PyCharm
If performing data exploration, building analytics pipelines, and sharing results – RStudio
If you have a large dataset for Spark distributed processing - Databricks
Please comment with your command line/IDE/Notebook best practices and tips.