Python's rich ecosystem of data science tools is a big draw for users. The only downside of such a broad and deep collection is that sometimes the best tools can get overlooked.
Here's a rundown of some of the best newer or lesser-known data science projects available for Python. Some, like Polars, are getting more attention than before but still deserve wider notice; others, like ConnectorX, are hidden gems.
Most data sits in a database somewhere, but computation typically happens outside of a database. Getting data to and from the database for actual work can be a slowdown. ConnectorX loads data from databases into many common data-wrangling tools in Python, and it keeps things fast by minimizing the amount of work to be done.
Like Polars (which I'll discuss soon), ConnectorX uses a Rust library at its core. This allows for optimizations like being able to load from a data source in parallel with partitioning. Data in PostgreSQL, for instance, can be loaded this way by specifying a partition column.
Aside from PostgreSQL, ConnectorX also supports reading from MySQL/MariaDB, SQLite, Amazon Redshift, Microsoft SQL Server and Azure SQL, and Oracle. The results can be funneled into a Pandas or PyArrow DataFrame, or into Modin, Dask, or Polars by way of PyArrow.
Data science folks who use Python ought to be aware of SQLite—a small, but powerful and speedy, relational database packaged with Python. Since it runs as an in-process library, rather than a separate application, it's lightweight and responsive.
DuckDB is a little like someone answered the question, "What if we made SQLite for OLAP?" Like other OLAP database engines, it uses a columnar datastore and is optimized for long-running analytical query workloads. But it gives you all the things you expect from a conventional database, like ACID transactions. And there's no separate software suite to configure; you can get it running in a Python environment with a single
pip install command.
DuckDB can directly ingest data in CSV, JSON, or Parquet format. The resulting databases can also be partitioned into multiple physical files for efficiency, based on keys (e.g., by year and month). Querying works like any other SQL-powered relational database, but with additional built-in features like the ability to take random samples of data or construct window functions.
DuckDB also has a small but useful collection of extensions, including full-text search, Excel import/export, direct connections to SQLite and PostgreSQL, Parquet file export, and support for many common geospatial data formats and types.
One of the least enviable jobs you can be stuck with is cleaning and preparing data for use in a DataFrame-centric project. Optimus is an all-in-one toolset for loading, exploring, cleansing, and writing data back out to a variety of data sources.
Optimus can use Pandas, Dask, CUDF (and Dask + CUDF), Vaex, or Spark as its underlying data engine. Data can be loaded in from and saved back out to Arrow, Parquet, Excel, a variety of common database sources, or flat-file formats like CSV and JSON.
The data manipulation API resembles Pandas, but adds
.cols() accessors to make it easy to do things like sort a dataframe, filter by column values, alter data according to criteria, or narrow the range of operations based on some criteria. Optimus also comes bundled with processors for handling common real-world data types like email addresses and URLs.
One possible issue with Optimus is that it's still under active development but its last official release was in 2020. This means it may not be as up-to-date as other components in your stack.
If you spend much of your time working with DataFrames and you're frustrated by the performance limits of Pandas, reach for Polars. This DataFrame library for Python offers a convenient syntax similar to Pandas.
Unlike Pandas, though, Polars uses a library written in Rust that takes maximum advantage of your hardware out of the box. You don't need to use special syntax to take advantage of performance-enhancing features like parallel processing or SIMD; it's all automatic. Even simple operations like reading from a CSV file are faster.
Polars also provides eager and lazy execution modes, so queries can be executed immediately or deferred until needed. It also provides a streaming API for processing queries incrementally, although streaming isn't available yet for many functions. And Rust developers can craft their own Polars extensions using pyo3.
Data science workflows are hard to set up, and even harder to set up in a consistent, predictable way. Snakemake was created to enable just that: automatically setting up data analyses in Python in ways that ensure everyone else gets the same results you do. Many existing data science projects rely on Snakemake. The more moving parts you have in your data science workflow, the more likely you'll benefit from automating it with Snakemake.
Snakemake workflows resemble GNU make workflows—you define the things you want to create with rules, which define what they take in, what they put out, and what commands to execute to accomplish that. Workflow rules can be multithreaded (assuming that gives them any benefit), and configuration data can be piped in from JSON/YAML files. You can also define functions in your workflows to transform data used in rules, and write the actions taken at each step to logs.
Snakemake jobs are designed to be portable—they can be deployed on any Kubernetes-managed environment, or in specific cloud environments like Google Cloud Life Sciences or Tibanna on AWS. Workflows can be "frozen" to use some exact set of packages, and any successfully executed workflow can have unit tests automatically generated and stored with it. And for long-term archiving, you can store the workflow as a tarball.
Serdar Yegulalp is a senior writer at InfoWorld, focused on machine learning, containerization, devops, the Python ecosystem, and periodic reviews.
Copyright © 2023 IDG Communications, Inc.