Tabular Data – Summer of Code

Parquet.jl enhancements

Difficulty: Medium

Duration: 175 hours

Apache Parquet is a binary data format for tabular data. It has features for compression and memory-mapping of datasets on disk. A decent implementation of Parquet in Julia is likely to be highly performant. It will be useful as a standard format for distributing tabular data in a binary format. There exists a Parquet.jl package that has a Parquet reader and a writer. It currently conforms to the Julia Tabular file IO interface at a very basic level. It needs more work to add support for critical elements that would make Parquet.jl usable for fast large scale parallel data processing. Each of these goals can be targeted as a single, short duration (175 hrs) project.

Lazy loading and support for out-of-core processing, with Arrow.jl and Tables.jl integration. Improved usability and performance of Parquet reader and writer for large files.
Reading from and writing data on to cloud data stores, including support for partitioned data.
Support for missing data types and encodings making the Julia implementation fully featured.

Resources:

The Parquet file format (also are many articles and talks on the Parquet storage format on the internet)
A tour of the data ecosystem in Julia
Tables.jl
Arrow.jl

Recommended skills: Good knowledge of Julia language, Julia data stack and writing performant Julia code.

Expected Results: Depends on the specific projects we would agree on.

Mentors: Tanmay Mohapatra

DataFrames.jl join enhancements

Difficulty: Hard

Duration: 175 hours

DataFrames.jl is one of the more popular implementations of tabular data type for Julia. One of the features it supports is data frame joining. However, more work is needed to improve this functionality. The specific targets for this project are (a final list of targets included in the scope of the project can be decided later).

fully implement multi-threading support by joins, reduce memory requirements of used join algorithms (which should additionally improve their performance), verify efficiency of alternative joining strategies in comparison to those currently used and implement them along with adaptive algorithm choosing the right joining strategy depending on the passed data;
implement join allowing for efficient matching on non-equal keys; special attention should be made to matching on keys that are date/time and spatial objects;
implement join allowing for an in-place update of columns of one data frame by values stored in another data frame based on matching key and condition specifying when an update should be performed;
implement an more flexible mechanizm than currently available allowing to define output data frame column names when performing a join.

Resources:

Recommended skills: Good knowledge of Julia language, Julia data stack and writing performant multi-threaded Julia code. Experience with benchmarking code and writing tests. Knowledge of join algorithms (as e.g. used in databases like DuckDB or other tabular data manipulation ecosystems e.g. Polars or data.table).

Expected Results: Depends on the specific projects we would agree on.

Mentors: Bogumił Kamiński