Day 3 -

June 7th

Track 2 (Waterlink Atrium)

14:30 - 15:15

Gallia: A Schema-Aware Library for Practical Data Transformation in Scala

Gallia is a schema-aware data transformation library for Scala, which I created with a special emphasis on practicality, readability, and scalability (in order of importance to the library). In other words it aims to make it easy to get the job done, make the code readable to domain experts, and process big data when actually needed (by leveraging Apache Spark RDDs).

In this session I will describe what Gallia is, how to use it, and why one might want to use it, especially in contrast with alternative tools such as Pandas or Apache Spark on its own. I will briefly discuss the internal workings of the library, notably its underlying two Directed Acyclic Graphs, which process the schema and data respectively. I will also perform some live coding in order to showcase more involved use cases, for small and big data. I will then conclude the talk with a discussion of the library’s strengths (and weaknesses), some of its latest and exciting new features (such as support for Avro/Parquet), and its future direction.

Anthony Cros

Gallia Project

I am an independent software engineer/architect with 20 years of professional coding experience (see LinkedIn). My focus is on data transformations (especially big data), domain modeling, software architecture in general, and bioinformatics.

My past experiences primarily include work in the biomedical field, with positions held at the Ontario Institute for Cancer Research, the Hospital for Sick Children in Toronto, the Children’s Hospital of Philadelphia, the BF2I lab (INSA Lyon), and the bacteriology lab at UCBL (Lyon). I also worked for a short period of time in the telecom industry, although a less exciting venture for my tastes.

The above experiences included a great many situations in which data had to be modeled with the most extreme care, and processed with just the right trade-offs of practicality and performance. In the biomedical field in particular, errors in judgment on these aspects can have real consequences for patient care, whether indirectly via portals used by researchers (e.g. International Cancer Genome Consortium data portal), or directly in the case of internal hospital systems (typically kept private).

All these experiences led me to develop my own tool, Gallia, with the aim of streamlining the process of data transformation, and which draws on all the issues I’ve encountered using existing tools (as well as their strengths!). I’m also developing a tool intended to help with the modeling aspect - with an emphasis on semantics - and hope to publish it as well at some point in the future.

Subscribe

Join our conference

Subscribe and follow @ScalaDays on Twitter for the latest conference updates.