Dr. Roberto de Almeida is not only Marinexplore’s ocean data engineer but also a very talented open source developer. In 2003 Rob released the first version of Pydap, a pure Python library that enables access to scientific datasets published on OPeNDAP servers. With Pydap it’s possible to introspect and manipulate a dataset as if it were stored locally, with data being downloaded on-the-fly. Because Pydap is an integral part of the Marinexplore architecture, we asked Rob to share his motivations and experiences working on the library over the past decade.
The History of Pydap
You recently celebrated the 10th anniversary of Pydap. Can you tell us a little about the history of the project and how you got started?
I started using Python during my Ph.D., back in 2001. I was studying the variability of the sea surface temperature in the South Atlantic Ocean for my thesis, and needed to compute something called “empirical orthogonal functions” (EOFs). I discovered a package called PyClimate which had a straightforward example of EOFs. PyClimate was written in this new language called “Python”, and I was impressed by the expressiveness and simplicity of the language. I soon started learning more about Python and using it for my research.
A lot of the data I needed for my thesis was available on OPeNDAP servers. At the time, OPeNDAP was already widely used for distributing oceanographic and climate data, and since I was now using Python, I decided to write a client library to access the data I needed. I had very little programming experience back in 2003, so my first version was somewhat crude, but it worked. Since I was a great fan of the free software movement, I released my code as open source.
The nice things about open source are that once people start using the code you have released, you are hooked. You feel responsible for the code to keep improving it and implementing features your users request. I continued developing Pydap, adding better parsers, binary data transfer, etc. while at the same time learning more about programming and improving my computer science skills.
In 2005 Google announced a program called Summer of Code, where students from any level could participate in open source projects during the summer. I submitted a proposal to develop version 2.0 of Pydap having the Python Software Foundation as a mentoring organization. My plan was to developed a new version of Pydap, including not only the client but also a server for NetCDF, SQL and other data formats. The proposal was accepted, and I ended up releasing Pydap 2.0. This was the first version of Pydap that I was happy with — this should have been the 1.0 version.
After that, development continued. There have been a few rewrites from scratch since then, usually to simplify the code base or when better 3rd party libraries appear, like requests and Numpy. During these past 10 years, the code became stable enough that only minor bug fixes were needed, usually to support quirky servers.
Who are the users of Pydap, and for what applications is it being implemented?
Over the years Pydap developed a small and friendly community of users. From what I gather from the statistics and user surveys I conduct from time to time most people are using Pydap as a client to access scientific data. This is usually pretty straightforward and well documented, so I don’t get much feedback from the majority of the users.
Most of the activity on the mailing list comes from people who want to set up a Pydap server, since this involves a lot of options: how to proxy the server (using Apache or nginx, for example), how to install handlers for particular formats in which the data is stored, how to handle authentication, and so on. These are the users that normally contribute to the code base, sending patches and creating new data handlers.
My favorite use of Pydap was when it was used to distribute model output from the 4th Assessment Report (AR4) of the Intergovernmental Panel for Climate Change (IPCC), in 2007. The report summarized the state-of-the-art knowledge about climate change at the time, and gave the Nobel Peace Prize to the IPCC and Al Gore. Pydap was also considered for distributing data for the 5th report (AR5), but I’m not sure about the final decision. It fared extremely well in a benchmark comparing different OPeNDAP servers done a few years ago, leading a good number of the results. This was quite a surprise since people usually consider Python as a slow language.
Other than that, there are a good number of institutions and companies using Pydap: NOAA, NASA, Petrobras and, of course, Marinexplore. It is also used as a “glue” to bring data into other products. I remember seeing a Pydap-based plugin for ArcGIS, which allowed it to access data on OPeNDAP servers.
Pydap On Marinexplore
How is Marinexplore using Pydap?
The first use we have for Pydap is in our collection pipeline, the part of our system that imports data from public providers. A lot of the data we import is available through OPeNDAP servers, so Pydap is used as a client to access and retrieve that data. I’m always happy when I see that some data we need can be accessed using Pydap — it means I don’t need to download, uncompress, read and remove files to import the data.
We also have the OPeNDAP Bridge in Marinexplore, which is built on top of Pydap. The OPeNDAP Bridge allows prepared datasets to be accessed using different clients, making it easier for our users to access and work with the data they need. I did a lot of work to make sure that different OPeNDAP clients can access the data correctly, since the protocol is quite generic and not fully supported by all clients. Most of the clients implement only a subset of the data model (Ferret works better with gridded datasets), so making a single dataset that can be accessed by all clients is quite a challenge.
My mantra for software development is “a lot of effort went into making this effortless.” This is hopefully how it works when using our OPeNDAP Bridge.
Are you planning a new release of Pydap to celebrate the 10th anniversary?
Yes, indeed! While working on the OPeNDAP Bridge for Marinexplore I implemented a lot of optimizations in Pydap - in particular when serving sequential data from non-gridded datasets. I wanted to be sure that my co-workers understood the code I was writing so I simplified it a lot in the way data handlers work by implementing a lot of OPeNDAP-related idiosyncrasies in the library and away from the handlers. The code that we run is on my personal repository and will soon be released as Pydap 3.2.
I also learned a lot during the time I have been at Marinexplore, so this version is more robust from a computer engineering point of view. The code has full test coverage, continuous integration, support for Python 2.6+ and 3.3+, and I’m now using the amazing git-flow tool for development.
The Future of Pydap
What is next for Pydap?
A new version of the OPeNDAP protocol, DAP4, should be released in the next months, and I plan to implement the changes in Pydap. I’ll probably implement this as Pydap 4, to keep the version in sync with the protocol’s. I have no plans in implementing new features for Pydap, other than bug fixes, protocol updates and new handlers (which are very easy to write).
Instead, Pydap is stable enough that I would like to implement new projects on top of it. I see Pydap as a library with which we can build geospatial servers, pipelines for data analytics, visual tools and rich web applications. This is the direction where I want to see it grow. Pydap itself is very modular and was written with applications like this in mind.
One application that I would like to work with in particular is a data server with support for inserting and editing proper metadata describing the dataset, making it trivially easy to upload data to the server and share it. Maybe a drag and drop interface, where you drop files to your server, and a click interface to describe it using a metadata model, so scientists can share their data. Working at Marinexplore tought me that metadata is as important as data, but very often, and unfortunately, ignored by data providers.
With regard to the OPeNDAP protocol itself, while it has been used for some 20 years and widely adopted due to its simplicity and usefulness, there is still room for improvement. One issue I see is that as the amount of data grows exponentially it increases the need to send computations to the data, instead of downloading it to run computations on a local machine. This happens only because the amount of data is growing faster than the available bandwidth. OPeNDAP was designed mostly as a data access protocol as the name says, though it has support for performing simple operations on the server before the data is downloaded.
There are a number of non-standard solutions for this problem. As far as I know both the GrADS and Ferret data servers (GDS and FDS, respectively) implement their own incompatible solutions for server side processing, and different OPeNDAP developers have a long ongoing discussion about how to agree on a common solution. I proposed a backend neutral solution in 2007 during a meeting in Boulder, CO, and this year the topic has resurfaced as it becomes more and more important.
This is a discussion that is highly relevant for Marinexplore. How do we improve existing APIs and build better ones, so that we can make data “invisible” for our users? By invisible I mean that users should not be concerned with the data itself: they should not have to think or worry about it. Instead, data should be always there, ubiquitous, at our fingertips so that the ocean community can focus on exploring ideas and producing results. I think we’re moving closer to that every day.