Date: prev next · Thread: first prev next last
2019 Archives by date, by thread · List index


> the benefits that this format brings

Its column-based format means that its data can be queried without loading the full file.

More can be found at [1]

I see 2 distinct advantages:

1. Convenience: sometimes I build a programmatic process that spits out a bunch of parquet files, then I query them with AWS Athena or Apache Drill. If I want to peak into the parquet file, it requires to either write up a pandas script or to open it with visidata. If I find a problem with the file, I need to go back to the process that generated it, modify it, and re-generate the file (or write a script specifically for editing the file). If I could just double-click, edit, save, it would be so much easier. That's a major advantage for CSV despite its inefficiency in query/filtering performance.

2. Performance: On the other hand, a spreadsheet editor might not be designed to exploit this column-based format for better efficiency. It's expected to open the whole file anyway. Maybe filtering the worksheet with parquet would be faster than with CSV, but that depends on how the filtering is implemented. I have no idea how it's done in Calc or other editors. We all know the dread of opening a large file in a spreadsheet editor. But then again, maybe that's when the data should be moved into a database rather than stay in a heavy spreadsheet.


> https://github.com/apache/parquet-format

> The best place to learn about the specifics of this file format

Yes that's it. I don't want to sound self-contradictory, but maybe it's NOT a good idea to support Parquet. I was just bringing it up, and maybe this needs some more thought about the degree of usefulness or if people will actually use it. Chicken-and-egg problem?


Links:

[1] https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats

Shadi Akiki
Founder & CEO, AutofitCloud
https://autofitcloud.com/
+1 813 579 4935

On 11/22/19 2:42 PM, Kohei Yoshida wrote:
On 22.11.2019 02:37, Shadi Akiki wrote:

I'm wondering why Parquet is not yet a supported format in LibreOffice
Calc (and most desktop worksheet processing tools for that matter).

Well, one reason may be that nobody had asked for it yet!  On that note, asking about it and raising awareness (which you did) is a necessary first step.

Also, it would be nice to know the benefits that this format brings that any other existing formats currently do not.  I use pandas occasionally and I do work with people who use it on a regular basis, but I had not heard this file format mentioned in our conversations to this day.

Is this page

https://github.com/apache/parquet-format

The best place to learn about the specifics of this file format, or is there any other page that provides more details?

One way we can add support for a new file format such as this one to Calc is to add it to the orcus library [1], which Calc uses internally to handle a subset of file formats.  That may potentially be a much easier route than adding it to the LibreOffice code base directly... Full disclosure: I do maintain this library.

Kohei

[1] https://gitlab.com/orcus/orcus


Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.