> the benefits that this format brings
Its column-based format means that its data can be queried without
loading the full file.
More can be found at [1]
I see 2 distinct advantages:
1. Convenience: sometimes I build a programmatic process that spits out
a bunch of parquet files, then I query them with AWS Athena or Apache
Drill. If I want to peak into the parquet file, it requires to either
write up a pandas script or to open it with visidata. If I find a
problem with the file, I need to go back to the process that generated
it, modify it, and re-generate the file (or write a script specifically
for editing the file). If I could just double-click, edit, save, it
would be so much easier. That's a major advantage for CSV despite its
inefficiency in query/filtering performance.
2. Performance: On the other hand, a spreadsheet editor might not be
designed to exploit this column-based format for better efficiency. It's
expected to open the whole file anyway. Maybe filtering the worksheet
with parquet would be faster than with CSV, but that depends on how the
filtering is implemented. I have no idea how it's done in Calc or other
editors. We all know the dread of opening a large file in a spreadsheet
editor. But then again, maybe that's when the data should be moved into
a database rather than stay in a heavy spreadsheet.
> https://github.com/apache/parquet-format
> The best place to learn about the specifics of this file format
Yes that's it. I don't want to sound self-contradictory, but maybe it's
NOT a good idea to support Parquet. I was just bringing it up, and maybe
this needs some more thought about the degree of usefulness or if people
will actually use it. Chicken-and-egg problem?
Links:
[1]
https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats
Shadi Akiki
Founder & CEO, AutofitCloud
https://autofitcloud.com/
+1 813 579 4935
On 11/22/19 2:42 PM, Kohei Yoshida wrote:
On 22.11.2019 02:37, Shadi Akiki wrote:
I'm wondering why Parquet is not yet a supported format in LibreOffice
Calc (and most desktop worksheet processing tools for that matter).
Well, one reason may be that nobody had asked for it yet! On that
note, asking about it and raising awareness (which you did) is a
necessary first step.
Also, it would be nice to know the benefits that this format brings
that any other existing formats currently do not. I use pandas
occasionally and I do work with people who use it on a regular basis,
but I had not heard this file format mentioned in our conversations to
this day.
Is this page
https://github.com/apache/parquet-format
The best place to learn about the specifics of this file format, or is
there any other page that provides more details?
One way we can add support for a new file format such as this one to
Calc is to add it to the orcus library [1], which Calc uses internally
to handle a subset of file formats. That may potentially be a much
easier route than adding it to the LibreOffice code base directly...
Full disclosure: I do maintain this library.
Kohei
[1] https://gitlab.com/orcus/orcus
Context
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.