Re: Fw: benchmark of Excel, Calc, Google Docs

Aditya Parameswaran <adityagp -AT- berkeley.edu>
Mon, 9 Dec 2019 11:14:46 -0800

Oops, for some reason I forgot to add Sajjadur to the email list!

Aditya

On Sun, Dec 8, 2019 at 3:14 PM Aditya Parameswaran <adityagp@berkeley.edu>
wrote:

Michael,

Great to connect with you.  Sajjadur forwarded your email to me.

Just a bit of context about the work that you're referring to --

I'm Aditya Parameswaran, an assistant professor at UC Berkeley.  Along
with Prof. Karrie Karahalios at the University of Illinois and many Ph.D.
student researchers, we've been working on developing a scalable
spreadsheet system, DataSpread (http://dataspread.github.io), for about
half a decade now. We've published a number of papers on this topic as well
as developed a prototype system.  Sajjadur led the effort on benchmarking
that you referenced in your email and a navigation plugin (more on that in
a bit), while Mangesh (also cc-ed, when he was at the University of
Illinois) led the efforts to integrate spreadsheets and databases and to
build a more interactive (asynchronous) formula execution engine when he
was at Illinois.

We'd be very keen to collaborate to see if some of the ideas that we've
developed and opportunities we've identified would make sense in Calc.  Our
ultimate aim is to percolate some of these ideas back into popular
spreadsheet systems like Calc, so I'm excited to have this opportunity.

Answers to some of your questions in-line below:


        I was interested in a number of things: particularly whether we
can get
your test sheets / macros so that we can run the tests under a profiler
& of course see what stupid things jump out that we can fix =)


Yes, of course. Sajjadur, with Kelly's help, is looking into packaging
this and sending it your way.


        We do have a columnular layout; checkout:


https://gerrit.libreoffice.org/plugins/gitiles/core/+/master/sc/inc/column.hxx#111

... assuming you have reasonably uniform, contiguous data types down a
column your test shouldn't have concluded:

        "Takeaway: Spreadsheet systems do not employ a columnar
data layout to improve computational (e.g., aggregation) performance"


Looking at Figure 10 in our preprint,  Calc does seem to be somewhat
faster for sequential read than random read at least on larger sizes.

So I am not sure why we concluded outright that none of the spreadsheet
systems employ a columnar layout -- this is a good catch; we will fix.

That said, looking at Figure 10, it is surprising that the gains for the
sequential read are not a lot more;  and the gains should increase
proportionally.  So something funky is going on. Worth investigating.

        For incremental updates - re-computation is frequently chosen
over more
smarts  since correctness is a far more dominating concern than
performance generally, and there are plenty of know performance
optimizations that can be done before we try to complicate things
further. Also the assumption that after deleting A100:

        SUM(A1:A100) - A100 === SUM(A1:A99)

        falls foul of potential precision problems.


Makes sense.


        The idea of converting to SQL queries is an interesting one but I
find
it very hard to believe it would provide any performance advantage at
the same memory footprint. Furthermore - I'd be interested to know how
you do other spreadsheet operations: row & column insertion, addressing,
and dependency work on top of a SQL database with any efficiency.


We started by having the relational database be a simple persistent
storage layer, when coupled with an index to retrieve data by position, can
allow us to scroll through large datasets of billions of rows at ease. We
developed a new positional index to handle insertions and deletions in
O(log(n)) -- https://arxiv.org/pdf/1708.06712.pdf. I agree that pushing
the computation to the relational database does have overheads; but at the
same time, it allows for scaling to arbitrarily large datasets.

Happy to explain this on the phone.

        It is also worth bearing in mind that dependency management and
tracking of what to update when something changes consumes a far larger
proportion of a spreadsheet than is commonly expected: what to
re-calculate having changed A1 is far from trivial for large, twisty
real-world sheets.


Ah, for this, we have a new asynchronous formula computation capability
(led by Mangesh) that looks something like this:
https://twitter.com/adityagp/status/1146105708024229888?s=20 (gif in
tweet).  When a change is made, we use a compact dependency maintenance
structure to identify dependent cells that are hidden away with a progress
bar, and then compute them in the background.


        Anyhow - would be happy to have a chat with your team at some
stage if
you're interested in helping us to improving things here: a good start
would be just getting more representative benchmarks for the workload
you're interested in would be really useful.


Would love to chat and see if any of the work that we're doing can
translate into Calc, and how we can contribute.

One other project that may be of interest is one where we're trying to
build a spreadsheet summarization and navigation tool, which can be
especially helpful on very large spreadsheets.
http://srahman7.web.engr.illinois.edu/papers/NOAH.pdf


        And finally of course, you used a really old version. I'd
recommend
using the 6.4 snapshots, or 6.3 if you must.


https://gerrit.libreoffice.org/plugins/gitiles/core/+/46d0afba738d8ee7c9b63384fef513f42ee587f3

https://gerrit.libreoffice.org/plugins/gitiles/core/+/845e1cdca3349c72e3083186502285d5b776abbe
...


Agreed. We started the benchmarking effort a couple years ago, and the old
version was the new version back then :-)

        And lots of others. Indeed, we have a customer who has large
~100k+ row
spreadsheets with complex sorts across many rows who claims that Calc
sorts the data in second to minutes vs. Excels' hours - so I was
surprised to see your sort results; would be good to inspect what you do.

        I hope that helps, would love to get in touch with you & your team,

there is plenty of fun stuff to do here to make things quicker & better =)


Again, happy to share what we know!  Let's find a time to chat.  I see
that you're in Europe, so mornings for us (PT/CT) may work better?
Sajjadur is traveling, so I'm not entirely sure if he's around, but I
should be able to find time to chat early in the morning any day next week.

Aditya

Context

benchmark of Excel, Calc, Google Docs · Michael Meeks
- Re: benchmark of Excel, Calc, Google Docs · Rahman, Sajjadur
  - Re: benchmark of Excel, Calc, Google Docs · Michael Meeks
    - Re: benchmark of Excel, Calc, Google Docs · Rahman, Sajjadur
- (message not available)
  - Re: Fw: benchmark of Excel, Calc, Google Docs · Aditya Parameswaran
    - Re: Fw: benchmark of Excel, Calc, Google Docs · Aditya Parameswaran
      - Re: Fw: benchmark of Excel, Calc, Google Docs · Wols Lists
        
        Re: Fw: benchmark of Excel, Calc, Google Docs · Aditya Parameswaran
        
        Re: Fw: benchmark of Excel, Calc, Google Docs · Wol
        
        Re: benchmark of Excel, Calc, Google Docs · Chris Sherlock
    - Re: Fw: benchmark of Excel, Calc, Google Docs · Michael Meeks
      - Re: Fw: benchmark of Excel, Calc, Google Docs · Aditya Parameswaran
        
        Re: Fw: benchmark of Excel, Calc, Google Docs · Aditya Parameswaran
- Re: benchmark of Excel, Calc, Google Docs · Kohei Yoshida

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.