Hi Aditya,
On 08/12/2019 23:14, Aditya Parameswaran wrote:
I'm Aditya Parameswaran, an assistant professor at UC Berkeley.  Along
with Prof. Karrie Karahalios at the University of Illinois and many
Ph.D. student researchers, we've been working on developing a scalable
spreadsheet system, DataSpread (http://dataspread.github.io), for about
half a decade now.
        Interesting stuff.
We'd be very keen to collaborate to see if some of the ideas that we've
developed and opportunities we've identified would make sense in Calc.
        Sounds good. Very busy this week, but would yo be up for a conference
(with whomever is interested) sometime in the evening (UK time) of the
17th or 19th ? We could use https://meet.jit.si/CalcChat eg.
Our ultimate aim is to percolate some of these ideas back into popular
spreadsheet systems like Calc, so I'm excited to have this opportunity.
        Great. Some good ideas to include there, only a chunk of typing is
required =)
Yes, of course. Sajjadur, with Kelly's help, is looking into packaging
this and sending it your way.
        Excellent; thanks.
So I am not sure why we concluded outright that none of the spreadsheet
systems employ a columnar layout -- this is a good catch; we will fix.
        =)
That said, looking at Figure 10, it is surprising that the gains for the
sequential read are not a lot more;  and the gains should increase
proportionally.  So something funky is going on. Worth investigating. 
        Ah - well ... so ;-) as I said it depends on your data-set, and its
type homogeneity down the column to a degree, and also we can improve
our lookup algorithm there.
We started by having the relational database be a simple persistent
storage layer, when coupled with an index to retrieve data by position,
can allow us to scroll through large datasets of billions of rows at
ease. We developed a new positional index to handle insertions and
deletions in O(log(n)) -- https://arxiv.org/pdf/1708.06712.pdf. I agree
that pushing the computation to the relational database does have
overheads; but at the same time, it allows for scaling to arbitrarily
large datasets. 
        Ooh - nice paper. Your crawled data-set looks quite interesting too, we
run wide-scale crash-testing on the LibreOffice code-base across ~100k
files and enlarging our corpus there: or better, getting some
statistical view of which OOXML attributes (and thus features) are most
used out there would be extremely useful to us as we develop the core.
        I like the data on spreadsheet and formula shape - that is very useful.
Do you have data on the geometry of formulae - as in rows vs. columns ?
[ we switched to columnar storage based mostly on experience rather than
hard data ;-].
        It is also interesting to have access to very large (1.3m row)
data-sets that can have useful analysis done on them - would love to see
the source data there.
Would love to chat and see if any of the work that we're doing can
translate into Calc, and how we can contribute. 
        Great.
One other project that may be of interest is one where we're trying to
build a spreadsheet summarization and navigation tool, which can be
especially helpful on very large
spreadsheets.  http://srahman7.web.engr.illinois.edu/papers/NOAH.pdf
        Sounds good too. Of course, most useful on thee huge corpus of existing
sheets out there in XLS[X] / ODS format.
Agreed. We started the benchmarking effort a couple years ago, and the
old version was the new version back then :-) 
        Heh ;-)
Again, happy to share what we know!  Let's find a time to chat.  I see
that you're in Europe, so mornings for us (PT/CT) may work better? 
Sajjadur is traveling, so I'm not entirely sure if he's around, but I
should be able to find time to chat early in the morning any day next week. 
        Sounds good, cf. above - if we can't make that - early in the new year
would be great.
        I look forward to talking,
                Michael.
-- 
michael.meeks@collabora.com <><, GM Collabora Productivity
Hangout: mejmeeks@gmail.com, Skype: mmeeks
(M) +44 7795 666 147 - timezone usually UK / Europe
Context
   
 
  Privacy Policy |
  
Impressum (Legal Info) |
  
Copyright information: Unless otherwise specified, all text and images
  on this website are licensed under the
  
Creative Commons Attribution-Share Alike 3.0 License.
  This does not include the source code of LibreOffice, which is
  licensed under the Mozilla Public License (
MPLv2).
  "LibreOffice" and "The Document Foundation" are
  registered trademarks of their corresponding registered owners or are
  in actual use as trademarks in one or more countries. Their respective
  logos and icons are also subject to international copyright laws. Use
  thereof is explained in our 
trademark policy.