SolvedDataFrames.jl Handling of strings for column indexing
✔️Accepted Answer
but we are aspiring towards that, no?
I do not want to say "no", but at least for now I do not see how to achieve this.
The current design is the following:
- we are type unstable
- we have all the benefits of type instability - we can add/remove columns, we can change column types, we can change column names, we can have thousands of heterogeneous columns without huge compilation cost (even recently we changed some bits of code to be type unstable, as otherwise CSV.jl was very slow when saving files)
- all exposed methods are type stable internally - i.e. they process things fast and only input and output is type unstable - roughly there is at most one dynamic dispatch per column processed (unless we explicitly want to be unstable or we have forgotten to fix something); in particular purposefully by default we drop column names when processing data to avoid constant recompilation even in type-stable branches (as passing around column names would trigger recompilation each time names change)
- if you want type stability for your own methods then call
Tables.columntable
orTables.namedtupleiterator
to have a no-copy type-stable object (orTables.rowtable
- at the cost of performing a copy) - Hopefully one day Julia will be able to cache compiled functions better than it does now between sessions, so given the points 1-4 DataFrames.jl will have a very fast lading and response time (as many things will already be compiled in cache)
- There are other packages in the ecosystem that focus on type stability, so if despite points 1-5 one still needs type stability there are other options
Other Answers:
I may be late to this, but a point of interest: I thought Julia tended to embrace the "don't offer too many ways to do the same thing" design philosophy (e.g. using "
for strings, and not allowing multiple options like '
and "
like Python)? Or am I making that up?
I will confess (particularly if there's a 2x performance advantage) I'm inclined to restricting to the current Option 1. This is more flexible, but also a potential invitation for confusion among new users?
I suspect that if we go to option 3, then over time Strings will become what everyone uses (similar to pandas / R), people will basically forget about the symbol functionality most of the time, and people will end up with column accesses that are 2x slower than they could be...). But we'll still have to support two use cases instead of one.
EDIT: Typo.
EDIT 2: Add note about eventual shift in use tendencies.
I am still fighting with myself what is better.
@kleinschmidt - what do you think about this PR in the context of StatsModels.jl and formula interface (which now requires Symbols
).
The alternative is to stick to Option 1 + recommend Wrangling.jl (it is more powerful than what we would have anyway) in the manual if someone wants more flexibility (possibly we would change rename!
and unstack
API for convenience).
Now the report from the field (I have started preparing for the PR): almost all functions would need some minor update with this change (this is not super bad, but just shows how big this PR is).
Maybe let me also ask - is there someone who "strongly" wants strings accepted? (just to hear the other side, as it seems that most people do not have a strong preference but only mild one in one way or the other).
I am sorry for possible bikeshedding here, but this is a fundamental design decision for me that will have very long lasting consequences.
I'm a fairly new Julia user, and just to share my experience, for the first week or two I was definitely a bit confused between Symbol("col_1") and :col_1. I was also a little thrown off by why Symbol("col 1") worked but not :col 1. I obviously figured these things out, but for someone just starting Julia, and if the assumption is that many new Julia users are coming from Python like myself, I think switching to string indexing instead of symbol would be great.
So this exactly was my thinking, but I was afraid that just switching from symbols to strings (and dropping symbols) seems as "too breaking" even for 2.0, and would require a really serious consideration (to avoid problems like Python had moving from 2 to 3).
Some additional comments for a decision:
- string lookup vs symbol lookup is ~ 2.5x slower (for typical column name length); this is not hugely problematic, but still I wanted to note this
- we should consider other tabular data formats (and Tables.jl in general), where Tables.jl strictly assumes that column names should be
Symbol
s
Essentially our current design is:
- fast to lookup, and consistent with Tables.jl
- at the cost that if the user wants to work with strings one needs to use
Symbol
andstring
for conversions in both ways.
And the additional question is in what cases this is really problematic (i.e. worth considering to change) given that:
- in
rename
we already handle strings - we allow using regex for column name matching
(as maybe it is enough to just add 1-2 convenience functions to cover 90% of use cases where strings are needed)
Something @StefanKarpinski pointed out on Slack, and I was not aware of is that we could overload
getproperty
andsetproperty
to accept a string as an agrument, so things like:and
would work.
@nalimilan - do you think we want to allow this?