What syntax would you propose?
I would suggest a familiar notation like "[a, b] -> c" in a dedicated dag block:
dag text_stats {
tee -> [ split_words, count_chars ]
# word-based frequencies
split_words -> tee_words
tee_words -> ngram2 -> save_digram
tee_words -> ngram3 -> save_trigram
tee_words -> ranked_frequency -> save_words
# character-based frequencies
count_chars -> add_percentage
chars_to_lines -> ranked_frequency -> add_percentage -> save_chars
}
run text_stats < input.txt
https://www2.dmst.aueb.gr/dds/sw/dgsh/#text-propertiesor
dag commit_graph {
git_log -> filter_recent -> sort -n -> [ uniq_committers, sort_by_email ]
uniq_committers -> [ last_commit, first_commit, committer_positions ]
[ last_commit, first_commit ] -> cat -> tr '\n' ' ' -> days_between
[ committer_positions, sort_by_email ] -> join_by_email -> sort -k2n -> [ make_bitmap_header, plot_per_day ]
[ uniq_committers, days_between ] -> emit_dims -> plot_per_day
make_bitmap_header -> cat
plot_per_day -> morphconv -> [ to_png_large, to_png_small ]
}
run commit_graph
https://www2.dmst.aueb.gr/dds/sw/dgsh/#committer-plotThe translations above are computer-assisted and may contain mistakes, but you get the idea.
having dgsh output a graphvis file in dry-run mode would be a neat feature.
git_log() {
git log --pretty=tformat:'%at %ae'
}
Separating function definitions allows you to run, test, and re-use them.I’m on my phone at the moment and cooking so cannot type any examples, but if I get time, I’ll throw together some comparisons later tonight
However Murex does support CSP-style concurrency. So while there’s no syntax sugar for writing graphs, you can very easily create adhoc pipes and pass them around instead of using stdout / stderr.
So it wouldn’t actually take much to refine that with some DAG-friendly syntax.
In fact maybe that can be my next project…
Looking properly at this, I can see no iteration is needed. Which actually makes the Murex implementation even easier because Murex already has tee pipes just like dgsh. It’s just not (yet) particularly well documented.
Dgsh – Directed Graph Shell - https://news.ycombinator.com/item?id=21700014 - Dec 2019 (11 comments)
Dgsh – Directed graph shell - https://news.ycombinator.com/item?id=13352659 - Jan 2017 (51 comments)
I.e. much faster to use dgsh for a basic processing DAG, much more painful to use dgsh for a large ETL pipeline.
Python with something like Prefect isn't something you'd use a REPL to bang out a one-off on, but it'd be more maintainable. dgsh would let you use a REPL to bang out a quick and dirty DAG.
Even creating tools in Python that can be connected together in a Unix shell pipeline isn't trivial. By default if a downstream program stops processing Python's output you get an unsightly broken pipe exception, so you need to execute signal.signal(signal.SIGPIPE, signal.SIG_DFL) to avoid this.
The upgrade was a nightmare for so many organizations. It shouldn't be that way but it was.
which have their own subculture. You could solve the same problems they do with pandas and scikit-learn but people who use those tools would never use pandas and scikit-learn and vice versa.
Circa 2015 I was thinking those tools all had the architectural flaw that they pass relational rows over the lines as opposed to JSON objects (or equivalent) which means you had to realize joins as highly complex graphs where things that seem like local concerns to me require a global structure and where what seems like a little change to management changes the whole graph in a big way.
I found the people who were buying up that sort of tools didn’t give a damn because they thought customers demanded the speed of columnar execution which our way couldn’t deliver.
I made a prototype that gave the right answers every time and then went to work for a place which had some luck selling their own version that didn’t always give the right answers because: they didn’t know what algebra it supported, didn’t believe something like that had an algebra, and didn’t properly tear the pipeline down at the end.
There are probably libraries that could help, but then you need to install dependencies which is sad in python for other reasons
Others use nextflow but that requires learning Groovy and it's less intuitive.
awk -F\; ' $2 > max[$1] { max[$1] = $2 } !($1 in min) || $2 < min[$1] { min[$1] = $2 } { sum[$1] += $2; count[$1]++} END { for (n in sum) printf("%s=%.1f/%.1f/%.1f, ", n, min[n], sum[n] / count[n], max[n])}'
Can't see how dgsh could be applied to it.