116 points by ndhandala 6 days ago | 13 comments
N_Lens 2 days ago
OTEL as a set of standards is admirable and ambitious, though in my experience actual implementation differs significantly between different vendors and they all seem to overcomplicate it.
eurekin 2 days ago
Plus that tens of terabytes of data you have to store for a week worth of traces
c2h5oh 2 days ago
That's why you sample just enough instead of storing everything
voidfunc 2 days ago
That sounds great until you have a massive issue that costs the company real money and leadership asks why you weren't logging everything in full fidelity?

We run with Debug logging on in prod for that reason too. We also ingest insane amounts of data but it does seem to be worth it for a sufficiently complex and important enough system to really have it all.

majormajor 2 days ago
> That sounds great until you have a massive issue that costs the company real money and leadership asks why you weren't logging everything in full fidelity?

You should have an answer, right? Like, in your case, you run a lot of logging, and you know why. So if it's off, you say "because it would cost X/million dollars a year and we decided not to do it."

Course, if you're the one who set it up, you should have the receipts on when that decision was made. This can be tricky sometimes because a lot of software dev ICs are strangely insulated from direct budgets, but if you're presented with an option that would be helpful but would cost a ton of money, it's generally a good thing to at least quickly run by someone higher up to confirm the desired direction.

TYPE_FASTER 1 day ago
I’ve used feature flags to manage logging verbosity and sample rate. It’s really nice to be able to go from logging very little to incrementally pump up the volume when there’s an incident.
evidencetamper 2 days ago
> and leadership asks why you weren't logging everything in full fidelity?

I haven't been asked this question ever. In a way, I wish I was. I wish leadership was engaged in the details of the capabilities of the systems they lead.

But I don't anyone asking me this question any time soon either.

no_wizard 2 days ago
Have you ever been asked “why didn’t we catch this sooner?”. I feel like it’s the same question worded differently
voidfunc 2 days ago
Its really two questions:

1. Why didn't we catch this sooner

2. Why did it take so long to mitigate

Without the debug logging #2 can be really tricky sometimes as well as you can be flying blind to some deep internal conditional branch firing off.

vlovich123 2 days ago
Sampling unconditionally at the start of the request is worth less than sampling at the end (so that your sample 1% of successful traces and be 100% of traces with issues).
eurekin 2 days ago
We do. 0.5%
drivenextfunc 2 days ago
Has anyone used OpenTelemetry for long-running batch jobs? OTel seems designed for web apps where spans last seconds/minutes, but batch jobs run for hours or days. Since spans are only submitted after completion, there's no way to track progress during execution, making OTel nearly unusable for batch workloads.

I have a similar issue with Prometheus -- not great for batch job metrics either. It's frustrating how many otherwise excellent OSS tools are optimized for web applications but fall short for batch processing use cases.

nucleardog 17 hours ago
Nothing running for days, but sometimes a half hour or so. When the process kicks off it starts a trace, but individual steps of the process create separate spans within that trace (and sometimes further nested spans) that don't run the entire length of the job. As the job progresses, the spans and their related events, logs, etc all appear.

I think this does highlight, to me, the biggest weakness of OTel--the actual documentation and examples for "how to solve problems with this" really suck.

ekzy 2 days ago
I’ve implemented OTEL for background jobs, so async jobs that get picked up from the DB where I store the trace context in the DB and pass it along to multiple async jobs. For some jobs that fail and retry with a backoff strategy, they can take many hours and we can see the traces fine in grafana. Each job create its own span but they are all within the same trace.

Works well for us, I’m not sure I understand the issue you’re facing?

ekzy 2 days ago
Ok after re reading I think you have issues with long running spans, I think you should break down your spans in smaller chunks. But a trace can take many hours or days, and be analysed even when it’s not finished
scottgg 1 day ago
You could use span links for this. The idea is you have a bunch of discrete traces that indicate they are downstream or upstream of some other trace. You’d just have to bend it a bit to work in your probably single process batch executor !
nickzelei 2 days ago
Hm from what I’ve seen it emits metrics at a regular interval just like Prometheus. Maybe I’m thinking of something else though.
dvfjsdhgfv 13 hours ago
> I have a similar issue with Prometheus -- not great for batch job metrics either.

How do you mean? The metrics are available for 15 days by default. What exactly are you missing?

zug_zug 2 days ago
This is sort of all just a reframing of existing technologies.

Span = an event (which is bascially just a log with an associated trace), and some data fields. Trace = a log for a request with a unique Id.

A useful thing about opentelemetry is that there's auto-instrumentation so you can get this all out-of-the-box for most JVM apps. Of course you could probably log your queries instead, so it's not necessarily a game-changer but a nice-to-have.

Also the standardization is nice.

jeffbee 2 days ago
I always preach the isomorphism between traces and logs, but you left out the key thing. A span is a log entry associated with a trace, but the other key attributes of the span are its own unique identifier and a reference to the other event that caused the event. With those three attributes you can interpret the trace as a casual graph.
zug_zug 2 days ago
True. I think I’m emphasizing their similarities because what I’m seeing is companies treating them as unrelated (eg splunk and signalfx making entirely different query languages and visualization tools for logs vs spans)

Imo spans and logs should be understood as the same and displayed and queried the same (it’s trivial to add span id to each log), it almost feels like people are trying to make something trivially simple seem more substantial or complex

BoiledCabbage 2 days ago
Traces and spans can be extended from or added to existing logging, but they aren't the same.

Logs are point in time, spans are a duration. Logs are flat, spans have a hierarchy.

It's the difference between logging a message in a function, and logging the beginning and end of a function while noting the specific instance of the fn caller.

If you have many threads or callers to the same function that difference is critical in tracing causality of failures or any other type of action of note.

zug_zug 15 hours ago
Logs can represent point in time, or spans, or anything you choose. Logs can have the span-id attached to them (and normally should), so they are hierarchical.

In short, logs can do everything spans can do, depending on how you use them. So really there isn't much distinction.

I'd say that spans and logs are about 95% similar, whereas metrics are wildly dissimilar to both.

Most tools would be be better if they treated spans and logs as two nearly-identical things:

- logs should be viewable in a hierarchy if they have a span id associated - spans should be queryable and countable and alarmable and dashboardable with the same tools as logs

bboreham 2 days ago
Span has a beginning and an end time. Event typically just has a time when it happened.
tnolet 2 days ago
yeah, but spans can have events!
freetonik 1 day ago
So, events are recursive?
digianarchist 2 days ago
I've been tasked with adding telemetry to an AWS based service at work:

CLI -> Web API Gateway -> Lambda returning a signed S3 URL S3 upload -> SQS -> Lambda which writes to S3 and updates a Dynamo record -> CLI polls for changes

This flow isn't only over HTTP and relies on AWS to fire events. I worked around this by embedding the trace ID into the signed URL metadata. It doesn't look like this is possible with all AWS services.

I wonder if X-Ray can help here?

It can also be tedious to initialize spans everywhere. Aspects could help a lot here and orchestrion [0] is a good example of how it could be done in Go. I haven't found an OTEL equivalent yet (though haven't looked hard).

[0] - https://datadoghq.dev/orchestrion/docs/architecture/#code-in...

scottgg 1 day ago
There’s an OTel SIG to do something similar / based on orchestrion and some other prior art - so just a matter of time !
psnehanshu 2 days ago
The amount of additional code that it needs is horrible. We will now have to spend more brain juice on telemetry when working on a feature.
SirHackalot 2 days ago
It’s really not that bad, integrating it with dashboards is where I found most of the difficulty to be (due to bad documentation). I spent 4 days on implementing observability for this new backend project I’m working on. OTEL logging, tracing, and metric emission took less than a day to implement, instrumentation was very well documented. When I tried to integrate with Grafana dashboards, that’s when things started getting pretty frustrating…
gazpacho 2 days ago
I work for Pydantic. We make Logfire, a commercial OTEL backend. But we’ve made wrappers around the OTEL SDKs in various languages that simplify configuration and usage. They can be used with any OTEL compatible backend (although we’d love if you try our SaaS offering): - JavaScript / Typescript: https://github.com/pydantic/logfire-js - Rust: https://github.com/pydantic/logfire-rust - Python: https://github.com/pydantic/logfire
diegojromero 2 days ago
Thanks for your comment! It has given me an idea for a project: a simple library that provides a Python decorator that can be used to include basic telemetry for functions in Python code (think spans with the input parameters as attributes): https://github.com/diegojromerolopez/otelize

Feedback welcome!

pm90 2 days ago
There’s certainly some overhead, nothing is free. But the tradeoff is better insight into your system and better tools to validate issues when they arise. It can be very powerful in those scenarios.

Ive spent countless hours on issues where customers complain about performance or a bug and it just can’t be reproduced. Telemetry allows us to get more information to locate and fix these issues.

matsemann 2 days ago
I don't really agree. It's mostly setup done once. Like configuring it and for example attaching some span generator to the library you use to talk with the database. Then future queries get it "for free". And just a single line if you want something custom if you have an annotation in java or using with in python for instance.
bavell 2 days ago
Nah, if you have an important application this is very low cost for adding tons of insight into how your app is running.
rednafi 2 days ago
What clicked for me is:

A span is a key-value attribute about some point in time event

A trace is a DAG of spans that tells you a story about some related events

eterm 2 days ago
What do you mean exactly by "point in time event"?

As I understand it, a metric is information at a point in time.

A span however has a start timestamp and end timestamp, and is about a single operation that happens across that time.

https://opentelemetry.io/docs/specs/otel/metrics/

vs

https://opentelemetry.io/docs/specs/otel/trace/api/#span

alkonaut 2 days ago
Trying to use OTel in any scenario outside of web backends such as desktop is a frustrating exercise in to trying to find exactly what small subset should use. I wish they had more examples of other types of software.
sdedovic 2 days ago
I agree. An anecdote:

A while ago I was working on some CUDA kernels for n-body physics simulations. It wasn’t too complicated and the end result was generative art. The problem was that it was quite slow and I didn’t know why. Well the core of the application was written in Clojure so I wrote a simple macro to wrap every function in a ns with a span and then ship all the data to jaeger. This ended up being exactly what I needed - I found out that the two slowest functions were data transfer between the GPU memory and writing out a frame (image) to my disk.

In many other places I see the usefulness of this approach but OTel is too often too geared towards HTTP services. Even simple async/queue processing is not as simple. Though, there have been improvements (like span links and trace links).

QuiCasseRien 1 day ago
> Metrics tell you what changed. Logs tell you why something happened. Traces tell you where time was spent and how a request moved across your system.

maybe the first time i read a crystal clear difference between metrics, logs and traces.

nice post.

2 days ago
lucketone 2 days ago
Nice summary at the start.
andoando 2 days ago
Is there anything that wraps multiple requests?
mdaniel 2 days ago
I doubt "wraps" but almost certainly what you're shopping for is a correlation identifier on the (logs, traces, metrics) that would enable you to group the related requests. Sometimes just the session id can get you where you want to go, but in more complicated setups you may have to annotate from the client side to indicate "I'm doing these 5 things as part of this one logical operation"
geoffbp 2 days ago
Good article. thanks for sharing
mihaitodor 2 days ago
While I do like the comprehensive writeup, there's something about the style which triggers my "it's AI-generated" reflex...