If you're developing a FOSS project, be aware of cryptobros trying to PR a tea.yml into it.

db0@lemmy.dbzer0.com to Open Source@lemmy.ml – 418 points –
The disappointing tea.xyz
connortumbleson.com

Yet another "brilliant" scheme from a cryptobro. Naturally this caused a gold-rush for scammers who outsourced random people via the gig economy to open PRs for this yml file (example)

96

You are viewing a single comment

The easy red flag here is YAML. It’s a hideous, overly-complex format for anything so of course a scam would choose it.

That's a patently ridiculous statement

Have you read the spec? It’s a total mess

I have read the 1.2 spec (I'm trying to make a round trip parser for JS, and I do maintainance on a fork of the rumel yaml python package). I actually think its very well thought out, with things I hadn't considered like future extensibility, streaming applications, and data-corruption detection.

The diagrams, color coding, and less-formailty of the spec was much appreciated. Especially compared to something like the ECMA Script spec, which reads like a math textbook had a child with a legal document.

I'm not saying YAML is perfect; round trip (the thing I'm working on) is nearly impossible because it wasn't a design goal. It has a few too many features (I've never seen a declaration in the wild), but it does a good job at accomplishing the creators goals, and the additional features basically only slow down parser-implementers like me. I often pick it because of the tag support, which I've struggled to find an equivalent for in other serialization languages. I use anchors in recursive data structures, and complex keys for serializing complex data structures (not human readable). The "document end" marker has been nice when I'm worried about detecting partial-writes. And the merge key is nice for config files.

The application/perspective matters. Yaml might be bad for you but its not bad for everyone.

Even if anchors are pretty novel… I’ve watched myself & others fail for things that seem like they should be simple like scalars, quoting, & indentation rules all for being confusing (while failing to understand how/why the tab character isn’t supported).

That sounds like a skill issue. Something isn’t bad because you don’t understand it. Suggesting quoting is an issue for yaml is beyond the pale; it happens to be an issue everywhere.

Despite my love of yaml. I actually think he has a small point with unquoted strings. I teach students and see their struggles. Bash also does unquoted strings and basically all students go years and years without realizing

cat --help
cat "--help"
# ^ same thing

cat *
cat "*"
# ^ not same thing

cat $thing
cat "$thing"
# ^ similar but not the same 

To know the difference between special and normal-but-no-quotes you have to know literally every special symbol. And, for example, its rare to realize the -- in --help, isn't special at a language level, its only special at a convention level.

Same thing can happen in yaml files, but actually a little worse I'd say. In bash all the "special" things are at least symbols. But in yaml there are more special cases. Imagine editing this kind of a list:

js_keywords:
- if
- else
- while
- break
- continue
- import
- from
- default
- class
- const
- var
- let
- new
- async
- function
- undefined
- null
- true
- false
- Nan
- Infinity

Three of those are not strings. Syntax highlighting can help (which is why I don't think its a real issue). But still "why are three not strings? Well ... just because". AKA there isn't a syntax pattern, there's just a hardcoded list of names that need to be memorized. What is actually challeging is, unless students start with a proper yaml tutorial, or see examples of quotes in the config, its not obvious that quotes will solve the problem (students think "true" behaves like "\"true\""). So even when they see true is highlighted funny, they don't really know what to do about it. I've seem some try stuff like \true.

Still doesn't mean yaml is bad, every language has edge cases.

While the subjective assessment that quote handling in yaml is worse than bash is understandable, it is really just two of many many cases where quotes complicate things. And for a pretty good reason. They are used to isolate strings in many languages, even prose. They, therefore, always get special handling in lexical analysis. Understanding which languages use single quotes, double quotes, backticks, heredocs, etc and when to use them is really just part of the game or the struggle I guess.

Most languages require you to put quotes around strings as the norm… breaking that is part of what causes all of the confusion in the first place. Better design upfront would lead to less common errors. I have way more quoting issues in YAML than I do JSON, Nix, Nickel, Dhall, etc. because they aren’t trying to be cute with strings.

When you're editing yaml, why not just always write JSON?

Almost all nix attr keys are unquoted strings. Maybe I'm missing the point list, but I kinda wouldn't expect it to be on the list.

Its easy for me to say "just start writing JSON in the yaml. It doesn't get more simple than JSON", but actually I do think there's a small point with the unquoted strings.

Back before I knew programming, I was trying to change grammar settings sublime 2, which uses yaml. I had no idea what yaml was. The default setting values used unquoted strings fot regex. I knew PCRE regex and escapes, but suddenly they didnt work, and when I tried to match a single quote inside of regex that also didn't work. I didn't know I was editing yaml file (it had a .tmLanguage extension). Even worse, if I remeber correctly, unparsable settings just silently fail. Not only did I have no errors to google, I didn't have any reason to believe the escapes were the cause of the problem (they worked in the command line). Sometimes I edited the regex and it was fine, and other times it just seemed to break. I didn't learn about quoting in YAML until years later.

For me that was an unfortuate combination, which was exacerbated by yaml unquoted weirdness. But when you're talking about "did you read the spec" that's a whole other story. .nan for nan, tabs vs spaces, unquted string weirdness, etc should just be one error message+google away. I think they're a small hiccups with what is overall a great format.

Brief history of YAML:

"Oh no! All of these configuration file formats are complicated. I want to make things simpler!"

(Years go by)

"...I have made things more complicated, haven't I?"

YAML is generally good if it's used for what it was originally designed for (relatively short data files, e.g. configuration data). Problem is, people use it for so much more. (My personal favourite pain example: i18n stuff in Ruby on Rails. YAML language files work for small apps, but when the app grows, so does the pain.)

Ansible is using YAML and it's orders more readable than any other config engine, like puppet or cfengine.

Ideally, yes it can be beautifully written, certainly more than bash scripts.

With that said, I've also seen some hideous ansible scripts...

originally designed for (relatively short data files, e.g. configuration data)

This I can get behind. But because it’s not bad in those spaces folks think it’ll be a good idea in all spaces. Anchors do neat things, but organizing large files with YAML’s weird rules around quoting, & no support for tab indentation rub me the wrong way.

What? I love having 20 ambiguous ways to express the same data with weird and unexpected conversion rules. JSON is so much worse - if data types are explicit and obvious, how can I properly express my feelings when writing a config file?

{"foo":true,"bar":{"baz":1}} is valid YAML; better throw it out.

I have no issues with using a strict and unambiguous subset of YAML :)

And what would your ideal, legible, general-purpose data markup language be? XML?

Yaml Ain't Markup Language: am i a joke to you

(JSON for data, TOML for configuration)

I've used both YAML and a TOML-adjacent INI format for Ansible. While I wouldn't use YAML for massive data serialization (because significant whitespaces are fucking stupid), it's much better suited for manual data entry compared to most options, including TOML, when nested data structures are required.

And if YAML's structure is too complicated, that's honestly a skill issue.

Not that YAML's structure is too complicated, but its syntax is too flexible. All the shit about being whitespace sensitive yet with whitespace errors leading to a syntactically valid YAML document. TOML's syntax is rigid which makes it unsuitable for expressing complex nested data structures, which is good because that's not what you should use TOML for. Ultimately the dependence on a highly flexible baseline language like YAML to create complex DSLs is a failure on the developers' part, and the entire configuration system should be reworked.

Do you use a linter like the ansible vscode extension?

I used to hate writing ansible, and yaml, until I installed the ansible lint vscode extension, and everything became much, much easier.

Later on, when I was working on a docker-compose, I noticed that the vscode yaml extension (which the ansible extension pulled in as a dependency) caught errors. It's quite intelligent, able to spot errors exactly like what you mentioned, where the yaml syntax is correct, but the docker-compose, or the ansible syntax is wrong.

Of course. If you're working in a DSL that's popular enough for someone to have written a good schema/parser for then tooling can help.

Significant white space is awesome! Not supporting tabs tho shows you don’t know what you are doing, YAML.

They very well know what they are doing. Take your filthy tabs and get out of here. Spaces only.

Tabs for indentation, spaces for alignment. It's perfect. Lets people visually indent as much as they want in their settings, but manually aligned things stay manually aligned. Forcing indents to always be... whatever number of spaces you personally like is dumb.

Plus then you can outdent with a single Backspace in every text editor ever.

😧

Accessibility be damned

set expandtab

That just converts tabs to space but doesn’t address the underlying accessibility needs where some folks demand different indentation due to vision issues or nonstandard IO devices like braille readers. Tabs allow the user to configure the width for their needs. Being static spaces ignores the needs of many folks.

Very good point re. Braille readers. I was being flippant and did not think of that. My apologies. Tabs for indentation may be useful there. as would a blind-friendly pre- and post- processor for programming language specific files (a braille liner, could call it black-er for python :)

I don't know how braille readers actuality work, but I guess they process a bytestream. How do they handle utf-16 and other non standard character sets? This is a known problem for a lot of systems- it would be interesting to know how they address it.

Have you met any such person?

I have interacted with triple digits number of developers and I have met exactly zero folk with such a need. If there is an actual need by a team member, sure, we will be accommodating.

Until then, however, the much more common thing is for people to have their own preferences for tab width and ignore the current code style, ending up in an identation abomination that sucks for everyone. Therefore, no thanks, forced width for everyone, using spaces.

No point ruining the happy path scenario for a theoretical person with such a disability. If there will be an actual need, sure, let's convert to tabs only then.

their own preferences for tab width... ending up in an identation abomination

Can you give an example where a person's personal tab width breaks things? One tab per logical indent, and then spaces for alignment. How does this break anything? I know for a fact it doesn't or else people like me wouldn't advocate for it. What breaks indenting is mixing tabs and spaces for indents, and obviously that's foolish. You can't blame that mistake for causing an "abomination" when it's something that would violate any code style specification, whether using spaces or tabs. You yourself could set your IDE to emit only 2 spaces when you hit Tab, and that would also violate your code style spec if you mix those indents into a file with 4-space indents, and that has nothing to do with tabs at all.

Doing stupid things in the code that violates the code style are stupid things that violate the code style. No matter what whitespace you use. But having a personal setting to see 8 spaces per tab isn't one of them if you only use tabs for logical indenting and not for alignment.

All tabs or all spaces for indents result in the exact same thing: good looking code. But tabs then have further advantages. Easier outdenting, better accessibility, etc. The only benefit to forcing spaces is that some random program you use for code comparison or whatever might default to something other than 4 columns for a Tab and your code looks a little wide until you change your settings. That's nothing compared to the advantages of tabs. Turns out that "benefit" of spaces is actually a drawback because no one is allowed to view the indents as anything but whatever column width you personally think it should be.

I am literally that person, but kay

Depends on the use case but XML is good for markup—especially if you need extensibility.

For config, Nickel & Dhall take the cake for being typed & having LSPs so the configuration writer can get immediate feedback about possible options (while eliminating invalid states) without requiring the manual—with configuration readers not needing to mess around with marshaling their types. Both these configuration languages let you import files & write little loops to make your config more DRY & makes maintaining large files (like say Kubernetes) easier.

XML is great if the (de-)serialization is already implemented. Otherwise traversing the document is a massive pain.

True. Something like XPath can really help & there are use cases where that is more concise but requires loading XPath into your head like Regex (which tends to get unloaded). The extensibility shines tho as seen by XMPP continuing to this day with very good backwards compatibility with 2 decades of updates since everything in an extension to the base.

RON (Rusty Object Notation). Its like JSON but better.

Do you remember CSON? CoffeeScript Object Notation was a cute way to make JSON readable before CoffeeScript kinda died.

CSON looks like a slightly worse version of YAML to me

Parsing rules are way simpler… it was also a different time.

as a rustacian i cannot thank you enough for notifying me of this

I see you get downvoted a lot. But as a norwegian that repeatedly have run into the norwegian problem when trying to use some program... i see you.

YAML 1.2 was released 15 years ago and fixed this issue. The problem is not YAML but the libraries people are using to parse it being a decade and a half out of date.