ZickZack

@ZickZack@kbin.social
0 Post – 22 Comments
Joined 1 years ago

Go to the relevant domain's front page (e.g https://kbin.social/d/kbin.social for kbin.social).
The URL scheme is "https://kbin.social/d/DOMAINHERE" assuming you are currently on kbin.social.
On the right in the sidebar you can see "Domain" and below that options to subscribe or to block.
Really it's the same thing as magazines, just that you generally don't visit the domain itself.

10 more...

They will make it open source, just tremendously complicated and expensive to comply with.
In general, if you see a group proposing regulations, it's usually to cement their own positions: e.g. openai is a frontrunner in ML for the masses, but doesn't really have a technical edge against anyone else, therefore they run to congress to "please regulate us".
Regulatory compliance is always expensive and difficult, which means it favors people that already have money and systems running right now.

There are so many ways this can be broken in intentional or unintentional ways. It's also a great way to detect possible e.g. government critics to shut them down (e.g. if you are Chinese and everything is uniquely tagged to you: would you write about Tiananmen square?), or to get monopolies on (dis)information.
This is not literally trying to force everyone to get a license for producing creative or factual work but it's very close since you can easily discriminate against any creative or factual sources you find unwanted.

In short, even if this is an absolutely flawless, perfect implementation of what they want to do, it will have catastrophic consequences.

That's not what lossless data compression schemes do:
In lossless compression the general idea is to create a codebook of commonly occuring patterns and use those as shorthand.
For example, one of the simplest and now ancient algorithms LZW does the following:

  • Initialize the dictionary to contain all strings of length one.
  • Initialize the dictionary to contain all strings of length one.
  • Emit the dictionary index for W to output and remove W from the input.
  • Add W followed by the next symbol in the input to the dictionary.
  • repeat
    Basically, instead of rewriting long sequences, it just writes down the index into an existing dictionary of already seen sequences.

However, once this is done, you now need to find an encoding that takes your characterset (the original characters+the new dictionary references) and turns it into bits.
It turns out that we can do this optimally: Using an algorithm called Arithmetic coding we can align the length of a bitstring to the amount of information it contains.
"Information" here meaning the statistical concept of information, which depends on the inverse likelihood a certain character is observed.
Logically this makes sense:
Let's say you have a system that measures earthquakes. As one would expect, most of the time, let's say 99% of the time, you will see "no earthquake", while in 1% of the cases you will observe "earthquake".
Since "no earthquake" is a lot more common, the information gain is relatively small (if I told you "the system said no earthquake", you could have guessed that with 99% confidence: not very surprising).
However if I tell you "there is an earthquake" this is much more important and therefore is worth more information.

From information theory (a branch of mathematics), we know that if we want to maximize the efficiency of our codec, we have to match the length of every character to its information content. Arithmetic coding now gives us a general way of doing this.

However, we can do even better:
Instead of just considering individual characters, we can also add in character pairs!
Of course, it doesn't make sense to add in every possible character pair, but for some of them it makes a ton of sense:
For example, if we want to compress english text, we could give a separate codebook entry to the entire sequence "the" and save a ton of bits!
To do this for pairs of characters in the english alphabet, we have to consider 26*26=676 combinations.
We can still do that: just scan the text 600 times.
With 3 character combinations it becomes a lot harder 26*26*26=17576 combinations.
But with 4 characters its impossible: you already have half a million combinations!
In reality, this is even worse, since you have way more than 26 characters: you have things like ", . ? ! and your codebook ids which blow up the size even more!

So, how are we supposed to figure out which character pairs to combine and how many bits we should give them?
We can try to predict it!
This technique, called [PPM](Prediction by partial matching) is already very old (~1980s), but still used in many compression algorithms.
The important trick is now that with deep learning, we can train even more efficient estimators, without loosing the lossless property:
Remember, we only predict what things we want to combine, and how many bits we want to assign to them!
The worst-case scenario is that your compression gets worse because the model predicts nonsensical character-combinations to store, but that never changes the actual information you store, just how close you can get to the optimal compression.

The state-of-the-art in text compression already uses this for a long time (see Hutter Prize) it's just now getting to a stage where systems become fast and accurate enough to also make the compression useful for other domains/general purpose compression.

2 more...

They choose to do this. Delicious has historically been a point and click developer, but they wanted to diversify, especially since their previous title "pillars of the earth" flopped. They first tried their have at rts with "a year of rain" which is simply not that good, and then looked into Gollum.
You also can't raid make the argument that the project was rushed out the door, considering the game was supposed to release in 2021 (two years ago).

They tried something they had no experience in, not through coercion but because they wanted to, and produced a game of shockingly low quality. Since this wasn't the first flop, but just the latest in a huge series of flops, (though it was the most expensive and high profile one) the studio closed.

1 more...

While the inability to source is a huge problem, but you also have to keep in mind that complaining about AI has other objective beyond the obvious "AI bad".

  • it's marketing: "Our thing is so powerful it could irreparably change someone's life" is still advertising even if that irreparable change is bad. Saying "AI so powerful it's dangerous" just sounds less advertis-y than "AI so powerful you cannot not invest in it" despite both leading to similar conclusions (you can look back at the "fearvertising" done during the original AI boom: same paint, different color)
  • it's begging for regulatory zeals to be put into place: Everyone with a couple of millions can build an LLM from scratch. That might sound like a lot, but it's only getting cheaper and it doesn't need highly intricate systems to replicate. Specifically the ability to finetune a large model with few datapoints allows even open-source non-profits like OpenAssistant to compete against the likes of google and openai: Google has made that very explicit in their leaked We have no moat memo. This is why you see people like Sam Altman talking to congress about the dangers of AI: He has no serious competetive advantage and hopes that with sufficient fear-mongering he can get the government to give him one.

Complaining about AI is as much about the AI as it is about the economical incentives behind AI.

And don't forget that even after that you still have to watch baked-in "This video is sponsored by <insert shady company here>" adds since the actual revenue that gets passed to creators from youtube is so low that to keep the ship afloat they have to look for additional revenue streams.

It's $\mathbb{X}$ or unicode š¯•¸ (U+1D54F)
Maybe he really likes metric spaces??

24, always driven manual, EU.
From my experience most people in the EU can or at least could: This is because many (if not all, not sure) countries make a distinction between manual and automatic licenses (see e.g. https://www.learn-automatic.com/qualified/automatic-driving-licence/).
I.e. if you want to drive manual, you have to take the test manual, but if you take the test on manual transmission, you are allowed to drive automatics as well.

That paper makes a bunch of(implicit) assumptions that make it pretty unrealistic: basically they assume that once we have decently working models already, we would still continue to do normal "brain-off" web scraping.
In practice you can use even relatively simple models to start filtering and creating more training data:
Think about it like the original LLM being a huge trashcan in which you try to compress Terrabytes of mostly garbage web data.
Then, you use fine-tuning (like the instruction tuning used the assistant models) to increases the likelihood of deriving non-trash from the model (or to accurately classify trash vs non-trash).
In general this will produce a datasets that is of significantly higher quality simply because you got rid of all the low-quality stuff.

This is not even a theoretical construction: Phi-1 (https://arxiv.org/abs/2306.11644) does exactly that to train a state-of-the-art language model on a tiny amount of high quality data (the model is also tiny: only half a percent the size of gpt-3).
Previously tiny stories https://arxiv.org/abs/2305.07759 showed something similar: you can build high quality models with very little data, if you have good data (in the case of tiny stories they generate simply stories to train small language models).

In general LLM people seem to re-discover that good data is actually good and you don't really need these "shotgun approach" web scrape datasets.

2 more...

No he doesn't?
Don't get me wrong there are many places where the paper can be wrong (eg fig 1 or their magnetism exceptionally looking more similar to diamagnetism than superconductivity) but you are mixing him up with Ranga Dias who has had a history of data fabrication.
Dias has nothing to do with this paper though.

Zeiss is German, they also produce substantially more than just the optics https://en.m.wikipedia.org/wiki/Carl\_Zeiss\_SMT

Peertube is inherently very scalable with relatively little cost due to an artifact of all social media platforms: Most of the traffic is driven by a tiny amount of videos/magazines/etc...

For services like youtube, you can use this as a way to quickly cache data close to the place it's going to be streamed: e.g. Netflix works with ISPs to install small servers at their locations to lessen the burden on their (and the ISPs) systems.
But with centralised systems you can only push this so far since ultimately everything is still concentrated at one central location.

Hypothetically, if you could stop this super-linear scaling for each user (you need to pay per user plus overhead generated from managing them at scale), you could easily compete against the likes of youtube simply because, at sufficient scale, all the other effects get ammortized away.

Peertube does exactly this by serving the videos as webtorrents: essentially this means that for every "chunk" of a video you downloaded, you also host that chunk for other people to download. That means that peertube itself theoretically only has to host every unique video once (or less than once since the chunks are in the network for a while), meaning you rid yourself of the curse of linear user scaling against users and only scale sub-linearly with the number of unique videos (how sub-linear depends on the lifetime for your individual torrents; i.e. how long a single video chunk stays available for others).

The costs that remain for every peertube instance is essentially the file hosting costs (and encoding the video, but that also only scales in the number of videos and could be pushed onto the uploader using WASM video encoders).
Storage itself isn't cheap, but also not ungodly expensive (especially since you can ammortize the costs over a long time as you platform grows with storage prices in a continual massive decline).

Platforms like Netflix and youtube cannot do this because

  1. Netflix is a paid-service and people don't want to do the hosting job for netflix after having already paid for the service
  2. Youtube has to serve adds which is incompatible with the "users host the content" method

In general torrenting is a highly reliable and well tested method that scales fantastically well to large data needs (it quite literally becomes more efficient the more people use it)

Have a look at Kraken which has many state-of-the-art models for both HTR and OCR

Not really: you have to keep in mind the amount of expertise and ressources that already went into silicon, as well as the geopolitics and sheer availability of silicon. The closest currently available competitor is probably gallium arsenide. That has a couple of disadvantages compared to silicon

  • It's more expensive (both due to economies of scale and the fact that silicon is just much more abundant in general)
  • GaAs crystals are less stable, leading to smaller boules.
  • GaAs is a worse thermal conductor
  • GaAs has no native "oxide" (compare to SiOā‚‚) which can be directly used as an insulator
  • GaAs mobilities are worse (Si is 500 vs GaAs 400), which means P channel FETs are naturally slower in GaAs, which makes CMOS structures impossible
  • GaAs is not a pure element, which means you get into trouble with mixing the elements
    You usually see GaAs combined with germanium substrates for solar panels, but rarely independently of that (GaAs is simply bad for logic circuits).
    In short: It's not really useful for logic gates.

Germanium itself is another potential candidate, especially since it can be alloyed with silicon which makes it interesting from an integration point-of-view.
SiGe is very interesting from a logic POV considering its high forward and low reverse gain, which makes it interesting for low-current high-frequency applications. Since you naturally have heterojunctions which allow you to tune the band-gap (on the other hand you get the same problem as in GaAs: it's not a pure element so you need to tune the band-gap).
One problem specifically for mosfets is the fact that you don't get stable silicon-germanium oxides, which means you can't use the established silicon-on-insulator techniques.
Cost is also a limiting factor: before even starting to grow crystals you have the pure material cost, which is roughly $10/kg for silicon, and $800/ kg for germanium.
That's why, despite the fact that the early semiconductors all relied on germanium, germanium based systems never really became practical: It's harder to do mass production, and even if you can start mass production it will be very expensive (that's why if you do see germanium based tech, it's usually in low-production runs for high cost specialised components)

There's some research going on in commercialising these techniques but that's still years away.

What you are alluding to is called "DIDs" = "Decentralized identifiers" (see https://en.wikipedia.org/wiki/Decentralized\_identifier).
The idea of most of these methods is that you identify yourself using a private key, while a public key is spread throughout the network.
If you want to log into a server on that network, the server would "challenge" your identity by encrypting something (e.g. a random number) using the public key, which you, the holder of the private key, can then decrypt and send back to prove you are who you say you are.

This method is already standardized by the W3C, but only has been for less than a year. You also have to keep in mind that all federalized social network systems (such as lemmy and kbin) are still in early development.

Just as a quick check: are you sure you are in your "subscribed" view?
KBIN by default uses an "all" view, which you can change at the top right next to your username (the "table" menu).

I see no indication that this was a to down forced decision from management (just from having talked to some developers at Gamescom a couple of years ago).
The concept really wasn't horrible it just looks like it now having seen the product, but a stealth have themed after Gollum is not a dumb idea.
There's lots of stuff you could do, like e.g. use the ring for temporary invisibility but at the cost of losing some e.g. sanity resource you need to recover.

The problem with this game is that the idea being bad doesn't even really factor into its quality since just the actual bare-bones graphics and fundamental gameplay is so broken that the lack of original ideas isn't really a factor.

If this was just a no-thrills e.g. thief clone with a Gollum skin, nobody would bar an eye. The problem is that even this low bar of "some stealth game+Gollum" is not reached.

In fact, we have a very direct comparison to a different "Gollum like stealth have produced by an indie developer" that was a smash hit: "Styx: master of shadows" is a climbing based stealth have featuring a small green goblin like protagonist that has to deal with a powerful but risky to use substance.

I can just go to the search tab and look for the magazine (e.g. Search for retro gaming) and find an the other instances.
I think s fair number of people forget to switch the search to magazines before looking (or are actually subscribing to other instances but don't notice it)

You can use keepassXC and "self-host" your passwords on any cloud-storage you want (it's just a file after all), but if you are using 1Pass at the moment, I don't see an opt-in anonymized telemetry system as a reason to switch.

The "adequate covering" of our distribution p is also pretty self-explanatory: We don't need to see the statement "elephants are big" a thousand times to learn it, but we do need to see it at least once:

Think of the p distribution as e.g. defining a function on the real numbers. We want to learn that function using a finite amount of samples. It now makes sense to place our samples at interesting points (e.g. where the function changes direction), rather than just randomly throwing billions of points against the problem.

That means that even if our estimator is bad (i.e. it can barely distinguish real and fake data), it is still better than just randomly sampling (e.g. you can say "let's generate 100 samples of law, 100 samples of math, 100 samples of XYZ,..." rather than just having a big mush where you hope that everything appears).
That makes a few assumptions: the estimator is better than 0% accurate, the estimator has no statistical bias (e.g. the estimator didn't learn things like "add all sentences that start with an A", since that would shift our distribution), and some other things that are too intricate to explain here.

Importantly: even if your estimator is bad, it is better than not having it. You can also manually tune it towards being a little bit biased, either to reduce variance (e.g. let's filter out all HTML code), or to reduce the impact of certain real-world effects (like that most stuff on the internet is english: you may want to balance that down to get a more multilingual model).

However, you have not note here that these are LANGUAGE MODELS. They are not everything models.
These models don't aim for factual accuracy, nor do they have any way of verifying it: That's simply not the purview of these systems.
People use them as everything models, because empirically there's a lot more true stuff than nonsense in those scrapes and language models have to know something about the world to e.g. solve ambiguity, but these are side-effects of the model's training as a language model.
If you have a model that produces completely realistic (but semantically wrong) language, that's still good data for a language model.
"Good data" for a language model does not have to be "true data", since these models don't care about truth: that's not their objective!
They just complete sentences by predicting the next token, which is independent of factuallity.
There are people working on making these models more factual (same idea: you bias your estimator towards more likely to be true things, like boosting reliable sources such as wikipedia, rather than training on uniformly weighted webscrapes), but to do that you need a lot more overview over your data, for which you need more efficient models, for which you need better distributions, for which you need better estimators (though in that case they would be "factuallity estimators").
In general though the same "better than nothing" sentiment applies: if you have a sampling strategy that is not completely wrong, you can still beat completely random sample models. If your estimator is good, you can substantially beat them (and LLMs are pretty good in almost everything, which means you will get pretty good samples if you just sample according to the probability that the LLM tells you "this data is good")

For actually making sure that the stuff these models produce is true, you need very different systems that actually model facts, rather than just modelling language. Another way is to remove the bottleneck of machine learning models with respect to accuracy (i.e. you build a model that may be bad, but can never give you a wrong answer):
One example would be vector-search engines that, like search engines, retrieve information from a corpus based on the similarity as predicted by a machine learning model. Since you retrieve from a fixed corpus (like wikipedia) the model will never give you wrong information (assuming the corpus is not wrong)! A bad model may just not find the correct e.g. wikipedia entry to present to you.

I really like patreon since it allows creators some independence on the whims of platforms and advertising companies.
It also allows certain content that doesn't (currently) work on e.g. youtube to exist: E.g. (very) long form videos or highly produced documentaries that may take half a year to plan and shoot just cannot exist within youtube due to the limited per-click revenue.

That doesn't mean this system is perfect: E.g. I would like to have an option to put some money into a monthly pot, which gets distributed based on my viewing habits and current interests. E.g. Twitch has "bits" which can be bought in bulk and distributed freely as donations.
Having a monthly system for "tokens" according to which a monthly donation gets divided into (i.e. a person got 25% of my tokens, so he gets 25% of the pot) would be nice (this does have the potential issue of hurting long-form content, but I could still donate the normal way).

Yes: keep in mind that with "good" nobody is talking about the content of the data, but rather how statistically interesting it is for the model.

Really what machine learning is doing is trying to deduce a probability distribution q from a sampled distribution x ~ p(x).
The problem with statistical learning is that we only ever see an infinitesimally small amount of the true distribution (we only have finite samples from an infinite sample space of images/language/etc....).

So now what we really need to do is pick samples that adequately cover the entire distribution, without being redundant, since redundancy produces both more work (you simply have more things to fit against), and can obscure the true distribution:
Let's say that we have a uniform probability distribution over [1,2,3] (uniform means everything has the same probability of 1/3).

If we faithfully sample from this we can learn a distribution that will also return [1,2,3] with equal probability.
But let's say we have some redundancy in there (either direct duplicates, or, in the case of language, close-to duplicates):
The empirical distribution may look like {1,1,1,2,2,3} which seems to make ones a lot more likely than they are.
One way to deal with this is to just sample a lot more points: if we sample 6000 points, we are naturally going to get closer to the true distribution (similar how flipping a coin twice can give you 100% tails probability, even if the coin is actually fair. Once you flip it more often, it will return to the true probability).

Another way is to correct our observations towards what we already know to be true in our distribution (e.g. a direct 1:1 duplicate in language is presumably a copy-paste rather than a true increase in probability for a subsequence).

<continued in next comment>