💣 Fuck-up night tales over a beer 🍻
Vladislav Ertel
Amsterdam, 2025
I'm a teamlead/developer in Toolbox and Feedback Hub
Previously I was working with Rider, Code With Me and Remote Dev
And I'm...
only human.
And since I'm a human - I make mistakes.
And today we are going to talk about funny fuck-ups worth teaching
Today we are going to talk about:
One of 1001 wrong ways to download a file
What is the most unobvious place for test flakiness to occur on CI?
What should you keep in mind when developing a public service?
Chapter #1: How to DDOS yourself
Hehe, Alexandr thanks for the slide. Anyway.
Back in 2020, during the good old days of the Great Indoors for humanity **CLICK**
We, developers, were sitting at home and were missing interactions with each other
Code With Me
And Code With Me should have been help with that
I joined Code With Me team back in 2020.
Code With Me was an ambitious startup-like project on top of a legacy codebase
The idea of Code With Me is fairly simple
"Fairly simple"
Imagine you want to work on a project together with your buddy**CLICK**
You can find a small button with a person icon in the top right corner**CLICK**
Click it - a popup with session creation options will appear. You start the session**CLICK**
"Fairly simple"
By starting a session IDE goes to the CWM server**CLICK**
and asks for a session link**CLICK**
Then you share the session link with your buddy**CLICK**
He downloads the thin client**CLICK**
And connects to your IDE
That's how Code With Me looks from both sides
One more minute of boring stuff
As you can see, Code With Me includes IDE, Thin Client and Server**CLICK**
Two instances of CWM servers were hosted on Kubernetes cluster**CLICK**
And the load from all the IDEs**CLICK** was evenly spead across two instances
Phew, boring part - check. Now comes the story...
Back in the beginning of 2021, when we first launched CWM - everything was
fine
The grass was green and the sky was blue
The usage was growing naturally.
total tranquility. BUT SUDDENLY!
One of the servers goes down
And usually it's not a big trouble if you are hosted on Kubernetes
But this time the usage grew to amount when one replica couldn't handle the load
And because of that - the traffic is redirected on another instance, while the first instance is rebooting**CLICK**
The second gets overwhelmed by the incoming traffic and also goes down**CLICK**
repeat until fixed
the Fix
The fix is simple**CLICK**
just scale up vertically and horizontally, no big deal
the proper fix
val isSuccess = makeRequest()
if (!isSuccess) {
retry(2.seconds)
}
But the proper fix - is to prevent that from happening in the future
The main problem here was the following
Everything worked fine while everything worked fine**CLICK**
But as soon as the IDE requests start failing
IDE begins to generate even more traffic while trying to recover by using retries
The delay of these retries was hardcoded to 2 seconds
the proper fix
val isSuccess = makeRequest()
if (!isSuccess) {
val expPower = attemp + small_random
val delayTime = exp(2, expPower) + bigger_random
retry(delayTime.seconds)
}
And to prevent the problem in future - I changed the retries strategy
They are now scaled exponentially with random jitter
The randomness in the delay allowed us to smear the load evenly over time
Takeaway #1
1. Scale up your infrastructure in advance
2. Make your retries exponential
3. Ask SRE team for an advice
What can we take away from this story?**CLICK**
Scale up your infrastructure in advance**CLICK**
If you are using retries - I advise you to make them exponential and randomized**CLICK**
Don't hesitate to consult with our SRE team, they are cool people with interest to help you
Give them a round of applause
Ok, going further
Chapter #2: It's all about files
Back in 2021, during the good old days of the Great Indoors for humanity
Kirill Skrygan approached me and asked:
Look, we have Code With Me. What if instead of connecting to another person's IDE
we start the IDE remotely and connect to it?
If you had one shot or one opportunity
To seize everything you ever wanted in one moment
Would you make a prototype?
And I was like:
Fuck Yeah! Mom's spaghetti! Let's go!
So, I made the first Gateway prototype together with Nikolay Kuznetsov and Evgeniy Stepanov
really fast, in a couple of months
And speaking about spaghetti
To be honest, there was some amount of spaghetti code
that remote dev team fixes up to this day
Please give them a round of applause
We, programmers, like files. But when the time finally comes to downloading them
there are tons of ways to fuck it up.
File buffering in memory, checksums, retries, smart retries with range header, parallel streams, whatever
1. Download an IDE on a remote host
2. Download an IDE locally
3. Connect them together
So, the Gateway. It was basically a thing that had 3 jobs
1. Download an IDE on a remote host
2. Download an IDE locally
3. Connect them together
Quick recap of Gateway
Alright, let's quickly recap how Gateway works.
Quick recap of Gateway
Quick recap of Gateway
Quick recap of Gateway
Quick recap of Gateway
Quick recap of Gateway
Quick recap of Gateway
Where the data is coming from?
https://data.services.jetbrains.com/products?code=QA
But how did the Gateway get to know what builds are available?
Well... **CLICK**
We have a special service in the internet that can provide you such information in the form of JSON
[{"code":"QA","intellijProductCode":"QA","alternativeCodes":["QA"],"salesCode":"QA","name":"Aqua",
"productFamilyName":"Aqua","link":"https://www.jetbrains.com/aqua/","description":"A powerful IDE for
test automation","tags":[{"id":"java","name":"Java"},{"id":"kotlin","name":"Kotlin"},{"id":"python","
name":"Python"},{"id":"js","name":"JavaScript, TypeScript"},{"id":"sql","name":"SQL/NoSQL"}],"types":[],"
categories":["IDE"],"releases":[{"date":"2025-01-24","type":"release","downloads":{"linuxARM64":{"link":"htt
ps://download.jetbrains.com/aqua/aqua-2024.3.2-aarch64.tar.gz","size":1077325995,"checksumLink":"https://do
wnload.jetbrains.com/aqua/aqua-2024.3.2-aarch64.tar.gz.sha256"},"linux":{"link":"https://download.jetbrai
ns.com/aqua/aqua-2024.3.2.tar.gz","size":1077154153,"checksumLink":"https://download.jetbrains.com/aqua/aqua-2
024.3.2.tar.gz.sha256"},"thirdPartyLibrariesJson":{"link":"https://resources.jetbrains.com/storage/third-party-
libraries/aqua/aqua-2024.3.2-third-party-libraries.json","size":68408,"checksumLink":"https://resources.jetbrain
s.com/storage/third-party-libraries/aqua/aqua-2024.3.2-third-party-libraries.json.sha256"},"windows":{"link":"ht
tps://download.jetbrains.com/aqua/aqua-2024.3.2.exe","size":778054112,"checksumLink":"https://download.jetbrains
.com/aqua/aqua-2024.3.2.exe.sha256"},"windowsARM64":{"link":"https://download.jetbrains.com/aqua/aqua-2024.3.2-aa
rch64.exe","size":749261640,"checksumLink":"https://download.jetbrains.com/aqua/aqua-2024.3.2-aarch64.exe.sha256
"},"macM1":{"link":"https://download.jetbrains.com/aqua/aqua-2024.3.2-aarch64.dmg","size":1029961397,"checksumLi
nk":"https://download.jetbrains.com/aqua/aqua-2024.3.2-aarch64.dmg.sha256"},"mac":{"link":"https://download.jetb
rains.com/aqua/aqua-2024.3.2.dmg","size":1039127577,"checksumLink":"https://download.jetbrains.com/aqua/aqua-
2024.3.2.dmg.sha256"}},"patches":{},"notesLink":null,"licenseRequired":true,"version":"2024.3.2","majorVersi
on":"2024.3","build":"243.23654.154","whatsnew":"Aqua 2024.3.2 is out. Whats' new:\n Minor
fixes and stability improvements","uninstallFeedbackLinks":null,"printableReleaseType":null},{"d
ate":"2024-12-20","type":"release","downloads":{"linuxARM64":{"link":"https://download.jetbrains.com/aqua/aqua
@Serializable
data class Download(
val link: String,
val size: Int? = null,
val checksumLink: String? = null,
)
Do you see the problem?
Alright, we can use Kotlin serialization.
We declare Download class that has a link and a size field
And... **CLICK** we already made a mistake
Do you see the problem here?
2048MB ought to be enough for anybody
But then comes the Rider.
Rider is a BIG boy
With a BIG installer
And when this installer appears in the products JSON ...
Rider stops being suggested in gateway completely
because it can't parse the json
because the last build is just too big for Int
And of course all the Rider releases are gone from Gateway.
So we have to fix it.
Fix
@Serializable
data class Download(
val link: String,
val size: Int? = null,
val checksumLink: String? = null,
)
Well, the fix is simple - just change Int to Long and call it a day.
Fix
@Serializable
data class Download(
val link: String,
val size: Long? = null,
val checksumLink: String? = null,
)
We don't use the size field anywhere in our codebase.
It existed purely to be a reminder that we exceeded 2GB installer size
Takeaway #2
1. It's 2025. Don't use Int when working with files, use Long
2. If you have gone past 2 GB: prepare for 4 GB
What can we take away from this story?**CLICK**
1. It's 2025. Don't use Int when working with files, use Long
Four bytes of difference is nothing today, but it will save your nerves and time.**CLICK**
2. If you gone past 2 GB - prepare for 4 GB
Eventually you won't be able to fit in a zip file. Just be prepared
Ok, moving forward.
Chapter #3: STDIO
It works but there is a nuance
So... The Gateway had two jobs
Download IDES and connect them to each other
We already talked about Downloading part
Let's talk about "connecting them to each other"
Boring slides once again
$PATH, $http_proxy, keychain, ulimit
$ ssh my.lovely.host
~/.profile, ~/.bash_profile, ~/.zprofile, ~/.bashrc, ~/.zshrc, ....
First of all: Before conencting IDEs to each other - we need to launch them
The launching of backend IDE was performed in a login shell**CLICK**
Why? To be as close as possible to users environment**CLICK**
$PATH, keychain agents, proxy, ulimit tweaks, whatever
Login shell is default type of shell when you are connecting to something via "ssh somehost"**CLICK**
They include configuration from ~/.*profile**CLICK**
Most users don't think where to put their configuration - *profile or *rc, system-wide or user-wide = OK
So, we got the user environment as close to what we want achieve
But here comes the catch**CLICK**
We got the user environment as close to what we want achieve
It also includes some garbage, that gets into our precios STDOUT and that can ruin the parsing
Let's start with a short historical excursion to see the problem deeper
Gateway used a tiny program written in go (called "go worker") to perform various basic opeartions on backend
It was launched with different arguments every time we need anything from backend
And the output of "go worker" was parsed by gateway in very cunning way
/**
* People of the future, I'm really sorry for this regex,
* but I needed somehow to pass all information
* about deploy via log message :(
*/
private const val SuccessfulDeployDelimiter = "#\$hi#"
private const val SuccessfulDeployPattern = "$SuccessfulDeployPrefix\\. " +
"idePath=#\\\$hi#(.*)#\\\$hi# " +
"productCode=#\\\$hi#($productCodePattern)#\\\$hi# " +
...
So remember that I said sometimes we get garbage in our stdio
So, let's try to fight it with the most naive solution
The comment on the top already looks promising**CLICK**
Since we control the go worker, let's just wrap the message we await with some well know barriers
And parse it with regexes
Alright, now deploy fails less often but still fails
Alright, people of the future became smarter, got some manners and decided to enhance it
@Serializable
private data class SuccessfulDeployInfo (
val idePath: String,
val productCode: String,
....)
...
internal const val SuccessfulDeployPrefix = "Deployment Successful #\$hi#"
private const val messageSuffix = "#\$poka#"
...
val jsonStr = message.substringAfter(SuccessfulDeployPrefix)
.substringBefore(messageSuffix)
val deployInfo = try {
SuccessfulDeployInfo.fromJson(jsonStr)
} catch (e: SerializationException) { ... }
Not only they decided to use JSON,**CLICK** but also to use a new postfix - POKA**CLICK**
Ok the new approach works better, since the JSON is printed on one line and in more structured way
But it still fails in some cases
ssh my.host ./go-worker --trace do_stuff
Alright, let's debug it**CLICK**
We launch our small go worker with --trace flag to produce more logs
And suddenly, here comes another problem. It just stuck. The command is not finished.
However it works without the --trace flag
Any ideas here?
val returnCode = execute(command)
val stdout = command.readStdout()
remote program Gateway code
⬇️ ⬆️
⬇️ ⬆️
sshd (server) -> TCP -> sshj
(client library)
You see, when gateway calls a binary on remote - it awaits the exit code of go worker
But go worker can not end its execution and give the exit code
write(stdout, log_string) // <- blocked the thread
You see, when gateway calls a binary on remote - it awaits the exit code of go worker
But go worker can not end its execution and give the exit code
because the pipe is a kernel object that has it's own buffer limit
And if you to write some message while the buffer is already full - the thread will block
Takeaway #3
Be careful when using stdio in complex situations:
1. There is no structure in the output payload
2. You have to read from the stdio pipe of a process
3. Don't use login shell if you need to parse output of a command
Be careful when using stdio in complex situations:**CLICK**
1. There is no structure in the output payload**CLICK**
2. You have to read from the stdio pipe of a process, or it will block the whole process on attempt to write**CLICK**
3. Don't use login shell if you need to parse output of a command. I toolbox we don't use stdout for communication anymore and it works much more reliably
As you can see, I can speak about questionnable
engineering decisions in Gateway for a long time
But that's a tale for another day.
Chapter #4: Chained by a build chain
In this section we will talk about usual interaction with teamcity
from the perspective of not Teamcity-certified developer.
After Gateway and Remote Development I joined Toolbox team in 2024.
And I started with small things here and there.**CLICK**
One of them was to add a small integration test that checks if a jetbrains:// link was handled by Toolbox.
Ok. But how do we check that a link was handled?**CLICK**
In integration tests Toolbox is started with a special environment variable,
that instructs it to start HTTP server on a predefined port.
/installIde
/listIde
/login
...
/listHandledUrls
This server handles different HTTP requests that simulate user actions like:
**CLICK** install an ide, **CLICK** listing installed ides, **CLICK** login,**CLICK** etc.
I added one more handler **CLICK** — give me the list of handled URLs
Everything works fine locally, tests are passing, BUT on teamcity tests are failing
/installIde
/listIde
/login
...
/listHandledUrls - 404: Not found
Artifact dependencies!
Artifact dependencies is the answer!
But let's look closer what exactly happened.
Our test infrastructure was simple
1. You build an honest distribution in one build**CLICK
2. You unpack the archive and run tests against the unpacked distribution in another build
As simple as that! What can go wrong? Well...
There are two types of dependencies in Teamcity:**CLICK**
Artifact and Snapshot. And they represent different ideas.
But how do they work?
Artifact vs Snapshot dependencies
Snapshot = run build with the same sources (vcs revision)*checkout rules apply
Artifact = give me an artifact
Snapshot + Artifact = give me an artifact with the same source revision*
You can treat them like this:**CLICK**
Snapshot means trigger a build with the same sources (vcs revision) as current**CLICK**
Artifact means give me an artifact from the dependent build. But same source revision is not
mandatory, it can be different**CLICK**
Snapshot + Artifact means give me an artifact from the dependent build with the same source
revision
/installIde
/listIde
/login
...
/listHandledUrls - 404: Not found
So, getting back to the 404 problem
The problem here is that we were taking the distribution from another build chain that already existed **CLICK**
And the tests were talking with a build that doesn't have /listHandledUrls endpoint
Let's quickly see how it looks like in TeamCity DSL, so you know what to be aware of
How it was
dependencies {
dependency(installerBuild) {
artifacts {
buildRule = lastSuccessful()
artifactRules = artifactRule
}
}
}
How it had to be
dependencies {
dependency(installerBuild) {
snapshot {
synchronizeRevisions = true
}
artifacts {
artifactRules = artifactRule
}
}
}
Takeaway #4
Be careful when using artifact-only dependencies
Be careful when using artifact-only dependencies Teamcity
What wisdowm we can grasp from this story?**CLICK**
Be careful when using artifact-only dependencies.
**CLICK** Actually, be careful when using Teamcity itself
Teamcity is a complex CI engine and has its own quirks
To use it wisely - you probably need a Teamcity-certified build engineer
And if you use artifact-only dependency deliberately - probably you are already a teamcity-certified developer
One last takeaway
Don't be afraid to ask expirienced colleagus
Don't be afraid to ask experienced colleagues. Perhaps, by talking before doing something, you won't make some kind of non-obvious mistake.
@Vlada
@Vladislav.Ertel
Also, I encourage you to share your mistakes worth giving a talk from this scene. Please contact me or Vlada Danilova if you are interested
And one more thing
We are hiring
We are hiring!
We are looking for brave and cheerful developers
If you want to join Toolbox team - drop me a message
Settings Drift: A Tale of Broken Import and IDE Amnesia
You know where to find this creature, and this creature knows where to find you.
⇒
Say, you are releasing a new major IDE version, for example 2024.1. It is great, it has a lot of shiny new features.
But then a bug report comes: a user says their settings are wiped out by the migration.
We all here work in software development, we know that a lot of weird things happen. This could be caused by dozens of things out of our control, a hardware issue, a user environment issue, some misconfiguration.
⇒
But nope, you investigate and see that nothing of this happened.
⇒
Then you figure out: the user can still start the previous product version but cannot start the new one.
Then you start thinking. How really settings from the previous version interact with the new version? And what could've been broken in there? Huh?
Settings Migration
~/…/Rider2023.3
↓↓↓↓↓
~/…/Rider2024.1
🎉 ✓ CORRECT ✓ 🎊
Each major product version has its own versioned folder.
So we need to copy the folders on update. Simple, right?
Yes. We are done. Great job.
Settings Migration
(the silent update)
ConfigImportHelper.java: 1819 lines
Settings Migration
Toolbox-performed update
Custom migration command support
Per-product customizations (AppCode, CLion, JetBrainsClient, Rider)
Special migrations for plugins
Many more
Of course we have many more things going on here.
IDE Update Via Toolbox
Download the IDE archive (problems)
Unpack the archive (problems)
(upvote TBX-6708 )
Run the IDE with "update" arg (problems)
(generate the actual update script)
Run the IDE with "update" arg (problems)
(perform the plugin update)
Finish (problems)
…and then we finish, we allow the IDE to copy the settings on the next run.
IDE Update Via Toolbox
Download the IDE archive (problems)
Unpack the archive (problems)
(upvote TBX-6708 )
Run the IDE with "update" arg (problems)
(generate the actual update script)
Run the IDE with "update" arg (problems)
(perform the plugin update)
Finish (problems)
We'll focus on two of these items for the purpose of this talk.
The Source
IDE will only perform the silent import into the unused settings folder.
Non-empty settings folder is not replaced.
IDE Update [1/2]
Run the IDE with "update" arg
(ruined update)
So, imagine you are working on an IDE that requires an always running backend process. You spawn it during the startup.
But nothing is supposed to trigger during this special update phase:
it's too dangerous that it could touch some components that will touch the settings folder — because if they do, the migration will be ruined!
IDE Update [1/2]
Run the IDE with "update" arg
(backend not triggered)
Normally it's nopt supposed to even trigger the component that controls the backend, so everything should be fine. Right?
IDE Update [1/2]
Run the IDE with "update" arg
(ruined update with EBS)
Of course not: we have this thing called "EBS"!
IDE Update [1/2]
Run the IDE with "update" arg
(ruined update with EBS)
So, at some point, we figured it out and made it to check for the update mode and not start the backend.
IDE Update [1/2]
Run the IDE with "update" arg
Problem: ReSharper EBS process
Problem: ReSharper settings import
(only if ReSharper is used)
This wasn't the only problem: we also have a workflow of importing the ReSharper settings during the first run — but only on the machines where ReSharper is actually used. You would imagine how hard it was to figure out that this could also cause problems.
IDE Update [2/2]
Finish (copy the settings during the next run)
Alive processes keeping the settings dir from copying (upvote IJPL-148314 )
IDE Update [3/2]
The Unknown
We also have some suspicions that this might still be problematic (reports from other IDEs).
Investigation
Rider: Must be a Toolbox problem
Toolbox: Must be a Rider problem
Rider: Must be a Toolbox problem?
…⚽…
IJPL-159952 : Import diagnostics
Here's how the investigation went. Like soccer!
Investigation
Collect logs from both sources:
--- IDE STARTED ---
#c.i.p.i.b.AppStarter - Will skip the config import to directory
"C:\Users\fried\AppData\Roaming\JetBrains\IntelliJIdea2025.1"
(exists = true). Current entries: "app-internal-state.db",
"bundled_plugins.txt", "c.kdbx", "c.pwd", …
Conclusion
When having a problem with import, remember that we have logs (both IDE and Toolbox).
Coordinate investigation of important issues.
Do not play ⚽soccer, play 🏈football.
We have logs. And let's do better next time!