How to handle race condition with Coroutines in Kotlin? - android

I have a coroutine/flow problem that I'm trying to solve
I have this method getClosesRegion that's suppose to do the following:
Attempt to connect to every region
The first region to connect (I use launch to attempt to connect to all concurrently), should be returned and the rest of the region requests should be cancelled
If all regions failed to connect OR after a 30 second timeout, throw an exception
That's currently what I have:
override suspend fun getClosestRegion(): Region {
val regions = regionsRepository.getRegions()
val firstSuccessResult = MutableSharedFlow<Region>(replay = 1)
val scope = CoroutineScope(Dispatchers.IO)
// Attempts to connect to every region until the first success
scope.launch {
regions.forEach { region ->
launch {
val retrofitClient = buildRetrofitClient(region.backendUrl)
val regionAuthenticationAPI = retrofitClient.create(
val response = regionAuthenticationAPI.canConnect()
if (response.isSuccessful && scope.isActive) {
val result = withTimeoutOrNull(TimeUnit.SECONDS.toMillis(30)) { firstSuccessResult.first() }
if (result != null)
return result
throw Exception("Failed to connect to any region")
Issues with current code:
If 1 region was successfully connected, I expect that the of the requests will be cancelled (by scope.cancel()), but in reality other regions that have successfully connected AFTER the first one are also emitting value to the flow (scope.isActive returns true)
I don't know how to handle the race condition of throw exception if all regions failed to connect or after 30 second timeout
Also I'm pretty new to kotlin Flow and Coroutines so I don't know if creating a flow is really necessary here

You don't need to create a CoroutineScope and manage it from within a coroutine. You can use the coroutineScope function instead.
I of course didn't test any of the below, so please excuse syntax errors and omitted <types> that the compiler can't infer.
Here's how you might do it using a select clause, but I think it's kind of awkward:
override suspend fun getClosestRegion(): Region = coroutineScope {
val regions = regionsRepository.getRegions()
val result = select<Region?> {
onTimeout(30.seconds) { null }
for (region in regions) {
launch {
val retrofitClient = buildRetrofitClient(region.backendUrl)
val regionAuthenticationAPI = retrofitClient.create(
val result = regionAuthenticationAPI.canConnect()
if (!it.isSuccessful) {
delay(30.seconds) // prevent this one from being selected
}.onJoin { region }
coroutineContext.cancelChildren() // Cancel any remaining async jobs
requireNotNull(result) { "Failed to connect to any region" }
Here's how you could do it with channelFlow:
override suspend fun getClosestRegion(): Region = coroutineScope {
val regions = regionsRepository.getRegions()
val flow = channelFlow {
for (region in regions) {
launch {
val retrofitClient = buildRetrofitClient(region.backendUrl)
val regionAuthenticationAPI = retrofitClient.create(
val result = regionAuthenticationAPI.canConnect()
if (result.isSuccessful) {
val result = withTimeoutOrNull(30.seconds) {
coroutineContext.cancelChildren() // Cancel any remaining async jobs
requireNotNull(result) { "Failed to connect to any region" }
I think your MutableSharedFlow technique could also work if you dropped the isActive check and used coroutineScope { } and cancelChildren() like I did above. But it seems awkward to create a shared flow that isn't shared by anything (it's only used by the same coroutine that created it).

If 1 region was successfully connected, I expect that the of the requests will be cancelled (by scope.cancel()), but in reality other regions that have successfully connected AFTER the first one are also emitting value to the flow (scope.isActive returns true)
To quote the documentation...
Coroutine cancellation is cooperative. A coroutine code has to cooperate to be cancellable.
Once your client is initiated, you can't cancel it - the client has be able to interrupt what it's doing. That probably isn't happening inside of Retrofit.
I'll presume that it's not a problem that you're sending more requests than you need - otherwise you won't be able to make simultaneous requests.
I don't know how to handle the race condition of throw exception if all regions failed to connect or after 30 second timeout
As I understand there are three situations
There's one successful response - other responses should be ignored
All responses are unsuccessful - an error should be thrown
All responses take longer than 30 seconds - again, throw an error
Additionally I don't want to keep track of how many requests are active/failed/successful. That requires shared state, and is complicated and brittle. Instead, I want to use parent-child relationships to manage this.
The timeout is already handled by withTimeoutOrNull() - easy enough!
First success
Selects could be useful here, and I see #Tenfour04 has provided that answer. I'll give an alternative.
Using suspendCancellableCoroutine() provides a way to
return as soon as there's a success - resume(...)
throw an error when all requests fail - resumeWithException
suspend fun getClosestRegion(
regions: List<Region>
): Region = withTimeoutOrNull(10.seconds) {
// don't give the supervisor a parent, because if one response is successful
// the parent will be await the cancellation of the other children
val supervisorJob = SupervisorJob()
// suspend the current coroutine. We'll use cont to continue when
// there's a definite outcome
suspendCancellableCoroutine<Region> { cont ->
launch(supervisorJob) {
.map { region ->
// note: use async instead of launch so we can do awaitAll()
// to track when all tasks have completed, but none have resumed
async(supervisorJob) {
coroutineContext.job.invokeOnCompletion {
log("cancelling async job for $region")
val retrofitClient = buildRetrofitClient(region)
val response = retrofitClient.connect()
// if there's a success, then try to complete the supervisor.
// complete() prevents multiple jobs from continuing the suspended
// coroutine
if (response.isSuccess && supervisorJob.complete()) {
log("got success for $region - resuming")
// happy flow - we can return
// uh-oh, nothing was a success
if (supervisorJob.complete()) {
log("no successful regions - throwing exception & resuming")
cont.resumeWithException(Exception("no region response was successful"))
} ?: error("Timeout error - unable to get region")
all responses are successful
If all tasks are successful, then it takes the shortest amount of time to return
List(5) {
Region("attempt1-region$it", success = true)
log("result for all success: $regionSuccess, time $time")
got success for Region(name=attempt1-region1, success=true, delay=2s) - resuming
cancelling async job for Region(name=attempt1-region3, success=true, delay=2s)
result for all success: Region(name=attempt1-region1, success=true, delay=2s), time 2.131312600s
cancelling async job for Region(name=attempt1-region1, success=true, delay=2s)
all responses fail
When all responses fail, it should take the only as long as the maximum timeout.
List(5) {
Region("attempt2-region$it", success = false)
log("failure: $allFailEx, time $time")
[DefaultDispatcher-worker-6 #all-fail#6] cancelling async job for Region(name=attempt2-region4, success=false, delay=1s)
[DefaultDispatcher-worker-4 #all-fail#4] cancelling async job for Region(name=attempt2-region2, success=false, delay=4s)
[DefaultDispatcher-worker-3 #all-fail#3] cancelling async job for Region(name=attempt2-region1, success=false, delay=4s)
[DefaultDispatcher-worker-6 #all-fail#5] cancelling async job for Region(name=attempt2-region3, success=false, delay=4s)
[DefaultDispatcher-worker-6 #all-fail#2] cancelling async job for Region(name=attempt2-region0, success=false, delay=5s)
[DefaultDispatcher-worker-6 #all-fail#1] no successful regions - throwing exception resuming
[DefaultDispatcher-worker-6 #all-fail#1] failure: java.lang.Exception: no region response was successful, time 5.225431500s
all responses timeout
And if all responses take longer than the timeout (I reduced it to 10 seconds in my example), then an exception will be thrown.
List(5) {
Region("attempt3-region$it", false, 100.seconds)
log("timeout: $timeoutEx, time $time")
[kotlinx.coroutines.DefaultExecutor] timeout: java.lang.IllegalStateException: Timeout error - unable to get region, time 10.070052700s
Full demo code
import kotlin.coroutines.*
import kotlin.random.*
import kotlin.time.Duration.Companion.seconds
import kotlin.time.*
import kotlinx.coroutines.*
suspend fun main() {
System.getProperties().setProperty("kotlinx.coroutines.debug", "")
withContext(CoroutineName("all-success")) {
val (regionSuccess, time) = measureTimedValue {
List(5) {
Region("attempt1-region$it", true)
log("result for all success: $regionSuccess, time $time")
withContext(CoroutineName("all-fail")) {
val (allFailEx, time) = measureTimedValue {
try {
List(5) {
Region("attempt2-region$it", false)
} catch (exception: Exception) {
log("failure: $allFailEx, time $time")
withContext(CoroutineName("timeout")) {
val (timeoutEx, time) = measureTimedValue {
try {
List(5) {
Region("attempt3-region$it", false, 100.seconds)
} catch (exception: Exception) {
log("timeout: $timeoutEx, time $time")
suspend fun getClosestRegion(
regions: List<Region>
): Region = withTimeoutOrNull(10.seconds) {
val supervisorJob = SupervisorJob()
suspendCancellableCoroutine<Region> { cont ->
launch(supervisorJob) {
.map { region ->
async(supervisorJob) {
coroutineContext.job.invokeOnCompletion {
log("cancelling async job for $region")
val retrofitClient = buildRetrofitClient(region)
val response = retrofitClient.connect()
if (response.isSuccess && supervisorJob.complete()) {
log("got success for $region - resuming")
// uh-oh, nothing was a success
if (supervisorJob.complete()) {
log("no successful regions - throwing exception resuming")
cont.resumeWithException(Exception("no region response was successful"))
} ?: error("Timeout error - unable to get region")
data class Region(
val name: String,
val success: Boolean,
val delay: Duration = Random(name.hashCode()).nextInt(1..5).seconds,
) {
val backendUrl = "http://localhost/$name"
fun buildRetrofitClient(region: Region) = RetrofitClient(region)
class RetrofitClient(private val region: Region) {
suspend fun connect(): ClientResponse {
return ClientResponse(region.backendUrl, region.success)
data class ClientResponse(
val url: String,
val isSuccess: Boolean,
fun log(msg: String) = println("[${Thread.currentThread().name}] $msg")


Kotlin coroutine does not run synchronously

In all cases that I have been using corrutines, so far, it has been executing its "lines" synchronously, so that I have been able to use the result of a variable in the next line of code.
I have the ImageRepository class that calls the server, gets a list of images, and once obtained, creates a json with the images and related information.
class ImageRepository {
val API_IMAGES = "https://api.MY_API_IMAGES"
suspend fun fetch (activity: AppCompatActivity) {
activity.lifecycleScope.launch() {
val imagesResponse = withContext(Dispatchers.IO) {
if (imagesResponse != null) {
val jsonWithImagesAndInfo = composeJsonWithImagesAndInfo(imagesResponse)
} else {
// TODO Warning to user
Log.e(TAG, "Error: Get request returned no response")
...// All the rest of code
Well, the suspend function executes correctly synchronously, it first makes the call to the server in the getRequest and, when there is response, then composes the JSON. So far, so good.
And this is the call to the "ImageRepository" suspension function from my main activity:
lifecycleScope.launch {
val result = withContext(Dispatchers.IO) { neoRepository.fetch(this#MainActivity) }
Log.i(TAG, "After suspend fun")
The problem is that, as soon as it is executed, it calls the suspension function and then displays the log, obviously empty. It doesn't wait for the suspension function to finish and then display the log.
Why? What am I doing wrong?
I have tried the different Dispatchers, etc, but without success.
I appreciate any help.
Thanks and best regards.
It’s because you are launching another coroutine in parallel from inside your suspend function. Instead of launching another coroutine there, call the contents of that launch directly in your suspend function.
A suspend function is just like a regular function, it executes one instruction after another. The only difference is that it can be suspended, meaning the runtime environment can decide to halt / suspend execution to do other work and then resume execution later.
This is true unless you start an asynchronous operation which you should not be doing. Your fetch operation should look like:
class ImageRepository {
suspend fun fetch () {
val imagesResponse = getRequest(API_IMAGES)
if (imagesResponse != null) {
val jsonWithImagesAndInfo = composeJsonWithImagesAndInfo(imagesResponse)
} else {
// TODO Warning to user
Log.e(TAG, "Error: Get request returned no response")
... // All the rest of code
-> just like a regular function. Of course you need to all it from a coroutine:
lifecycleScope.launch {
val result = withContext(Dispatchers.IO) { neoRepository.fetch() }
Log.i(TAG, "After suspend fun")
Google recommends to inject the dispatcher into the lower level classes ( so ideally you'd do:
val neoRepository = ImageRepository(Dispatchers.IO)
lifecycleScope.launch {
val result = neoRepository.fetch()
Log.i(TAG, "After suspend fun")
class ImageRepository(private val dispatcher: Dispatcher) {
suspend fun fetch () = withContext(dispatcher) {
val imagesResponse = getRequest(API_IMAGES)
if (imagesResponse != null) {
val jsonWithImagesAndInfo = composeJsonWithImagesAndInfo(imagesResponse)
} else {
// TODO Warning to user
Log.e(TAG, "Error: Get request returned no response")
... // All the rest of code

Async multiple download with retrofit get the data in response

Lets say that i have a model
data class PendingFile(segment: Int, fileHash: String, url: String)
So when i have a list with pendingFiles i want to download each file concurrently.
private suspend fun downloadLinks(pendingFiles: List<PendingFile>) {
scope.launch {
val deferredList = {
async(Dispatchers.IO) {
// runs in parallel in background thread
// Waiting all requests are finished without blocking the current thread
val listOfReturnData = deferredList.awaitAll()
val (success, failed) = listOfReturnData.partition {
// What should i put here??
if (failed.isNotEmpty()) {
// Back off to the half size
currentDownloadParts /= 2
if (success.isNotEmpty()) {
// Continue double size
currentDownloadParts *= 2
I want my success / failed to be distinguished and i also want the lists to have the PendingFile models accordingly in order to know which one succeeded and which one failed. How can i do that?
You can improve the concurrent code using coroutineScope see:
Too see what failed and what worked you can use null as a fallback value in the list (or create a sealed class with Success/Failure values)
suspend fun downloadLinks(pendingFiles: List<PendingFile>) = coroutineScope {
val deferredList = {
async(Dispatchers.IO) {
// runs in parallel in background thread
try {
} catch (e: Exception) { // might wanna adjust this depending on your use case
null // null here means failure, alternately you could use a sealed class with success and failure
// Waiting all requests are finished without blocking the current thread
val listOfReturnData = deferredList.awaitAll()
val (success, failed) = listOfReturnData.partition {
it != null
TODO() // rest of your code

Kotlin Flow, callback

I have created the following extension function :
fun <T> Flow<T>.handleErrors(showError: Boolean = false, retry: Boolean = false,
navigateBack: Boolean = true): Flow<T> =
catch { throwable ->
var message: String? = null
if (showError) {
when (throwable) {
is HttpException -> {
The extension function then posts the throwable type to a Base Activity and based on the event type posted a relevant dialog is displayed.
If the event is a retry, I would like to retry the failed flow.
For example if the HTTP exception is a 400, and I would like to retry the failed call when retry is selected on the dialog.
Is it possible to add callback to a Kotlin Flow, that has failed and can be called, from a different activity or fragment?
I don't think you want to retry in a separate block, you can organize your code like this
fun presentDialog(onClick: (Boolean) -> Unit) {
// some code to compile you dialog / install callbacks / show it
onClick(true) // user-click-retry
suspend fun main() {
val source = flow {
while (true) {
if (Math.random() > 0.5) Result.success(100) else Result.failure(IllegalArgumentException("woo"))
source.collect { result ->
suspendCancellableCoroutine<Unit> { cont ->
result.onSuccess {
println("we are done")
}.onFailure {
presentDialog { choice ->
if (choice) {
} else {
println("we are done")
now some explanations
as you already know, flow is cold,if you don't collect it, it will never produce, as a result, if your collect block does not return, the remaining thunk in flow builder after emit will not get executed,
you suspend the execution of flow builder by calling suspendCoroutine in collect block, if an HTTP error occurs, you show your dialog, and resume the execution according to user response, if no error happens or users just don't click retry, leave everything alone. the sample code above is somehow misleading, for a general case, you don't supply a retry mechanism when everything goes fine, so the while true block could change to
flow {
do {
val response = response()
} while(response.isFailure)
}.collect {
it.onSuccess { println("of-coz-we-are-done") }.onFailure {
// suspend execution and show dialog
// resume execution when user click retry
which may comfort you that the flow has an end, but actually it is basically the same

Kotlin Coroutines - Suspend function returning a Flow runs forever

I am making a network repository that supports multiple data retrieval configs, therefore I want to separate those configs' logic into functions.
However, I have a config that fetches the data continuously at specified intervals. Everything is fine when I emit those values to the original Flow. But when I take the logic into another function and return another Flow through it, it stops caring about its coroutine scope. Even after the scope's cancelation, it keeps on fetching the data.
TLDR: Suspend function returning a flow runs forever when currentCoroutineContext is used to control its loop's termination.
What am I doing wrong here?
Here's the simplified version of my code:
Fragment calling the viewmodels function that basically calls the getData()
lifecycleScope.launch {
suspend fun getData(config: MyConfig): Flow<List<Data>>
return flow {
when (config)
//It worked fine when fetchContinuously was ingrained to here and emitted directly to the current flow
//And now it keeps on running eternally
fetchContinuously().collect { updatedList ->
//Note logic of this function is greatly reduced to keep the focus on the problem
private suspend fun fetchContinuously(): Flow<List<Data>>
return flow {
while (currentCoroutineContext().isActive)
val updatedList = fetchDataListOverNetwork().await()
if (updatedList != null)
Timber.i("Context is no longer active - terminating the continuous-fetch coroutine")
private suspend fun fetchDataListOverNetwork(): Deferred<List<Data>?> =
withContext(Dispatchers.IO) {
return#withContext async {
var list: List<Data>? = null
val response = apiService.getDataList().execute()
if (response.isSuccessful && response.body() != null)
list = response.body()!!.list
Timber.w("Failed to fetch data from the network database. Error body: ${response.errorBody()}, Response body: ${response.body()}")
catch (e: Exception)
Timber.w("Exception while trying to fetch data from the network database. Stacktrace: ${e.printStackTrace()}")
return#async list
list //IDE is not smart enough to realize we are already returning no matter what inside of the finally block; therefore, this needs to stay here
I am not sure whether this is a solution to your problem, but you do not need to have a suspending function that returns a Flow. The lambda you are passing is a suspending function itself:
fun <T> flow(block: suspend FlowCollector<T>.() -> Unit): Flow<T> (source)
Here is an example of a flow that repeats a (GraphQl) query (simplified - without type parameters) I am using:
override fun query(query: Query,
updateIntervalMillis: Long): Flow<Result<T>> {
return flow {
// this ensures at least one query
val result: Result<T> = execute(query)
while (coroutineContext[Job]?.isActive == true && updateIntervalMillis > 0) {
val otherResult: Result<T> = execute(query)
I'm not that good at Flow but I think the problem is that you are delaying only the getData() flow instead of delaying both of them.
Try adding this:
suspend fun getData(config: MyConfig): Flow<List<Data>>
return flow {
when (config)
fetchContinuously().collect { updatedList ->
Take note of the delay(refreshIntervalInMs).

Testing RxJava repeatWhen with Mockk returnsMany

I'm trying to test multiple server responses with Mockk library. Something like I found in this answer for Mockito.
There is my sample UseCase code, which every few seconds repeats call to load the system from a remote server and when the remote system contains more users than local it stops running (onComplete is executed).
override fun execute(localSystem: System, delay: Long): Completable {
return cloudRepository.getSystem(
.repeatWhen { repeatHandler -> // Repeat every [delay] seconds
repeatHandler.delay(params.delay, TimeUnit.SECONDS)
.takeUntil { // Repeat until remote count of users is greater than local count
return#takeUntil it.users.count() > localSystem.users.count()
.ignoreElements() // Ignore onNext() calls and wait for onComplete()/onError() call
To test this behavior I'm mocking the cloudRepository.getSystem() method with the Mockk library:
fun testListeningEnds() {
every { getSystem(TEST_SYSTEM_ID) } returnsMany listOf(
Single.just(testSystemGetResponse), // return the same amount of users as local system has
Single.just(testSystemGetResponse), // return the same amount of users as local system has
Single.just( // return the greater amount of users as local system has
owners = listOf(
TEST_USER.copy(id = UUID.randomUUID().toString())
localSystem = TEST_SYSTEM,
delay = 3L
As you can see I'm using the returnsMany Answer which should return a different value on every call.
The main problem is that returnsMany returns the same first value every time and .takeUntil {} never succeeds what means that onComplete() is never called for this Completable. How to make returnsMany return a different value on each call?
You probably don't understand how exactly .repeatWhen() works. You expect cloudRepository.getSystem(id) being called every time repetition is requested. That is not correct. Repeated subscription is done all the time on the same instance of mocked Single – first Single.just(testSystemGetResponse) in your case.
How to make sure, getSystem() is called every time? Wrap your Single into Single.defer(). It's similar to Single.fromCallable() but there is a difference between the return type of passed lambda. Lambda passed to the .defer() operator must return Rx type (Single in our case).
Final implementation (I have made a few changes to make it compile successfully):
data class User(val id: String)
data class System(val users: List<User>, val id: Long)
class CloudRepository {
fun getSystem(id: Long) = Single.just(System(mutableListOf(), id))
class SO63506574(
private val cloudRepository: CloudRepository
) {
fun execute(localSystem: System, delay: Long): Completable {
return Single.defer { cloudRepository.getSystem( } // <-- defer
.repeatWhen { repeatHandler ->
repeatHandler.delay(delay, TimeUnit.SECONDS)
.takeUntil {
return#takeUntil it.users.count() > localSystem.users.count()
And test (succeeds after ~8s):
class SO63506574Test {
fun testListeningEnds() {
val TEST_USER = User("UUID")
val TEST_SYSTEM = System(mutableListOf(), 10)
val repository = mockk<CloudRepository>()
val useCase = SO63506574(repository)
val testSystemGetResponse = System(mutableListOf(), 10)
every { repository.getSystem(10) } returnsMany listOf(
Single.just(testSystemGetResponse), // return the same amount of users as local system has
Single.just(testSystemGetResponse), // return the same amount of users as local system has
Single.just( // return the greater amount of users as local system has
users = listOf(
TEST_USER.copy(id = UUID.randomUUID().toString())
localSystem = TEST_SYSTEM,
delay = 3L

