Persist
persist
starts the computation and stores the results in memory. However, it is different from compute
in a way where persist
gives back dask object with results being computed or computed and stored on distributed cluster memory. compute
on one hand returns back the numpy or pandas dataframe as result which is a big one data stored. This is may not be possible for a single machine memory to store such big data. If cluster is accessible, we can use persist which can stored results on cluster distributed memory.
# returns an in-memory non-dask object
result = y.compute()
# returns an in-memory dask object that uses distributed storage
# if available
result = y.persist()
One usecase of persist
is that persisting the data after some preprocessing the large data and then using this persisted data for other processing which gives faster results.
a = da.arange(10000, chunks=(100))
processed = a % 2
processed = processed.persist()
# These are relatively faster as relevant data is in memory
processed.sum().compute()
processed.mean().compute()