Persist

persist starts the computation and stores the results in memory. However, it is different from compute in a way where persist gives back dask object with results being computed or computed and stored on distributed cluster memory. compute on one hand returns back the numpy or pandas dataframe as result which is a big one data stored. This is may not be possible for a single machine memory to store such big data. If cluster is accessible, we can use persist which can stored results on cluster distributed memory.

# returns an in-memory non-dask object
result = y.compute()
 
# returns an in-memory dask object that uses distributed storage
# if available
result = y.persist()

One usecase of persist is that persisting the data after some preprocessing the large data and then using this persisted data for other processing which gives faster results.

a = da.arange(10000, chunks=(100))
processed = a % 2
processed = processed.persist()
 
 
# These are relatively faster as relevant data is in memory
processed.sum().compute()
processed.mean().compute()

bitPhile

Explorer

persist

Persist

Graph View