Home Home > GIT Browse > SLE12-SP5-AZURE
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPetr Tesarik <ptesarik@suse.cz>2019-08-20 15:24:13 +0200
committerPetr Tesarik <ptesarik@suse.cz>2019-08-20 15:24:13 +0200
commit62b9ee69b0ce30c825db8cb84d00466aaf7235eb (patch)
treea7f896366ff25cfd96f6ea296a4e63afdd35369f
parentb706dfc8718287e633650bfd55dfdae37b63c445 (diff)
parent931e4bb8a8b41f60f5707bc992babc4bb781ec40 (diff)
Merge branch 'users/mgorman/SLE15-SP1/for-next' into SLE15-SP1
Pull a mm fix from Mel Gorman
-rw-r--r--patches.suse/mm-vmscan-do-not-special-case-slab-reclaim-when-watermarks-are-boosted.patch292
-rw-r--r--series.conf1
2 files changed, 293 insertions, 0 deletions
diff --git a/patches.suse/mm-vmscan-do-not-special-case-slab-reclaim-when-watermarks-are-boosted.patch b/patches.suse/mm-vmscan-do-not-special-case-slab-reclaim-when-watermarks-are-boosted.patch
new file mode 100644
index 0000000000..aa7ff21f7f
--- /dev/null
+++ b/patches.suse/mm-vmscan-do-not-special-case-slab-reclaim-when-watermarks-are-boosted.patch
@@ -0,0 +1,292 @@
+From 496a61c34e3f964f5530688cd72895eed729bd23 Mon Sep 17 00:00:00 2001
+From: Mel Gorman <mgorman@suse.de>
+Date: Tue, 13 Aug 2019 15:37:57 -0700
+Subject: [PATCH] mm, vmscan: do not special-case slab reclaim when watermarks
+ are boosted
+
+References: git fixes (mm/vmscan)
+Patch-mainline: v5.3
+Git-commit: 28360f398778d7623a5ff8a8e90958c0d925e120
+
+Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe
+("mm: reclaim small amounts of memory when an external fragmentation
+event occurs").
+
+The report is extensive:
+
+ https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/
+
+and it's worth recording the most relevant parts (colorful language and
+typos included).
+
+ When running a simple, steady state 4kB file creation test to
+ simulate extracting tarballs larger than memory full of small
+ files into the filesystem, I noticed that once memory fills up
+ the cache balance goes to hell.
+
+ The workload is creating one dirty cached inode for every dirty
+ page, both of which should require a single IO each to clean and
+ reclaim, and creation of inodes is throttled by the rate at which
+ dirty writeback runs at (via balance dirty pages). Hence the ingest
+ rate of new cached inodes and page cache pages is identical and
+ steady. As a result, memory reclaim should quickly find a steady
+ balance between page cache and inode caches.
+
+ The moment memory fills, the page cache is reclaimed at a much
+ faster rate than the inode cache, and evidence suggests that
+ the inode cache shrinker is not being called when large batches
+ of pages are being reclaimed. In roughly the same time period
+ that it takes to fill memory with 50% pages and 50% slab caches,
+ memory reclaim reduces the page cache down to just dirty pages
+ and slab caches fill the entirety of memory.
+
+ The LRU is largely full of dirty pages, and we're getting spikes
+ of random writeback from memory reclaim so it's all going to shit.
+ Behaviour never recovers, the page cache remains pinned at just
+ dirty pages, and nothing I could tune would make any difference.
+ vfs_cache_pressure makes no difference - I would set it so high
+ it should trim the entire inode caches in a single pass, yet it
+ didn't do anything. It was clear from tracing and live telemetry
+ that the shrinkers were pretty much not running except when
+ there was absolutely no memory free at all, and then they did
+ the minimum necessary to free memory to make progress.
+
+ So I went looking at the code, trying to find places where pages
+ got reclaimed and the shrinkers weren't called. There's only one
+ - kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
+ reclaim small amounts of memory when an external fragmentation
+ event occurs").
+
+The watermark boosting introduced by the commit is triggered in response
+to an allocation "fragmentation event". The boosting was not intended
+to target THP specifically and triggers even if THP is disabled.
+However, with Dave's perfectly reasonable workload, fragmentation events
+can be very common given the ratio of slab to page cache allocations so
+boosting remains active for long periods of time.
+
+As high-order allocations might use compaction and compaction cannot
+move slab pages the decision was made in the commit to special-case
+kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
+reclaiming slab does not directly help compaction.
+
+As Dave notes, this decision means that slab can be artificially
+protected for long periods of time and messes up the balance with slab
+and page caches.
+
+Removing the special casing can still indirectly help avoid
+fragmentation by avoiding fragmentation-causing events due to slab
+allocation as pages from a slab pageblock will have some slab objects
+freed. Furthermore, with the special casing, reclaim behaviour is
+unpredictable as kswapd sometimes examines slab and sometimes does not
+in a manner that is tricky to tune or analyse.
+
+This patch removes the special casing. The downside is that this is not
+a universal performance win. Some benchmarks that depend on the
+residency of data when rereading metadata may see a regression when slab
+reclaim is restored to its original behaviour. Similarly, some
+benchmarks that only read-once or write-once may perform better when
+page reclaim is too aggressive. The primary upside is that slab
+shrinker is less surprising (arguably more sane but that's a matter of
+opinion), behaves consistently regardless of the fragmentation state of
+the system and properly obeys VM sysctls.
+
+A fsmark benchmark configuration was constructed similar to what Dave
+reported and is codified by the mmtest configuration
+config-io-fsmark-small-file-stream. It was evaluated on a 1-socket
+machine to avoid dealing with NUMA-related issues and the timing of
+reclaim. The storage was an SSD Samsung Evo and a fresh trimmed XFS
+filesystem was used for the test data.
+
+This is not an exact replication of Dave's setup. The configuration
+scales its parameters depending on the memory size of the SUT to behave
+similarly across machines. The parameters mean the first sample
+reported by fs_mark is using 50% of RAM which will barely be throttled
+and look like a big outlier. Dave used fake NUMA to have multiple
+kswapd instances which I didn't replicate. Finally, the number of
+iterations differ from Dave's test as the target disk was not large
+enough. While not identical, it should be representative.
+
+ fsmark
+ 5.3.0-rc3 5.3.0-rc3
+ vanilla shrinker-v1r1
+ Min 1-files/sec 4444.80 ( 0.00%) 4765.60 ( 7.22%)
+ 1st-qrtle 1-files/sec 5005.10 ( 0.00%) 5091.70 ( 1.73%)
+ 2nd-qrtle 1-files/sec 4917.80 ( 0.00%) 4855.60 ( -1.26%)
+ 3rd-qrtle 1-files/sec 4667.40 ( 0.00%) 4831.20 ( 3.51%)
+ Max-1 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
+ Max-5 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
+ Max-10 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
+ Max-90 1-files/sec 4649.60 ( 0.00%) 4780.70 ( 2.82%)
+ Max-95 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
+ Max-99 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
+ Max 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
+ Hmean 1-files/sec 5004.75 ( 0.00%) 5075.96 ( 1.42%)
+ Stddev 1-files/sec 1778.70 ( 0.00%) 1369.66 ( 23.00%)
+ CoeffVar 1-files/sec 33.70 ( 0.00%) 26.05 ( 22.71%)
+ BHmean-99 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
+ BHmean-95 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
+ BHmean-90 1-files/sec 5107.05 ( 0.00%) 5131.41 ( 0.48%)
+ BHmean-75 1-files/sec 5208.45 ( 0.00%) 5206.68 ( -0.03%)
+ BHmean-50 1-files/sec 5405.53 ( 0.00%) 5381.62 ( -0.44%)
+ BHmean-25 1-files/sec 6179.75 ( 0.00%) 6095.14 ( -1.37%)
+
+ 5.3.0-rc3 5.3.0-rc3
+ vanillashrinker-v1r1
+ Duration User 501.82 497.29
+ Duration System 4401.44 4424.08
+ Duration Elapsed 8124.76 8358.05
+
+This is showing a slight skew for the max result representing a large
+outlier for the 1st, 2nd and 3rd quartile are similar indicating that
+the bulk of the results show little difference. Note that an earlier
+version of the fsmark configuration showed a regression but that
+included more samples taken while memory was still filling.
+
+Note that the elapsed time is higher. Part of this is that the
+configuration included time to delete all the test files when the test
+completes -- the test automation handles the possibility of testing
+fsmark with multiple thread counts. Without the patch, many of these
+objects would be memory resident which is part of what the patch is
+addressing.
+
+There are other important observations that justify the patch.
+
+1. With the vanilla kernel, the number of dirty pages in the system is
+ very low for much of the test. With this patch, dirty pages is
+ generally kept at 10% which matches vm.dirty_background_ratio which
+ is normal expected historical behaviour.
+
+2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
+ 0.95 for much of the test i.e. Slab is being left alone and
+ dominating memory consumption. With the patch applied, the ratio
+ varies between 0.35 and 0.45 with the bulk of the measured ratios
+ roughly half way between those values. This is a different balance to
+ what Dave reported but it was at least consistent.
+
+3. Slabs are scanned throughout the entire test with the patch applied.
+ The vanille kernel has periods with no scan activity and then
+ relatively massive spikes.
+
+4. Without the patch, kswapd scan rates are very variable. With the
+ patch, the scan rates remain quite steady.
+
+4. Overall vmstats are closer to normal expectations
+
+ 5.3.0-rc3 5.3.0-rc3
+ vanilla shrinker-v1r1
+ Ops Direct pages scanned 99388.00 328410.00
+ Ops Kswapd pages scanned 45382917.00 33451026.00
+ Ops Kswapd pages reclaimed 30869570.00 25239655.00
+ Ops Direct pages reclaimed 74131.00 5830.00
+ Ops Kswapd efficiency % 68.02 75.45
+ Ops Kswapd velocity 5585.75 4002.25
+ Ops Page reclaim immediate 1179721.00 430927.00
+ Ops Slabs scanned 62367361.00 73581394.00
+ Ops Direct inode steals 2103.00 1002.00
+ Ops Kswapd inode steals 570180.00 5183206.00
+
+ o Vanilla kernel is hitting direct reclaim more frequently,
+ not very much in absolute terms but the fact the patch
+ reduces it is interesting
+ o "Page reclaim immediate" in the vanilla kernel indicates
+ dirty pages are being encountered at the tail of the LRU.
+ This is generally bad and means in this case that the LRU
+ is not long enough for dirty pages to be cleaned by the
+ background flush in time. This is much reduced by the
+ patch.
+ o With the patch, kswapd is reclaiming 10 times more slab
+ pages than with the vanilla kernel. This is indicative
+ of the watermark boosting over-protecting slab
+
+A more complete set of tests were run that were part of the basis for
+introducing boosting and while there are some differences, they are well
+within tolerances.
+
+Bottom line, the special casing kswapd to avoid slab behaviour is
+unpredictable and can lead to abnormal results for normal workloads.
+
+This patch restores the expected behaviour that slab and page cache is
+balanced consistently for a workload with a steady allocation ratio of
+slab/pagecache pages. It also means that if there are workloads that
+favour the preservation of slab over pagecache that it can be tuned via
+vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
+the parameter when boosting is active.
+
+Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
+Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Reviewed-by: Dave Chinner <dchinner@redhat.com>
+Acked-by: Vlastimil Babka <vbabka@suse.cz>
+Cc: Michal Hocko <mhocko@kernel.org>
+Cc: <stable@vger.kernel.org> [5.0+]
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+---
+ mm/vmscan.c | 11 ++---------
+ 1 file changed, 2 insertions(+), 9 deletions(-)
+
+diff --git a/mm/vmscan.c b/mm/vmscan.c
+index e32e44c5137d..66934bc43654 100644
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -97,9 +97,6 @@ struct scan_control {
+ /* Can pages be swapped as part of reclaim? */
+ unsigned int may_swap:1;
+
+- /* e.g. boosted watermark reclaim leaves slabs alone */
+- unsigned int may_shrinkslab:1;
+-
+ /*
+ * Cgroups are not reclaimed below their configured memory.low,
+ * unless we threaten to OOM. If any cgroups are skipped due to
+@@ -2557,7 +2554,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
+ shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
+ node_lru_pages += lru_pages;
+
+- if (memcg && sc->may_shrinkslab)
++ if (memcg)
+ shrink_slab(sc->gfp_mask, pgdat->node_id,
+ memcg, sc->nr_scanned - scanned,
+ lru_pages);
+@@ -2588,7 +2585,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
+ * Shrink the slab caches in the same proportion that
+ * the eligible LRU pages were scanned.
+ */
+- if (global_reclaim(sc) && sc->may_shrinkslab)
++ if (global_reclaim(sc))
+ shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
+ sc->nr_scanned - nr_scanned,
+ node_lru_pages);
+@@ -2989,7 +2986,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+ .may_writepage = !laptop_mode,
+ .may_unmap = 1,
+ .may_swap = 1,
+- .may_shrinkslab = 1,
+ };
+
+ /*
+@@ -3026,7 +3022,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
+ .may_unmap = 1,
+ .reclaim_idx = MAX_NR_ZONES - 1,
+ .may_swap = !noswap,
+- .may_shrinkslab = 1,
+ };
+ unsigned long lru_pages;
+
+@@ -3072,7 +3067,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+ .may_writepage = !laptop_mode,
+ .may_unmap = 1,
+ .may_swap = may_swap,
+- .may_shrinkslab = 1,
+ };
+
+ /*
+@@ -3374,7 +3368,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
+ */
+ sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
+ sc.may_swap = !nr_boost_reclaim;
+- sc.may_shrinkslab = !nr_boost_reclaim;
+
+ /*
+ * Do some background aging of the anon list, to give
diff --git a/series.conf b/series.conf
index f6b9504939..0b66325677 100644
--- a/series.conf
+++ b/series.conf
@@ -48682,6 +48682,7 @@
patches.drivers/usb-iowarrior-fix-deadlock-on-disconnect.patch
patches.drivers/iio-adc-max9611-Fix-misuse-of-GENMASK-macro.patch
patches.fixes/driver_core-Fix_use-after-free_and_double_free_on_glue.patch
+ patches.suse/mm-vmscan-do-not-special-case-slab-reclaim-when-watermarks-are-boosted.patch
patches.drivers/iommu-dma-handle-sg-length-overflow-better
patches.drivers/ALSA-hda-Apply-workaround-for-another-AMD-chip-1022-.patch
patches.drivers/ALSA-hda-Fix-a-memory-leak-bug.patch