the complete guide to rails performance讀後筆記

動機

好看的書，但我主要是看怎麼profile與gc所以有跳過一些部分

跳過: webfont, CDN, SSL, HTTP cache, Rails Cache, Backgrounding Work

principle

Benford’s Law

越大的數，以它為首幾位的數出現的機率就越低。它可用於檢查各種數據是否有造假

Zipf’s law

一個單詞出現的頻率與它在頻率表里的排名成反比只有少數的單字常常被使用

Zipf’s law是離散的分布，如果用成連續的就是Pareto’s law(80/20)

Pareto’s law

80%的財富掌握在20%的人手上

Pareto’s law告訴我們把時間花在找出那20%決定80%的部分

所以下面的章節有很大一部分在怎麼測量，與測量什麼

little law

$l=\lambda w$ $l$是幾台主機 $\lambda$是單位時間有多少req $w$是主機的avg. response time

這是針對長期而言的式子，主要是用來看有沒有over-scaled(太多主機了、太少主機了(主機數剛好是算出來的數字!?)) 另外，little law預設，req彼此獨立、server彼此獨立(不能互卡(IO)、卡資源(cpu或mem有限))、response不能差平均太多(95th percentile response times很重要)

同時也是說，scale對avg. response time沒有影響，對

throughput有影響
減少user在queue中的等待時間有影響
- 如果在queue中平均等待的人數少於1，或是在queue中等待的時間太短
  - scale的效果就不明顯
  - server沒有全力工作(100%)
- 所以要在到達這個點之前先scale
  - 如果avg. response time慢，更應該如此
    - 但要注意邊際遞減 (Amdahl’s law)

The Performance Workflow

有沒有metrics怪怪的
profile找兇手，在哪邊花了最多時間
做小範圍的benchmark，測時間或是花費資源
- develop環境的benchmark結果會與production的有差距
  - 例: 500ms in production vs 1000ms locally
    - Generally a factor of 3 is acceptable
- profile的結果不一定準
  - 如果把佔了50%的method拿掉，不一定讓結果變好50%
做整體的benchmark與profile，確定改的是對的

profile

profile: 測code各個部位的資源占比

profile mode

mode不同讓測出來的時間不一樣

CPU - clock counter
- 數clock cycles
  - “Amount of clock cycles” / “CPU frequency”
  - 但現在cpu會stepping
    - load重的時候把clock frequency調高
- system-wide
  - 改用time stamp counter去算時間
  - 這樣其他工作也會影響到當前的profile
- 建議
  - Use CPU time when you’re interested in seeing the profile without I/O
Wall time
- 就是看start之後扣掉end
  - wall就是牆上的時間
- 影響到當前的profile
  - 其他process使用相同的資源
  - Network or I/O
- 建議
  - Despite its flaws, wall time is usually the mode you’ll want to use
Process time
- 只測量目前process花的時間
  - 不包括forked的process
- 建議
  - process time, if available, is usually a better choice over CPU time.
  - If you have code that spawns subprocesses, you may need to stick with CPU time (or wall time).
- 有的profiler把cpu time當成這裡的process time

還有測量方式，profiler都是看在stack上花的時間去算占比

Tracing
- 每個invoke都記錄
  - 超準
  - 超浪費資源
Sampling
- 一定時間固定去看stack的樣子，紀錄占比
  - 要抽樣夠多次才準
- 因為資源占比小，所以可以放在prodution環境中看profile

ruby: Ruby-Prof

Ruby-Prof直接與MRI掛勾(tracing)，所以一但跑了就會比平常慢2到3倍

require 'ruby-prof'
SORTED_ARRAY = Array.new(10_000) { rand(100_000) }.sort!
array_size = SORTED_ARRAY.size
RubyProf.measure_mode = RubyProf::CPU_TIME
result = RubyProf.profile do
    1_000_000.times { bsearch2(SORTED_ARRAY, rand(array_size)) }
end
printer = RubyProf::FlatPrinter.new(result)
printer.print(STDOUT)

%self total self wait child calls name
17.22 19.117 12.777 0.000 6.340 13182869 Fixnum#==
8.54 6.340 6.340 0.000 0.000 13182869 BasicObject#==
5.73 71.918 4.252 0.000 67.666 14182869 *Object#bsearch2
2.14 74.196 1.590 0.000 72.606 1 Integer#times
0.93 0.688 0.688 0.000 0.000 1000000 Kernel#rand
0.68 0.508 0.508 0.000 0.000 1000000 Array#count
0.00 74.196 0.000 0.000 74.196 1 Global#[No method]

%self: 花在這個method的時間占比 total: 這個method與child共花了多少時間 self: 這個method花了多少時間 child: total - self calls: 被call了幾次

從%self的下手!!

ruby: Stackprof

做sampling的profiler是rack-mini-profiler的backend

一般不會在開發時使用，因為Ruby-Prof比較準

rack: rack-mini-profiler

除了一般的profile還可以看

SQL query
server response time
method的flamegraph
memory leak(gc)

同時rack-mini-profiler本來就是設計給production用的!!

記得，如果在profile速度時，要調到prodution mode，因為development mode會有需多方便開發的功能把速度拖慢

裝完就可以啟動網站，之後網頁會出現一個badge裡面大概像

之後就可以問

How many SQL queries am I generating?
- 一般來說一個orm的class就只會有一條SQL
- 一個簡單page一般來說是1~3條SQL
What’s my total request time?
- 一般會希望在50ms以下
What % of time am I spending in SQL?
- 最好使用production的DB去測
  - 通常production的資料量遠比develop大
How long until DOMContentLoaded fires?
- 從收到response到出現在畫面上需要一段過程
  - 這留到前端優化
Are any of the parts of the page taking up an extreme amount of time compared to others?

往有SQL的項目點，可以看partial render時間與SQL時間，與花在這個項目的時間(左邊)

那中間差的時間呢? 就是在code上的，詳細的需要看flamegraph

通常遇到partial的SQL可以試著

把整個拿掉
cache partial
善用includes一次多load一點

另外還有profile-gc、profile-memory可以看記憶體與gc的情況!! profile-gc就是GC.stat，可以看

New bytes allocated outside of Ruby heaps
- 過了10MB要特別注意
ObjectSpace delta caused by request
- 收到req之後多了多少物件與哪種物件

rack-mini-profiler利用profile-memory去看

allocated memory by gem
allocated memory by file
allocated objects by gem

memory profile

因為多了vm這一層，所以memory profile變得很麻煩下面的工具都是基於MRI的

ObjectSpace and objspace.so

ObjectSpace就是所有object的存放地，因為是與MRI深度綁定，所以不要在prodution用

下面來看看有什麼有趣的功能

像ObjectSpace.count_objects

irb(main):001:0> ObjectSpace.count_objects
=> {:TOTAL=>53802, :FREE=>31, :T_OBJECT=>3373,
:T_CLASS=>888, :T_MODULE=>30, :T_FLOAT=>4,
:T_STRING=>36497, :T_REGEXP=>164, :T_ARRAY=>9399,
:T_HASH=>789, :T_STRUCT=>2, :T_BIGNUM=>2, :T_FILE=>7,
:T_DATA=>1443, :T_MATCH=>85, :T_COMPLEX=>1,
:T_NODE=>1050, :T_ICLASS=>37}

好懂的好懂，但還是有些怪怪的

T_NODE: AST
T_DATA: interrupter自己的東西

這裡就看我們知道的基本type就好

利用這個與gc開關可以來寫個簡單的benchmark看一段code產生多少object

def allocate_count
    GC.disable
    before = ObjectSpace.count_objects
    yield
    after = ObjectSpace.count_objects
    after.each { |k,v| after[k] = v - before[k] }
    after[:T_HASH] -= 1 # probe effect - we created the before hash.
    after[:FREE] += 1 # same
    GC.enable
    after.reject { |k,v| v == 0 }
end

allocate_count { 100.times { 'hello' + 'hi' }}

同時也可以看現在有多少存活的object

puts ObjectSpace.each_object.count
puts ObjectSpace.each_object(Numeric).count
puts ObjectSpace.each_object(Complex).count
ObjectSpace.each_object(Complex) { |c| puts c

看一個type總共花的多少mem

irb(main):057:0> ObjectSpace.count_objects_size
{
:T_OBJECT => 198560,
:T_CLASS => 614784,
:T_MODULE => 66712,
:T_FLOAT => 160,
:T_STRING => 1578522,
:T_REGEXP => 122875,
:T_ARRAY => 630976,
:T_HASH => 165672,
:T_STRUCT => 160
...

看看這個物件佔了多少mem

irb(main):062:0> ObjectSpace.memsize_of("The quick brown fox jumps over the lazy dog")
40 # NOT ACCURATE
irb(main):063:0> ObjectSpace.memsize_of("The quick brown fox")
40
irb(main):064:0> ObjectSpace.memsize_of([])
40
irb(main):065:0>ObjectSpace.memsize_of(Array.new(10_000) { :a })
80040

為什麼是40? ruby vm的RVALUE大小就是40bytes

使用時機:

各種try，以增加gc的知識
用 ObjectSpace.each_object 去check live objects
如果市面上的profiler不行了，可以hack這個來做profile

GC::Profiler

ruby的gc是generational garbage collector

看object活過幾次GC來對object分類
- 活過一次就是old
  - 可以看old_objects(GC.stat)找有沒有memory leak發生
    - 如果這個值慢慢上升就是中了
Minor GCs
- 只挑new處理
- 核心思想是通常object沒有幾個會活很久

GC.count就是從執行程式以來GC被執行的總次數，包含major與minor GC.stat就是各種參數，除了GC的還有memory本身的各種數值

跑跑GC::profiler吧

GC::Profiler.enable
require 'set'
GC.start
GC::Profiler.report
GC::Profiler.disable

GC 133 invokes.
Index Invoke    Time(sec)    Use Size(byte)    Total Size(byte)    Total Object    GC Time(ms)
1    1.966    801240    6315840    157896    2.33700000000003349498

invoke的值與GC.count一樣也就是，這是從執行程式以來GC被執行的總次數

使用時機: 如果gc跑太久，GC與GC::Profiler是個很好的開始

derailed_benchmarks

這個是看memory用量，可以追memory bloat

像bundle exec derailed bundle:mem 會列出每個gem用多少memory

delayed_job: 18.9805 MiB (Also required by:
delayed/railtie, delayed_job_active_record)
delayed/performable_mailer: 17.8633 MiB
mail: 17.8555 MiB (Also required by: TOP)
mime/types: 12.9492 MiB (Also required by:
/Users/nateberkopec/.gem/ruby/2.3.0/gems/rest-client1.8.0/lib/restclient/request,
/Users/nateberkopec/.gem/ruby/2.3.0/gems/rest-client1.8.0/lib/restclient/payload)
mail/field: 2.0039 MiB
mail/message: 0.8477 MiB
delayed/worker: 0.6055 MiB
rails/all: 15.8125 MiB
rails: 7.5352 MiB (Also required by:
active_record/railtie, active_model/railtie, and 10
others)
rails/application: 5.3867 MiB
[… continues on and on]

跑app多次，看最後花了多少mem derailed exec perf:mem_over_time 如果持續上升，memory leak

看object到底在哪產生的 derailed exec perf:objects 可以用來追哪個指令用memory太兇

使用時機:

bundle:mem來check gem的mem占量，與減少mem bloat
trace mem leak

memory_profiler

memory_profiler其實是derailed_benchmarks的backend

memory_profiler可以只對一段code做profile

require 'memory_profiler'
report = MemoryProfiler.report do
    # run your code here
end
report.pretty_print

之後的report會有

Retained memory
- 在跑profiler之前就在的object
Allocated memory
- 跑profiler時alloc的object
  - 高Allocated memory代表gc會跑比較多次，程式會變慢

另外，memory_profiler跑出來的usage與會比ps看到的少，因為ruby有memory fragmentation

memory_profiler也可以profile c extendsion的memory

使用時機

要追non-Rack-app與background jobs的mem issue
Rack的app可以用 derailed and rack-mini-profiler

front-end: chrome timeline

對於end-user而言，Server response times(100ms~300ms)不是重點，因為占比很小(10%) 整個load time大概是1~3秒

gzipped size對於花多久時間下載很重要
unzip 後的size對花多久時間parse與construct很重要
New Relic的real user monitoring (RUM)可以提供大概end-user感覺的page load time
Chrome Timeline可以看到每一步的實際狀況
- Chrome Timeline同時會記錄其他extensions的事件!! (記得關其他extendsion)

整個流程 (從browser的角度)

DNS/TCP/SSL setup
download html
parsing html. 一但遇到其他資源就停下parse等載好跑完才會繼續parse
- css不會擋parse
- js上async或是defer不會擋parse
剛剛提到不會被擋的東西，preloader會在他簡單掃過後去preload!!

所以我們要

Don’t stop the parser.
- preloader會在parser跑之前先掃看有沒有可以先下載的東西
  - head, script, …
- 如果用js做動態生dom(script)，這樣preloader看不到!!
Get out of the browser preloader’s way.
- preloader不吃
  - iframe
  - webfont
  - HTML5 audio/video
  - css @import
Use HTTP caching - but not too much.
- 把常用的設成cache，自己打包(jquery…)
  - 如果賭user有大廠的api(來自google的cdn之類)
    - 有，沒事
    - 沒有，整個parse被block!!
Use the Resource Hint API.
- DNS Prefetch
- Preconnect
- Prefetch
- Prerender

整個流程 (從網頁的角度)

送request，等response
- 這不會出現在timeline上，前面的空白就是這段時間
- 包含
  - service response time (大約10ms)
  - network latency (大約10ms~300ms，看有沒有跨境)
    - 光速從新加坡到US都要花70ms!!
Receive Response
- 在收到任何byte就會有這一事件
- 這就是下載，完成下載後會有Finish Loading
  - 所以後面會看到很多這個event
Parse HTML
- 把html轉成dom
- 下載需要的resource
  - <script src="/assets/application.js" async="async" ... />
    - 這個有加async，所以不會block整個parse
    - 如果沒加parse會等下載完才繼續
  - CSS不會block
- 跑裡面寫的JS (會有對應的js)
  - async會下載完直接跑(中斷parse)
  - defer會下載完等parse跑完再跑
Recalculate Styles
- parse CSS 把 DON 轉成 CSSOM
- 如果css還沒載完?
  - 先用browser預設的
- 如果這邊花很久時間，代表css太複雜了
Layout
- 走訪DOM，算
  - visibility
  - applicable CSSs tyles
  - relative geometry (width)
- 複雜的CSS與HTML會讓這事件變久
- layout thrashing(reflow)
  - Any time you change the geometry of an element (its height, width, whatever), you trigger a layout event
  - 一般browsers不知道哪邊要重算
    - 因此大部分都是全部重算(reflow)
  - 通常發生在
    - js在搞dom
    - 太多張stylesheets
  - What forces layout / reflow
DomContentLoaded
- html與css與沒有標async的js(整個html)都跑完了 (照順序跑)
  - 但其他資源還沒載完
    - 都載完會trigger load
  - 實際體驗
- 但現在不會有任何東西在畫面上
Paint
- 把CSSOM畫在畫面上!!

之後可能還有其他CSS,JS 就會再產生對應的事件

怎麼用timeline來profile

Hard reload (ctrl-shift-r) and load the Timeline with fresh data
Look at the pie graph for the entire page load
- Reduce Idle
  - slow server responses
  - asset requests
- Reduce Loading
  - HTML/CSS太大了
- Reduce Scripting
  - 通常是花在下載其他的script
    - - async or defer
  - 或是對js做profile
- Reduce Rendering and Painting
  - 這與css優化有關

為什麼要整成一包? HTML, TCP and latency are the problems, not bandwidth. 與render或是執行相比，network latency其實很重

一個inline的1MB page與有著100個external request的1MB page，一定是inline的最快，光是當下載就飽了重點是什麼時候畫，什麼時間載入不是重點

對於end-user而言最重要的時間

First paint: 雖然說只會看到框框，但還是很重要，這與人感知速度有關
First paint of text content
The load event

Encoding

http header
meta tag
- 要放在第一個，不然會讓parse從頭再跑一次
browser去猜

Viewports 要放在第一個，不然之後有css會讓browser去reflow

css first 如果有js的head，且沒有async，這種情況下把js的head放css前面，這樣會讓css的下載被block

3rd-party: New Relic

development & production 環境會有差異的原因

Application settings
- code reloading
Caching behavior
Differences in data
- production資料量一定比較大(用includes)
Network latency
- 大概數字
  - 在同一個城市: 10ms
  - 在兩個州之間: 20ms
  - 從US東到US西: 100ms
  - 到世界的另一邊: up to 300ms
  - 如果是手機網路，可能要再乘4倍
JavaScript and devices
- 同樣的js code不同的裝置
  - PC
  - mobile: 跑起來比較痛苦
System configuration and resources
- 同樣的container可以跑在不同硬體上
- 程式用不同的compiler或是compile flag編
Virtualization
- negatively and unpredictably impact performance when one virtualized server is hogging up the resources available

New Relic: profile in production env

Transactions: response Real-User Monitoring (also RUM and Browser monitoring):

在每頁插入JS測時間
- NavigationTimingAPI
- Events set include domContentLoaded, domfomplete, requestStart and responseEnd

process time

The web transaction response time graph

預設時間是30mins，我們要看時間的越長越好，最好是一個月，new relic最長到7天，但也夠了

純後端性能

App server avg response time	Status
< 100ms	Fast!
< 300ms	Average
> 300ms	Slow!

如果是JSON的API server，可以把時間再減半

後端加前端性能

Browser avg load time	Status
< 3 sec	Fast!
< 6 sec	Average
> 6 sec	Slow!

Next, I’m considering the shape of the response time graph 重點是在每個部分中，時間都花到哪裡去了

一般來說應該花在ruby上最多
如果database, web external, or other processes比較多就是有問題
- web external就是有人在等外部API
- request queueing代表需要更多server

這裡是看哪次request(transaction)最特別

看最左(95th Percentiles)，去做優化但也要記得看最右，為什麼這麼快，Are they asset requests? Redirects? Errors? Are they asset requests? Redirects? Errors?

Requests per minute	Scale
< 10	Should only have 1 server/Heroku dyno.
10 - 1000	Average
> 1000	High. “Just add more servers” may not work anymore.

大於1000時就要考慮怎麼處理databases或是cache stores，以及引入devops

Transactions

如果requests-per-minute scale靠前，用most time consuming排序(80%時間花在20%的controller)

requests-per-minute小，用slowest average response time排序

因為把一個100ms的response變成10ms對user體驗沒有太大影響(所以注意超過500ms的request)

database

用most time consuming看有沒有query太久

常見病症

Lots of time in #find
- Pay attention to the “time consumption by caller” graph
  - where is this query being called the most?
    - Go check out those controllers and see
      - where的欄位沒有index
      - N+1 query
SQL - OTHER
- Rails periodically issues queries
  - 別管他們

External Services

Most Ruby applications will block on network requests 一般Rails會被外部API的request給block 根據不同的timeout，可以delay載入大概200ms~500ms，如果是95th percentile還可以到20秒

一個是用background worker去跑，把東西放到cache 或是設定Circuit Breaker，如果看到request一直timeout讓之後的request直接fail

GC stats and Reports

不準，忘了他

Browser / Real user monitoring (RUM)

切成“Browser page load time”，之後看每個元件的average load time

Request queueing
- 通常最多10-20ms
Web application
- 就是你的app，但注意到這裡的時間占比很小
Network
- 通常比 response + queueing 還久
- 這是算雙向的時間
DOM Processing
- 花很多時間 > Web application+Request queueing
- 算load finish到DOMContentReady
- 這個時候只是html parse完
  - 後面還有其他CSS與JS
- 這個時候畫面還是白的
Page Rendering
- 算DOMContentReady到load
  - DOMContentReady就是$(document).ready
  - load就是所有資源都好了才會動
- 到load之前，browser可能會顯示一些畫面

Turbolinks & “HTML-over-the-wire”

HTML-over-the-wire與SPA差在一個傳HTML一個傳資料

一般來說，rails app (大約1秒)

return a response in 100-300ms,
spend about 200ms loading the HTML and CSSOM, a few hundred more ms renderingand painting
then likely loads of JS scripting tied to the onload event.

Turbolinks可以把上面的時間減少200-700ms

代價

不能用一般的方式寫js
- idempotent function
- 不能往ready一直掛hook
  - 被Turbolinks拿去用了!! (load也被拿走了)
    - 所以要用其他事件
不能與其他client side JS frameworks共存
- load被拿走了
做Integration testing會很痛苦
在mobile上基本沒用
不能offline(SPA可以)

常見錯誤

確信這個page有被Turbolinks
- 開console看有沒有Navigated to http://www.whatever.com/foo
  - 有就gg
用dom append的方式改網頁
- 因為Turbolinks是回傳整個html，所以應該
  - 用controller產生資料帶到erb，之後生html
  - 不是一直用js塞

關於反應時間

0.1秒: 很快 1秒: 可以接受，也許有人會覺得慢 10秒: 人能夠忍耐的上限，需要feedback讓user知道跑到哪了 source

benchmark

benchmark: 測code花多少時間或是花多少資源

也許某段code的benchmark好，但是其實這段code的占比不大，那就不用特別去改還有可能是單單benchmark快，但是對整體沒有影響甚至是拖累整體

require 'benchmark/ips'
Benchmark.ips do |x|
    SORTED_ARRAY = Array.new(10_000) { rand(100_000) }.sort!
    array_size = SORTED_ARRAY.size
    # Typical mode, runs the block as many times as it can
    x.report("bsearch1") { bsearch1(SORTED_ARRAY, rand(array_size))}
    x.report("bsearch2") { bsearch2(SORTED_ARRAY, rand(array_size))}
    x.compare!
end

整個網站

performance之於企業

Create a Performance Culture
- 用$衡量效率
- 設定 a front-end load time
  - DOMContentLoaded
  - window.load
  - start render time
- 設定 MART and/or M95RT
  - Set a maximum average response time and/or a maximum 95th percentile response time for your server responses
  - it’s important to capture what’s going on in the “long tail” as well as what’s happening to the average case.
- 設定 a page weight
  - cannot exceed <projected user bandwidth in megabytes/second> / <load time budget in seconds>
- 設定 integration costs
- Add automated performance and page weight tests
  - An acceptance test
    - make a GET request to this page
      - record two or three numbers
        Server response time
        User page load timings (DOMContentLoaded & load)
    - benchmark “hot code”
  - Run the performance acceptance tests separately from your unit and acceptance/integration tests.
  - 這一定有灰色地帶
  - 有第三方服務
    - Blazemeter
    - Loader.io

DB optimization

加index的好地方

Foreign keys
Primary keys
Polymorphic relationships
updated_at
- 給 Russian Doll caching

當覺得sql效能怪怪的，用EXPLAIN

MVCC會產生新資料與舊資料，通常在transaction好了之後舊資料會被清掉，但總是會有漏的

VACUUM!!

省空間
讓query planner更有效率

scale的時候，除了process變多，還有 process是怎麼與shared resource溝通

database
Redis, memcache, and other key-value stores

這是重點的理由是連線數量有限制!! 要記的算!!

在test的時候，可以放鬆ACID，加快測試速度

db放RAMdisk
把fsync 與 synchronous commit關掉

Rails slow?

log to disk
useless gems
- Sprockets
- ActionMailer
- …
useless rack middleware
- Rack::Sendfile
- ActionDispatch::Cookies
- ActionDispatch::Session::CookieStore
- ActionDispatch::Flash
- ActionDispatch::RemoteIp
- ActionDispatch::ShowExceptions
- ActionDispatch::DebugExceptions
- ActionDispatch::Callbacks
- ActionDispatch::RequestId
- Rack::Runtime
- …

exception slow!!

Exceptions should not be used for flow control, use throw/catch for that.

This reserves exceptions for true failure conditions.

catch(:done) do
    i = 0
    loop do
        i += 1
        throw :done if i > 100_000
    end
end
finish_up

memory bloat

要看什麼?

Resident Set Size (RSS): process用到的記憶體 (包含shared)
- Shared Memory
- Private Memory: 包含forked的child
  - Real Memory = Shared Memory + Private Memory

怎麼看?

ps
get_process_mem
Oink

減少memory bloat

Beware Big Allocations
- 不是說GC完所有不用的mem都會還回去
  - 可以看成還記憶體很慢
- 替代方案是streaming: file.gets
Gemfile Auditing
- 檢查gem: derailed_benchmarks
jemalloc
- ptmalloc、tcmalloc与jemalloc对比分析
GC Parameters

Memory Leaks

有不同等級

Managed Ruby object leaks
C-extension leaks
Leaks in Ruby itself (the VM)

一般來說大概2~3小時mem用量會平緩下來，最慢大概24小時沒有就有可能leak

怎麼重現

調高環境的記憶體上限與設定不要把process砍掉
放著跑，看用量有沒有收斂

siege做多次測試，之後看

RSS memory usage
GC.stat[:heap_live_slots]
- 這是有多少slot被object占用
- 如果RSS上升，但這個不變
  - 可能是C-extension leak
GC.stat[:heap_free_slots]
- 這是沒有被object占用的slot
- 如果這個數字大，代表
  - ruby vm沒有把mem還給記憶體
  - 有人alloc大量記憶體之後就不用了
  - 這是 memory bloat
ObjectSpace.count_objects
- 這是目前在ruby vm中的object數量
- 如果有個type的object一直漲，代表
  - Ruby memory leak

這裡有一個小程式可以看上面的訊息

Thread.new do
    logger = Logger.new('mem_log.txt')
    logger.formatter = proc { |sev, date, prog, msg| msg }
    headers = [
        "RSS",
        "Live slots",
        "Free slots",
        ObjectSpace.count_objects.keys
    ].flatten
    logger.info headers.join(",")
    while true
        pid = Process.pid
        rss = `ps -eo pid,rss | grep #{pid} | awk '{print $2}'`
        memory_info = [
            rss.strip,
            GC.stat[:heap_live_slots],
            GC.stat[:heap_free_slots],
            ObjectSpace.count_objects.values
        ].flatten
        logger.info memory_info.join(",")
        logger.info "\n"
        sleep 5
    end
end

把上面的code放到config/initializers，之後就會有csv 這樣就可以用seige打打看，生10~15k的資料，之後就可以分析了

Managed Ruby object leaks
- heap live slots & RSS 上升, heap free slots不高
- 用memory_profiler看 retained objects by location
C-extension leaks
- heap live slots & heap free slots不變, RSS 上升
- Ruby的Heap dumping
- jemalloc Introspection
Leaks in Ruby itself (the VM)
- heap live slots & heap free slots不變, RSS 上升
  - 但是找不到任何C-extension leak!!
- 直接回報

真的不行了就Worker-Killers

Memory Fragmentation

Memory fragmentation會讓mem usage對數上升，直到一個可怕的limit

主要原因在於ruby沒辦法移動meme中的obj

ObjectSpace就是ruby vm的mem，slot存的就是RVALUE(40 bytes)，也就是object的指標 RVALUE會被集合在一起成一個page

所以ruby其實本身也有Fragmentation 看到GC.stat

heap_live_slots: 代表現在有被RVALUE占用的slot有多少 aka 現在有多少object活著
heap_eden_pages
- eden_page是至少有一個活著的slot的page
- tomb_page就是都沒有一個活著的slot的page
  - tomb_page才可以還給OS
heap_sorted_length
- 一開始分配mem都是一塊一塊配上去
  - 這個以分配的長度是heap_sorted_length
- 但是如果中間有幾塊被free了…
  - heap_sorted_length不變，因為不是連續的
  - 但是中間的就沒辦法用了 (Fragmentation)

所以可以用兩種方式看Fragmentation

heap_live_slots / heap_eden_pages的slot數量

GC.stat[:heap_live_slots] # 24508
GC.stat[:heap_eden_pages] # 83
GC::INTERNAL_CONSTANTS[:HEAP_PAGE_OBJ_LIMIT] # 408
# 1 - live_slots / (eden_pages * slots_per_page)
# 24508 / (83 * 408) = 72.3
# 100% - 72.3% = 27.7%

GC.stat[:heap_sorted_length]/GC.stat[:heap_sorted_length]

per-thread memory arena

We call malloc in a thread
The thread attempts to obtain the lock for the memory arena it accessed previously
If that arena is not available, try the next memory arena
If none of the memory arenas are available, create a new arena and use that
- 同時加回去arena的list

所以arena其實就是記憶體!! 但現在如果沒有限制arena數量的話

變成有好幾塊小塊的記憶體沒辦法合併
ruby的ptr不能被移動 (RVALUE的ptr直接指到mem)

arena少，mem usage少，但contention會變多

所以下次遇到

Reduce Memory Arenas(改MALLOC_ARENA_MAX)
Use jemalloc
Compacting GC (夢想)

關於application server本身

aws與heroku很好scale，也同時讓人容易過度scale

Scaling increases throughput, not speed. scale只有在有queue時才會提升response times

所以別只看response times做scale，要看有多少queue的request

因為不同server的io model與process/thread model不同，讓server在scale上有巨大的差別

The life of a request

重點是request會被queue在哪裡

load balancer
Heroku router
- it will then wait up to five seconds for that dyno to accept the request and open a connection.
available host
- backlog: the socket on the dyno will accept the connection even if the webserver is busy processing other requests.

上面最重要的有兩點

router會等5秒直到成功連線
request可以活在host的backlog中 (server要有這功能)

server在scale上差在?

主要是處理兩件事

slow client protection
- request buffering，會等req下載好了才會轉給app
slow response protection
- kind of concurrency - either multithreading or multiprocess/forking
  - 至少不會卡IO (如果thread有特別處理)
  - 但如果是multithread會因為GIL，讓其他人不能用cpu
    - 所以ruby的multithreading對cpu-bound的request不好

這樣看下來只有

Puma in clustered mode
Phusion Passenger 5 可以用於scale server

設定server參數

目標

讓memory 與 CPU使用最大化
讓throughput最大化

要4個要注意的

process數目
- process才是真的平行
- 建議一台最少3個process
  - 最多?
    - 要看mem與cpu
      - mem
        不能太多mem，不然會overcommit與swap
        測ruby app的mem用量
        放著跑12~24小時
        用ps看
        procs = (TOTAL_RAM / (RAM_PER_PROCESS * 1.2))
      - cpu
        每5分鐘或15分鐘看cpu load
        如果靠近或是到100%，就減少process數量
        procs = 1.2~1.5倍的hyperthread
    - 一般來說是8個
- 多process的好處是?
  - 可以讓OS做load balance
  - 比讓load balancer做load balance還要好
    - OS可以知道process的狀態!!
thread數量
- ruby的thread只能處理IO(db)
- 所以要多少
  - 最多5~6個
  - 再多就會
    - 碰Amdahl’s law
    - mem會被吃爆(看mem fragmentation)
copy-on-write
- 在init好了之後fork (preload)
- 但省的空間沒有想像的多
  - 如果用大分頁，只要改一個bit就會被copy，導致copy大量資料
    - 想想ruby vm怎麼用page的，好幾個object塞同一個page
  - fragmentation!!
Container size
- 就是cpu與mem要多少
- 針對
  - 你的app需求 (吃mem? 吃cpu?)
  - 前面提到的process數量
    - 3process，ruby app一個大約300MB
      - 所以mem至少要1G

步驟

找出1 process跑5 thread要多少mem
一個child process需要 (TOTAL_RAM / (RAM_PER_PROCESS * 1.2))
- 一台需要3個process，以此推算出需要的總mem
確認hyperthread的數量夠
- child process的數量要等於1.25~1.5的hyperthread
monitor cpu與mem usage，調整process數量與container的規格

gc

Generational GC認為通常都是年輕的object掛掉，所以分成兩個gc

minor gc只處理new object(活不超過3以下的object)
- 在沒有free slot啟動
  - 處理new object、在remember set的object、沒有write barrier的object
    - remember set: 一群old object但是有new object的ptr
    - write barrier: ruby runtime與object之間的interface
major gc處理所有object
- 在下面2種case下啟動
  - 跑完minor gc後還是沒有free slot
  - 4個limit的其中一個超標
    - malloc_increase_bytes_limit
      - malloc_increase_bytes
        當RVALUE不夠存時需要alloc資料到其他地方
        malloc_increase_bytes就是他的大小
    - oldmalloc_increase_bytes_limit
      - 與malloc_increase_bytes同樣道理但是只針對old
    - old_objects_limit
      - old object的slot
    - remembered_wb_unprotected_objects_limit
      - remembere set與沒有write barrier的object

trace gc count可以看r background job是不是會一直觸發gc 像下面就是可以用來trace的midleware

class GCCounter
def initialize(app)
    @app = app
end
def call(env)
    gc_counts_before = GC.stat.select { |k,v| k =~ /count/
    @app.call(env)
    gc_counts_after = GC.stat.select { |k,v| k =~ /count/
    puts gc_counts_before.merge(gc_counts_after) { |k, vb, va| va - vb }
end
end

ruby的ObjectSpace(heap)就是記憶體，一個ptr(RVALUE)對到一個slot，好多個slot變成一個page

heap_sorted_length是目前alloc的連續長度(想像怎麼實作vma的) heap_allocated_pages是只有多少page(已經變成page的mem) heap_allocatable_pages是指可以再有多少page(已經malloc了的mem)

heap_live_slots是指現在有多少object heap_free_slots是指有多少空的slot heap_final_slots是指多少slot被finalize heap_marked_slots是指old的object與沒有write barrier的物件(c-extendsion的mem)

tomb_pages就是slot都是free (可以還給OS) eden_pages就是至少有一個live slot

tune gc的目的

減少memory bloat
減少跑gc的時間

核心想法: 讓free slot不要太多

調

RUBY_GC_HEAP_FREE_SLOTS_GOAL_RATIO
RUBY_GC_HEAP_INIT_SLOTS
RUBY_GC_HEAP_FREE_SLOTS_MAX_RATIO
RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO

RUBY_GC_HEAP_FREE_SLOTS_GOAL_RATIO=0.1
RUBY_GC_HEAP_FREE_SLOTS_MAX_RATIO=0.2
RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.05
RUBY_GC_HEAP_INIT_SLOTS=1000000

orm

核心概念: avoiding instantiating ActiveRecord objects

如果很多records，不要用each一筆一筆讀，mem會起飛
- find_each and in_batches loads them in batches
Select Only What You Need
Preloads somehow
- eager_load use LEFT OUTER JOIN when eager loading the model associations.
- includes 最先考慮他
- preload 會產生qeury把指定的model載入，之後透過ruby把剩下的接起來
- Each eager load increases the number of instantiated objects, and in turn slows down the query
  - Each eager load increases the number of instantiated objects, and in production-like data turn slows down the query
Don’t Use Many Queries When One Will Do
- create too many ActiveRecord objects is when doing mass updates
- 如果可以一行sql處理掉，就讓sql處理
  - update_all, destroy_all
Do Math In The Database
- 如果需要統計可以留給db
N+1
- 用production的資料，跑看sql的log
- 找SQL哪裡生出來的(看下面的code)
- 跑rack-mini-profiler

module LogQuerySource
    def debug(*args, &block)
        return unless super
        
        backtrace = Rails.backtrace_cleaner.clean caller
        relevant_caller_line = backtrace.detect do |caller_line|
        !caller_line.include?('/initializers/')
        end
        if relevant_caller_line
            logger.debug(" -> #{ relevant_caller_line.sub("#{
        end
    end
end

ActiveRecord::LogSubscriber.send :prepend, LogQuerySource

書上的例子是這個 partial會對每個collections的東西call find_by!!

using an ActiveRecord query method like find_by which is called on every element in a collection - is extremely common 只要用在OOOs中的其中一個就有N+1的風險

流程是

Methods on a model trigger SQL queries (by using the ActiveRecord API)
those methods get called in the view
they end up being used in a partial or something that gets iterated for every element in a collection,
N+1

解法

Instead of doing using ActiveRecord methods that trigger SQL queries, we’re going to rewrite this method to use regular Arrays and Enumerable methods.
Do not use ActiveRecord query methods inside models, especially not on a model’s instance methods.
- Use them only in controllers and helpers.

動機#

principle#

Benford’s Law#

Zipf’s law#

Pareto’s law#

little law#

The Performance Workflow#

profile#

profile mode#

ruby: Ruby-Prof#

ruby: Stackprof#

rack: rack-mini-profiler#

memory profile#

ObjectSpace and objspace.so#

GC::Profiler#

derailed_benchmarks#

memory_profiler#

front-end: chrome timeline#

3rd-party: New Relic#

development & production 環境會有差異的原因#

New Relic: profile in production env#

process time#

Transactions#

database#

External Services#

GC stats and Reports#

Browser / Real user monitoring (RUM)#

Turbolinks & “HTML-over-the-wire”#

關於反應時間#

benchmark#

performance之於企業#

DB optimization#

Rails slow?#

exception slow!!#

memory bloat#

Memory Leaks#

Memory Fragmentation#

關於application server本身#

The life of a request#

server在scale上差在?#

設定server參數#

gc#

orm#

動機