More on 40% better single-threaded performance in MariaDB

In my previous post I wrote about how I achived a >40% speedup on sysbench read-only using profile-guided optimisation (PGO). While this is a preliminary result, I though it was so interesting that it deserved early mention. The fact that any benchmark can be improved that much shows clearly that PGO is something worth looking into. Even if we will probably not improve all workloads by 40%, it seems highly likely that we can obtain significant gains also for many real workloads.

I had one or two interesting comments on the post that raise valid concerns, so I wanted to write a follow-up here, explaining some of the points in more details and going deeper into the performance counter measurements. As I wrote before, actual observations and measurements are crucial to fully understand performance of complex code on modern CPUs. Intuition and guesswork just does not suffice.

On branch mispredictions

bludwarf suggested that branch mispredictions could be an important part of the picture:

> It would be interesting to see the hottest spots.... Having to execute just
> a lot of code is not an explanation. If it would be so, then profiler
> feedback won't affect performance so much. I believe the reson for so big
> difference for the results is branch misprediction.

I had similar thoughts when I first saw the results, but I was able to verify
my explanation with actual measurements of instructions retired and icache
misses before and after profile-guided optimisations:

    171821  INST_RETIRED.ANY
     17730  ICACHE.MISSES

    166686  INST_RETIRED.ANY
     13643  ICACHE.MISSES

Before PGO, ICACHE.MISSES/INST_RETIRED.ANY = 10.3%. After, we get 8.2%. So we do get significant less icache misses. I wrote in my previous post some plausible reasons for that with my remarks on basic blocks, though I have not tried to verify those with actual measurements.

I also checked branch mispredictions. The Intel manual has two measures to
estimate the cost of branch misprediction. One is time wasted due to
speculatively issued uops that do not retire, %Bad_Speculation =
(UOPS_ISSUED.ANY – UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES )
/ (4*CPU_CLK_UNHALTED.THREAD). The other is cycles wasted based on an
estimated cost of 20 cycles per misprediction, %BR.MISP.COST = 20 *
BR_MISP_RETIRED.ALL_BRANCHES_PS / CPU_CLK_UNHALTED.THREAD. Here are the
relevant measurements taken before and after (two sets of measurements, as
Sandy Bridge supports only 4 simultaneous performance counters):

    486641  CPU_CLK_UNHALTED.THREAD
    205880  UOPS_RETIRED.RETIRE_SLOTS
    226184  UOPS_ISSUED.ANY
      7219  INT_MISC.RECOVERY_CYCLES
    %Bad_Speculation = 2.5%

    335051  CPU_CLK_UNHALTED.THREAD
    194616  UOPS_RETIRED.RETIRE_SLOTS
    216353  UOPS_ISSUED.ANY
      6830  INT_MISC.RECOVERY_CYCLES
    %Bad_Speculation = 3.7%

    490250  CPU_CLK_UNHALTED.THREAD
      1310  BR_MISP_RETIRED.ALL_BRANCHES_PS
    %BR.MISP.COST = 5.3%

    338524  CPU_CLK_UNHALTED.THREAD
      1147  BR_MISP_RETIRED.ALL_BRANCHES_PS
    %BR.MISP.COST = 6.8%

Note that the actual branch misprediction is much the same. But with less time spent on other bottlenecks, the estimated relative cost becomes higher.

Most of the branch mispredictions are due to indirect branches (virtual functions and @plt entries in shared libraries). Even without PGO the CPU is able to predict other branches very well, so compiler or __builtin_expect can not help much.

At the end of this post, I have appended the top of some per-function profiles. They show how branch mispredictions are mostly from commonly called library functions like memcpy() (these are mispredictions of the indirect jumps in the PLT), and from virtual function calls. They also show how the icache misses are spread out more or less evenly, correlating strongly with the count of actual instructions executed in various parts of the code.

On better coverage of the PGO

A comment from Anonymous raised another valid concern:

> If I read you correctly, you are using the same benchmark for optimizing as
> well as measuring performance increases. You need to have different test and
> training sets or your measurements don't generalize.

This is true, of course. The sysbench read-only exercises only an extremely narrow subset of mysqld functionality. The next step, on which I am already working, is to implement a more suitable and general workload for the profile, which can hopefully cover, and hence speedup, a significant portion of common real-life workloads.

More tests will be needed to be sure that real-world gains can be achieved, but I am very hopeful. After all, much of the code of sys-bench read-only is also used in many other workloads, and much of the properties (eg. infrequency of the lots and lots of error check branches) should be similar.

My main concern is that GCC will choose to de-optimise code that is not covered by the test load, to the extent that it significantly harms performance in some cases. But it should not be too much of a problem – after all, GCC will still try to optimise the generated code as best it can, and if the information it has from the profile is incomplete, it should not be much worse than having no information, as is the case without PGO. And over time, we can extend the coverage of the test load to help eliminate any problems that do turn up.

Since the last post, I implemented a first version of a program that generates a test load. For now, I put it on github: gen_profile_load. The program generates a range of different loads, such as simple key lookups, small joins, inserts/updates/deletes, under various conditions such as with/without binlogging, with/without prepared statements, and so on.

I plan to next do some tests with using this for doing the PGO, and checking the effect on sysbench read-only and possibly other benchmarks. Another idea I had was to try to use only parts of the generated load, and test that this does not cause significant regressions on the omitted parts. This could help in getting some idea of what kind of (hopefully small) performance regressions can be expected for workloads that are not well covered by the PGO.

Detailed profiles

These are pasted from the output of perf report. They show the top functions in terms of instructions executed, icache misses suffered, and branches mispredicted.

Instructions retired (INST_RETIRED.ANY), before PGO:

Samples: 171K of event 'rc0', Event count (approx.): 171821
  1,94%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
  1,92%  mysqld  mysqld               [.] my_hash_sort_bin
  1,60%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
  1,48%  mysqld  mysqld               [.] malloc
  1,36%  mysqld  mysqld               [.] free
  1,34%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
  1,31%  mysqld  mysqld               [.] alloc_root
  1,30%  mysqld  mysqld               [.] JOIN::optimize_inner()
  1,16%  mysqld  mysqld               [.] my_malloc_size_cb_func
  1,14%  mysqld  mysqld               [.] ha_innobase::build_template(bool)
  1,08%  mysqld  mysqld               [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*)
  1,08%  mysqld  mysqld               [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
  1,07%  mysqld  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
  1,01%  mysqld  libpthread-2.13.so   [.] pthread_mutex_lock
  1,00%  mysqld  mysqld               [.] Protocol::net_store_data(unsigned char const*, unsigned long, charset_info_st const*, charset_info_st const*)
  0,98%  mysqld  libpthread-2.13.so   [.] pthread_getspecific
  0,93%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
  0,90%  mysqld  mysqld               [.] build_template_field(row_prebuilt_t*, dict_index_t*, dict_index_t*, TABLE*, Field const*, unsigned long)
  0,88%  mysqld  [vdso]               [.] 0x000000000000070c
  0,83%  mysqld  mysqld               [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long)
  0,79%  mysqld  libc-2.13.so         [.] __memset_sse2
  0,78%  mysqld  libc-2.13.so         [.] __strlen_sse42
  0,77%  mysqld  [kernel.kallsyms]    [k] unix_stream_recvmsg
  0,77%  mysqld  mysqld               [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
  0,71%  mysqld  mysqld               [.] _ZN4JOIN7prepareEPPP4ItemP10TABLE_LISTjS1_jP8st_orderbS7_S1_S7_P13st_select_lexP18st_select_lex_unit.part.227
  0,70%  mysqld  libc-2.13.so         [.] __strcmp_sse42
  0,70%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
  0,70%  mysqld  [kernel.kallsyms]    [k] update_cfs_shares
  0,69%  mysqld  mysqld               [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
  0,69%  mysqld  mysqld               [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.231
  0,68%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
  0,60%  mysqld  mysqld               [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
  0,59%  mysqld  [kernel.kallsyms]    [k] effective_load
  0,56%  mysqld  [kernel.kallsyms]    [k] select_task_rq_fair
  0,56%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
  0,50%  mysqld  mysqld               [.] ha_innobase::external_lock(THD*, int)
  0,48%  mysqld  mysqld               [.] my_malloc
  0,48%  mysqld  [kernel.kallsyms]    [k] enqueue_task_fair
  0,48%  mysqld  mysqld               [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
  0,47%  mysqld  [kernel.kallsyms]    [k] update_entity_load_avg
  0,46%  mysqld  mysqld               [.] mysql_execute_command(THD*)
  0,46%  mysqld  mysqld               [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
  0,45%  mysqld  mysqld               [.] Item_func::type() const
  0,44%  mysqld  [kernel.kallsyms]    [k] update_curr
  0,44%  mysqld  [kernel.kallsyms]    [k] copy_user_generic_string
  0,44%  mysqld  [kernel.kallsyms]    [k] do_raw_spin_lock
  0,43%  mysqld  mysqld               [.] my_lengthsp_8bit
  0,43%  mysqld  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
  0,43%  mysqld  [kernel.kallsyms]    [k] init_sync_kiocb.constprop.15
  0,41%  mysqld  mysqld               [.] get_best_combination(JOIN*)
  0,40%  mysqld  mysqld               [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool)
  0,36%  mysqld  mysqld               [.] Item_equal::Item_equal(Item*, Item*, bool)
  0,36%  mysqld  mysqld               [.] Item::Item()
  0,35%  mysqld  [kernel.kallsyms]    [k] atomic64_dec_and_test
  0,35%  mysqld  mysqld               [.] Arg_comparator::set_cmp_func(Item_result_field*, Item**, Item**, Item_result)
  0,35%  mysqld  mysqld               [.] JOIN::exec_inner()
  0,35%  mysqld  [kernel.kallsyms]    [k] account_entity_enqueue
  0,35%  mysqld  [kernel.kallsyms]    [k] do_sys_poll
  0,35%  mysqld  [kernel.kallsyms]    [k] system_call
  0,34%  mysqld  mysqld               [.] JOIN::destroy()
  0,34%  mysqld  mysqld               [.] ha_innobase::store_lock(THD*, st_thr_lock_data**, thr_lock_type)

Icache misses (ICACHE.MISSES), before PGO:

Samples: 17K of event 'r280', Event count (approx.): 17730
  2,21%  mysqld  mysqld               [.] JOIN::optimize_inner()
  2,04%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
  1,91%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
  1,42%  mysqld  mysqld               [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
  1,27%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
  1,20%  mysqld  mysqld               [.] malloc
  1,09%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
  1,07%  mysqld  mysqld               [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
  0,90%  mysqld  mysqld               [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
  0,90%  mysqld  [vdso]               [.] 0x0000000000000771
  0,86%  mysqld  libpthread-2.13.so   [.] pthread_mutex_lock
  0,83%  mysqld  mysqld               [.] _ZN4JOIN7prepareEPPP4ItemP10TABLE_LISTjS1_jP8st_orderbS7_S1_S7_P13st_select_lexP18st_select_lex_unit.part.227
  0,73%  mysqld  mysqld               [.] mysql_execute_command(THD*)
  0,72%  mysqld  mysqld               [.] alloc_root
  0,68%  mysqld  mysqld               [.] JOIN::exec_inner()
  0,65%  mysqld  libc-2.13.so         [.] __memset_sse2
  0,64%  mysqld  mysqld               [.] JOIN::destroy()
  0,63%  mysqld  mysqld               [.] free
  0,59%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
  0,55%  mysqld  mysqld               [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.231
  0,55%  mysqld  [kernel.kallsyms]    [k] do_raw_spin_lock
  0,54%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
  0,54%  mysqld  mysqld               [.] ha_innobase::external_lock(THD*, int)
  0,52%  mysqld  mysqld               [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
  0,51%  mysqld  mysqld               [.] my_malloc_size_cb_func
  0,51%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
  0,51%  mysqld  mysqld               [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
  0,50%  mysqld  mysqld               [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
  0,47%  mysqld  mysqld               [.] setup_conds(THD*, TABLE_LIST*, List<TABLE_LIST>&, Item**)
  0,47%  mysqld  mysqld               [.] get_best_combination(JOIN*)
  0,47%  mysqld  mysqld               [.] ha_innobase::store_lock(THD*, st_thr_lock_data**, thr_lock_type)
  0,47%  mysqld  libc-2.13.so         [.] __strcmp_sse42
  0,47%  mysqld  [kernel.kallsyms]    [k] unix_stream_recvmsg
  0,46%  mysqld  mysqld               [.] trx_commit(trx_t*)
  0,44%  mysqld  [kernel.kallsyms]    [k] fget_light
  0,43%  mysqld  mysqld               [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool)
  0,43%  mysqld  [kernel.kallsyms]    [k] system_call
  0,42%  mysqld  mysqld               [.] JOIN::cleanup(bool)
  0,41%  mysqld  mysqld               [.] handle_select(THD*, LEX*, select_result*, unsigned long)
  0,41%  mysqld  mysqld               [.] mtr_commit(mtr_t*)
  0,41%  mysqld  mysqld               [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*)
  0,41%  mysqld  mysqld               [.] my_hash_sort_bin
  0,40%  mysqld  mysqld               [.] open_table(THD*, TABLE_LIST*, st_mem_root*, Open_table_context*)
  0,40%  mysqld  libpthread-2.13.so   [.] pthread_getspecific
  0,39%  mysqld  mysqld               [.] buf_page_get_known_nowait(unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*)
  0,39%  mysqld  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
  0,38%  mysqld  mysqld               [.] _ZL21join_read_const_tableP13st_join_tableP11st_position.isra.226
  0,38%  mysqld  mysqld               [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
  0,38%  mysqld  mysqld               [.] find_field_in_table(THD*, TABLE*, char const*, unsigned int, bool, unsigned int*)
  0,37%  mysqld  mysqld               [.] ha_innobase::extra(ha_extra_function)
  0,37%  mysqld  mysqld               [.] ha_innobase::build_template(bool)
  0,37%  mysqld  [kernel.kallsyms]    [k] enqueue_task_fair
  0,36%  mysqld  mysqld               [.] add_key_field(JOIN*, key_field_t**, unsigned int, Item_func*, Field*, bool, Item**, unsigned int, unsigned long long, st_sargable_param**)
  0,36%  mysqld  mysqld               [.] add_key_fields(JOIN*, key_field_t**, unsigned int*, Item*, unsigned long long, st_sargable_param**)
  0,35%  mysqld  [kernel.kallsyms]    [k] update_rq_clock
  0,34%  mysqld  mysqld               [.] do_select(JOIN*, List<Item>*, TABLE*, Procedure*)
  0,33%  mysqld  mysqld               [.] my_real_read(st_net*, unsigned long*)
  0,33%  mysqld  mysqld               [.] Arg_comparator::set_cmp_func(Item_result_field*, Item**, Item**, Item_result)
  0,33%  mysqld  [kernel.kallsyms]    [k] __schedule
  0,33%  mysqld  mysqld               [.] Item::Item()
  0,32%  mysqld  mysqld               [.] thd_ha_data

Branch mispredictions (BR_MISP_RETIRED.ALL_BRANCHES_PS), before PGO:

Samples: 1K of event 'rc5:p', Event count (approx.): 1310
  7,02%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
  6,64%  mysqld  libpthread-2.13.so   [.] pthread_getspecific
  3,74%  mysqld  libpthread-2.13.so   [.] pthread_mutex_unlock
  3,66%  mysqld  mysqld               [.] Item_func::type() const
  3,05%  mysqld  libc-2.13.so         [.] __memset_sse2
  2,67%  mysqld  mysqld               [.] Field_long::type() const
  2,60%  mysqld  libpthread-2.13.so   [.] pthread_mutex_lock
  2,52%  mysqld  libc-2.13.so         [.] __strcmp_sse42
  2,06%  mysqld  mysqld               [.] Item_field::used_tables() const
  2,06%  mysqld  mysqld               [.] Item_param::used_tables() const
  2,06%  mysqld  libc-2.13.so         [.] __strlen_sse42
  1,76%  mysqld  librt-2.13.so        [.] clock_gettime
  1,68%  mysqld  mysqld               [.] Item::cmp_type() const
  1,60%  mysqld  mysqld               [.] st_select_lex::master_unit()
  1,37%  mysqld  mysqld               [.] Item_equal::functype() const
  1,30%  mysqld  mysqld               [.] Item::const_item() const
  1,30%  mysqld  mysqld               [.] Item::real_item()
  1,07%  mysqld  mysqld               [.] Field_str::charset() const
  0,92%  mysqld  mysqld               [.] Protocol::end_statement()
  0,92%  mysqld  mysqld               [.] Item_field::type() const
  0,84%  mysqld  mysqld               [.] mysql_execute_command(THD*)
  0,84%  mysqld  mysqld               [.] Item_field::cleanup()
  0,84%  mysqld  mysqld               [.] Item_field::field_type() const
  0,84%  mysqld  mysqld               [.] ha_innobase::extra(ha_extra_function)
  0,84%  mysqld  libpthread-2.13.so   [.] __libc_recv
  0,76%  mysqld  mysqld               [.] Item_param::save_in_field(Field*, bool)
  0,76%  mysqld  mysqld               [.] my_hash_sort_bin
  0,69%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
  0,69%  mysqld  mysqld               [.] my_real_read(st_net*, unsigned long*)
  0,69%  mysqld  mysqld               [.] scheduler_wait_net_end
  0,69%  mysqld  mysqld               [.] Item_param::val_int()
  0,69%  mysqld  mysqld               [.] Item::check_cols(unsigned int)
  0,69%  mysqld  mysqld               [.] Item_field::result_type() const
  0,61%  mysqld  mysqld               [.] Field_long::pack_length() const
  0,61%  mysqld  mysqld               [.] vio_read
  0,61%  mysqld  [kernel.kallsyms]    [k] system_call
  0,53%  mysqld  mysqld               [.] Item::is_expensive()
  0,53%  mysqld  mysqld               [.] st_select_lex::outer_select()
  0,53%  mysqld  mysqld               [.] do_handle_one_connection(THD*)
  0,53%  mysqld  mysqld               [.] Explain_select::~Explain_select()
  0,53%  mysqld  mysqld               [.] Field_num::result_type() const
  0,53%  mysqld  mysqld               [.] Item_param::result_type() const
  0,53%  mysqld  mysqld               [.] Item_param::field_type() const
  0,53%  mysqld  mysqld               [.] Item_equal::count_sargable_conds(unsigned char*)
  0,53%  mysqld  mysqld               [.] Item_func_eq::functype() const
  0,53%  mysqld  mysqld               [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
  0,53%  mysqld  [kernel.kallsyms]    [k] __pollwait
  0,53%  mysqld  [kernel.kallsyms]    [k] sock_def_readable
  0,46%  mysqld  mysqld               [.] net_before_header_psi(st_net*, void*, unsigned long)
  0,46%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
  0,46%  mysqld  mysqld               [.] mysqld_stmt_execute(THD*, char*, unsigned int)
  0,46%  mysqld  mysqld               [.] handler::index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)
  0,46%  mysqld  mysqld               [.] Item_equal::select_optimize() const
  0,46%  mysqld  mysqld               [.] Item_func::used_tables() const
  0,46%  mysqld  mysqld               [.] ha_innobase::info(unsigned int)
  0,46%  mysqld  libc-2.13.so         [.] __memmove_ssse3
  0,46%  mysqld  libc-2.13.so         [.] __memcmp_sse4_1
  0,46%  mysqld  [kernel.kallsyms]    [k] dequeue_task_fair
  0,46%  mysqld  [kernel.kallsyms]    [k] pick_next_task_fair
  0,38%  mysqld  mysqld               [.] Protocol_binary::store(char const*, unsigned long, charset_info_st const*)
  0,38%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)

Instructions retired (INST_RETIRED.ANY), after PGO:

Samples: 166K of event 'rc0', Event count (approx.): 166686
  2,55%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
  1,79%  mysqld  mysqld               [.] my_hash_sort_bin
  1,60%  mysqld  mysqld               [.] alloc_root
  1,48%  mysqld  mysqld               [.] malloc
  1,42%  mysqld  mysqld               [.] my_malloc_size_cb_func
  1,42%  mysqld  mysqld               [.] free
  1,29%  mysqld  libc-2.13.so         [.] __memset_sse2
  1,23%  mysqld  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
  1,20%  mysqld  mysqld               [.] ha_innobase::build_template(bool)
  1,16%  mysqld  libpthread-2.13.so   [.] pthread_mutex_lock
  1,15%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
  1,15%  mysqld  mysqld               [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
  1,13%  mysqld  mysqld               [.] JOIN::optimize_inner()
  1,12%  mysqld  mysqld               [.] build_template_field(row_prebuilt_t*, dict_index_t*, dict_index_t*, TABLE*, Field const*, unsigned long)
  1,07%  mysqld  [vdso]               [.] 0x00000000000006a1
  1,03%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
  1,00%  mysqld  libpthread-2.13.so   [.] pthread_getspecific
  0,99%  mysqld  mysqld               [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
  0,94%  mysqld  libc-2.13.so         [.] __strlen_sse42
  0,83%  mysqld  mysqld               [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*)
  0,80%  mysqld  libc-2.13.so         [.] __strcmp_sse42
  0,78%  mysqld  [kernel.kallsyms]    [k] unix_stream_recvmsg
  0,78%  mysqld  mysqld               [.] Protocol::net_store_data(unsigned char const*, unsigned long, charset_info_st const*, charset_info_st const*)
  0,75%  mysqld  mysqld               [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long)
  0,69%  mysqld  mysqld               [.] JOIN::prepare(Item***, TABLE_LIST*, unsigned int, Item*, unsigned int, st_order*, bool, st_order*, Item*, st_order*, st_select_lex*, st_select_lex_unit*)
  0,68%  mysqld  mysqld               [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.183
  0,68%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
  0,63%  mysqld  [kernel.kallsyms]    [k] effective_load
  0,63%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
  0,62%  mysqld  mysqld               [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
  0,59%  mysqld  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
  0,57%  mysqld  [kernel.kallsyms]    [k] select_task_rq_fair
  0,56%  mysqld  [kernel.kallsyms]    [k] update_cfs_shares
  0,56%  mysqld  mysqld               [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
  0,52%  mysqld  [kernel.kallsyms]    [k] update_curr
  0,52%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
  0,52%  mysqld  mysqld               [.] ha_innobase::external_lock(THD*, int)
  0,51%  mysqld  mysqld               [.] my_malloc
  0,50%  mysqld  [kernel.kallsyms]    [k] update_entity_load_avg
  0,46%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
  0,46%  mysqld  [kernel.kallsyms]    [k] copy_user_generic_string
  0,45%  mysqld  mysqld               [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
  0,43%  mysqld  mysqld               [.] get_best_combination(JOIN*)
  0,43%  mysqld  [kernel.kallsyms]    [k] do_raw_spin_lock
  0,42%  mysqld  mysqld               [.] find_field_in_table(THD*, TABLE*, char const*, unsigned int, bool, unsigned int*)
  0,42%  mysqld  [kernel.kallsyms]    [k] enqueue_task_fair
  0,41%  mysqld  [kernel.kallsyms]    [k] account_entity_enqueue
  0,39%  mysqld  [kernel.kallsyms]    [k] init_sync_kiocb.constprop.15
  0,39%  mysqld  mysqld               [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool)
  0,39%  mysqld  mysqld               [.] mysql_execute_command(THD*)
  0,38%  mysqld  mysqld               [.] my_lengthsp_8bit
  0,38%  mysqld  [kernel.kallsyms]    [k] system_call
  0,38%  mysqld  mysqld               [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
  0,37%  mysqld  mysqld               [.] Item::Item()
  0,36%  mysqld  [kernel.kallsyms]    [k] atomic64_dec_and_test
  0,35%  mysqld  librt-2.13.so        [.] clock_gettime
  0,34%  mysqld  mysqld               [.] JOIN::cleanup(bool)
  0,34%  mysqld  mysqld               [.] Item_func::type() const
  0,33%  mysqld  mysqld               [.] dict_index_copy_types(dtuple_t*, dict_index_t const*, unsigned long)
  0,33%  mysqld  mysqld               [.] MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**)
  0,33%  mysqld  [kernel.kallsyms]    [k] __update_tg_runnable_avg.isra.25

Icache misses (ICACHE.MISSES), after PGO:

Samples: 13K of event 'r280', Event count (approx.): 13643
  2,07%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
  1,39%  mysqld  mysqld               [.] malloc
  1,33%  mysqld  mysqld               [.] JOIN::optimize_inner()
  1,26%  mysqld  mysqld               [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
  1,17%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
  1,08%  mysqld  [vdso]               [.] 0x0000000000000864
  1,07%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
  1,06%  mysqld  libc-2.13.so         [.] __memset_sse2
  1,01%  mysqld  mysqld               [.] free
  0,97%  mysqld  libpthread-2.13.so   [.] pthread_mutex_lock
  0,92%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
  0,80%  mysqld  mysqld               [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
  0,76%  mysqld  mysqld               [.] alloc_root
  0,74%  mysqld  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
  0,73%  mysqld  libc-2.13.so         [.] __strcmp_sse42
  0,71%  mysqld  mysqld               [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
  0,69%  mysqld  mysqld               [.] my_hash_sort_bin
  0,67%  mysqld  [kernel.kallsyms]    [k] fget_light
  0,63%  mysqld  [kernel.kallsyms]    [k] do_raw_spin_lock
  0,62%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
  0,59%  mysqld  mysqld               [.] Item::Item()
  0,57%  mysqld  mysqld               [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.183
  0,53%  mysqld  libpthread-2.13.so   [.] pthread_getspecific
  0,52%  mysqld  [kernel.kallsyms]    [k] update_cfs_shares
  0,51%  mysqld  mysqld               [.] ha_innobase::external_lock(THD*, int)
  0,51%  mysqld  mysqld               [.] my_malloc_size_cb_func
  0,50%  mysqld  mysqld               [.] JOIN::prepare(Item***, TABLE_LIST*, unsigned int, Item*, unsigned int, st_order*, bool, st_order*, Item*, st_order*, st_select_lex*, st_select_lex_unit*)
  0,48%  mysqld  mysqld               [.] mysql_execute_command(THD*)
  0,47%  mysqld  mysqld               [.] free_root
  0,47%  mysqld  [kernel.kallsyms]    [k] idle_cpu
  0,47%  mysqld  [kernel.kallsyms]    [k] unix_stream_sendmsg
  0,46%  mysqld  [kernel.kallsyms]    [k] system_call
  0,44%  mysqld  [kernel.kallsyms]    [k] unix_stream_recvmsg
  0,43%  mysqld  mysqld               [.] ha_innobase::info_low(unsigned int, bool)
  0,43%  mysqld  mysqld               [.] MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**)
  0,43%  mysqld  librt-2.13.so        [.] clock_gettime
  0,43%  mysqld  [kernel.kallsyms]    [k] enqueue_task_fair
  0,42%  mysqld  mysqld               [.] JOIN::init(THD*, List<Item>&, unsigned long long, select_result*)
  0,42%  mysqld  mysqld               [.] Item_field::used_tables() const
  0,41%  mysqld  mysqld               [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
  0,40%  mysqld  mysqld               [.] Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)
  0,40%  mysqld  [kernel.kallsyms]    [k] emulate_vsyscall
  0,40%  mysqld  mysqld               [.] dict_index_copy_types(dtuple_t*, dict_index_t const*, unsigned long)
  0,40%  mysqld  [kernel.kallsyms]    [k] update_rq_clock
  0,39%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
  0,38%  mysqld  mysqld               [.] ha_innobase::extra(ha_extra_function)
  0,38%  mysqld  [kernel.kallsyms]    [k] update_entity_load_avg
  0,37%  mysqld  mysqld               [.] get_best_combination(JOIN*)
  0,37%  mysqld  mysqld               [.] Item_field::fix_fields(THD*, Item**)
  0,37%  mysqld  mysqld               [.] my_malloc
  0,37%  mysqld  [kernel.kallsyms]    [k] __perf_event_task_sched_out
  0,37%  mysqld  [kernel.kallsyms]    [k] __schedule
  0,36%  mysqld  mysqld               [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
  0,36%  mysqld  mysqld               [.] mysqld_stmt_execute(THD*, char*, unsigned int)
  0,35%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
  0,35%  mysqld  mysqld               [.] Prepared_statement::execute(String*, bool)
  0,34%  mysqld  mysqld               [.] thd_ha_data
  0,34%  mysqld  mysqld               [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
  0,34%  mysqld  [kernel.kallsyms]    [k] select_task_rq_fair
  0,34%  mysqld  [kernel.kallsyms]    [k] do_sys_poll
  0,34%  mysqld  mysqld               [.] st_select_lex::master_unit()

Branch mispredictions (BR_MISP_RETIRED.ALL_BRANCHES_PS), after PGO:

Samples: 1K of event 'rc5:p', Event count (approx.): 1147
  8,37%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
  7,06%  mysqld  libpthread-2.13.so   [.] pthread_getspecific
  3,84%  mysqld  mysqld               [.] Item_func::type() const
  2,88%  mysqld  libc-2.13.so         [.] __memset_sse2
  2,70%  mysqld  libpthread-2.13.so   [.] pthread_mutex_lock
  2,35%  mysqld  mysqld               [.] st_select_lex::master_unit()
  2,27%  mysqld  libpthread-2.13.so   [.] pthread_mutex_unlock
  2,18%  mysqld  mysqld               [.] Item_param::used_tables() const
  2,01%  mysqld  mysqld               [.] Field_long::type() const
  2,01%  mysqld  libc-2.13.so         [.] __strlen_sse42
  1,92%  mysqld  libc-2.13.so         [.] __strcmp_sse42
  1,39%  mysqld  mysqld               [.] Item::const_item() const
  1,31%  mysqld  mysqld               [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
  1,22%  mysqld  mysqld               [.] handler::index_read_idx_map(unsigned char*, unsigned int, unsigned char const*, unsigned long, ha_rkey_function)
  1,13%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
  1,13%  mysqld  mysqld               [.] end_send(JOIN*, st_join_table*, bool)
  1,05%  mysqld  mysqld               [.] my_real_read(st_net*, unsigned long*)
  1,05%  mysqld  mysqld               [.] Item_equal::functype() const
  1,05%  mysqld  mysqld               [.] Item_bool_func2::cleanup()
  0,96%  mysqld  mysqld               [.] Item::is_expensive()
  0,96%  mysqld  mysqld               [.] Item_field::type() const
  0,96%  mysqld  mysqld               [.] Item_param::type() const
  0,87%  mysqld  mysqld               [.] Item::real_item()
  0,78%  mysqld  mysqld               [.] Item_field::used_tables() const
  0,78%  mysqld  mysqld               [.] scheduler_wait_net_end
  0,78%  mysqld  librt-2.13.so        [.] clock_gettime
  0,70%  mysqld  mysqld               [.] Statement::set_statement(Statement*)
  0,70%  mysqld  mysqld               [.] _ZL21join_read_const_tableP13st_join_tableP11st_position.isra.180
  0,70%  mysqld  mysqld               [.] JOIN::optimize()
  0,70%  mysqld  mysqld               [.] Field_num::size_of() const
  0,70%  mysqld  mysqld               [.] Field_str::charset() const
  0,70%  mysqld  mysqld               [.] Item_field::cleanup()
  0,70%  mysqld  libpthread-2.13.so   [.] pthread_rwlock_rdlock
  0,61%  mysqld  mysqld               [.] select_result::initialize_tables(JOIN*)
  0,61%  mysqld  mysqld               [.] mysql_execute_command(THD*)
  0,61%  mysqld  mysqld               [.] mysqld_stmt_execute(THD*, char*, unsigned int)
  0,61%  mysqld  mysqld               [.] JOIN::optimize_inner()
  0,61%  mysqld  mysqld               [.] Field_str::decimals() const
  0,61%  mysqld  mysqld               [.] handler::rebind_psi()
  0,61%  mysqld  mysqld               [.] Item_field::field_type() const
  0,61%  mysqld  mysqld               [.] Item::cmp_type() const
  0,52%  mysqld  mysqld               [.] net_before_header_psi(st_net*, void*, unsigned long)
  0,52%  mysqld  mysqld               [.] my_net_read
  0,52%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
  0,52%  mysqld  mysqld               [.] Field::new_key_field(st_mem_root*, TABLE*, unsigned char*, unsigned char*, unsigned int)
  0,52%  mysqld  mysqld               [.] Field::send_binary(Protocol*)
  0,52%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
  0,52%  mysqld  mysqld               [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
  0,52%  mysqld  [vdso]               [.] 0x0000000000000630
  0,44%  mysqld  mysqld               [.] Protocol::send_eof(unsigned int, unsigned int)
  0,44%  mysqld  mysqld               [.] Item::has_subquery() const
  0,44%  mysqld  mysqld               [.] Item::mark_as_condition_AND_part(TABLE_LIST*)
  0,44%  mysqld  mysqld               [.] select_send::send_eof()
  0,44%  mysqld  mysqld               [.] select_send::send_data(List<Item>&)
  0,44%  mysqld  mysqld               [.] add_key_fields(JOIN*, key_field_t**, unsigned int*, Item*, unsigned long long, st_sargable_param**)
  0,44%  mysqld  mysqld               [.] do_handle_one_connection(THD*)
  0,44%  mysqld  mysqld               [.] Field_str::repertoire() const
  0,44%  mysqld  mysqld               [.] handler::cancel_pushed_idx_cond()
  0,44%  mysqld  mysqld               [.] Item_field::get_item_equal()
  0,44%  mysqld  mysqld               [.] Item::check_cols(unsigned int)
  0,44%  mysqld  mysqld               [.] Item_field::fix_fields(THD*, Item**)

4 comments

  1. Hi Kristian,

    very interesting results.

    Before PGO my_hash_sort_bin was the second just after the hottest spot (__memcpy_ssse3) – 1,92% and 1,94% correspondingly. After the optimization it’s still the second, but the gap is significantly larger – 2,55% and 1,79%. So it seems the function was optimized. Actually, it’s just a simple loop calculating a hash, so it’s assembly code should be very short. Is it possible to have a look at the assembly for the function before and after PGO? Maybe there is some cycle unrolling?.. The loop depends on input data (key) so probably compiler unrolls some iterations?..

    1. Looks like the compiler detected that the code is run sufficiently often that it decided to unroll the loop a couple of times. See dissasembly here: http://knielsen-hq.dk/pgo-disassembly.txt

      However, note that this profile is about instructions retired, and my_hash_sort_bin() did not change that much. The reason for the larger gap seems to be that the compiler decided to use __memcpy_ssse3() in more places (instead of inline rep movsb), hence more instructions are executed inside it.

      I also noticed that the my_hash_sort_bin() function is written poorly. The compiler is forced to store and re-load the nr1[0] and nr2[0] values in each iteration, which is very costly – but this is necessary, otherwise the semantics would be wrong in case nr1 and nr2 alias with one another (or with key). Of course, they never alias in practise, but the compiler does not know this. I will fix this (__restrict might work, but it is easier to just copy into and out of temporary variables).

      And thanks for you interesting observations and comments, keep them coming 🙂

      – Kristian.

Leave a Reply to bludwarf Cancel reply

Your email address will not be published. Required fields are marked *